1,720,973 research outputs found
Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks
A representation method is an algorithm that calculates numerical feature vectors for samples in a dataset. Such vectors, also known as embeddings, define a relatively low-dimensional space able to efficiently encode high-dimensional data. Very recently, many types of learned data representations based on machine learning have appeared and are being applied to several tasks in bioinformatics. In particular, protein representation learning methods integrate different types of protein information (sequence, domains, etc.), in supervised or unsupervised learning approaches, and provide embeddings of protein sequences that can be used for downstream tasks. One task that is of special interest is the automatic function prediction of the huge number of novel proteins that are being discovered nowadays and are still totally uncharacterized. However, despite its importance, up to date there is not a fair benchmark study of the predictive performance of existing proposals on the same large set of proteins and for very concrete and common bioinformatics tasks. Therefore, this lack of benchmark studies prevent the community from using adequate predictive methods for accelerating the functional characterization of proteins. In this study, we performed a detailed comparison of protein sequence representation learning methods, explaining each approach and comparing them with an experimental benchmark on several bioinformatics tasks: (i) determining protein sequence similarity in the embedding space; (ii) inferring protein domains and (iii) predicting ontology-based protein functions. We examine the advantages and disadvantages of each representation approach over the benchmark results. We hope the results and the discussion of this study can help the community to select the most adequate machine learning-based technique for protein representation according to the bioinformatics task at hand.Fil: Fenoy, Luis Emilio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Edera, Alejandro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Stegmayer, Georgina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; Argentin
Anc2vec: Embedding gene ontology terms by preserving ancestors relationships
The gene ontology (GO) provides a hierarchical structure with a controlled vocabulary composed of terms describing functions and localization of gene products. Recent works propose vector representations, also known as embeddings, of GO terms that capture meaningful information about them. Significant performance improvements have been observed when these representations are used on diverse downstream tasks, such as the measurement of semantic similarity between GO terms and functional similarity between proteins. Despite the success shown by these approaches, existing embeddings of GO terms still fail to capture crucial structural features of the GO. Here, we present anc2vec, a novel protocol based on neural networks for constructing vector representations of GO terms by preserving three important ontological features: its ontological uniqueness, ancestors hierarchy and sub-ontology membership. The advantages of using anc2vec are demonstrated by systematic experiments on diverse tasks: visualization, sub-ontology prediction, inference of structurally related terms, retrieval of terms from aggregated embeddings, and prediction of protein-protein interactions. In these tasks, experimental results show that the performance of anc2vec representations is better than those of recent approaches. This demonstrates that higher performances on diverse tasks can be achieved by embeddings when the structure of the GO is better represented. Full source code and data are available at https://github.com/sinc-lab/anc2vec.Fil: Edera, Alejandro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Milone, Diego Humberto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Stegmayer, Georgina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; Argentin
Learning Markov Network Structures Constrained by Context-Specific Independences
This work focuses on learning the structure of Markov networks from data. Markov networks are parametric models for compactly representing complex probability distributions. These models are composed by: a structure and numerical weights, where the structure describes independences that hold in the distribution. Depending on which is the goal of structure learning, learning algorithms can be divided into: density estimation algorithms, where structure is learned for answering inference queries; and knowledge discovery algorithms, where structure is learned for describing independences qualitatively. The latter algorithms present an important limitation for describing independences because they use a single graph; a coarse grain structure representation which cannot represent flexible independences. For instance, context-specific independences cannot be described by a single graph. To overcome this limitation, this work proposes a new alternative representation named canonical model as well as the CSPC algorithm; a novel knowledge discovery algorithm for learning canonical models by using context-specific independences as constraints. On an extensive empirical evaluation, CSPC learns more accurate structures than state-of-the-art density estimation and knowledge discovery algorithms. Moreover, for answering inference queries, our approach obtains competitive results against density estimation algorithms, significantly outperforming knowledge discovery algorithms.Fil: Edera, Alejandro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza; Argentina. Universidad Tecnológica Nacional. Facultad Regional Mendoza. Departamento de Sistemas de Información; ArgentinaFil: Schluter, Federico Enrique Adolfo. Universidad Tecnológica Nacional. Facultad Regional Mendoza. Departamento de Sistemas de Información; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza; ArgentinaFil: Bromberg, Facundo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza; Argentina. Universidad Tecnológica Nacional. Facultad Regional Mendoza. Departamento de Sistemas de Información; Argentin
Male sterility and somatic hybridization in plant breeding
Plant male sterility refers to the failure in the production of fertile pollen. It occurs spon-taneously in natural populations and may be caused by genes encoded in the nuclear (genicmale sterility; GMS) or mitochondrial (cytoplasmic male sterility; CMS) genomes. Thisfeature has great agronomic value for the production of hybrid seeds, since it prevents self-pollination without the need of emasculation which is time-consuming and cost-intensive.CMS has been widely used in crops, such as corn, rice, wheat, citrus, and several speciesof the family Solanaceae. Mitochondrial genes determining CMS have been uncovered ina wide range of plant species. The modes of action of CMS have been classified in terms ofthe effect they produce in the cell, which ultimately leads to a failure in the production offertile pollen. Male fertility can be restored by nuclear-encoded genes, termed restorer-of-fertility (Rf) factors. CMS from wild plants has been transferred to species of agronomicinterest through somatic hybridization. Somatic hybrids have also been produced togenerate CMS de novo upon recombination of the mitochondrial genomes of two parentalplants or by separating the CMS cytoplasm from the nuclear Rf alleles. As a result, somatichybridization can be used as a highly efficient and useful strategy to incorporate CMS inbreeding programs.Fil: Garcia, Laura Evangelina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza. Instituto de Biología Agrícola de Mendoza. Universidad Nacional de Cuyo. Facultad de Ciencias Agrarias. Instituto de Biología Agrícola de Mendoza; Argentina. Universidad Nacional de Cuyo; ArgentinaFil: Edera, Alejandro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza. Instituto de Biología Agrícola de Mendoza. Universidad Nacional de Cuyo. Facultad de Ciencias Agrarias. Instituto de Biología Agrícola de Mendoza; ArgentinaFil: Marfil, Carlos Federico. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza. Instituto de Biología Agrícola de Mendoza. Universidad Nacional de Cuyo. Facultad de Ciencias Agrarias. Instituto de Biología Agrícola de Mendoza; ArgentinaFil: Sánchez Puerta, María Virginia. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza. Instituto de Biología Agrícola de Mendoza. Universidad Nacional de Cuyo. Facultad de Ciencias Agrarias. Instituto de Biología Agrícola de Mendoza; Argentin
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Variations on the Author
“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
Appropriate Similarity Measures for Author Cocitation Analysis
We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis
The IBMAP approach for Markov network structure learning
In this work we consider the problem of learning the structure of Markov networks from data. We present an approach for tackling this problem called IBMAP, together with an efficient instantiation of the approach: the IBMAP-HC algorithm, designed for avoiding important limitations of existing independence-based algorithms. These algorithms proceed by performing statistical independence tests on data, trusting completely the outcome of each test. In practice tests may be incorrect, resulting in potential cascading errors and the consequent reduction in the quality of the structures learned. IBMAP contemplates this uncertainty in the outcome of the tests through a probabilistic maximum-a-posteriori approach. The approach is instantiated in the IBMAP-HC algorithm, a structure selection strategy that performs a polynomial heuristic local search in the space of possible structures. We present an extensive empirical evaluation on synthetic and real data, showing that our algorithm outperforms significantly the current independence-based algorithms, in terms of data efficiency and quality of learned structures, with equivalent computational complexities. We also show the performance of IBMAP-HC in a real-world application of knowledge discovery: EDAs, which are evolutionary algorithms that use structure learning on each generation for modeling the distribution of populations. The experiments show that when IBMAP-HC is used to learn the structure, EDAs improve the convergence to the optimum.Fil: Schluter, Federico Enrique Adolfo. Universidad Tecnológica Nacional. Facultad Regional Mendoza. Departamento de Sistemas de información. Laboratorio DHARMa; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza; ArgentinaFil: Bromberg, Facundo. Universidad Tecnológica Nacional. Facultad Regional Mendoza. Departamento de Sistemas de información. Laboratorio DHARMa; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza; ArgentinaFil: Edera, Alejandro. Universidad Tecnológica Nacional. Facultad Regional Mendoza. Departamento de Sistemas de información. Laboratorio DHARMa; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza; Argentin
Dispelling the Myths Behind First-author Citation Counts
We conducted a full-scale evaluative citation analysis study of scholars in the XML research field to explore just how different from each other author rankings resulting from different citation counting methods actually are, and to demonstrate the capability of emerging data and tools on the Web in supporting more realistic citation counting methods. Our results contest some common arguments for the continued
use of first-author citation counts in the evaluation of scholars, such as high correlations between author rankings by first-author citation counts and other citation
counting methods, and high costs of using more realistic citation counting methods that are not well-supported by the ISI databases. It is argued that increasingly available digital full text research papers make it possible for citation analysis studies to go beyond what the ISI databases have directly supported and to employ more
sophisticated methods
- …
