1,721,210 research outputs found

    Automated gene function prediction through gene multifunctionality in biological networks

    Full text link
    As the number of sequenced genomes rapidly grows, Automated Prediction of gene Function (AFP) is now a challenging problem. Despite significant progresses in the last several years, the accuracy of gene function prediction still needs to be improved in order to be used effectively in practice. Two of the main issues of AFP problem are the imbalance of gene functional annotations and the 'multifunctional properties' of genes. While the former is a well studied problem in machine learning, the latter has recently emerged in bioinformatics and few studies have been carried out about it. Here we propose a method for AFP which appropriately handles the label imbalance characterizing biological taxonomies, and embeds in the model the property of some genes of being 'multifunctional'. We tested the method in predicting the functions of the Gene Ontology functional hierarchy for genes of yeast and fly model organisms, in a genome-wide approach. The achieved results show that cost-sensitive strategies and 'gene multifunctionality' can be combined to achieve significantly better results than the compared state-of-the-art algorithms for AFP

    Gene2DisCo : gene to disease using disease commonalities

    Full text link
    OBJECTIVE: Finding the human genes co-causing complex diseases, also known as "disease-genes", is one of the emerging and challenging tasks in biomedicine. This process, termed gene prioritization (GP), is characterized by a scarcity of known disease-genes for most diseases, and by a vast amount of heterogeneous data, usually encoded into networks describing different types of functional relationships between genes. In addition, different diseases may share common profiles (e.g. genetic or therapeutic profiles), and exploiting disease commonalities may significantly enhance the performance of GP methods. This work aims to provide a systematic comparison of several disease similarity measures, and to embed disease similarities and heterogeneous data into a flexible framework for gene prioritization which specifically handles the lack of known disease-genes. METHODS: We present a novel network-based method, Gene2DisCo, based on generalized linear models (GLMs) to effectively prioritize genes by exploiting data regarding disease-genes, gene interaction networks and disease similarities. The scarcity of disease-genes is addressed by applying an efficient negative selection procedure, together with imbalance-aware GLMs. Gene2DisCo is a flexible framework, in the sense it is not dependent upon specific types of data, and/or upon specific disease ontologies. RESULTS: On a benchmark dataset composed of nine human networks and 708 medical subject headings (MeSH) diseases, Gene2DisCo largely outperformed the best benchmark algorithm, kernelized score functions, in terms of both area under the ROC curve (0.94 against 0.86) and precision at given recall levels (for recall levels from 0.1 to 1 with steps 0.1). Furthermore, we enriched and extended the benchmark data to the whole human genome and provided the top-ranked unannotated candidate genes even for MeSH disease terms without known annotations

    A remark on a paper of F. Chiarenza and M. Frasca

    Full text link
    In 1990 F. Chiarenza and M. Frasca published a paper in which they generalized a result of C. Fefferman on estimates of the integral of bup|bu|^{p} through the integral of Dup|Du|^{p} for p>1p>1. Formally their proof is valid only for d3d\geq 3. We present here further generalization with a different proof in which DD is replaced with the fractional power of the Laplacian for any dimension d1d\geq 1.Comment: 4 page

    COSNet: An R package for label prediction in unbalanced biological networks

    Full text link
    Several problems in computational biology and medicine are modeled as learning problems in graphs, where nodes represent the biological entities to be studied, e.g. proteins, and connections different kinds of relationships among them, e.g. protein-protein interactions. In this context, classes are usually characterized by a high imbalance, i.e. positive examples for a class are much less than those negative. Although several works studied this problem, no graph-based software designed to explicitly take into account the label imbalance in biological networks is available. We propose COSNet, an R package to serve this purpose. COSNet deals with the label imbalance problem by implementing a novel parametric model of Hopfield Network (HN), whose output levels and activation thresholds of neurons are parameters to be automatically learnt. Due to the quasi-linear time complexity, COSNet nicely scales when the number of instances is large, and application examples to challenging problems in biomedicine show the efficiency and the accuracy of the proposed library

    Selection of negatives in Hopfield networks

    No full text
    In this work we propose a novel methodology for graph-based semi-supervised learning which is composed of two main steps: Step 1) a novel strategy for PU learning specific for Hopfield networks, which can be applied both to structured classes and to hierarchy-less contexts; Step 2) a semi-supervised classifier based on a family of parametric Hopfield networks, which embeds the negative selection performed at Step 1) in the dynamics of network

    Multi-Task Label Propagation with Dissimilarity Measures

    No full text
    Multi-task algorithms typically use task similarity information as a bias to speed up learning. We argue that, when the classification problem is unbalanced, task dissimilarity information provides a more effective bias, as rare class labels tend to be better separated from the frequent class labels. In particular, we show that a multi-task extension of the label propagation algorithm for graph-based classification works much better on protein function prediction problems when the task relatedness information is represented using a dissimilarity matrix as opposed to a similarity matrix

    GRAPH-BASED APPROACHES FOR IMBALANCED DATA IN FUNCTIONAL GENOMICS

    Full text link
    The Gene Function Prediction (GFP) problem consists in inferring biological properties for the genes whose function is unknown or only partially known, and raises challenging issues from both a machine learning and a computational biology standpoint. The GFP problem can be formalized as a semi-supervised learning problem in an undirected graph. Indeed, given a graph with a partial graph labeling, where nodes represent genes, edges functional relationships between genes, and labels their membership to functional classes, GFP consists in inferring the unknown functional classes of genes, by exploiting the topological relationships of the networks and the available a priori knowledge about the functional properties of genes. Several network-based machine learning algorithms have been proposed for solving this problem, including Hopfield networks and label propagation methods; however, some issues have been only partially considered, e.g. the preservation of the prior knowledge and the unbalance between positive and negative labels. A first contribution of the thesis is the design of a Hopfield-based cost sensitive neural network algorithm (COSNet) to address these learning issues. The method factorizes the solution of the problem in two parts: 1) the subnetwork composed by the labelled vertices is considered, and the network parameters are estimated through a supervised algorithm; 2) the estimated parameters are extended to the subnetwork composed of the unlabeled vertices, and the attractor reached by the dynamics of this subnetwork allows to predict the labeling of the unlabeled vertices. The proposed method embeds in the neural algorithm the “a priori” knowledge coded in the labeled part of the graph, and separates node labels and neuron states, allowing to differentially weight positive and negative node labels, and to perform a learning approach that takes into account the “unbalance problem” that affects GFP. A second contribution of this thesis is the development of a new algorithm (LSI ) which exploits some ideas of COSNet for evaluating the predictive capability of each input network. By this algorithm we can estimate the effectiveness of each source of data for predicting a specific class, and then we can use this information to appropriately integrate multiple networks by weighting them according to an appropriate integration scheme. Both COSNet and LSI are computationally efficient and scale well with the dimension of the data. COSNet and LSI have been applied to the genome-wide prediction of gene functions in the yeast and mouse model organisms, achieving results comparable with those obtained with state-of-the-art semi-supervised and supervised machine learning methods

    Regularized network-based algorithm for predicting gene functions with high-imbalanced data

    No full text
    Motivations. The gene function prediction problem is a real-world problem consisting in finding new bio-molecular functions of genes/gene products and characterized by hundreds or thousands of functional classes structured according to a predefined hierarchy. This problem can be formalized as a semi-supervised multi-class, multi-label classification problem where the biological functions of new genes can be predicted by exploiting their connections with genes whose biological functions are known. Many different approaches have been proposed to address this problem, including "guilt- by-association" [1], "label propagation" [2], module-assisted techniques [3], SVMs [4]. Nevertheless, these methods usually suffer a decay in performance when input data are highly unbalanced, that is positive examples are significantly less than negatives. This scenario characterizes in particular the most specific classes of the ontology, which are the classes more far from the root classes and that better describe the functions of genes. Methods. To address these items, we propose a regularization of a Hopfield-based cost- sensitive algorithm, COSNet, recently proposed to predict gene functions [5]. This algorithm, although designed to manage the imbalance in labeled data, tends to predict an excessively high proportion of positives when data are particularly unbalanced (that is in particular on most specific classes). By adding a term to the energy function of the network, we are able in modifying the dynamics in order to prevent the number of positives becomes too large. This energy term is minimized when the proportion of positive neurons (current positive rate) resembles the rate of positive labels in the training set (expected positive rate). The higher the difference between current and expected positive rates, the more the penalty to the energy function. We call this regularized version R-COSNet. Results. We tested R-COSNet on the prediction of yeast genes, by using four different data sets and the classes of the FunCat ontology [6]. This ontology is structured in forest of trees, in which each node belong to one of the six levels of specificity. Level 1 refers to the root nodes, level i to nodes at distance i from the root. The considered classes are those with at least 20 positives and are spanned from level 1 to level 5. We compared our methods with a label propagation algorithm, LP-Zhu [2], and Support Vector Machine (SVM) with probabilistic output [4]. In Figure 1 we report the results in terms of F-score averaged across the functional classes belonging to the level 4 and level 5 of the hierarchy

    A cost-sensitive neural algorithm to predict gene functions using large biological networks

    No full text
    Biological networks can represent different types of relationships between biomolecular entities (e.g. genes or proteins), ranging from genetic or physical interactions, to geneexpression correlations, chemical reactions, or co-occurences in bio-medical literature. In this context, a central problem is the integration of different networks and the development of algorithms to infer the underlying biological properties of the biological entities, i.e. the functional classes of genes, or the potential protein targets of a drug, with relevant applications in functional genomics, proteomics and pharmacogenomics. In particular, the gene function prediction problem can be formalized as a semi-supervised multi-class, multi-label classification problem where the unknown labels of the unlabeled part of the network can be predicted by exploiting the known labels of the labeled part and the relationships connecting the nodes of the network. Several approaches have been proposed to address this problem, including simple “guilt-by-association” methods (Marcotte et al. 1999), “label propagation” algorithms (Zhou et al.2003), Markov (Deng et al., 2004) and Gaussian Random fields (Tsuda et al. 2005, Mostafavi et al. 2008). Unfortunately none of these methods has been specifically designed to manage the unbalance which often characterizes gene functional classes, with negative examples that largely outnumber positives. Moreover, most of these methods do not preserve the prior knowledge coded in the labeling of genes. To address these items, we propose a Hopfield-based cost sensitive neural algorithm which preserves the prior information and introduces an efficient cost-sensitive strategy to learn the appropriate parameters of the network (neuron states and their thresholds) in order to manage the unbalance between positive and negative examples in functional classes. Our method factorizes the solution of the problem in two parts: 1) the sub-network composed of the labelled vertices is considered, and the network parameters are estimated through an efficient supervised algorithm; 2) the estimated parameters are extended to the subnetwork composed of the unlabeled vertices, and the attractor reached by the dynamics of this subnetwork allows to predict the labeling of the unlabeled vertices. Moreover our method allows to efficiently integrate multiple sources of data and significantly reduces the computational complexity by restricting the network dynamics to the unlabeled part of the network. The algorithm is fast, scales nicely when new sources of data are added, and can be efficiently applied to large biological networks. We tested our method on the yeast and mouse model organisms at genome-wide level, using both the FunCat and Gene Ontology taxonomies, integrating different data sources, including protein domain, gene expression, and protein interactions. Cross-validated results show that our integrated approach achieves competitive results with state-of-the-art semi-supervised and supervised methods on the MouseFunc benchmark data (Pena-Castillo et al., 2008). Moreover our cost-sensitive approach allows to significantly outperform state-of-the-art hierarchical ensemble methods (Cesa-Bianchi et al, 2010) using multiple sources of data and the whole FunCat taxonomy with the yeast model organism
    corecore