1,721,000 research outputs found

    TransformerGO: predicting protein–protein interactions by modelling the attention between sets of gene ontology terms

    No full text
    MOTIVATION: Protein–protein interactions (PPIs) play a key role in diverse biological processes but only a small subset of the interactions has been experimentally identified. Additionally, high-throughput experimental techniques that detect PPIs are known to suffer various limitations, such as exaggerated false positives and negatives rates. The semantic similarity derived from the Gene Ontology (GO) annotation is regarded as one of the most powerful indicators for protein interactions. However, while computational approaches for prediction of PPIs have gained popularity in recent years, most methods fail to capture the specificity of GO terms. RESULTS: We propose TransformerGO, a model that is capable of capturing the semantic similarity between GO sets dynamically using an attention mechanism. We generate dense graph embeddings for GO terms using an algorithmic framework for learning continuous representations of nodes in networks called node2vec. TransformerGO learns deep semantic relations between annotated terms and can distinguish between negative and positive interactions with high accuracy. TransformerGO outperforms classic semantic similarity measures on gold standard PPI datasets and state-of-the-art machine-learning-based approaches on large datasets from Saccharomyces cerevisiae and Homo sapiens. We show how the neural attention mechanism embedded in the transformer architecture detects relevant functional terms when predicting interactions. AVAILABILITY AND IMPLEMENTATION: https://github.com/Ieremie/TransformerGO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online

    Comparative analysis of the Arabidopsis and rice expressed sequence tag (EST) sets

    No full text
    Large numbers of expressed sequence tags (ESTs) have now been generated from a variety of model organisms. In plants, substantial collections of ESTs are available for Arabidopsis and rice, in each case representing significant proportions of the estimated total numbers of genes. Large-scale comparisons of Arabidopsis and rice sequences are especially interesting due to the fact that these two species are representatives of the two subclasses of the flowering plants (Dicotyledonae and Monocotyledonae, respectively). Here we present the results of systematic analysis of the Arabidopsis and rice EST sets. Non-redundant sets of sequences from Arabidopsis and rice were first separately derived and then combined so that gene families in common between the two species could be identified. Our results show that 58% of non-singleton ESTs are derived from genes in gene families common to the two species. These gene families constitute the basis of a core set of higher plant genes

    Protein language models meet reduced amino acid alphabets

    No full text
    Motivation: protein language models (PLMs), which borrowed ideas for modelling and inference from natural language processing, have demonstrated the ability to extract meaningful representations in an unsupervised way. This led to significant performance improvement in several downstream tasks. Clustering amino acids based on their physical-chemical properties to achieve reduced alphabets has been of interest in past research, but their application to PLMs or folding models is unexplored.Results: here, we investigate the efficacy of PLMs trained on reduced amino acid alphabets in capturing evolutionary information, and we explore how the loss of protein sequence information impacts learned representations and downstream task performance. Our empirical work shows that PLMs trained on the full alphabet and a large number of sequences capture fine details that are lost in alphabet reduction methods. We further show the ability of a structure prediction model(ESMFold) to fold CASP14 protein sequences translated using a reduced alphabet. For 10 proteins out of the 50 targets, reduced alphabets improve structural predictions with LDDT-Cα differences of up to 19%.</p

    The USP7 protein interaction network and its roles in tumorigenesis

    No full text
    Ubiquitin-specific protease (USP7), also known as Herpesvirus-associated ubiquitin-specific protease (HAUSP), is a deubiquitinase. There has been significant recent attention on USP7 following the discovery that USP7 is a key regulator of the p53-MDM2 pathway. The USP7 protein is 130 kDa in size and has multiple domains which bind to a diverse set of proteins. These interactions mediate key developmental and homeostatic processes including the cell cycle, immune response, and modulation of transcription factor and epigenetic regulator activity and localization. USP7 also promotes carcinogenesis through aberrant activation of the Wnt signalling pathway and stabilization of HIF-1α. These findings have shown that USP7 may induce tumour progression and be a therapeutic target. Together with interest in developing USP7 as a target, several studies have defined new protein interactions and the regulatory networks within which USP7 functions. In this review, we focus on the protein interactions of USP7 that are most important for its cancer-associated roles.</p

    Computational framework for analysis of prey–prey associations in interaction proteomics identifies novel human protein–protein interactions and networks

    No full text
    Large-scale protein-protein interaction data sets have been generated for several species including yeast and human and have enabled the identification, quantification, and prediction of cellular molecular networks. Affinity purification-mass spectrometry (AP-MS) is the preeminent methodology for large-scale analysis of protein complexes, performed by immunopurifying a specific “bait” protein and its associated “prey” proteins. The analysis and interpretation of AP-MS data sets is, however, not straightforward. In addition, although yeast AP-MS data sets are relatively comprehensive, current human AP-MS data sets only sparsely cover the human interactome. Here we develop a framework for analysis of AP-MS data sets that addresses the issues of noise, missing data, and sparsity of coverage in the context of a current, real world human AP-MS data set. Our goal is to extend and increase the density of the known human interactome by integrating bait-prey and cocomplexed preys (prey-prey associations) into networks. Our framework incorporates a score for each identified protein, as well as elements of signal processing to improve the confidence of identified protein-protein interactions. We identify many protein networks enriched in known biological processes and functions. In addition, we show that integrated bait-prey and prey-prey interactions can be used to refine network topology and extend known protein networks.<br/

    The bait compatibility index: computational bait selection for interaction proteomics experiments

    No full text
    Protein interaction network maps have been generated for multiple species, making use of large-scale methods such as yeast two-hybrid (Y2H) and affinity purification mass spectrometry (AP-MS). These methods take fundamentally different approaches toward characterizing protein networks, and the resulting data sets provide complementary views of the protein interactome. The specific determinants of the outcome of Y2H and AP-MS experiments, in terms of detection of interacting proteins are, however, poorly understood. Here we show that a statistical model built using sequence- and annotation- based features of bait proteins is able to identify bait features that are significant determinants of the outcome of interaction proteomics experiments. We show that bait features are able to explain in part the disparities observed between Y2H and AP-MS constructed networks and can be used to derive the “bait compatibility index”, a numeric score that assesses the compatibility of bait proteins with each technology. Aside from understanding the bias and limitations of interaction proteomics, our approach provides a rational, data-driven method for prioritization of baits for interaction proteomics experiments, an essential requirement for future proteome-wide applications of these technologies

    D<smcaps>A</smcaps>D<smcaps>A</smcaps>: Degree-Aware Algorithms for Network-Based Disease Gene Prioritization

    No full text
    Abstract Background High-throughput molecular interaction data have been used effectively to prioritize candidate genes that are linked to a disease, based on the observation that the products of genes associated with similar diseases are likely to interact with each other heavily in a network of protein-protein interactions (PPIs). An important challenge for these applications, however, is the incomplete and noisy nature of PPI data. Information flow based methods alleviate these problems to a certain extent, by considering indirect interactions and multiplicity of paths. Results We demonstrate that existing methods are likely to favor highly connected genes, making prioritization sensitive to the skewed degree distribution of PPI networks, as well as ascertainment bias in available interaction and disease association data. Motivated by this observation, we propose several statistical adjustment methods to account for the degree distribution of known disease and candidate genes, using a PPI network with associated confidence scores for interactions. We show that the proposed methods can detect loosely connected disease genes that are missed by existing approaches, however, this improvement might come at the price of more false negatives for highly connected genes. Consequently, we develop a suite called DADA, which includes different uniform prioritization methods that effectively integrate existing approaches with the proposed statistical adjustment strategies. Comprehensive experimental results on the Online Mendelian Inheritance in Man (OMIM) database show that DADA outperforms existing methods in prioritizing candidate disease genes. Conclusions These results demonstrate the importance of employing accurate statistical models and associated adjustment methods in network-based disease gene prioritization, as well as other network-based functional inference applications. DADA is implemented in Matlab and is freely available at http://compbio.case.edu/dada/.</p

    ROCS: a reproducibility index and confidence score for interaction proteomics

    Full text link
    Affinity-Purification Mass-Spectrometry (AP-MS) provides a powerful means of identifying protein complexes and interactions. Several important challenges exist in interpreting the results of AP-MS experiments. First, the reproducibility of AP-MS experimental replicates can be low, due both to technical variability and the dynamic nature of protein interactions in the cell. Second, the identification of true protein-protein interactions in AP-MS experiments is subject to inaccuracy due to high false negative and false positive rates. Several experimental approaches can be used to mitigate these drawbacks, including the use of replicated and control experiments and relative quantification to sensitively distinguish true interacting proteins from false ones. RESULTS: To address the issues of reproducibility and accuracy of protein-protein interactions, we introduce a two-step method, called ROCS, which makes use of Indicator Proteins to select reproducible AP-MS experiments, and of Confidence Scores to select specific protein-protein interactions. The Indicator Proteins account for measures of protein identification as well as protein reproducibility, effectively allowing removal of outlier experiments that contribute noise and affect downstream inferences. The filtered set of experiments is then used in the Protein-Protein Interaction (PPI) scoring step. Prey protein scoring is done by computing a Confidence Score, which accounts for the probability of occurrence of prey proteins in the bait experiments relative to the control experiment, where the significance cutoff parameter is estimated by simultaneously controlling false positives and false negatives against metrics of false discovery rate and biological coherence respectively. In summary, the ROCS method relies on automatic objective criterions for parameter estimation and error-controlled procedures. We illustrate the performance of our method by applying it to five previously published AP-MS experiments, each containing well characterized protein interactions, allowing for systematic benchmarking of ROCS. We show that our method may be used on its own to make accurate identification of specific, biologically relevant protein-protein interactions or in combination with other AP-MS scoring methods to significantly improve inferences. CONCLUSIONS: Our method addresses important issues encountered in AP-MS datasets, making ROCS a very promising tool for this purpose, either on its own or especially in conjunction with other methods. We anticipate that our methodology may be used more generally in proteomics studies and databases, where experimental reproducibility issues arise. The method is implemented in the R language, and is available as an R package called "ROCS", freely available from the CRAN repository http://cran.r-project.org/

    Network-based approaches for extending the Wnt signaling pathway and identifying context-specific sub-networks

    No full text
    Wnt signaling is a critically important signaling pathway regulating embryogenesis and differentiation, and is broadly conserved amongst multicellular animals. In addition, dys-regulation of Wnt signaling contributes to the pathogenesis of many human cancers, in particular colorectal cancer. Core members of the Wnt signaling pathway are quite well defined, although it has become apparent that a much broader network of interacting proteins regulates Wnt signaling activity. The goal of this paper is first to identify novel members of the Wnt regulatory network and second, to identify subnetworks of the larger Wnt signaling network that are active in different biological contexts. We address these two questions using complementary computational approaches and show how these approaches may identify potentially novel Wnt signaling proteins as well as defining Wnt sub-networks active in different stages of colorectal cancer

    iOmicsPASS: network-based integration of multiomics data for predictive subnetwork discovery

    No full text
    Computational tools for multiomics data integration have usually been designed for unsupervised detection of multiomics features explaining large phenotypic variations. To achieve this, some approaches extract latent signals in heterogeneous data sets from a joint statistical error model, while others use biological networks to propagate differential expression signals and find consensus signatures. However, few approaches directly consider molecular interaction as a data feature, the essential linker between different omics data sets. The increasing availability of genome-scale interactome data connecting different molecular levels motivates a new class of methods to extract interactive signals from multiomics data. Here we developed iOmicsPASS, a tool to search for predictive subnetworks consisting of molecular interactions within and between related omics data types in a supervised analysis setting. Based on user-provided network data and relevant omics data sets, iOmicsPASS computes a score for each molecular interaction, and applies a modified nearest shrunken centroid algorithm to the scores to select densely connected subnetworks that can accurately predict each phenotypic group. iOmicsPASS detects a sparse set of predictive molecular interactions without loss of prediction accuracy compared to alternative methods, and the selected network signature immediately provides mechanistic interpretation of the multiomics profile representing each sample group. Extensive simulation studies demonstrate clear benefit of interaction-level modeling. iOmicsPASS analysis of TCGA/CPTAC breast cancer data also highlights new transcriptional regulatory network underlying the basal-like subtype as positive protein markers, a result not seen through analysis of individual omics data.</p
    corecore