1,721,093 research outputs found

    Computational Protein Function Prediction and its Application to the Missing Enzymes Problem

    No full text
    Improving the overall annotation level of genomes and completeness of biological pathways with high accuracy is the long term basic goal for this research. Large numbers of proteins are getting sequenced every year, creating a pressing need to build computational techniques for rapidly analyzing genomes to extract relevant knowledge. The purpose of this study is 1) to develop an advanced method to computationally elucidate functions of unannotated proteins, 2) to characterize the relationships between functional terms used to describe the proteins and 3) to further use these relationships to predict missing enzymes in the metabolic pathways. Here we have developed the Extended Similarity Group (ESG) method for protein annotation prediction that iteratively searches the sequence homology space around the query protein and draws consensus from the annotations of proteins in the neighborhood. In terms of prediction accuracy, ESG has been shown to outperform simple PSI-BLAST search and the PFP method previously developed in our lab. Secondly we have designed two scores, Co-occurrence Association Score (CAS) and PubMed Association Score (PAS), that capture the relationship between pairs of Gene Ontology terms used for annotating the proteins. CAS is based on co-occurrence of annotation terms in the database to annotate the same proteins, and PAS is based on co-mentions of annotation terms in the PubMed abstracts. These two scores have been successfully applied to identify functionally coherent groups of proteins that work in coordinated fashion to achieve some biological task. For newly sequenced genomes, metabolic reconstruction often leads to several missing enzymes where a known reaction is not associated with any gene product. As the next step, we use the aforementioned function association scores combined with the phylogenetic profile and microarray expression data to find the most likely matches for such missing enzymes thereby increasing the completeness of biological knowledge. Thus the principal goal achieved here is to understand and improve the computational characterization of protein annotations starting from the individual proteins and moving towards the systems level

    Exploring applications of suboptimal alignments in threading protein structure prediction

    No full text
    Protein structures provide keys to deep understanding of biological problems and applicable research. When experimental structures are unavailable, threading can provide a rough structure prediction for biologists to accelerate the pace of their research. However, current threading confronts two challenges: (1) offering the error estimation for predicted structures; (2) improving the prediction accuracy of threading. The objectives of my Ph.D. project are therefore to develop some new techniques that address these challenges. To achieve our objectives, we introduce suboptimal alignments into threading. This idea is rarely explored in previous studies and offers possibilities to the birth of the following new techniques: (1) To predict the threading error: The SPAD score is proposed to quantify the diversity of suboptimal alignments in threading. Then we measure the SPAD scores and the errors of 5232 threading predictions made on the L-E dataset. These data show that logarithms of SPAD scores are linearly correlated with those of threading errors at global and local levels. Seven other error-indicating parameters are collected from the same set of predictions and head-to-head compared with SPAD scores. The comparison indicates that SPAD scores are the best index among these parameters to predict threading errors since it has the highest correlation coefficient with prediction errors. We conduct a regression analysis to derive a quantitative relationship between SPAD scores and threading errors. With this relationship, we predicted the errors of 383 CASP threading predictions. The predicted errors match the actual errors well at both global and local levels. (2) To improve the threading accuracy: (i) We propose the reranking strategy and the probabilistic contact strategy to consider two-body contact potentials in threading. The benchmarking on the SALIGN dataset and the L-E dataset shows that these two strategies improve the template recognition accuracy and the alignment accuracy of threading. (ii) We use the optimal and suboptimal alignments, rather than the optimal alignment alone, to build 3D predicted structures. This technique reduces the RMSD of predicted structures, according to the test of CASP7 targets. (iii) We combine SPAD scores and Z-scores for template recognition, which improves the recognition accuracy on the L-E dataset

    Geometrical analysis of interaction sites of proteins

    No full text
    Bioinformatics is an interdisciplinary research area between biological science and mathematics, physics, statistics, and computer science to simulate, compute, and predict biological scientific concepts. Proteomics is a branch of bioinformatics research that focus on the structure and function of proteins. This dissertation focuse on the geometrical analysis of interaction sites of proteins, studying four proteomics research problems using protein surface representations. The major contributions are: 1. Benchmark analysis of a fast protein tertiary structure retrieval method based on global surface shape similarity; 2. Development of a method for characterization of local geometry of protein surfaces using a visibility criterion; 3. Improved protein-protein docking prediction accuracy using predicted protein-protein interface information; and 4. Proposal of a new method for flexible docking using the CABS model

    Computational Methods for Protein-Protein Interaction Identification

    No full text
    Understanding protein-protein interactions (PPIs) in a cell is essential for learning protein functions, pathways, and mechanisms of diseases. This dissertation introduces the computational method to predict PPIs. In the first chapter, the history of identifying protein interactions and some experimental methods are introduced. Because interacting proteins share similar functions, protein function similarity can be used as a feature to predict PPIs. NaviGO server is developed for biologists and bioinformaticians to visualize the gene ontology relationship and quantify their similarity scores. Furthermore, the computational features used to predict PPIs are summarized. This will help researchers from the computational field to understand the rationale of extracting biological features and also benefit the researcher with a biology background to understand the computational work. After understanding various computational features, the computational prediction method to identify large-scale PPIs was developed and applied to Arabidopsis, maize, and soybean in a whole-genomic scale. Novel predicted PPIs were provided and were grouped based on prediction confidence level, which can be used as a testable hypothesis to guide biologists’ experiments. Since affinity chromatography combined with mass spectrometry technique introduces high false PPIs, the computational method was combined with mass spectrometry data to aid the identification of high confident PPIs in large-scale. Lastly, some remaining challenges of the computational PPI prediction methods and future works are discussed

    RNA-protein interactions: Analysis of binding interfaces and prediction of protein binding sites in RNA

    No full text
    RNA-protein interactions are vital to many biological processes such as translation and splicing. Analysis of the binding interfaces in RNA-protein complexes obtained from the Protein Data Bank reveal molecular properties in RNA and protein that are statistically favored in binding regions as opposed to non-binding regions. For example, although the nucleotide guanine is preferred when RNA bases form hydrogen bonds with the proteins, it is disfavored when the RNA backbone interacts with the protein. Protein binding is favored in RNA loop regions over those that form Watson-Crick base-pairs. For proteins, positively charged amino acids such as Arginine are frequently observed interacting with the negatively charged RNA backbone. Aromatic protein residues are also seen stacking with the nucleotides. Such insights into recognition principles governing RNA-protein interactions can be translated into computational prediction of binding sites in participating RNAs and proteins, thus aiding in their functional annotation. Because the statistical analysis revealed that RNA has distinctive sequence and structure at protein binding and non-binding sites, computational prediction of protein binding sites in RNA is possible. We developed an information theoretic model that predicts protein binding sites in RNA with 60% accuracy. By using a conditional random field model, we identified the sequence and structural characteristics that are indicative of protein binding in RNA. We find that RNA structure is much more informative than RNA sequence in distinguishing protein binding from non-binding sites. Since experimentally determined structural information is not available for several RNAs, we developed a heuristic approach to identify a comprehensive set of base-paired regions in RNA from suboptimal structure predictions. Development of tools to predict RNA-protein interaction partners is a future research direction that will allow computational construction of RNA-protein interaction network for a biological process or a system

    Scoring functions in predicting protein structure and protein-protein interaction

    No full text
    Structural bioinformatics is of great necessity to the study of mechanisms of molecular machinery in the biological processes. It applies statistical and mathematical modeling to solve problems in protein folding, protein structure prediction and protein-protein interactions. Amongst the various issues in structural bioinformatics, scoring function is a very important one because it is the core of many algorithms. In this thesis, scoring function optimization and weight training problems are investigated in three related works: (1) Quality Assessment of Protein Structure Model: Knowing the resolution and accuracy of the structure model is crucial for biologists to determine its usage. Various quality assessment scores are combined using linear, logistic and LOESS regressions to predict the quality of the structure model in terms of RMSD and correct/incorrect categories. Local quality of the structure, in terms of Cα distance, is also modeled using simple regression and hierarchical approaches. Finally, the developed regression equations are applied to assess quality of structure models of the whole E.coli proteome. (2) Optimizing Scoring Function for Ranking Protein Docking Conformations: Numerous metrics that measure the goodness of the docking scoring function are used to optimize our scoring function that is a linear combination of 9 energetic terms and the weights are optimized by logistic regression and Genetic Algorithm. By cross comparison, different metrics are shown to have different generalization ability. The resulting scoring functions are then compared to ZRANK and ZDOCK on a benchmark data set and show substantial improvement. Finally ensemble approaches are employed and improvement is observed on several metrics. (3) Threading without Optimizing Weighting Factor for Scoring Function: A simple gapless threading system with two energy terms is used to test several novel methods which do not require training weights on a training set. Basic ideas of these methods is to sample different values of the weight and select an optimal template structure for a target sequence by examining the characteristics of the distribution of scores computed by varying the weight. An artificial neural network model is also built to predict target-specific weight based on the features of protein sequence. Finally, it is shown that the novel approaches combined with the traditional methods can increase the predicting power of the scoring function

    Design, evaluation, and application of PFP: An automated system for protein function prediction

    No full text
    The last decade of biological research has seen a tremendous push towards production of high volume data describing DNA and protein sequence, structure, expression, interaction, and localization. This glut of new data is the impetus for the development and emergence of a slew of computational tools that can interpret it to provide new functional characterization of proteins. We have developed PFP, an automated function prediction system which provides high probability annotations for a query sequence in each of the three branches of the Gene Ontology: biological process, cellular component, and molecular function. Rather than using precise pattern matching to identify functional motifs in the sequences and structures of these proteins, we designed PFP to increase the coverage of function annotation by lowering resolution of predictions when detailed functional information is not predictable. To do this, we extend a traditional PSI-BLAST homology search by extracting and scoring annotations (GO terms) individually, including annotations from distantly related sequences, and applying a novel data mining tool, the Function Association Matrix, to score strongly associated pairs of annotations. The scoring scheme also provides GO term-based statistical significance scores and confidence scores empirically derived from an extensive benchmark evaluation of annotated proteins from fifteen organisms. We have shown this system to be effective in providing accurate predictions for both specific and broad functional terms. This is consistent with the performance of PFP as the best overall predictor in two independent international assessments: AFP-SIG ’05 and CASP7 function (FN), where it outperformed even consensus predictions made by the organizers. Additionally, we have extensively applied blind predictions to the protein interaction networks of and clusters of contiguous genes in E. coli, S. cerevisiae, and P. falciparum (Malaria plasmodium). Through this style of applications, PFP is able to provide significant annotation gain for previously uncharacterized groups of proteins. The automated PFP system is publicly available as a web server at http://dragon.bio.purdue.edu/pfp/

    Machine Learning Approaches Towards Protein Structure and Function Prediction

    No full text
    Proteins are drivers of almost all biological processes in the cell. The functions of a protein are dependent on their three-dimensional structure and elucidating the structure and function of proteins is key to understanding how a biological system operates. In this research, we developed computational methods using machine learning techniques to predicts the structure and function of proteins. Protein 3D structure prediction has advanced significantly in recent years, largely due to deep learning approaches that predict inter-residue contacts and, more recently, distances using multiple sequence alignments (MSAs). The performance of these models depends on the number of similar protein sequences to the query protein, wherein some cases similar sequences are few but dissimilar sequences with local similarities are more and can be helpful. We have developed a novel deep learning-based approach AttentiveDist which further improves over the previous state of art. We added an attention mechanism where dis-similar sequences are also used (increasing number of sequences) and the model itself determines which information from such sequences it should attend to. We showed that the improvement of distance predictions was successfully transferred to achieve better protein tertiary structure modeling. We also show that structure prediction from a predicted distance map can be further enhanced by using predicted inter-residue sidechain center distances and main-chain hydrogen-bonds. Protein function prediction is another avenue we explored where we want to predict the function that a protein will perform. The crux of the approach is to predict the function of protein based on the function of similar sequences. Here, we developed a method where we use dissimilar sequences to extract additional information and improve performance over the previous approaches. We used phylogenetic analysis to determine if a dissimilar sequence can be close to the query sequence and thus can provide functional information. Our method was ranked highly in worldwide protein function prediction competition CAFA3 (2016-2019). Further, we expanded the method with a neural network to predict protein toxicity that can be used as a safety check for human-designed protein sequences

    Molecular Dynamics in Protein Structure Quality Assessment and Refinement

    No full text
    Proteins are the active biomolecules of the cell. They perform metabolic action, give the cell structure, protect the cell from antigens, give the cell motility, and much more. The function of proteins are intrinsically linked to their structures, so it is therefore necessary to characterize the structure of a protein to fully understand its function and operation. In this research the application of computational methods, primarily molecular dynamics, towards protein structure determination, refinement, and quality assessment were studied. I applied molecular dynamics techniques to four major projects; the determination of relative error of atomic models deposited with electron microscopy maps in the EMDB, solving and refining atomics structure models for the PhageG major capsid proteins, the elucidation of the structure the protein USP7 and the binding pose of a of a candidate therapeutic drug, and the determination of relative stability of candidate protein folds to distinguish near native models from not. Each year an increasing number of protein structures have been solved using electron microscopy (EM). The influx of solved structure has proven to be a boon to the community, but it is necessary to note that the quality EM maps vary substantially. To understand to what extent atomic structure models generated from EM matched their respective maps, two computational structure refinement methods were used to examine how much structures could be refined. The deviation from the starting structure by refinement, as well as the disagreement between refined models produced by the two computational methods, scaled inversely with both the global and local map resolutions. The results suggested that the observed discrepancy between the deposited maps and refined models is due to the lack of resolvable structural data present in EM maps at low to moderate resolutions, and therefore these annotations must be used with caution in further applications. I also successfully implemented molecular dynamics as a method for protein structure quality assessment. Proteins tend towards shapes which minimize their energy. Experimentally, the stability of a protein can be measured through several techniques, one such technique includes the controlled application of tension to proteins in an atomic force microscopy (AFM) framework. This kind of tension-based approach is of interest as it probes the force required to unfold individual domains of a protein rather than a bulk characteristic like molting point or activity. It has been shown that key features observed in an AFM experiment can be well reproduced with molecular dynamics simulation, which has been applied to characterize the mechanisms of unfolding of proteins as well as ligand-protein interactions. Steered molecular dynamics (SMD) was applied to pull and unfold proteins and determine the force required to unfold them. The relative force required to unfold different models with the same sequence was used to estimate relative model accuracy. This follows from the hypothesis that the structural stability of a given model’s conformation would positively correlate with its accuracy, i.e. how close that model is to its native fold. It was found that near-native models could be successfully selected by comparing the forces required to unfold models, indicating that high unfolding forces indeed indicated high model stability, which in turn correlated with model accuracy

    Computational models of mutations for predicting and classifying protein-protein interaction sites

    No full text
    Protein-protein interaction residues are largely responsible for mediating many critical functions in the cell, such as inhibitory effects through enzyme-inhibitor interaction, initiating immune response by an antibody-antigen interaction, and regulation of cell-signaling proteins. Currently, various methods are available for predicting protein-protein interaction sites, these methods allows a residue-level understanding of the protein-binding phenomena presented by the global construction protein-protein interaction networks. In this thesis, protein-protein interaction sites are predicted using phylogenetic substitution models of amino acid mutations at protein interfaces: 1) Predicting Protein-Protein Interaction Sites using Phylogenetic Substitution Models: Protein-protein are critical for maintaining many different biological functions in the cell. In particular, these processes involve functionally important amino acid residues that are traditionally accepted as conserved in sequence throughout evolutionary time. However, protein-protein interaction sites exhibit higher sequence variation than other functional regions, such as those that correspond to catalytic sites and ligand-binding sites. Consequently, the semi-conservation of protein-protein interaction sites pose significant challenges in the current protein-protein interface prediction methods. To approach this problem, we developed a phylogenetic framework to capture the mutational behavior of essential protein-protein binding residues. Through the comprehensive analysis of functionally diverse protein families, we discover key amino acid substitution patterns that are characteristic of protein-protein interfaces. We demonstrate the contrast between interface and non-interface substitution models shows mutational biases imposed on protein-protein binding residues. Based on this analysis, we have developed a novel method, BindML, which utilizes these evolutionary models to predict protein-protein binding sites on protein structures even without knowledge of their interacting partners. When assessed on a large benchmark of protein complexes, our method performs better compared to alternative methods for protein binding interface prediction. The conceptual novelty of this method is that it detects semi-conserved mutations rather than conventional conservation in protein family sequences, thus aimed to open a new direction in protein sequence analysis. 2) Prediction and Classification of Permanent and Transient Protein-Protein Interfaces: Proteins interact with each other in different ways for specific functional consequences. Our current research direction involves the development of a new method to classify mutation patterns of protein-protein interaction sites into permanent and transient types. The permanent type of interactions requires tight binding between proteins to assemble strong complexes. For example, enzyme-inhibitor, antigen-antibody, and large homo-oligomeric enzyme structures all compose of proteins that are required to be permanently bound in order to correctly carry out their functions. In contrast, transient type protein-protein interactions can readily dissociate after binding. Examples of transient interactions include proteins involved in signaling pathways, in which binding of transient proteins (such as protein kinases and G-proteins) induces conformational changes that allow protein function (and hence pathways) to switch on and off allowing strict and precise control of cellular activity. Although there are many studies that have already explored the differences in these two types of interactions at the level of the protein structure, in this study we develop amino acid substitution models to differentiate the differences between permanent and transient type interfaces primarily using sequence information. We built highly discriminative substitution models that can be used to classify protein interface predictions into permanent and transient interaction types. A detailed understanding of the mutational constraint differences between permanent and transient protein complexes should help elucidate critical amino acid substitution preferences that are useful for annotating protein binding interface predictions of structures and sequences of unknown function. 3) 3D-SURFER Software for high-throughput protein surface comparison and analysis: A web-based tool, 3D-Surfer, has been developed to facilitate high-throughput comparison and characterization of proteins based on their surface shape. As each protein is effectively represented by a vector of 3D Zernike descriptors, comparison times for a query protein against the entire PDB take, on average, only a couple of seconds. The web interface has been designed to be as interactive as possible with displays showing animated protein rotations, CATH codes and structural alignments using the CE program. In addition, geometrically interesting local features of the protein surface, such as pockets that often correspond to ligand binding sites as well as protrusions and flat regions can also be identified and visualized
    corecore