1,721,022 research outputs found
Remote homology detection based on oligomer distances
Motivation: Remote homology detection is among the most intensively researched problems in bioinformatics. Currently discriminative approaches, especially kernel-based methods, provide the most accurate results. However, kernel methods also show several drawbacks: in many cases prediction of new sequences is computationally expensive, often kernels lack an interpretable model for analysis of characteristic sequence features, and finally most approaches make use of so-called hyperparameters which complicate the application of methods across different datasets. Results: We introduce a feature vector representation for protein sequences based on distances between short oligomers. The corresponding feature space arises from distance histograms for any possible pair of K-mers. Our distance-based approach shows important advantages in terms of computational speed while on common test data the prediction performance is highly competitive with state-of-the-art methods for protein remote homology detection. Furthermore the learnt model can easily be analyzed in terms of discriminative features and in contrast to other methods our representation does not require any tuning of kernel hyperparameters
Word correlation matrices for protein sequence analysis and remote homology detection
Abstract Background Classification of protein sequences is a central problem in computational biology. Currently, among computational methods discriminative kernel-based approaches provide the most accurate results. However, kernel-based methods often lack an interpretable model for analysis of discriminative sequence features, and predictions on new sequences usually are computationally expensive. Results In this work we present a novel kernel for protein sequences based on average word similarity between two sequences. We show that this kernel gives rise to a feature space that allows analysis of discriminative features and fast classification of new sequences. We demonstrate the performance of our approach on a widely-used benchmark setup for protein remote homology detection. Conclusion Our word correlation approach provides highly competitive performance as compared with state-of-the-art methods for protein remote homology detection. The learned model is interpretable in terms of biologically meaningful features. In particular, analysis of discriminative words allows the identification of characteristic regions in biological sequences. Because of its high computational efficiency, our method can be applied to ranking of potential homologs in large databases.</p
Orphelia: predicting genes in metagenomic sequencing reads
Metagenomic sequencing projects yield numerous sequencing reads of a diverse range of uncultivated and mostly yet unknown microorganisms. In many cases, these sequencing reads cannot be assembled into longer contigs. Thus, gene prediction tools that were originally developed for whole-genome analysis are not suitable for processing metagenomes. Orphelia is a program for predicting genes in short DNA sequences that is available through a web server application (http://orphelia.gobics.de). Orphelia utilizes prediction models that were created with machine learning techniques on the basis of a wide range of annotated genomes. In contrast to other methods for metagenomic gene prediction, Orphelia has fragment length-specific prediction models for the two most popular sequencing techniques in metagenomics, chain termination sequencing and pyrosequencing. These models ensure highly specific gene predictions
Identification of New Fungal Peroxisomal Matrix Proteins and Revision of the PTS1 Consensus
The peroxisomal targeting signal type 1 (PTS1) is a seemingly simple peptide sequence at the C-terminal end of most peroxisomal matrix proteins. PTS1 can be described as a tripeptide with the consensus motif [S/A/C] [K/R/H] L. However, this description is neither necessary nor sufficient. It does not cover all cases of PTS1 proteins, and some proteins in accordance with this consensus do not target to the peroxisome. In order to find new PTS proteins in yeast and to arrive at a more complete description of the PTS1 consensus motif, we developed a machine learning approach that involves orthologue expansion of the set of known peroxisomal proteins. We performed a genome-wide in silico screen, in in characterised several PTS1-containing peptides and identified two new peroxisomal matrix proteins, which we named Pxp1 (Yel020c) and Pxp2 (Yjr111c). Based on these in silico and in vivo analyses, we revised the yeast PTS1 consensus which now includes all known PTS1 proteins
Experimental and statistical post-validation of positive example EST sequences carrying peroxisome targeting signals type 1 (PTS1)
Augustus at medigrid: Adaption of a bioinformatics application to grid computing for efficient genome analysis
In past years, researchers from many domains have discovered Grid technology which opens up new possibilities in solving problems that are difficult to handle with traditional cluster computing. With the rapidly increasing number of partially or completely sequenced genomes, computational genome annotation is a particularly challenging task in computational biology. In this paper, we describe how we adapted the gene-finding tool AUGUSTUS to Grid computing in the context of the German MediGRID project. The gridification process starts with providing security requirements and running the application manually using Grid middleware. Afterwards, the application is described as a workflow of successive program executions, which are automatically distributed to appropriate Grid resources by a workflow engine. Finally, we show how a convenient graphical user interface for end users is created by means of a portal framework. (C) 2008 Elsevier B.V. All rights reserved.German Federal Ministry of Education and Research (BMBF) [01AK803A-H
Protein signature-based estimation of metagenomic abundances including all domains of life and viruses
Motivation: Metagenome analysis requires tools that can estimate the taxonomic abundances in anonymous sequence data over the whole range of biological entities. Because there is usually no prior knowledge about the data composition, not only all domains of life but also viruses have to be included in taxonomic profiling. Such a full-range approach, however, is difficult to realize owing to the limited coverage of available reference data. In particular, archaea and viruses are generally not well represented by current genome databases. Results: We introduce a novel approach to taxonomic profiling of metagenomes that is based on mixture model analysis of protein signatures. Our results on simulated and real data reveal the difficulties of the existing methods when measuring achaeal or viral abundances and show the overall good profiling performance of the protein-based mixture model. As an application example, we provide a large-scale analysis of data from the Human Microbiome Project. This demonstrates the utility of our method as a first instance profiling tool for a fast estimate of the community structure.Deutsche Forschungsgemeinschaft [ME 3138, LI 2050
BCI competition 2003 - Data set IIb: Support vector machines for the P300 speller paradigm
We propose an approach to analyze data from the P300 speller paradigm using the machine-learning technique support vector machines. In a conservative classification scheme, we found the correct solution after five repetitions. While the classification within the competition is designed for offline analysis, our approach is also well-suited for a real-world online solution: It is fast, requires only 10 electrode positions and demands only a small amount of preprocessing
Mixture models for analysis of the taxonomic composition of metagenomes
Abstract
Motivation: Inferring the taxonomic profile of a microbial community from a large collection of anonymous DNA sequencing reads is a challenging task in metagenomics. Because existing methods for taxonomic profiling of metagenomes are all based on the assignment of fragmentary sequences to phylogenetic categories, the accuracy of results largely depends on fragment length. This dependence complicates comparative analysis of data originating from different sequencing platforms or resulting from different preprocessing pipelines.
Results: We here introduce a new method for taxonomic profiling based on mixture modeling of the overall oligonucleotide distribution of a sample. Our results indicate that the mixture-based profiles compare well with taxonomic profiles obtained with other methods. However, in contrast to the existing methods, our approach shows a nearly constant profiling accuracy across all kinds of read lengths and it operates at an unrivaled speed.
Availability: A platform-independent implementation of the mixture modeling approach is available in terms of a MATLAB/Octave toolbox at http://gobics.de/peter/taxy. In addition, a prototypical implementation within an easy-to-use interactive tool for Windows can be downloaded.
Contact: [email protected]; [email protected]
Supplementary Information: Supplementary data are available at Bioinformatics online.</jats:p
PredPlantPTS1: a web server for the prediction of plant peroxisomal proteins
Prediction of subcellular protein localization is essential to correctly assign unknown proteins to cell organelle-specific protein networks and to ultimately determine protein function. For metazoa, several computational approaches have been developed in the past decade to predict peroxisomal proteins carrying the peroxisome targeting signal type 1 (PTS1). However, plant-specific PTS1 protein prediction methods have been lacking up to now, and pre-existing methods generally were incapable of correctly predicting low-abundance plant proteins possessing non-canonical PTS1 patterns. Recently, we presented a machine learning approach that is able to predict PTS1 proteins for higher plants (spermatophytes) with high accuracy and which can correctly identify unknown targeting patterns, i.e. novel PTS1 tripeptides and tripeptide residues. Here we describe the first plant-specific web server PredPlantPTS1 for the prediction of plant PTS1 proteins using the above-mentioned underlying models. The server allows the submission of protein sequences from diverse spermatophytes and also performs well for mosses and algae. The easy-to-use web interface provides detailed output in terms of (i) the peroxisomal targeting probability of the given sequence, (ii) information whether a particular non-canonical PTS1 tripeptide has already been experimentally verified, and (iii) the prediction scores for the single C-terminal 14 amino acid residues. The latter allows identification of predicted residues that inhibit peroxisome targeting and which can be optimized using site-directed mutagenesis to raise the peroxisome targeting efficiency. The prediction server will be instrumental in identifying low-abundance and stress-inducible peroxisomal proteins and defining the entire peroxisomal proteome of Arabidopsis and agronomically important crop plants. PredPlantPTS1 is freely accessible at ppp.gobics.de
- …
