1,721,174 research outputs found
The estimation of relative site variability among aligned homologous protein sequences
Motivation: Maximum likelihood-based methods to estimate site by site substitution rate variability in aligned homologous protein sequences rely on the formulation of a phylogenetic tree and generally assume that the patterns of relative variability follow a pre-determined distribution. We present a phylogenetic tree-independent method to estimate the relative variability of individual sites within large datasets of homologous protein sequences. It is based upon two simple assumptions. Firstly that substitutions observed between two closely related sequences are likely, in general, to occur at the most variable sites. Secondly that non-conservative amino acid substitutions tend to occur at more variable sites. Our methodology makes no assumptions regarding the underlying pattern of relative variability between sites.
Results: We have compared, using data simulated under a non-gamma distributed model, the performance of this approach to that of a maximum likelihood method that assumes gamma distributed rates. At low mean rates of evolution our method inferred site by site relative substitution rates more accurately than the maximum likelihood approach in the absence of prior assumptions about the relationships between sequences. Our method does not directly account for the effects of mutational saturation, However, we have incorporated an 'ad-hoc' modification that allows the accurate estimation of relative site variability in fast evolving and saturated datasets
Accurate detection of genomic structural variations using high throughput resequencing data
Motivation: Insertions and deletions contribute significantly to genomic diversity both at intra and inter species levels. The recent advent of NGS methods has opened many opportunities for structural variant discovery, but also required the development of new computational methods. Several bioinformatics tools have been developed for the detection of indels using paired end reads (PE) NGS data.
Methods: Existing methods can broadly be grouped into two categories, those that identify genomic clusters of pairs of reads showing atypical insert sizes to identify insertions and deletions with respect to a reference genome and those that consider the distribution of insert sizes for all read pairs covering a given genomic position. We present a variation on the latter approach which also includes information from reads where one member of the pair does not map to the reference genome (broken pairs) and uses machine learning approaches to differentiate between real indels and possible false positive predictions
Results: We demonstrate that our approach significantly outperforms other available methods in terms of sensitivity, specificity and computational time/power requirements both in simulations and using publicly available human genome resequencing data. Our analyses demonstrate that use of data from \\\"broken pairs\\\" and careful integration of different statistics from mapping patterns can significantly improve the quality of indel predictions
Exalign: a new method for comparative analysis of exon–intron gene structures
The evolution of genes is usually studied and reconstructed at the sequence level, that is, by comparing and aligning their genomic, transcript or protein sequences. However, including the exon–intron structure of genes in the analysis can provide further and useful information, for example to draw reliable phylogenetic relationships left unsolved by traditional sequence-based evolutionary studies, or to shed further light on patterns of intron gain and loss. Here we present Exalign, an algorithm designed to retrieve, compare and search for the exon-intron structure of existing gene annotations, that has been implemented in a software tool freely accessible through a web interface as well as available for download. We present different applications of our method, from the reconstruction of the evolutionary history of homologous gene families to the detection of as of today unknown cases of intron loss in human and rodents, and, remarkably, two never reported intron gain events in human and mouse. The web interface for accessing Exalign is available at http://www.pesolelab.it/exalign/ or http://www.beacon.unimi.it/exalign
Motif discovery and transcription factor binding sites before and after the next-generation sequencing era
Motif discovery has been one of the most widely studied problems in bioinformatics ever since genomic and protein sequences have been available. In particular, its application to the de novo prediction of putative over-represented transcription factor binding sites in nucleotide sequences has been, and still is, one of the most challenging flavors of the problem. Recently, novel experimental techniques like chromatin immunoprecipitation (ChIP) have been introduced, permitting the genome-wide identification of protein–DNA interactions. ChIP, applied to transcription factors and coupled with genome tiling arrays (ChIP on Chip) or next-generation sequencing technologies (ChIP-Seq) has opened new avenues in research, as well as posed new challenges to bioinformaticians developing algorithms and methods for motif discovery
A Support Vector Machine for the Discrimination of MicroRNA Precursors from Other Genomic Hairpin Structures
Motivation: MicroRNAs (miRNAs) are endogenous, small (~ 20 nt), single-stranded, non-coding RNAs (ncRNAs) that result from the nuclear and cytoplasmic processing of transcribed precursor hairpin structures. They are increasingly recognized as playing crucial roles as post-transcriptional antisense regulators of gene expression through regulation of mRNA stability or translational efficiency. miRNAs, first reported in Caenorhabditis elegans, have been identified in the genomes of most higher organisms, including worms, flies, plants, mammals and recently in viruses.
Functional studies have shown that miRNAs play important roles in processes such as, cell proliferation, fat metabolism, apoptosis, neuronal cell fate, insulin secretion, haematopoietic differentiation and developmental regulation.
The detection of homologs of known miRNAs through comparative genomic approaches has proved relatively tractable. However, the ab-initio prediction of miRNA precursors through computational methods poses several additional difficulties, not least the fact that not all thermodynamically plausible transcribed hairpins are processed to yield mature miRNAs. It has not until now been possible to identify conserved sequence or structural elements that define consensus recognition elements for the enzymes that process miRNA precursors.
In the light of these observations we wished to develop and improve methods for the discrimination of true miRNA precursor hairpins from spurious hairpins
Methods: We have developed a SVM (Support Vector Machine) that considers up to 74 features associated with the primary and secondary structures and thermodynamic characteristics of candidate hairpin structures. We use a standard heuristic approach to optimize combinations of features used and train the SVM with sets of characterized hairpin miRNA precursors and known non-miRNA hairpins.
Results: Our SVM shows highly promising results in the discrimination of true miRNA precursors from “spurious” hairpins (typically around 95% sensitivity) in various species. In particular, our levels of false positive predictions appear to be low relative to comparable methods
WebVar: a resource for the rapid estimation of relative site variability from multiple sequence alignments
WebVar is an online resource that provides estimates of relative site variability from multiple alignments of homologous protein or nucleic acid sequences. WebVar provides a variety of graphic and textual representations of estimates, designed to assist in phylogenetic analysis
uAUG and uORFs in human and rodent 5'untranslated mRNAs
The control of translation is a fundamental mechanism in the regulation of gene expression. Among the cis-acting elements that play a role in translation regulation are upstream open reading frames (uORFs) and upstream AUG (uAUGs) located in the 5'UTR of mRNAs. We present here a genome-wide analysis of uAUGs and uORFs in a curated set of human and rodent mRNAs. Our study shows that the occurrence of uAUGs is suppressed more strongly than that of uORFs and that in-frame uAUGs are more strongly suppressed than out-of-frame uAUGs. A very similar pattern of uAUG/uORF frequency was also observed in mouse mRNAs. The analysis of orthologous 5'UTR sequences revealed a remarkable degree of evolutionary conservation only of those uORFs which acquired some functional activity. Our data suggest that besides leaky scanning and reinitiation, which likely occur with variable and gene-specific efficiency, the ribosome-shunt mechanism, eventually coupled to reinitiation after uORF translation, may be a widespread mode of translation regulation in eukaryotes
SVM2 : an improved paired-end-based tool for the detection of small genomic structural variations using high-throughput single-genome resequencing data
Several bioinformatics methods have been proposed for the detection and characterization of genomic structural variation (SV) from ultra high-throughput genome resequencing data. Recent surveys show that comprehensive detection of SV events of different types between an individual resequenced genome and a reference sequence is best achieved through the combination of methods based on different principles (split mapping, reassembly, read depth, insert size, etc.). The improvement of individual predictors is thus an important objective. In this study, we propose a new method that combines deviations from expected library insert sizes and additional information from local patterns of read mapping and uses supervised learning to predict the position and nature of structural variants. We show that our approach provides greatly increased sensitivity with respect to other tools based on paired end read mapping at no cost in specificity, and it makes reliable predictions of very short insertions and deletions in repetitive and low-complexity genomic contexts that can confound tools based on split mapping of reads
Improved detection of intra-specific genomic structural variation using paired end high throughput resequencing data and Support Vector Machine
Several bioinformatics methods have been proposed for the detection and characterization of genomic structural variation (SV) from ultra-high throughput genome resequencing data. Recent surveys show that comprehensive detection of SV events of different types between an individual resequenced genome and a reference sequence is best achieved through the combination of methods based on different principles (split mapping, reassembly, read depth, insert size, etc). The improvement of individual predictors is thus an important objective. Here we propose a new a method that combines deviations from expected library insert sizes and additional information from local patterns of read mapping and uses supervised learning to predict the position and nature of structural variants. We show that our approach provides greatly increased sensitivity with respect to other tools based on paired end read mapping at no cost in specificity, and it makes reliable predictions of very short insertions and deletions in repetitive and low complexity genomic contexts that can confound tools based on split-mapping of reads
PscanChIP : finding over-represented transcription factor-binding site motifs and their correlations in sequences from ChIP-Seq experiments
Chromatin immunoprecipitation followed by sequencing with next-generation technologies (ChIP-Seq) has become the de facto standard for building genome-wide maps of regions bound by a given transcription factor (TF). The regions identified, however, have to be further analyzed to determine the actual DNA-binding sites for the TF, as well as sites for other TFs belonging to the same TF complex or in general co-operating or interacting with it in transcription regulation. PscanChIP is a web server that, starting from a collection of genomic regions derived from a ChIP-Seq experiment, scans them using motif descriptors like JASPAR or TRANSFAC position-specific frequency matrices, or descriptors uploaded by users, and it evaluates both motif enrichment and positional bias within the regions according to different measures and criteria. PscanChIP can successfully identify not only the actual binding sites for the TF investigated by a ChIP-Seq experiment but also secondary motifs corresponding to other TFs that tend to bind the same regions, and, if present, precise positional correlations among their respective sites. The web interface is free for use, and there is no login requirement. It is available at http://www.beaconlab.it/pscan_chip_dev
- …
