1,721,093 research outputs found
An integrative approach for fine-mapping chromatin interactions
MOTIVATION: Chromatin interactions play an important role in genome architecture and gene regulation. The Hi-C assay generates such interactions maps genome-wide, but at relatively low resolutions (e.g. 5-25 kb), which is substantially coarser than the resolution of transcription factor binding sites or open chromatin sites that are potential sources of such interactions.
RESULTS: To predict the sources of Hi-C-identified interactions at a high resolution (e.g. 100 bp), we developed a computational method that integrates data from DNase-seq and ChIP-seq of TFs and histone marks. Our method, χ-CNN, uses this data to first train a convolutional neural network (CNN) to discriminate between called Hi-C interactions and non-interactions. χ-CNN then predicts the high-resolution source of each Hi-C interaction using a feature attribution method. We show these predictions recover original Hi-C peaks after extending them to be coarser. We also show χ-CNN predictions enrich for evolutionarily conserved bases, eQTLs and CTCF motifs, supporting their biological significance. χ-CNN provides an approach for analyzing important aspects of genome architecture and gene regulation at a higher resolution than previously possible.
AVAILABILITY AND IMPLEMENTATION: χ-CNN software is available on GitHub (https://github.com/ernstlab/X-CNN).
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
Machine learning for antimicrobial peptide discovery and design
The growing global health concern of antibiotic resistance is prompting researchers to seek substitutes for conventional antibiotics. Antimicrobial peptides (AMPs), a diverse class of short and often cationic biological molecules, are gaining attention as promising candidates. While direct, large-scale wet lab screening is time-consuming and costly, using high-throughput bioinformatics tools to discover and design novel AMPs is an attractive approach. In this thesis, I introduce in silico tools for AMP discovery and design with machine learning models, and present the novel AMPs revealed by those methods.
The in silico discovery of AMPs typically involves the investigation of huge genomics, transcriptomics, or protein datasets, and accurate methods to sift through such large volumes of candidate sequences are required. In this thesis, I introduce AMPlify, a deep learning based tool for AMP prediction, improving upon the state-of-the-art methods by incorporating attention mechanisms. By integrating AMPlify into bioinformatics pipelines or workflows, four novel AMPs with proven antimicrobial activity have been identified from the Rana [Lithobates] catesbeiana (bullfrog) genome, as well as 13 other novel AMPs mined from the UniProtKB/Swiss-Prot database.
On the other hand, the potential sequence space of amino acids is combinatorially vast, allowing for the exploration of more AMPs that may not exist in nature to further expand the current arsenal of peptide-based therapeutics. However, manual design of novel synthetic AMPs requires prior field knowledge, restricting its throughput. In silico sequence generation methods for de novo AMP design stand out to be a high-throughput way to unearth novel synthetic AMPs. In this thesis, I introduce a recurrent neural network based tool, named AMPd-Up, for AMP sequence generation, and demonstrate its performance over existing methods. With AMPd-Up, 40 novel synthetic AMPs have been designed with proven antimicrobial activity against the bacterial strains tested in vitro.
I demonstrate the utility of AMPlify and AMPd-Up in the discovery and design of novel AMPs, and I expect these tools to play an important role in our fight against antibiotic resistance.Science, Faculty ofGraduat
Recommended from our members
SpaRC: scalable sequence clustering using Apache Spark
MOTIVATION: Whole genome shotgun based next-generation transcriptomics and metagenomics studies often generate 100-1000 GB sequence data derived from tens of thousands of different genes or microbial species. Assembly of these data sets requires tradeoffs between scalability and accuracy. Current assembly methods optimized for scalability often sacrifice accuracy and vice versa. An ideal solution would both scale and produce optimal accuracy for individual genes or genomes.
RESULTS: Here we describe an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomes and metagenomes from both short and long read sequencing technologies. It achieves near-linear scalability with input data size and number of compute nodes. SpaRC can run on both cloud computing and HPC environments without modification while delivering similar performance. Our results demonstrate that SpaRC provides a scalable solution for clustering billions of reads from next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar large-scale sequence data analysis problems.
AVAILABILITY AND IMPLEMENTATION: https://bitbucket.org/berkeleylab/jgi-sparc
Poisson hurdle model-based method for clustering microbiome features
MOTIVATION: High-throughput sequencing technologies have greatly facilitated microbiome research and have generated a large volume of microbiome data with the potential to answer key questions regarding microbiome assembly, structure and function. Cluster analysis aims to group features that behave similarly across treatments, and such grouping helps to highlight the functional relationships among features and may provide biological insights into microbiome networks. However, clustering microbiome data are challenging due to the sparsity and high dimensionality.
RESULTS: We propose a model-based clustering method based on Poisson hurdle models for sparse microbiome count data. We describe an expectation-maximization algorithm and a modified version using simulated annealing to conduct the cluster analysis. Moreover, we provide algorithms for initialization and choosing the number of clusters. Simulation results demonstrate that our proposed methods provide better clustering results than alternative methods under a variety of settings. We also apply the proposed method to a sorghum rhizosphere microbiome dataset that results in interesting biological findings.
AVAILABILITY AND IMPLEMENTATION: R package is freely available for download at https://cran.r-project.org/package=PHclust.
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
K-mer-based data structures and pipelines for sequence mapping and analysis
The exponential growth of genomic data demands progress and research on scalable bioinformatics algorithms. A paradigm to improve computational efficiency in bioinformatics is k-mers. Here we present three works based on the k-mer paradigm that improved the existing methods and opened new possibilities for major applications domains in bioinformatics. LINKS 2.0 is an alignment-free scaffolding tool that brings 3-fold run-time and 5-fold memory optimization to the latest previous version (LINKS v1.8.7). Together with enabling LINKS to process more data with lower computational requirements, this major update also outputs higher- quality scaffolds. Major memory optimization in LINKS 2.0 was obtained by storing k-mers as their 64-bit hash values instead of with ASCII characters. Multi-index Bloom filter (miBF) is a novel associative probabilistic data structure designed for efficiently storing k-mer and spaced seeds. MiBF-mapper discovered the utility of miBF in the long-read mapping domain and demonstrated its competitive accuracy. The mapping with miBF will be a future reference, especially for miBF-based methods. The work on miBF-based global ancestry inference (GAI) proved the scalability of miBF by processing high-coverage data of 208 individuals and promises to increase the accuracy of state-of-art by capturing short insertion and deletion (indel) markers as well as SNPs. We demonstrated high accuracy in continent-level inference and present a promising foundation for developing more accurate, loci-aware ancestry inferences.Science, Faculty ofGraduat
Genome misassembly detection using Stash : a data structure based on stochastic tile hashing
Analyzing large amounts of data produced by high-throughput sequencing technologies presents challenges in terms of memory and computational requirements. Therefore, it is crucial to develop data structures and computational methods that handle this information effectively. These challenges impact bioinformatics studies, including de novo genome assembly, which serves as the foundation of genomics. Issues like errors in reads or limitations due to heuristic decisions in assembly algorithms can lead to genome misassemblies and inaccurate genomic representations, compromising the quality of downstream analyses. Hence, de novo assemblies can benefit from misassembly detection and correction, to maximize the information provided by reads and produce an optimal assembly.
Here, we present Stash, a novel hash-based data structure designed for storing and querying large repositories of sequencing data based on a k-mer representation of a large sequence dataset. Stash uses a two-dimensional data structure based on hash values generated by sliding windows of spaced seed patterns over sequences to compress data. The key-value pairs stored in Stash are k-mers and sequence ID hashes, respectively. The stored hashed identifiers are then used to check if two queried k-mers are observed in the same set of sequences. This functionality provides utility for Stash across diverse domains of bioinformatics. For example, Stash can inform whether two genomic regions are covered by the same set of reads by measuring the number of matches between them. This can be used in detection of misassemblies within a genome assembly of interest. We demonstrate the effectiveness of Stash in detecting misassemblies in human genome assemblies generated by the Flye and Shasta algorithms, using Pacbio HiFi reads from the human cell line NA24385. We observe that scaffolding Stash-cut assemblies reduce 7.6% and 3.4% of misassemblies in the Flye and Shasta assemblies, respectively. It accomplishes this by utilizing eight GB of memory and a total processing time of 117 plus 18 minutes. Remarkably, it can outperform alternative methods for detecting misassemblies in long-read data, all the while preserving contiguity.Science, Faculty ofGraduat
Annotation of complex genomes for comparative genomics
Advancements in whole-genome sequencing technologies have opened the use of genomic approaches to study a variety of organisms and allowed studies at the whole-genome scale in non-model organisms. In these studies, genome annotation is a fundamental step to extract diverse biological information from sequences that are otherwise strings of characters incomprehensible to humans.
Here I assembled and annotated genomes of plant and insect species of applied interest. A common theme in my thesis is comparative and evolutionary genomics of the described organisms. The sequenced species I studied have complex genomic features, including large genome sizes and high repeat contents, which I described in detail.
In Chapter 2, I investigate the protein-coding genes of four spruces (Picea, Pinaceae) native to North America. Comparison to other annotated conifers highlights changes in selection in gene families. Several gene families have a significantly expanded number of genes. Some genes are under positive selection: previous studies in spruce highlighted the same proteins as genetic markers for local adaptation. In Chapter 3, I characterize the genome of Pissodes strobi, a naturally occurring pest of the spruces described in Chapter 2. The genome of P. strobi is larger and more repetitive than other sequenced species in the same family (Curculionidae). In Chapter 4, I assemble and annotate the genome of a proprietary Cannabis sativa strain, and study the flavonoid/anthocyanin metabolic pathway, uncovering the upregulation of key metabolic genes involved in the regulation of leaf pigmentation.
The presented genome annotations and comparative analyses provide insights into the biology and evolution of the described species. Comparative genome studies are important for generating hypotheses and open avenues of inquiry in future studies in population genomics. In the case of Picea gen. and P. strobi, such studies will enable us to understand the local adaptation of species and the genetic basis of regulatory processes, such as biotic stress mitigation and pest resistance.Science, Faculty ofGraduat
High throughput in silico discovery of antimicrobial peptides in amphibian and insect transcriptomes
Antimicrobial peptides (AMPs) are a family of short defence proteins produced naturally by all multicellular organisms, varying from microorganisms to humans. Since resistance to AMPs is less frequent as to antibiotics, they may serve as a potential alternative. Past research has shown that amphibians have the richest known AMP diversity, specifically the North American bullfrog has demonstrated potential in aiding the discovery of novel putative AMPs. Antibiotic resistance is becoming more prevalent each day, requiring agricultural practices to reduce the use of antibiotics to protect human health, animal health, and food safety. To reduce the use of antibiotics, the goal of my thesis is to develop and execute an AMP discovery pipeline to discover AMPs suitable for pharmaceutical development. In this thesis, I have accomplished rAMPage: Rapid Antimicrobial Peptide Annotation and Gene Estimation. rAMPage is a scalable, high throughput bioinformatics-based discovery platform for mining AMP sequences in publicly available genomic resources. RNA-seq amphibian and insect reads from the Sequence Read Archive (SRA) are used. After trimming, reads are assembled with RNA-Bloom into transcripts, filtered, and translated in silico. Then, the translated protein sequences are compared to known AMP sequences from the NCBI protein database and specific AMP databases APD3 and DADP, via homology search. These sequences are cleaved into their mature/bioactive form. Next, machine learning algorithm AMPlify is employed to classify and prioritize the candidate AMPs based on their AMP probability score. Finally, these candidate AMPs are annotated and characterized. Across 84 datasets, rAMPage detected > 1,000 putative AMPs, where 90 sequences have been selected for downstream validation.Science, Faculty ofGraduat
Genomic and transcriptomic signatures of virulence and UV resistance in Beauveria bassiana
Beauveria bassiana is an entomopathogenic fungus used as a biological control agent against insect pests related to agriculture, forestry and human health. There is a large amount of phenotypic and genomic variation within the species complex, and characterizing this variation is required to identify the optimal strain for protection against a specific pest. This thesis outlines comparative genomic and transcriptomic analyses of eight B. bassiana isolates, including six wild-type and two UV resistant derivatives to identify the genetic basis of virulence and UV resistance. The five strains demonstrating the highest virulence levels against mountain pine beetle produced high levels of the red pigment, oosporein. Phylogenetic analysis placed the eight strains in two distinct clusters that reflected their morphology, grouping red strains separately from the non-red strains. Genes unique to the red strains included several membrane transporters, transcription factors and toxins, and may confer virulence or other unique biological functions to these strains. Significant differential expression was identified between the red and non-red strains, and these differentially expressed genes likely contribute to increased virulence, transmembrane transport and stress response in the red strains. Several genes encoding toxins, lipases and chitinases were differentially expressed, all of which are crucial to the infection process. Variant calling and differential expression in the UVR derivatives identified several genes of interest involved in oxidoreductase activity, stress response, copper metabolism and DNA replication/repair. These are all important mechanisms for protecting cells from UV-induced damage such as free radicals. Finally, differential correlation analysis identified several transcription factors that may be involved in the regulation of the oosporein biosynthetic gene cluster. The results of this work have narrowed the scope for selecting and/or engineering the most effective strain of Beauveria bassiana for the biological control of insect pests.Science, Faculty ofGraduat
Recommended from our members
Phitest for analyzing the homogeneity of single-cell populations.
MotivationSingle-cell RNA sequencing technologies facilitate the characterization of transcriptomic landscapes in diverse species, tissues and cell types with unprecedented molecular resolution. In order to better understand animal development, physiology, and pathology, unsupervised clustering analysis is often used to identify relevant cell populations. Although considerable progress has been made in terms of clustering algorithms in recent years, it remains challenging to evaluate the quality of the inferred single-cell clusters, which can greatly impact downstream analysis and interpretation.ResultsWe propose a bioinformatics tool named Phitest to analyze the homogeneity of single-cell populations. Phitest is able to distinguish between homogeneous and heterogeneous cell populations, providing an objective and automatic method to optimize the performance of single-cell clustering analysis.Availability and implementationThe PhitestR package is freely available on both Github (https://github.com/Vivianstats/PhitestR) and the Comprehensive R Archive Network (CRAN). There is no new genomic data associated with this article. Published data used in the analysis are described in detail in the Supplementary Data.Supplementary informationSupplementary data are available at Bioinformatics online
- …
