1,721,046 research outputs found

    SkewIT, Bracken, and Kraken: Methods for Analyzing a Complex, But Invisible World

    Full text link
    As the DNA of the invisible world provides insight into the countless microscopic organisms living amongst us, the integrity of these genomes and the methods by which we analyze them become increasingly important. In the following, I introduce methods for both evaluating genomic integrity and analyzing microbial communities. For the analysis of bacterial genomes, I developed SkewIT (Skew Index Test) based on GC Skew, a bacterial genome phenomenon wherein the two replication strands of the same chromosome contain different proportions of guanine and cytosine nucleotides. SkewIT calculates a single metric representing the degree of GC skew for a single genome. Applied across 15,000+ complete bacterial genomes, SkewIT quickly detects assembly patterns and highlights potential bacterial mis-assemblies. Although eukaryotic microorganisms are abundant worldwide and as human pathogens, eukaryotic pathogen genomes are underrepresented in genomic databases and contain significant contamination. I therefore developed a bioinformatics system for eliminating contamination, generating a “clean” eukaryotic pathogen database (EuPathDB-Clean) of nearly 400 genomes. With the final database, I identify eukaryotic pathogens in human samples, demonstrating the increased sensitivity and reduction in false positives of the final database as compared to the originally contaminated genomes. As metagenomics captures the genomic data of all microbial organisms in any environment, I developed Bracken (Bayesian Reestimation of Abundance after Classification with KrakEN) for a quick and accurate characterization of the full microbial environment. Bracken uses the taxonomic assignments made by Kraken, a very fast read-level classifier, along with information about the genomes themselves to estimate abundance at the species level, the genus level, or above. I demonstrate that Bracken produces accurate abundance estimates even when a sample contains multiple near-identical species for both shotgun metagenomics projects and for 16S ribosomal RNA (rRNA) bacterial projects. SkewIT, Bracken, and EuPathDB-Clean are all publicly available for use in future metagenomics projects

    Study of cell-cell communication using 3D living cell microarrays

    Full text link
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 135-152).Cellular behavior is not dictated solely from within; it is also guided by a myriad of external cues. If cells are removed from their natural environment, apart from the microenvironment and social context they are accustomed to, it is difficult to study their behavior in any meaningful way. To that end, I describe a method for using optical trapping for positioning cells with submicron accuracy in three dimensions, then encapsulating them in hydrogel, in order to mimic the in vivo microenvironment. This process has been carefully optimized for cell viability, checking both prokaryotic and eukaryotic cells for membrane integrity and metabolic activity. To demonstrate the utility of this system, I have looked at a model "quorum sensing" system in Vibrio Fischeri, which operates by the emission and detection of a small chemical signal, an acyl-homoserine lactone. Through synthetic biology, I have engineered plasmids which express "sending" and "receiving" genes. Bacteria containing these plasmids were formed into complex 3D patterns, designed to assay signaling response. The gene expression of the bacteria was tracked over time using fluorescent proteins as reporters. A model for this system was composed using a finite element method to simulate signal transport through the hydrogel, and simple mass-action kinetic equations to simulate the resulting protein expression over time.by Winston Timp.Ph.D

    Study of disposable microdevices for DNA electrophoresis

    Full text link
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, September 2005.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. [77]-[79]).A study was undertaken to determine if a microfluidic chip, made of economical plastic materials, is feasible. The chip was designed to perform gel electrophoresis, specifically of DNA fragments for either sequencing or identification purposes. With a disposable version of such a chip, constraints on the gel type are relaxed and lifetime issues become nonexistent. Such a chip was created using polydimethylsiloxane(PDMS) as the plastic material, with a cast molding process. The chip was subsequently sealed against a piece of PDMS, mounted on a glass slide for structural support. Fluidic and electrical interconnects were added to the chip. A polyacrylarnide solution was injected into the chip for use in DNA separations. The chip was then placed into an apparatus designed for laser induced fluorescence(LIF) detection. Several different samples were run on the chip, including polystyrene beads, organic dye molecules, and single tandem repeat (STR) allelic ladders. The chip demonstrated its electrophoretic efficiency, evincing a low, almost negligible amount of electroosmotic flow. The separation of the dye and DNA was accomplished with good fidelity, allowing for identification of the various substitutents of the loaded sample.(cont.) The PDMS chip, though demonstrably efficient at DNA separation, needs work before it can move out of the prototype phase. Substantial work on the fluidic interconnection, as well as the basic plastic formulation is needed to move this idea forward. However, the chip is sufficient for a clear proof of the principle of disposable chips use in electrophoretic separations.by Winston Timp.S.M

    Computational Methods for Structural Variation Analysis in Populations

    Full text link
    Recent advances in long-read sequencing have given us an unprecedented view of structural variants (SVs). However, much of their role in disease and evolution remains unknown due to a number of technical and biological challenges, including the high error rate of most long-read sequencing data, the additional complexity of aligning around large variants, and biological differences in how the same SV can manifest in different individuals. In this thesis we introduce novel methods for structural variant analysis and demonstrate how they overcome many of these obstacles. First, we apply recent advances in data structures to the substring search problem and show how learned index structures can enable accelerated alignment of genomic reads. Next, we present an optimized SV calling pipeline that integrates improvements to existing software alongside two novel SV-processing methods, Iris and Jasmine, which improve the accuracy of SV breakpoints and sequences in individual samples and compare and integrate SV calls from multiple samples. Finally, we show how the introduction of CHM13, the first gap-free telomere-to-telomere human reference genome, enables for the first time variant calling in over 100 Mbp of newly resolved sequence and mitigates long-standing issues in variant calling that were attributed to gaps, errors, and minor alleles in the prior GRCh38 reference. We demonstrate the broad applicability of our advancements in SV inference by uncovering novel associations with gene expression in 444 human individuals from the 1000 Genomes Project, by detecting SVs in the tomato genome which affect fruit size and yield, and by comparing SVs between tumor and normal cells in organoids derived from the SKBR3 breast cancer cell line

    Developing compressed linear pangenome indexes for rapid sequence classification

    Full text link
    A reference genome serves an important function for various genomic analyses; it acts as a template to be used to match sequencing reads to the genome and provides a coordinate system to help translate findings from one study to another. However, being overly reliant on a single reference genome leads to an issue called "reference bias" where one's findings can be biased due to the genetic differences between the reference and donor genome. In order to combat this bias, the community has worked on assembling a multitude of reference genomes spanning a wide array of genetic backgrounds in order to build a pangenome reference. My thesis work will focus on the problem of trying to quickly map sequencing reads onto these large pangenome databases using a compressed linear pangenome index. The first half of the thesis will present novel computational methods for quickly classifying whether a sequencing read appears to have originated from a pangenome reference. We present an efficient string matching algorithm computing a quantity called pseudo-matching lengths and develop a hypothesis testing framework for classifying whether reads are present or not in the database. We show how to integrate the concepts of minimizer digestion and run-length encoding to build an efficient and scalable full-text index for querying. Utilizing these novel methods, we show that we can achieve comparable binary classification accuracy to state-of-the-art aligners while being substantially faster and more memory-efficient. The second half of the thesis will focus on specifically identifying where in a pangenome reference a read appears to originate from and we explore three different solutions for this problem. Firstly, we develop a novel data-structure that scales with pangenomes and allows users to identify a single genome that a substring match from a read occurs in thereby giving users information to classify which genome the full read is from. Secondly, in contrast to the previous solution, we theorized and implemented a new document listing data-structure which provides the full scope of information by allowing users to identify all of the genomes that a substring occurs in. Lastly, we showed a novel compression scheme that can reduce the size of the document listing data-structure by over two orders of magnitude. We utilize these new data-structures in the application of taxonomic classification and show that we achieve a higher classification accuracy over state-of-the-art tools

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Developing compressed linear pangenome indexes for rapid sequence classification

    Full text link
    A reference genome serves an important function for various genomic analyses; it acts as a template to be used to match sequencing reads to the genome and provides a coordinate system to help translate findings from one study to another. However, being overly reliant on a single reference genome leads to an issue called "reference bias" where one's findings can be biased due to the genetic differences between the reference and donor genome. In order to combat this bias, the community has worked on assembling a multitude of reference genomes spanning a wide array of genetic backgrounds in order to build a pangenome reference. My thesis work will focus on the problem of trying to quickly map sequencing reads onto these large pangenome databases using a compressed linear pangenome index. The first half of the thesis will present novel computational methods for quickly classifying whether a sequencing read appears to have originated from a pangenome reference. We present an efficient string matching algorithm computing a quantity called pseudo-matching lengths and develop a hypothesis testing framework for classifying whether reads are present or not in the database. We show how to integrate the concepts of minimizer digestion and run-length encoding to build an efficient and scalable full-text index for querying. Utilizing these novel methods, we show that we can achieve comparable binary classification accuracy to state-of-the-art aligners while being substantially faster and more memory-efficient. The second half of the thesis will focus on specifically identifying where in a pangenome reference a read appears to originate from and we explore three different solutions for this problem. Firstly, we develop a novel data-structure that scales with pangenomes and allows users to identify a single genome that a substring match from a read occurs in thereby giving users information to classify which genome the full read is from. Secondly, in contrast to the previous solution, we theorized and implemented a new document listing data-structure which provides the full scope of information by allowing users to identify all of the genomes that a substring occurs in. Lastly, we showed a novel compression scheme that can reduce the size of the document listing data-structure by over two orders of magnitude. We utilize these new data-structures in the application of taxonomic classification and show that we achieve a higher classification accuracy over state-of-the-art tools

    A Mechanism for Stochastic Cell Fate Specification

    Full text link
    Stochastic cell fate specification is a critical component of many developmental programs and is particularly important for generating diverse populations of sensory neurons. The mechanisms controlling stochastic cell fate specification, whereby cells randomly choose between two or more fates, remain poorly understood. In the fruit fly Drosophila melanogaster, stochastic cell fate specification is used during retinal development to specify two subtypes of R7 photoreceptors. Each subtype is defined by the expression of a unique Rhodopsin (Rh) receptor, Rh3 or Rh4. This fate decision is controlled by the transcription factor Spineless (Ss), which is expressed in a random subset of R7s. I used this simple, binary, random fate decision as a paradigm to study the greater principles underlying stochastic cell fate specification. In wildtype lab stocks, the percent of R7s that express Ss is typically ~67%, however in the wild, the proportion of SsON cells varies greatly amongst flies. We identified a naturally occurring genetic variant in the ss locus that lowers the proportion of SsON R7s by increasing the binding affinity for the transcriptional repressor, Klumpfuss (Klu). Additionally, we found that lowering the %SsON R7s shifts the innate color preference of flies from green to blue. The Klu binding site lies upstream of the ss locus in the early enhancer. At the time SsON/SsOFF fate is determined in the developing larval eye, this enhancer drives expression in R7 precursors, while the downstream late enhancer drives expression in terminally differentiated R7s. We show that a two-step mechanism involving both enhancers regulates the SsON/SsOFF decision in R7s. In the first step, the early enhancer drives a pulse of ss expression in R7 precursors that opens the previously heterochromatic locus. In the second step, early transcription ceases and the ss locus recompacts to variable degrees. We find that variable compaction post-pulse determines whether the late enhancer is accessible and able to reactivate ss in mature R7s. This work has contributed significantly to our understanding of mechanisms driving stochastic cell fate specification. Components of the two-step mechanism we identified are similar to other stochastic systems, suggesting that they may be shared core principles

    SVCFit: INFERRING STRUCTURAL VARIANT CELLULAR FRACTION IN TUMORS

    No full text
    The dynamic nature of the cancer genome, characterized by intratumor heterogeneity and multiple cell subpopulations (clones), underscores the importance of reconstructing tumor phylogeny to understand the evolutionary trajectories of cancer. This study introduces a novel approach to estimating the structural variant cellular fraction (SVCF) in tissue samples where tumor and normal cells are mixed. By applying variant allele frequency (VAF) adjustments for different types of structural variants (SVs), including deletions, tandem duplications, and inversions, my method SVCFit aims to achieve more accurate SVCF estimation. Preliminary results demonstrate improved SVCF estimation compared to a comparable published method, SVclone, particularly for deletions and inversions across various levels of tumor purity (proportion of tumor cells in the sample). The method, however, has limitations related to assumptions of constant read depth and heterozygous SVs, as well as challenges in detecting certain SV types due to the constraints of short-read sequencing. Future work will utilize long-read sequencing data to address these limitations and incorporate read-depth variation and SV haplotype information. This research represents a significant step forward in accurately reconstructing tumor phylogeny by considering the cellular fraction of somatic SVs, thereby enhancing our understanding of tumor evolution and its implications for therapeutic outcomes
    corecore