1,721,014 research outputs found
Patterns of methylation heritability in a genome-wide analysis of four brain regions
DNA methylation has been implicated in a number of diseases and other phenotypes. It is, therefore, of interest to identify and understand the genetic determinants of methylation and epigenomic variation. We investigated the extent to which genetic variation in cis-DNA sequence explains variation in CpG dinucleotide methylation in publicly available data for four brain regions from unrelated individuals, finding that 3–4% of CpG loci assayed were heritable, with a mean estimated narrow-sense heritability of 30% over the heritable loci. Over all loci, the mean estimated heritability was 3%, as compared with a recent twin-based study reporting 18%. Heritable loci were enriched for open chromatin regions and binding sites of CTCF, an influential regulator of transcription and chromatin architecture. Additionally, heritable loci were proximal to genes enriched in several known pathways, suggesting a possible functional role for these loci. Our estimates of heritability are conservative, and we suspect that the number of identified heritable loci will increase as the methylome is assayed across a broader range of cell types and the density of the tested loci is increased. Finally, we show that the number of heritable loci depends on the window size parameter commonly used to identify candidate cis-acting single-nucleotide polymorphism variants
PERT: A Method for Expression Deconvolution of Human Blood Samples from Varied Microenvironmental and Developmental Conditions
The cellular composition of heterogeneous samples can be predicted using an expression deconvolution algorithm to decompose their gene expression profiles based on pre-defined, reference gene expression profiles of the constituent populations in these samples. However, the expression profiles of the actual constituent populations are often perturbed from those of the reference profiles due to gene expression changes in cells associated with microenvironmental or developmental effects. Existing deconvolution algorithms do not account for these changes and give incorrect results when benchmarked against those measured by well-established flow cytometry, even after batch correction was applied. We introduce PERT, a new probabilistic expression deconvolution method that detects and accounts for a shared, multiplicative perturbation in the reference profiles when performing expression deconvolution. We applied PERT and three other state-of-the-art expression deconvolution methods to predict cell frequencies within heterogeneous human blood samples that were collected under several conditions (uncultured mono-nucleated and lineage-depleted cells, and culture-derived lineage-depleted cells). Only PERT's predicted proportions of the constituent populations matched those assigned by flow cytometry. Genes associated with cell cycle processes were highly enriched among those with the largest predicted expression changes between the cultured and uncultured conditions. We anticipate that PERT will be widely applicable to expression deconvolution strategies that use profiles from reference populations that vary from the corresponding constituent populations in cellular state but not cellular phenotypic identity
Recommended from our members
Representation learning methods developed for single cell genomics analysis
Advances in high throughput omics technologies allow for assaying increasing compendium of molecular layers, from genome and epigenome profiling, transcriptomics to proteomics. Such data provide detailed snapshots which can characterize the molecular state for a given biology system from very fine resolution. Single cell genomics assays such as scRNA-seq and scATAC-seq specifically captures the landscape of genomic features across large collections of cells and have become one of the most popular molecular profiling techniques for investigating diverse problems related to gene regulation, such as identification of novel cell types and their regulatory signatures, trajectory inference for the analysis of continuous processes such as differentiation, high resolution analysis of transcriptional dynamics, and characterization of transcriptional heterogeneity within population of cells.Despite the rapidly evolving technologies which can scales up to millions of cells across multiple individuals , one of the most pressing challenges in single cell genomics analysis is to address the amount of technical noise that can drive approximately 50% of the cell-cell variation in expression measurements. And such technical noise often times associated with high-sparsity of the genomic feature measurements. In chapter 2, we are mainly focusing on alleviating the effect of such technical variation in feature measurements of single cell genomics data, such as gene expression and locus accessibility. We show that this technical variation in both scRNA-seq and scATAC-seq datasets can be mitigated by analyzing feature detection patterns alone and ignoring feature quantification measurements. This result holds when datasets have low detection noise relative to quantification noise. We demonstrate state-of-the-art performance of detection pattern models using our new framework, scBFA, for both cell type identification and trajectory inference.While single cell genomics assays are inherently high dimensional, the variations of individual cells are often summarized in a low dimensional space reflecting the change of gene’s mean expression. Gene co-expression networks, which often inferred from RNA sequencing data are another perspective to study cell type specific functional modules and complex regulatory interactions from transcriptomics profile. The increasing availability of large-scale scRNA-seq datasets is now making it possible to infer many gene networks from diverse cell populations. However, there are no mature tools currently available to visualize and compare large collections of networks across single cell populations, or for identifying correlations between variance in gene network structure with cell population-level phenotypes. In chapter 3, we present an unsupervised framework scMultiAE enabling comparison and visualization of multiple gene networks in a low-dimensional space with a focus on studying the heterogeneity of iPSCs during differentiation
Recommended from our members
Multiscale and Multimodal Representation Learning for Single-Cell Omics
Understanding how molecular diversity at the single-cell level gives rise to complex, emergent functions and phenotypes, such as developmental progression or disease states, requires computational frameworks that capture cell state specificity, integrate diverse data modalities, bridge resolution gaps, prioritize key cellular programs, and incorporate prior biological knowledge to uncover underlying gene signatures. This dissertation presents a suite of deep learning models designed to meet these challenges in a multiscale and multimodal fashion, enabling interpretable and scalable analysis of single-cell data across complex biological systems.At the foundation of single-cell profiling lies cell type specificity. Chapter 2 introduces scProjection, a method for resolving cell type-specific signals from mixed or partially observed transcriptomic profiles. By projecting bulk or low-resolution profiles onto high-quality single-cell atlases, scProjection provides cell state-specific gene expression projections and imputes missing genes using learned gene-gene covariation structures through a deep generative model.Expanding on this, Chapter 3 presents scPair, a framework for enhanced cell state identification using information from multiple molecular modalities. scPair addresses the limitations of shallow multimodal assays by aligning chromatin accessibility and transcriptomic features via dual encoder-decoder architectures with implicit feature selection. This improves cross-modal translation, enables augmentation with larger unimodal atlases, and enhances statistical power for discovering transient or rare cell states. scPair reveals cross-modality relationships and uncovers gene regulatory programs, including key transcription factors active during transitional states.Chapter 4 transitions from cell-level resolution to the sample level with bioPointNet, a deep multiple instance learning (MIL) model that represents each biological sample as an unordered set of cell instances. By applying attention-based aggregation, bioPointNet predicts emergent phenotypes without relying on cell type annotations and identifies the most informative cell subpopulations predictive of phenotype. This enables interpretable phenotype associations and supports alignment of samples from different sources along developmental or disease trajectories.Finally, Chapter 5 introduces sciLaMA, a framework for integrating prior biological knowledge into single-cell analysis. By incorporating gene embeddings derived from large language models (LLMs) into a paired variational autoencoder (VAE) structure, sciLaMA learns joint representations of genes and cells, which facilitates the discovery of biologically meaningful gene modules and the identification of key markers driving specific cell states.Together, these methods establish a set of tools for multiscale and multimodal single-cell analysis, supporting integrative data modeling, interpretable inference, and mechanistic insight into the cellular basis of phenotypic variation and gene network discovery
Recommended from our members
DEEP LEARNING MODELS FOR THE ANALYSIS OF SINGLE CELL GENOMICS
Single cell transcriptomic technologies which capture high dimensional measurements of gene expression in individual cells have been exponentially scaling in the number of cells that can be sequenced and analyzed simultaneously. Capturing a snapshot of the landscape for possible gene expression measurements from a collection of cells enables researchers to observe the space of molecular variation inherent to specific biological systems, termed atlasing. A challenge to building deeply characterized atlases of complex biological systems such as the human brain is in the identification and correction of confounding factors which do not relate to the underlying biology but instead arise from technical confounders. In this dissertation I present deep learning models applied to single cell genomics which remove unwanted technical variation and contamination as well as perform novel analysis not previously possible using standard methods. The construction of single cell genomics atlases leverages recent advances in single cell RNA sequencing technologies such as 10X and SmartSeq which can capture thousands of cells in single experiment. When the sequencing of individual cells is performed on different technologies this introduces unwanted technical variation (bias) specific to the technology and confounds attempts to merge scRNA-seq experiments into more complete atlases. To address this challenge, we developed scAlign to remove the effects of unwanted technical variation on gene expression specifically, scRNA-seq alignment based on advances in computer vision. scAlign, an unsupervised deep learning method, performs data alignment that can incorporate partial, overlapping or a complete set of cell labels, and estimate per-cell differences in gene expression across datasets or conditions to characterize specific expression changes due to conditions such as age or disease.
With the recent surge of atlases efforts across complex tissues, conditions, and species another challenge is how to integrate the deep characterizations of cell state with lower resolution assays of single cell or bulk genomics. Specifically, spatial and multi-omics assays do not collect RNA from a single cell but instead from a spot containing multiple cells or in the later contamination from the unintended collection of additional cells. We developed scProjection to join deeply sequenced atlases with lower resolution genomic assays to address the unwanted heterogeneity in mixed samples and project such samples in a way that recovers the underlying single-cell measurements. scProjection is demonstrated to accurately estimate the abundance of cell types that compose a mixed RNA sample while simultaneously identifying the gene expression measurements consistent for each cell type in the sample to identify cell type specific changes due spatial location of cells or disease state
The benefits of selecting phenotype-specific variants for applications of mixed models in genomics
Applications of linear mixed models (LMMs) to problems in genomics include phenotype prediction, correction for confounding in genome-wide association studies, estimation of narrow sense heritability, and testing sets of variants (e.g., rare variants) for association. In each of these applications, the LMM uses a genetic similarity matrix, which encodes the pairwise similarity between every two individuals in a cohort. Although ideally these similarities would be estimated using strictly variants relevant to the given phenotype, the identity of such variants is typically unknown. Consequently, relevant variants are excluded and irrelevant variants are included, both having deleterious effects. For each application of the LMM, we review known effects and describe new effects showing how variable selection can be used to mitigate them.National Institute on Aging (Brain eQTL Study (dbGaP phs000249.v1.p1)
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Variations on the Author
“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
ISOpureR: an R implementation of a computational purification algorithm of mixed tumour profiles
Background
Tumour samples containing distinct sub-populations of cancer and normal cells present challenges in the development of reproducible biomarkers, as these biomarkers are based on bulk signals from mixed tumour profiles. ISOpure is the only mRNA computational purification method to date that does not require a paired tumour-normal sample, provides a personalized cancer profile for each patient, and has been tested on clinical data. Replacing mixed tumour profiles with ISOpure-preprocessed cancer profiles led to better prognostic gene signatures for lung and prostate cancer.
Results
To simplify the integration of ISOpure into standard R-based bioinformatics analysis pipelines, the algorithm has been implemented as an R package. The ISOpureR package performs analogously to the original code in estimating the fraction of cancer cells and the patient cancer mRNA abundance profile from tumour samples in four cancer datasets.
Conclusions
The ISOpureR package estimates the fraction of cancer cells and personalized patient cancer mRNA abundance profile from a mixed tumour profile. This open-source R implementation enables integration into existing computational pipelines, as well as easy testing, modification and extension of the model.Prostate Cancer CanadaMovember Foundation (Grant RS2014-01
- …
