1,721,116 research outputs found
Simultaneous inference for RNA-Seq data
In the last few years, RNA-Seq has become a popular choice for high-throughput studies of gene expression, revealing its potential to overcome microarrays and become the new standard for transcriptional profiling. At a gene-level, RNA-Seq yields counts rather than continuous measures of expression, leading to the need for novel methods to deal with count data in high-dimensional problems.
In this Thesis, we aim at shedding light on the problems related to the exploration and modeling of RNA-Seq data. In particular, we introduce simple and effective ways to summarize and visualize the data; we define a novel algorithm for the clustering of RNA-Seq data and we implement simple normalization strategies to deal with technology-related biases. Finally, we present a hierarchical Bayesian approach to the modeling of RNA-Seq data. The model accounts for the difference in sequencing depth, as well as for overdispersion, automatically accounting for different types of normalization.Negli ultimi anni il sequenziamento massivo di RNA (RNA-Seq) è diventato una scelta frequente per gli studi di espressione genica. Questa tecnica ha il potenziale di superare i microarray come tecnica standard per lo studio dei profili trascrizionali. A livello genico, i dati di RNA-Seq si presentano sotto forma di conteggi, al contrario dei microarray che stimano l’espressione su una scala continua. Questo porta alla necessità di sviluppare nuovi metodi e modelli per l'analisi di dati di conteggio in problemi con dimensionalità elevata.
In questa tesi verranno affrontati alcuni problemi relativi all'esplorazione e alla modellazione dei dati di RNA-Seq. In particolare, verranno introdotti metodi per la visualizzazione e il riassunto numerico dei dati. Inoltre si definirà un nuovo algoritmo per il raggruppamento dei dati e alcune strategie per la normalizzazione, volte a eliminare le distorsioni specifiche di questa tecnologia. Infine, verrà definito un modello gerarchico Bayesiano per modellare l'espressione di dati RNA-Seq e verificarne le eventuali differenze in diverse condizioni sperimentali. Il modello tiene in considerazione la profondità di sequenziamento e la sovra-dispersione e automaticamente sviluppa diversi tipi di normalizzazione
A novel approach to the clustering of microarray data via nonparametric density estimation
Abstract Background Cluster analysis is a crucial tool in several biological and medical studies dealing with microarray data. Such studies pose challenging statistical problems due to dimensionality issues, since the number of variables can be much higher than the number of observations. Results Here, we present a general framework to deal with the clustering of microarray data, based on a three-step procedure: (i) gene filtering; (ii) dimensionality reduction; (iii) clustering of observations in the reduced space. Via a nonparametric model-based clustering approach we obtain promising results both in simulated and real data. Conclusions The proposed algorithm is a simple and effective tool for the clustering of microarray data, in an unsupervised setting.</p
From Data-Driven to Expert-Guided: Combining Unsupervised and Semi-supervised Clustering in Spatial Transcriptomics
One of the challenges in spatial transcriptomic experiments is identifying clusters of genes that exhibit similar expression patterns within specific regions of a tissue sample. The SpaRTaCo model, proposed by A. Sottosanti and D. Risso in 2023, offers a fully data-driven approach for the spatial classification of a tissue based on gene expression levels. Additionally, pathologist annotations of tissue samples are often available, albeit with significant variations between annotations and the data-driven analysis. In this work, we present a pivotal study focusing on a prostate cancer tissue sample. We demonstrate the integration of SpaRTaCo with two semi-supervised variants of the model, which incorporate external biological knowledge. This integration aims to uncover meaningful biological insights and specific gene expression patterns that may not be apparent through solely one of the two approaches
Co-clustering of Spatially Resolved Transcriptomic Data
Spatial transcriptomics is a modern sequencing technology that allows the
measurement of the activity of thousands of genes in a tissue sample and map
where the activity is occurring. This technology has enabled the study of the
so-called spatially expressed genes, i.e., genes which exhibit spatial
variation across the tissue. Comprehending their functions and their
interactions in different areas of the tissue is of great scientific interest,
as it might lead to a deeper understanding of several key biological
mechanisms. However, adequate statistical tools that exploit the newly spatial
mapping information to reach more specific conclusions are still lacking.
In this work, we introduce SpaRTaCo, a new statistical model that clusters
the spatial expression profiles of the genes according to the areas of the
tissue. This is accomplished by performing a co-clustering, i.e., inferring the
latent block structure of the data and inducing two types of clustering: of the
genes, using their expression across the tissue, and of the image areas, using
the gene expression in the spots where the RNA is collected. Our proposed
methodology is validated with a series of simulation experiments and its
usefulness in responding to specific biological questions is illustrated with
an application to a human brain tissue sample processed with the 10X-Visium
protocol.Comment: Supplementary material attache
Clustering via nonparametric density estimation: an application to microarray data.
Cluster analysis is a crucial tool in several biological and medical studies dealing with microarray data. Such studies pose challenging statistical problems due to dimensionality issues, being the number of variables much higher than the number of observations. Here, we present a novel approach to clustering of microarray data via nonparametric density estimation, based on the following steps: (i) selection of relevant variables; (ii) dimensionality reduction; (iii) clustering of observations in the reduced space. Applications on simulated and real data show promising results in comparison with those produced by two standard approaches, k-means and Mclust. In the simulation studies, our nonparametric approach shows performances comparable to those of models based on normality assumption, even in Gaussian settings. On the other hand, in two benchmarking real datasets, it outperforms the existing parametric approaches
Per-sample standardization and asymmetric winsorization lead to accurate clustering of RNA-seq expression profiles
MOTIVATION: Data transformations are an important step in the analysis of RNA-seq data. Nonetheless, the impact of transformation on the outcome of unsupervised clustering procedures is still unclear.RESULTS: Here, we present an Asymmetric Winsorization per Sample Transformation (AWST), which is robust to data perturbations and removes the need for selecting the most informative genes prior to sample clustering. Our procedure leads to robust and biologically meaningful clusters both in bulk and in single-cell applications.AVAILABILITY: The AWST method is available at https://github.com/drisso/awst. The code to reproduce the analyses is available at https://github.com/drisso/awst_analysis.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
Spatially Informed Nonnegative Matrix Trifactorization for Coclustering Mass Spectrometry Data
Mass spectrometry imaging techniques measure molecular abundance in a tissue sample at a cellular resolution, all while preserving the spatial structure of the tissue. This kind of technology offers a detailed understanding of the role of several molecular factors in biological systems. For this reason, the development of fast and efficient computational methods that can extract relevant signals from massive experiments has become necessary. A key goal in mass spectrometry data analysis is the identification of molecules with similar functions in the analyzed biological system. This result can be achieved by studying the spatial distribution of the molecules' abundance patterns. To do so, one can perform coclustering, that is, dividing the molecules into groups according to their expression patterns over the tissue and segmenting the tissue according to the molecules' abundance levels. We present TRIFASE, a semi-nonnegative matrix trifactorization technique that performs coclustering while accounting for the spatial correlation of the data. We propose an estimation algorithm that solves the proposed matrix trifactorization problem. Moreover, to improve scalability, we also propose two heuristic approximations of the most expensive steps, which help the algorithm converge while significantly streamlining the computational cost. We validated our method on a series of simulation experiments, comparing the different estimating strategies discussed in the article. Last, we analyzed a mouse brain tissue sample processed with MALDI-MSI technology, showing how TRIFASE extracts specific expression patterns of molecule abundance in localized tissue areas and discovers blocks of proteins whose activation is directly linked to specific biological mechanisms
ROC estimation and threshold selection criteria in three-class classification problems for clustered data
Statistical evaluation of diagnostic tests, and, more generally, of biomarkers, is a constantly developing field, in which complexity of the assessment increases with complexity of the design under which data are collected. One particularly prevalent type of data is clustered data, where individual units are naturally nested into clusters. In these cases, bias can arise from omission, in the evaluation process, of cluster-level effects and/or individual covariates. Focussing on the three-class case and for continuous-valued diagnostic tests, we investigate how to exploit the clustered structure of data within a linear-mixed model approach, both when the assumption of normality holds and when it does not. We provide a method for estimation of covariate-specific ROC surfaces and discuss methods for the choice of optimal thresholds, proposing three possible estimators. A proof of consistency and asymptotic normality of the proposed threshold estimators is given. All considered methods are evaluated by extensive simulation experiments. As an application, we study the use of the Lysosomal Associated Membrane Protein Family Member 5 (Lamp5) gene expression as biomarker to distinguish among three types of glutamatergic neurons
PsiNorm: a scalable normalization for single-cell RNA-seq data
Motivation: Single-cell RNA sequencing (scRNA-seq) enables transcriptome-wide gene expression measurements at single-cell resolution providing a comprehensive view of the compositions and dynamics of tissue and organism development. The evolution of scRNA-seq protocols has led to a dramatic increase of cells throughput, exacerbating many of the computational and statistical issues that previously arose for bulk sequencing. In particular, with scRNA-seq data all the analyses steps, including normalization, have become computationally intensive, both in terms of memory usage and computational time. In this perspective, new accuratemethods able to scale efficiently are desirable. Results: Here, we propose PsiNorm, a between-sample normalization method based on the power-law Pareto distribution parameter estimate. Here, we show that the Pareto distribution well resembles scRNA-seq data, especially those coming from platforms that use unique molecular identifiers. Motivated by this result, ..
- …
