1,721,053 research outputs found
GOstat: find statistically overrepresented Gene Ontologies within a group of genes
Modern experimental techniques, as for example DNA microarrays, as a result usually produce a long list of genes, which are potentially interesting in the analyzed process. In order to gain biological understanding from this type of data, it is necessary to analyze the functional annotations of all genes in this list. The Gene-Ontology (GO) database provides a useful tool to annotate and analyze the functions of a large number of genes. Here, we introduce a tool that utilizes this information to obtain an understanding of which annotations are typical for the analyzed list of genes. This program automatically obtains the GO annotations from a database and generates statistics of which annotations are overrepresented in the analyzed list of genes. This results in a list of GO terms sorted by their specificity
Normalization of RNA-seq data using factor analysis of control genes or samples
Normalization of RNA-sequencing (RNA-seq) data has proven essential to ensure accurate inference of expression levels. Here, we show that usual normalization approaches mostly account for sequencing depth and fail to correct for library preparation and other more complex unwanted technical effects. We evaluate the performance of the External RNA Control Consortium (ERCC) spike-in controls and investigate the possibility of using them directly for normalization. We show that the spike-ins are not reliable enough to be used in standard global-scaling or regression-based normalization procedures. We propose a normalization strategy, called remove unwanted variation (RUV), that adjusts for nuisance technical effects by performing factor analysis on suitable sets of control genes (e.g., ERCC spike-ins) or samples (e.g., replicate libraries). Our approach leads to more accurate estimates of expression fold-changes and tests of differential expression compared to state-of-the-art normalization methods. In particular, RUV promises to be valuable for large collaborative projects involving multiple laboratories, technicians, and/or sequencing platforms
A systematic approach for comprehensive T-cell epitope discovery using peptide libraries
T-cell response to peptides bound on MHC Class I or Class II molecules is essential for immune recognition of pathogens. T-cells are activated by specific peptide epitopes that are determined within the antigen processing pathways and presented on the surface of other cells bound to MHC molecules. To determine which part of allergenic or pathogenic proteins can stimulate T-cells is important for the treatment of diseases. We sought to take advantage of the falling cost of synthetic, screening grade peptides, and devise a comprehensive, non-hypothesis-driven screen for T-cell epitopes. We were interested in the study of celiac disease (CD) and used the ELISPOT technique to perform a large number of T-cell assays. We therefore needed to compensate for the lack of statistical data analysis methods for ELISPOT assays
Silencing of Odorant Receptor Genes by G Protein βγ Signaling Ensures the Expression of One Odorant Receptor per Olfactory Sensory Neuron
SummaryOlfactory sensory neurons express just one out of a possible ∼1,000 odorant receptor genes, reflecting an exquisite mode of gene regulation. In one model, once an odorant receptor is chosen for expression, other receptor genes are suppressed by a negative feedback mechanism, ensuring a stable functional identity of the sensory neuron for the lifetime of the cell. The signal transduction mechanism subserving odorant receptor gene silencing remains obscure, however. Here, we demonstrate in the zebrafish that odorant receptor gene silencing is dependent on receptor activity. Moreover, we show that signaling through G protein βγ subunits is both necessary and sufficient to suppress the expression of odorant receptor genes and likely acts through histone methylation to maintain the silenced odorant receptor genes in transcriptionally inactive heterochromatin. These results link receptor activity with the epigenetic mechanisms responsible for ensuring the expression of one odorant receptor per olfactory sensory neuron
Reproductive failure and the major histocompatibility complex
The association between HLA sharing and recurrent spontaneous abortion (RSA) was tested in 123 couples and the association between HLA sharing, and the outcome of treatment for unexplained infertility by in vitro fertilization (IVF) was tested in 76 couples, by using a new shared- allele test in order to identify more precisely the region of the major histocompatibility complex (MHC ) infleuencing these reproductive defects. The shared-allele test circumvents the problem of rare alleles at HLA loci and at the same time provides a substantial gain in power over the simple χ2 test. Two statistical methods, a corrected homogeneity test and a bootstrap approach, were developed to compare the allele frequencies at each of the HLA-A, HLA-B, HLA-DR, and HLA-DQ loci; they were not stastically different among three patient groups and the control group. there was a significant excess of HLA-DR sharing in couples with RSA and a excess sharing of HLA- DQ sharing in coules with unexplained infertility who failed treatment by IVF. These findings indicate that genes located in different parts of the class II region of the MHC affect different aspects of reproduction and strongly suggest that the sharing of HLA antigens per se is not the mechanism involved in the reproductive defects. The segment of the MHC that has genes affecting reproduction also has genes associated with different autoimmune diseases, and this juxtaposition may explain the association between reproductive defects and autoimmune diseases.#1845
Recommended from our members
Statistical problems in DNA microarray data analysis
DNA microarrays are powerful tools for functional genomics studies. Each array contains thousands of microscopic spots of DNA oligonucleotides with specific sequences, which can hybridize with their complementary DNA sequences. Thus each microarray experiment consists of parallel assays about thousands of genomic fragments. This thesis concerns some statistical issues in the analysis of DNA microarray data. One common usage of DNA microarrays is to monitor the dynamic levels of gene expression in response to a stimulus. This is often achieved through a time course experiment, in which RNA samples are extracted at various time points after exposing the organism to the stimulus. A particularly interesting type of time course experiments involve replicated series of longitudinal samples. In 2006, Tai and Speed proposed a multivariate empirical Bayes model for analyzing this type of data. The MB-statistic derived from this model was shown useful for ranking the genes according to changes in their temporal expression profiles. In the first part of this thesis, we propose an empirical Bayes false discovery rate (FDR)-controlling procedure for multiple hypothesis testing using the MB-statistic. A null distribution is obtained using the parametric bootstrap. Critical values are determined according to the empirical Bayes FDR procedure. This method was compared, through simulations, to the frequentist FDR procedure, which requires a theoretical null distribution for calculating the nominal p-values. Although our method is slightly anti-conservative, it is more robust to the variability in the estimates of the hyperparameters, when the degree of moderation is small. Another common usage of DNA microarrays is to detect genomic locations that are associated with DNA-binding proteins. This is often achieved through ChIP-chip experiments that combine chromatin immunoprecipitation with the microarray technology. Traditional DNA microarrays designed for gene expression studies contain only a few probes for each gene. A special type of DNA microarrays, called tiling arrays, are often used in ChIP-chip experiments. They typically contain probes that are placed densely along the chromosomes to cover either the entire genome or contigs of the genome. A couple of challenges in the analysis of ChIP-chip tiling array data have not been met satisfactorily in the literature. When large scale genomic studies are carried over a long period of time, tiling arrays with different probe designs are often used for practical reasons. The first challenge is the integration of replicate experiments performed using different tiling array designs. When the biological process of interest involves a large protein complex, the investigators often perform ChIP-chip experiments on each component DNA-binding protein individually. DNA targets that are shared by the individual proteins are thought to be the localization sites of the protein complex. The second challenge is the joint analysis of multiple DNA-binding proteins, aimed at identifying their shared targets. In the second part of this thesis, we propose a nonhomogeneous hidden Markov model (HMM) for addressing these two challenges. The nonhomogeneous time axis represents the genomic positions of the probes. The hidden states represent the binding statuses of the proteins. The state-conditional emission distributions of the tiling array data are protein-specific and design-specific. We derived a modified Baum-Welch algorithm for fitting the model parameters. We also developed a procedure that converts the probe level summaries into peaks, which represent the putative binding sites, based on both signal strength and peak shape. To compare our method with existing methods, we curated a set of positive and negative genomic regions from a C. elegans dataset, and performed some receiver operating characteristics (ROC) analyses. When applied to each experiment separately, our method performs similarly as the three best existing methods. When applied to the combined data set, which consists of tiling arrays with different probe designs, our method shows a drastic improvement in performance. A generalization of the nonhomogeneous HMM enables the joint analysis of the ChIP-chip data of multiple proteins. We present an application of this method to identify the shared localization sites of two DNA-binding proteins, under two different conditions
Recommended from our members
Statistical models for longitudinal analysis of single and mixed species infections
AbstractStatistical models for longitudinal analysis of single and mixed species infectionsByKathryn Louise ColbornDoctor of Philosophy in BiostatisticsUniversity of California, BerkeleyProfessor Terence P. Speed, ChairThere are numerous examples of infectious diseases that are caused by various species ofthe same pathogen. Some examples include Lyme disease, malaria, Leishmaniasis, Denguefever, and Ehrlichiosis. The advancement of laboratory methods has facilitated more sensitivedetection of mixed species infections in humans, which has resulted in a surge of research focussingon the eects of mixed infections on clinical outcomes. Cross-sectional blood samplescompared with clinical outcome measures provide a limited scope of the interactions betweenspecies. It is important to study these infections in humans longitudinally, and within their naturalenvironments, in order to develop an understanding of the complex relationships betweenhosts, pathogens and vectors of transmission.Papua New Guinea is a country with high prevalence of both Plasmodium falciparum and P.vivax, two species of parasites that can cause malaria. It is well known that these two parasitescan cause severe morbidity and mortality independently, but there has not been conclusiveevidence of the eect of mixed P. falciparum and P. vivax infections on clinical symptoms.Children under age five are at highest risk of experiencing adverse outcomes from Plasmodiuminfections. In 2006, a cohort study was implemented to conduct an investigation of the eectsof mixed P. falciparum and P. vivax infections on clinical episodes of malaria in children livingin a rural area of Papua New Guinea. The data collected from this study are used throughoutthis dissertation to address both the epidemiological questions of the study investigators and topresent statistical models for analyzing longitudinal malaria data and mixed species infections
Recommended from our members
Statistical Methods for Dose-Response Assays
Dose-response assays are a common and increasingly high throughput method of assessing the toxicity of potential drug targets on test populations of cells. Such assays typically involve serial dilutions of the compounds in question applied to cell samples to determine the level of cell activity across a broad range of concentrations. Another factor in such experiments may be the change in activity under different enzyme combinations and the use of controls to adjust for interassay variations. Typically, the decreasing number in the population of cells due to increasing concentrations of the drug can be modeled with a logistic curve. Since the appropriate range of concentrations of the drug to test in order to see these reactions cannot be predetermined fully, frequently the data available for a given experimental unit may not be enough to fit such curves on their own successfully. Instead, the assay data as a whole can be successfully analyzed with methods such as constrained fitting and mixed effects models, where each set can borrow strength from each other in order to be fitted while still taking into account individual significances of the specific experiment. This dissertation illustrates variations of such methods on three major datasets from the Joe Gray lab at Lawrence Berkeley National Laboratory, the Douglas Clark lab at the Department of Chemical Engineering at UC Berkeley, and Bionovo, Inc. of Emeryville, California. The first dataset from breast cancer cell line testing involves the estimation of the National Cancer Institute concentration parameter called GI50, the concentration of the drug at which it inhibits the growth of the population of cells by half. We develop a method utilizing replicate data to estimate this parameter. The second dataset involves the estimation of a more commonly used concentration parameter called the IC50, which doesn't take into account the initial cell population, on special assays that mimic the liver metabolism in the body. The methods involve mixed effects models that incorporate the specific enzyme conditions and types of cells important to the experiment. The third dataset, involving different effects of plant compounds on an osteosarcoma cell line, illustrates the usage of negative and positive controls to appropriately adjust the observations for interassay variation
Recommended from our members
Statistical Aspects of ChIP-Seq Data Analysis
ChIP-Seq experiments combine the recently developed next-generation sequencing technology with the established chromatin immunoprecipitation assays to study the interactions between various classes of proteins and DNA in the cell nucleus. The experiments consist of isolating the protein-DNA complexes from the nucleus, enriching the pool of DNA fragments for those bound to the protein of interest, and sequencing the resulting pool of fragments, producing millions of short reads that can be aligned to the genome. Despite the fact that the ChIP-Seq technology has been developed very recently, a great number of studies have been carried out on the DNA binding of a variety of transcription factors in different species and tissue types. ChIP-Seq approaches have also been used to study cellular epigenomic states such as histone modifications.As with any nascent technology, a number of methodological issues need to be addressed before a proper data analysis pipeline for ChIP-Seq can be established. Some of the issues that need to be addressed are image processing and analysis, alignment of the reads to a genome or a subset of it, and identifying the signal sites along the genome. This work focuses on the issue of signal identification, the problem known as peak-finding in the literature.We describe the data-generating process for ChIP-Seq experiments and review properties of the data and various sources of biases in Chapter 1. We then review various approaches to peak-finding in Chapter 2. We provide a detailed overview of some common strategies, their relative advantages and disadvantages, and describe the statistical models used by some popular peak-finding tools. We formalize the conceptual framework of peak-finding by introducing the notions of enrichment measures and enrichment statistics and categorize various peak-finders in terms of this framework. We discuss in some detail the different kinds of control samples used in ChIP-Seq experiments, and how they are incorporated into the peak-finding procedure. We also address the important issue of validation in the context of ChIP-Seq experiments and the shortcomings of the currently available validation approaches.In Chapter 3 we propose a novel peak-finding strategy for experiments involving trancription factor binding that lack appropriate control samples (so-called one-sample experiments). Our approach accounts for genomic sequence biases in the data, namely the GC and mappability effects, and utilizes the knowledge of the shape of the read density profile in the vicinity of the true binding sites. We use deduced sets of true positive and true negative enriched regions to demonstrate that our approach is better at removing non-specifically enriched regions from the set of identified binding sites than other one-sample approaches and provides a superior spatial resolution to most examined peak-finders.Finally, in Chapter 4 we discuss the important issue of combining data from replicate samples. We discuss different kinds of replicates common in the ChIP-Seq literature and the standard approaches used to integrate data across replicates. We develop several diagnostic plots for assessing whether the standard assumption of Poisson variance holds and observe that the assumption can break down even for technical replicates due to flow cell-specific sequence composition effects
- …
