1,721,090 research outputs found
Pseudobulk with proper offsets has the same statistical properties as generalized linear mixed models in single-cell case-control studies
Motivation Generalized linear mixed models (GLMMs), such as the negative-binomial or Poisson linear mixed model, are widely applied to single-cell RNA sequencing data to compare transcript expression between different conditions determined at the subject level. However, the model is computationally intensive, and its relative statistical performance to pseudobulk approaches is poorly understood. Results We propose offset-pseudobulk as a lightweight alternative to GLMMs. We prove that a count-based pseudobulk equipped with a proper offset variable has the same statistical properties as GLMMs in terms of both point estimates and standard errors. We confirm our findings using simulations based on real data. Offset-pseudobulk is substantially faster (>x10) and numerically more stable than GLMMs.Y
A theory-based practical solution to correct for sex-differential participation bias
Most genomic cohorts are retrospective where the exposures and outcomes are predetermined prior to sample collection. Therefore, a spurious association between an exposure and an outcome can arise if both variables affect study participation. Such concerns were raised in previous studies questioning the representativeness of the UK Biobank. Recently, a genome-wide association study (GWAS) on biological sex found many autosomal hits and non-negligible autosomal heritability which the authors attribute to selection bias. In this study, we propose a simple and a practical method that can overcome sex-driven selection bias based on theoretical analysis and simulations. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13059-022-02703-0
Exploration of errors in variance caused by using the first-order approximation in Mendelian randomization
Mendelian randomization (MR) uses genetic variation as a natural experiment to investigate the causal effects of modifiable risk factors (exposures) on outcomes. Two-sample Mendelian randomization (2SMR) is widely used to measure causal effects between exposures and outcomes via genome-wide association studies. 2SMR can increase statistical power by utilizing summary statistics from large consortia such as the UK Biobank. However, the first-order term approximation of standard error is commonly used when applying 2SMR. This approximation can underestimate the variance of causal effects in MR, which can lead to an increased false-positive rate. An alternative is to use the second-order approximation of the standard error, which can considerably correct for the deviation of the first-order approximation. In this study, we simulated MR to show the degree to which the first-order approximation underestimates the variance. We show that depending on the specific situation, the first-order approximation can underestimate the variance almost by half when compared to the true variance, whereas the second-order approximation is robust and accurate.Y
Hap-seq: An Optimal Algorithm for Haplotype Phasing with Imputation Using Sequencing Data
Inference of haplotypes, or the sequence of alleles along each chromosome, is a fundamental problem in genetics and is important for many analyses, including admixture mapping, identifying regions of identity by descent, and imputation. Traditionally, haplotypes are inferred from genotype data obtained from microarrays using information on population haplotype frequencies inferred from either a large sample of genotyped individuals or a reference dataset such as the HapMap. Since the availability of large reference datasets, modern approaches for haplotype phasing along these lines are closely related to imputation methods. When applied to data obtained from sequencing studies, a straightforward way to obtain haplotypes is to first infer genotypes from the sequence data and then apply an imputation method. However, this approach does not take into account that alleles on the same sequence read originate from the same chromosome. Haplotype assembly approaches take advantage of this insight and predict haplotypes by assigning the reads to chromosomes in such a way that minimizes the number of conflicts between the reads and the predicted haplotypes. Unfortunately, assembly approaches require very high sequencing coverage and are usually not able to fully reconstruct the haplotypes. In this work, we present a novel approach, Hap-seq, which is simultaneously an imputation and assembly method that combines information from a reference dataset with the information from the reads using a likelihood framework. Our method applies a dynamic programming algorithm to identify the predicted haplotype, which maximizes the joint likelihood of the haplotype with respect to the reference dataset and the haplotype with respect to the observed reads. We show that our method requires only low sequencing coverage and can reconstruct haplotypes containing both common and rare alleles with higher accuracy compared to the state-of-the-art imputation methods.Y
Analysis of differences in human leukocyte antigen between the two wellcome trust case control consortium control datasets
© 2019, Korea Genome Organization.The Wellcome Trust Case Control Consortium (WTCCC) study was a large genome-wide association study that aimed to identify common variants associated with seven diseases. That study combined two control datasets (58C and UK Blood Services) as shared controls. Prior to using the combined controls, the WTCCC performed analyses to show that the genomic content of the control datasets was not significantly different. Recently, the analysis of human leukocyte antigen (HLA) genes has become prevalent due to the development of HLA imputation technology. In this project, we extended the between-control homogeneity analysis of the WTCCC to HLA. We imputed HLA information in the WTCCC control dataset and showed that the HLA content was not significantly different between the two control datasets, suggesting that the combined controls can be used as controls for HLA fine-map-ping analysis based on HLA imputation.Y
Structural Alignment Of Pseudoknotted Rna
In this paper, we address the problem of discovering novel non-coding RNA (ncRNA) using primary sequence, and secondary structure conservation, focusing on ncRNA families with pseudoknotted structures. Our main technical result is an efficient algorithm for computing an optimum structural alignment of an RNA sequence against a genomic substring. This algorithm has two applications. First, by scanning a genome, we can identify novel (homologous) pseudoknotted ncRNA, and second, we can infer the secondary structure of the target aligned sequence. We test an implementation of our algorithm (PAL) and show that it has near-perfect behavior for predicting the structure of many known pseudoknots. Additionally, it can detect the true homologs with high sensitivity and specificity in controlled tests. We also use PAL to search entire viral genome and mouse genome for novel homologs of some viral and eukaryotic pseudoknots, respectively. In each case, we have found strong support for novel homologs. © Mary Ann Liebert, Inc. 2008
MicroPredict: predicting species-level taxonomic abundance of whole-shotgun metagenomic data using only 16S amplicon sequencing data
Background The importance of the human microbiome in the analysis of various diseases is emerging. The two main methods used to profile the human microbiome are 16S rRNA gene sequencing (16S sequencing) and whole-genome shotgun sequencing (WGS). Owing to the full coverage of the genome in sequencing, WGS has multiple advantages over 16S sequencing, including higher taxonomic profiling resolution at the species-level and functional profiling analysis. However, 16S sequencing remains widely used because of its relatively low cost. Although WGS is the standard method for obtaining accurate species-level data, we found that 16S sequencing data contained rich information to predict high-resolution species-level abundances with reasonable accuracy.Objective In this study, we proposed MicroPredict, a method for accurately predicting WGS-comparable species-level abundance data using 16S taxonomic profile data.Methods We employed a mixed model using two key strategies: (1) modeling both sample- and species-specific information for predicting WGS abundances, and (2) accounting for the possible correlations among different species.Results We found that MicroPredict outperformed the other machine learning methods.Conclusion We expect that our approach will help researchers accurately approximate the species-level abundances of microbiome profiles in datasets for which only cost-effective 16S sequencing has been applied.Y
IPED: Inheritance Path-based Pedigree Reconstruction Algorithm Using Genotype Data
The problem of inference of family trees, or pedigree reconstruction, for a group of individuals is a fundamental problem in genetics. Various methods have been proposed to automate the process of pedigree reconstruction given the genotypes or haplotypes of a set of individuals. Current methods, unfortunately, are very time-consuming and inaccurate for complicated pedigrees, such as pedigrees with inbreeding. In this work, we propose an efficient algorithm that is able to reconstruct large pedigrees with reasonable accuracy. Our algorithm reconstructs the pedigrees generation by generation, backward in time from the extant generation. We predict the relationships between individuals in the same generation using an inheritance path-based approach implemented with an efficient dynamic programming algorithm. Experiments show that our algorithm runs in linear time with respect to the number of reconstructed generations, and therefore, it can reconstruct pedigrees that have a large number of generations. Indeed it is the first practical method for reconstruction of large pedigrees from genotype data
PASTRY: achieving balanced power for detecting risk and protective minor alleles in meta-analysis of association studies with overlapping subjects
Background
Meta-analysis is a statistical method that combines the results of multiple studies to increase statistical power. When multiple studies participating in a meta-analysis utilize the same public dataset as controls, the summary statistics from these studies become correlated. To solve this challenge, Lin and Sullivan proposed a method to provide an optimal test statistic adjusted for the correlation. This method quickly became the standard practice. However, we identified an unexpected power asymmetry phenomenon in this standard framework. This can lead to unbalanced power for detecting protective minor alleles and risk minor alleles.
Results
We found that the power asymmetry of the current framework is mainly due to the errors in approximating the correlation term. We then developed a meta-analysis method based on an accurate correlation estimator, called PASTRY (A method to avoid Power ASymmeTRY). PASTRY outperformed the standard method on both simulated and real datasets in terms of the power symmetry.
Conclusions
Our findings suggest that PASTRY can help to alleviate the power asymmetry problem. PASTRY is available at https://github.com/hanlab-SNU/PASTRY.This work was supported by the National Research Foundation of Korea (NRF) (Grant number 2022R1A2B5B02001897) funded by the Korean government, Ministry of Science, and ICT. This work was also supported by the Creative-Pioneering Researchers Program funded by Seoul National University (SNU). This study was supported by the BK21 FOUR Biomedical Science Program at Seoul National University (SNU)
- …
