Collection Of Biostatistics Research Archive
Not a member yet
1589 research outputs found
Sort by
Power Calculation for Cross-Sectional Stepped Wedge Cluster-Randomized Trials with Variable Cluster Sizes
Standard sample size calculation formulas for Stepped Wedge Cluster Randomized Trials (SW-CRTs) assume that cluster sizes are equal. When cluster sizes vary substantially, ignoring this variation may lead to an under-powered study. We investigate the relative efficiency of a SW-CRT with varying cluster sizes to equal cluster sizes, and derive variance estimators for the intervention effect that account for this variation under the assumption of a mixed effects model; a commonly-used approach for analyzing data from cluster randomized trials. When cluster sizes vary, the power of a SW-CRT depends on the order in which clusters receive the intervention, which is determined through randomization. We first derive a variance formula that corresponds to any particular realization of the randomized sequence and propose efficient algorithms to identify upper and lower bounds of the power. We then obtain an ``expected\u27\u27 power based on a first-order approximation to the variance formula, where the expectation is taken with respect to all possible randomization sequences. Finally, we provide a variance formula for more general settings where only the mean and coefficient of variation of cluster sizes, instead of exact cluster sizes, are known in the design stage. We evaluate our methods through simulations and illustrate that the power of a SW-CRT decreases as the variation in cluster sizes increases, and the impact is largest when the number of clusters is small
The genomic landscape of molecular responses to natural drought stress in \u3cem\u3ePanicum hallii\u3c/em\u3e.
Environmental stress is a major driver of ecological community dynamics and agricultural productivity. This is especially true for soil water availability, because drought is the greatest abiotic inhibitor of worldwide crop yields. Here, we test the genetic basis of drought responses in the genetic model for C4 perennial grasses, Panicum hallii, through population genomics, field-scale gene-expression (eQTL) analysis, and comparison of two complete genomes. While gene expression networks are dominated by local cis-regulatory elements, we observe three genomic hotspots of unlinked trans-regulatory loci. These regulatory hubs are four times more drought responsive than the genome-wide average. Additionally, cis- and trans-regulatory networks are more likely to have opposing effects than expected under neutral evolution, supporting a strong influence of compensatory evolution and stabilizing selection. These results implicate trans-regulatory evolution as a driver of drought responses and demonstrate the potential for crop improvement in drought-prone regions through modification of gene regulatory networks
Analysis of Covariance (ANCOVA) in Randomized Trials: More Precision, Less Conditional Bias, and Valid Confidence Intervals, Without Model Assumptions
Covariate adjustment in the randomized trial context refers to an estimator of the average treatment effect that adjusts for chance imbalances between study arms in baseline variables (called “covariates ). The baseline variables could include, e.g., age, sex, disease severity, and biomarkers. According to two surveys of clinical trial reports, there is confusion about the statistical properties of covariate adjustment. We focus on the ANCOVA estimator, which involves fitting a linear model for the outcome given the treatment arm and baseline variables, and trials with equal probability of assignment to treatment and control. We prove the following new (to the best of our knowledge) robustness property of ANCOVA to arbitrary model misspecification: Not only is the ANCOVA point estimate consistent (as proved by Yang and Tsiatis (2001)) but so is its standard error. This implies that confidence intervals and hypothesis tests conducted as if the linear model were correct are still valid even when the linear model is arbitrarily misspecified, e.g., when the baseline variables are nonlinearly related to the outcome or there is treatment effect heterogeneity. We also give a simple, robust formula for the variance reduction (equivalently, sample size reduction) from using ANCOVA. By re-analyzing completed randomized trials for mild cognitive impairment, schizophrenia, and depression, we demonstrate how ANCOVA can reduce variance, reduce bias conditional on chance imbalance, and increase power even when by chance there is perfect balance across arms in the baseline variables
Technical Considerations in the Use of the E-value
The E-value is defined as the minimum strength of association on the risk ratio scale that an unmeasured confounder would have to have with both the exposure and the outcome, conditional on the measured covariates, to explain away the observed exposure-outcome association. We have elsewhere proposed that the reporting of E-values for estimates and for the limit of the confidence interval closest to the null become routine whenever causal effects are of interest. A number of questions have arisen about the use of E-value including questions concerning the interpretation of the relevant confounding association parameters, the nature of the transformation from the risk ratio scale to the E-value scale, inference for and using E-values, and the relation to Rosenbaum’s notion of design sensitivity. Here we bring these various questions together and provide responses that we hope will assist in the interpretation of E-values and will further encourage their use
A SPLINE-ASSISTED SEMIPARAMETRIC APPROACH TO NONPARAMETRIC MEASUREMENT ERROR MODELS
Nonparametric estimation of the probability density function of a random variable measured with error is considered to be a difficult problem, in the sense that depending on the measurement error prop- erty, the estimation rate can be as slow as the logarithm of the sample size. Likewise, nonparametric estimation of the regression function with errors in the covariate suffers the same possibly slow rate. The traditional methods for both problems are based on deconvolution, where the slow convergence rate is caused by the quick convergence to zero of the Fourier transform of the measurement error density, which, unfortunately, appears in the denominators during the construction of these methods. Using a completely different approach of spline-assisted semiparametric methods, we are able to construct nonparametric estimators of both density functions and regression mean functions that achieve the same nonparametric convergence rate as in the error free case. Other than requiring the error-prone variable distribution to be compactly supported, our assumptions are not stronger than in the classical deconvolution literatures. The performance of these methods are demonstrated through some simulations and a data example
Heterologous Expression of Secreted Bacterial BPP and HAP Phytases in Plants Stimulates Arabidopsis thaliana Growth on Phytate.
Phytases are specialized phosphatases capable of releasing inorganic phosphate from myo-inositol hexakisphosphate (phytate), which is highly abundant in many soils. As inorganic phosphorus reserves decrease over time in many agricultural soils, genetic manipulation of plants to enable secretion of potent phytases into the rhizosphere has been proposed as a promising approach to improve plant phosphorus nutrition. Several families of biotechnologically important phytases have been discovered and characterized, but little data are available on which phytase families can offer the most benefits toward improving plant phosphorus intake. We have developed transgenic Arabidopsis thaliana plants expressing bacterial phytases PaPhyC (HAP family of phytases) and 168phyA (BPP family) under the control of root-specific inducible promoter Pht1;2. The effects of each phytase expression on growth, morphology and inorganic phosphorus accumulation in plants grown on phytate hydroponically or in perlite as the only source of phosphorus were investigated. The most enzymatic activity for both phytases was detected in cell wall-bound fractions of roots, indicating that these enzymes were efficiently secreted. Expression of both bacterial phytases in roots improved plant growth on phytate and resulted in larger rosette leaf area and diameter, higher phosphorus content and increased shoot dry weight, implying that these plants were indeed capable of utilizing phytate as the source of phosphorus for growth and development. When grown on phytate the HAP-type phytase outperformed its BPP-type counterpart for plant biomass production, though this effect was only observed in hydroponic conditions and not in perlite. Furthermore, we found no evidence of adverse side effects of microbial phytase expression in A. thaliana on plant physiology and seed germination. Our data highlight important functional differences between these members of bacterial phytase families and indicate that future crop biotechnologies involving such enzymes will require a very careful evaluation of phytase source and activity. Overall, our data suggest feasibility of using bacterial phytases to improve plant growth in conditions of phosphorus deficiency and demonstrate that inducible expression of recombinant enzymes should be investigated further as a viable approach to plant biotechnology
Concentrations of criteria pollutants in the contiguous U.S., 1979 – 2015: Role of model parsimony in integrated empirical geographic regression
BACKGROUND: National- or regional-scale prediction models that estimate individual-level air pollution concentrations commonly include hundreds of geographic variables. However, these many variables may not be necessary and parsimonious approach including small numbers of variables may achieve sufficient prediction ability. This parsimonious approach can also be applied to most criteria pollutants. This approach will be powerful when generating publicly available datasets of model predictions that support research in environmental health and other fields. OBJECTIVES: We aim to (1) build annual-average integrated empirical geographic (IEG) regression models for the contiguous U.S. for six criteria pollutants, for all years with regulatory monitoring data during 1979 – 2015; (2) explore the impact of model parsimony on model performance by comparing the model performance depending on the numbers or variables offered into a model; and (3) provide publicly available model predictions. METHODS: We compute annual-average concentrations from regulatory monitoring data for PM10, PM2.5, NO2, SO2, CO, and ozone at all monitoring sites for 1979-2015. We also compute ~900 geographic characteristics at each location including measures of traffic, land use, and satellite-based estimates of air pollution and landcover. We then develop IEG models, employing universal kriging and summary factors estimated by partial least squares (PLS) of independent variables. For all pollutants and years, we compare three approaches for choosing variables to include in the model: (1) no variables (kriging only), (2) a limited number of variables chosen by forward selection, and (3) all variables. We evaluate model performance using 10-fold cross-validation (CV) using conventional randomly-selected and spatially-clustered test data. RESULTS: Models using 3 to 30 variables generally have the best performance across all pollutants and years (median R2 conventional [clustered] CV: 0.66 [0.47]) compared to models with no (0.37 [0]) or all variables (0.64 [0.27]). Using the best models mostly including 3-30 variables, we predicted annual-average concentrations of six criteria pollutants for all Census Blocks in the contiguous U.S.
DISCUSSION: Our findings suggest that national prediction models can be built on only a small number (30 or fewer) of important variables and provide robust concentration estimates. Model estimates are freely available online
Incorporating Historical Models with Adaptive Bayesian Updates
This paper considers Bayesian approaches for incorporating information from a historical model into a current analysis when the historical model includes only a subset of covariates currently of interest. The statistical challenge is two-fold. First, the parameters in the nested historical model are not generally equal to their counterparts in the larger current model, neither in value nor interpretation. Second, because the historical information will not be equally informative for all parameters in the current analysis, additional regularization may be required beyond that provided by the historical information. We propose several novel extensions of the so-called power prior that adaptively combine a prior based upon the historical information with a variance-reducing prior that shrinks parameter values toward zero. The ideas are directly motivated by our work building mortality risk prediction models for pediatric patients receiving extracorporeal membrane oxygenation, or ECMO. We have developed a model on a registry-based cohort of ECMO patients and now seek to expand this model with additional biometric measurements, not available in the registry, collected on a small auxiliary cohort. Our adaptive priors are able to leverage the efficiency of the original model and identify novel mortality risk factors. We support this with a simulation study, which demonstrates the potential for efficiency gains in estimation under a variety of scenarios
ROBUST ESTIMATION OF THE AVERAGE TREATMENT EFFECT IN ALZHEIMER\u27S DISEASE CLINICAL TRIALS
The primary analysis of Alzheimer\u27s disease clinical trials often involves a mixed-model repeated measure (MMRM) approach. We consider another estimator of the average treatment effect, called targeted minimum loss based estimation (TMLE). This estimator is more robust to violations of assumptions about missing data than MMRM.
We compare TMLE versus MMRM by analyzing data from a completed Alzheimer\u27s disease trial data set and by simulation studies. The simulations involved different missing data distributions, where loss to followup at a given visit could depend on baseline variables, treatment assignment, and the outcome measured at previous visits. The TMLE generally has improved robustness in our simulated settings, i.e., less bias and mean squared error, and better confidence interval coverage probability. The robustness is due to the TMLE correctly modeling the dropout distribution. We illustrate the tradeoffs between these estimators and give recommendations for how to use these estimators in practice
Robust Inference for the Stepped Wedge Design
Based on a permutation argument, we derive a closed form expression for an estimate of the treatment effect, along with its standard error, in a stepped wedge design. We show that these estimates are robust to misspecification of both the mean and covariance structure of the underlying data-generating mechanism, thereby providing a robust approach to inference for the treatment effect in stepped wedge designs. We use simulations to evaluate the type I error and power of the proposed estimate and to compare the performance of the proposed estimate to the optimal estimate when the correct model specification is known. The limitations, possible extensions, and open problems regarding the method are discussed