Collection Of Biostatistics Research Archive
Not a member yet
1589 research outputs found
Sort by
Statistical Inference for Networks of High-Dimensional Point Processes
Fueled in part by recent applications in neuroscience, high-dimensional Hawkes process have become a popular tool for modeling the network of interactions among multivariate point process data. While evaluating the uncertainty of the network estimates is critical in scientific applications, existing methodological and theoretical work have only focused on estimation. To bridge this gap, this paper proposes a high-dimensional statistical inference procedure with theoretical guarantees for multivariate Hawkes process. Key to this inference procedure is a new concentration inequality on the first- and second-order statistics for integrated stochastic processes, which summarizes the entire history of the process. We apply this concentration inequality, combining a recent result on martingale central limit theory, to give an upper bounds for the convergence rate of the test statistics. We verify our theoretical results with extensive simulation and an application to a neuron spike train data set
Unified Methods for Feature Selection in Large-Scale Genomic Studies with Censored Survival Outcomes
One of the major goals in large-scale genomic studies is to identify genes with a prognostic impact on time-to-event outcomes which provide insight into the disease\u27s process. With rapid developments in high-throughput genomic technologies in the past two decades, the scientific community is able to monitor the expression levels of tens of thousands of genes and proteins resulting in enormous data sets where the number of genomic features is far greater than the number of subjects. Methods based on univariate Cox regression are often used to select genomic features related to survival outcome; however, the Cox model assumes proportional hazards (PH), which is unlikely to hold for each feature. When applied to genomic features exhibiting some form of non-proportional hazards (NPH), these methods could lead to an under- or over-estimation of the effects. We propose a broad array of marginal screening techniques that aid in feature ranking and selection by accommodating various forms of NPH. First, we develop an approach based on Kullback-Leibler information divergence and the Yang-Prentice model that includes methods for the PH and proportional odds (PO) models as special cases. Next, we propose R2 indices for the PH and PO models that can be interpreted in terms of explained variation. Lastly, we propose a generalized pseudo-R2 measure that includes PH, PO, crossing hazards and crossing odds models as special cases and can be interpreted as the percentage of separability between subjects experiencing the event and not experiencing the event according to feature expression. We evaluate the performance of our measures using extensive simulation studies and publicly available data sets in cancer genomics. We demonstrate that the proposed methods successfully address the issue of NPH in genomic feature selection and outperform existing methods. The proposed information divergence, R2 and pseudo-R2 measures were implemented in R (www.R-project.org) and code is available upon request
Inferring a consensus problem list using penalized multistage models for ordered data
A patient\u27s medical problem list describes his or her current health status and aids in the coordination and transfer of care between providers, among other things. Because a problem list is generated once and then subsequently modified or updated, what is not usually observable is the provider-effect. That is, to what extent does a patient\u27s problem in the electronic medical record actually reflect a consensus communication of that patient\u27s current health status? To that end, we report on and analyze a unique interview-based design in which multiple medical providers independently generate problem lists for each of three patient case abstracts of varying clinical difficulty. Due to the uniqueness of both our data and the scientific objectives of our analysis, we apply and extend so-called multistage models for ordered lists and equip the models with variable selection penalties to induce sparsity. Each problem has a corresponding non-negative parameter estimate, interpreted as a relative log-odds ratio, with larger values suggesting greater importance and zero values suggesting unimportant problems. We use these fitted penalized models to quantify and report the extent of consensus. For the three case abstracts, the proportions of problems with model-estimated non-zero log-odds ratios were 10/28, 16/47, and 13/30. Physicians exhibited consensus on the highest ranked problems in the first and last case abstracts but agreement quickly deteriorates; in contrast, physicians broadly disagreed on the relevant problems for the middle and most difficult case abstract
Burned Area Mapping of an Escaped Fire into Tropical Dry Forest in Western Madagascar Using Multi-Season Landsat OLI Data
A human-induced fire cleared a large area of tropical dry forest near the Ankoatsifaka Research Station at Kirindy Mitea National Park in western Madagascar over several weeks in 2013. Fire is a major factor in the disturbance and loss of global tropical dry forests, yet remotely sensed mapping studies of fire-impacted tropical dry forests lag behind fire research of other forest types. Methods used to map burns in temperature forests may not perform as well in tropical dry forests where it can be difficult to distinguish between multiple-age burn scars and between bare soil and burns. In this study, the extent of forest lost to stand-replacing fire in Kirindy Mitea National Park was quantified using both spectral and textural information derived from multi-date satellite imagery. The total area of the burn was 18,034 ha. It is estimated that 6% (4761 ha) of the Park’s primary tropical dry forest burned over the period 23 June to 27 September. Half of the forest burned (2333 ha) was lost in the large conflagration adjacent to the Research Station. The best model for burn scar mapping in this highly-seasonal tropical forest and pastoral landscape included the differenced Normalized Burn Ratio (dNBR) and both uni- and multi-temporal measures of greenness. Lessons for burn mapping of tropical dry forest are much the same as for tropical dry forest mapping—highly seasonal vegetation combined with a mixture of background spectral information make for a complicated analysis and may require multi-temporal imagery and site specific techniques
OPTIMIZED ADAPTIVE ENRICHMENT DESIGNS FOR MULTI-ARM TRIALS: LEARNING WHICH SUBPOPULATIONS BENEFIT FROM DIFFERENT TREATMENTS
We consider the problem of designing a randomized trial for comparing two treatments versus a common control in two disjoint subpopulations. The subpopulations could be defined in terms of a biomarker or disease severity measured at baseline. The goal is to determine which treatments benefit which subpopulations. We develop a new class of adaptive enrichment designs tailored to solving this problem. Adaptive enrichment designs involve a preplanned rule for modifying enrollment based on accruing data in an ongoing trial. The proposed designs have preplanned rules for stopping accrual of treatment by subpopulation combinations, either for efficacy or futility. The motivation for this adaptive feature is that interim data may indicate that a subpopulation, such as those with lower disease severity at baseline, is unlikely to benefit from a particular treatment while uncertainty remains for the other treatment and/or subpopulation. We optimize these adaptive designs to have the minimum expected sample size under power and Type I error constraints. We compare the performance of the optimized adaptive design versus an optimized non-adaptive (single stage) design. Our approach is demonstrated in simulation studies that mimic features of a completed trial of a medical device for treating heart failure. The optimized adaptive design has 25% smaller expected sample size compared to the optimized non-adaptive design; however, the cost is that the optimized adaptive design has 8% greater maximum sample size. Open-source software that implements the trial design optimization is provided, allowing users to investigate the tradeoffs in using the proposed adaptive versus standard designs
Default Priors for the Intercept Parameter in Logistic Regressions
In logistic regression, separation refers to the situation in which a linear combination of predictors perfectly discriminates the binary outcome. Because finite-valued maximum likelihood parameter estimates do not exist under separation, Bayesian regressions with informative shrinkage of the regression coefficients offer a suitable alternative. Little focus has been given on whether and how to shrink the intercept parameter. Based upon classical studies of separation, we argue that efficiency in estimating regression coefficients may vary with the intercept prior. We adapt alternative prior distributions for the intercept that downweight implausibly extreme regions of the parameter space rendering less sensitivity to separation. Through simulation and the analysis of exemplar datasets, we quantify differences across priors stratified by established statistics measuring the degree of separation. Relative to diffuse priors, our recommendations generally result in more efficient estimation of the regression coefficients themselves when the data are nearly separated. They are equally efficient in non-separated datasets, making them suitable for default use. Modest differences were observed with respect to out-of-sample discrimination. Our work also highlights the interplay between priors for the intercept and the regression coefficients: numerical results are more sensitive to the choice of intercept prior when using a weakly informative prior on the regression coefficients than an informative shrinkage prior
Cross-sectional HIV Incidence Estimation Accounting for Heterogeneity Across Communities
PHASE II ADAPTIVE ENRICHMENT DESIGN TO DETERMINE THE POPULATION TO ENROLL IN PHASE III TRIALS, BY SELECTING THRESHOLDS FOR BASELINE DISEASE SEVERITY
We propose and evaluate a two-stage, phase 2, adaptive clinical trial design. Its goal is to determine whether future phase 3 (confirmatory) trials should be conducted, and if so, which population should be enrolled. The population selected for phase 3 enrollment is defined in terms of a disease severity score measured at baseline. We optimize the phase 2 trial design and analysis in a decision theory framework. Our utility function represents a combination of the cost of conducting phase 3 trials and, if the phase 3 trials are successful, the improved health of the future population minus the cost of treatment. Given such a utility function and a discrete prior distribution on the conditional treatment effect, we compute the Bayes optimal adaptive design. The resulting design is compared to simpler designs in simulation studies. We also apply the designs to resampled data from a completed, phase 2 trial evaluating a new surgical intervention for stroke
Nitrogen (N) Dynamics in the Mineral Soil of a Central Appalachian Hardwood Forest During a Quarter Century of Whole-Watershed N Additions. Ecosystems
The structure and function of terrestrial ecosystems are maintained by processes that vary with temporal and spatial scale. This study examined temporal and spatial patterns of net nitrogen (N) mineralization and nitrification in mineral soil of three watersheds at the Fernow Experimental Forest, WV: 2 untreated watersheds and 1 watershed receiving aerial applications of N over a 25-year period. Soil was sampled to 5 cm from each of seven plots per watershed and placed in two polyethylene bags—one bag brought to the laboratory for extraction/analysis, and the other bag incubated in situ at a 5 cm depth monthly during growing seasons of 1993–1995, 2002, 2005, 2007– 2014. Spatial patterns of net N mineralization and nitrification changed in all watersheds, but were especially evident in the treated watershed, with spatial variability changing non-monotonically, increasing then decreasing markedly. These results support a prediction of the N homogeneity hypothesis that increasing N loads will increase spatial homogeneity in N processing. Temporal patterns for net N mineralization and nitrification were similar for all watersheds, with rates increasing about 25–30% from 1993 to 1995, decreasing by more than 50% by 2005, and then increasing significantly to 2014. The best predictor of these synchronous temporal patterns across all watersheds was number of degree days below 19°C, a value similar to published temperature maxima for net rates of N mineralization and nitrification for these soils. The lack of persistent, detectable differences in net nitrification between watersheds is surprising because fertilization has maintained higher stream-water nitrate concentrations than in the reference watersheds. Lack of differences in net nitrification among watersheds suggests that N-enhanced stream-water nitrate following N fertilization may be the result of a reduced biotic demand for nitrate following fertilization with ammonium sulfate
Challenges for assessing vertebrate diversity in turbid Saharan water-bodies using environmental DNA
The Sahara desert is the largest warm desert in the world and a poorly explored area. Small water-bodies occur across the desert and are crucial habitats for vertebrate biodiversity. Environmental DNA (eDNA) is a powerful tool for species detection and is being increasingly used to conduct biodiversity assessments. However, there are a number of difficulties with sampling eDNA from such turbid water-bodies and it is often not feasible to rely on electrical tools in remote desert environments. We trialled a manually powered filtering method in Mauritania, using pre-filtration to circumvent problems posed by turbid water in remote arid areas. From nine vertebrate species expected in the water-bodies, four were detected visually, two via metabarcoding, and one via both methods. Difficulties filtering turbid water led to severe constraints, limiting the sampling protocol to only one sampling point per study site, which alone may largely explain why many of the expected vertebrate species were not detected. The amplification of human DNA using general vertebrate primers is also likely to have contributed to the low number of taxa identified. Here we highlight a number of challenges that need to be overcome to successfully conduct metabarcoding eDNA studies for vertebrates in desert environments in Africa