Search CORE

1,721,060 research outputs found

Recommended from our members

Semi-Parametric Methods for Missing Data and Causal Inference

Author: Sun Baoluo
Publication venue
Publication date: 2017
Field of study

In this dissertation, we propose methodology to account for missing data as well as a strategy to account for outcome heterogeneity. Missing data occurs frequently in empirical studies in health and social sciences, often compromising our ability to make accurate inferences. An outcome is said to be missing not at random (MNAR) if, conditional on the observed variables, the missing data mechanism still depends on the unobserved outcome. In such settings, identification is generally not possible without imposing additional assumptions. Identification is sometimes possible, however, if an exogeneous instrumental variable (IV) is observed for all subjects such that it satisfies the exclusion restriction that the IV affects the missingness process without directly influencing the outcome. In chapter 1, we provide necessary and sufficient conditions for nonparametric identification of the full data distribution under MNAR with the aid of an IV. In addition, we give sufficient identification conditions that are more straightforward to verify in practice. For inference, we focus on estimation of a population outcome mean, for which we develop a suite of semiparametric estimators that extend methods previously developed for data missing at random. Specifically, we propose inverse probability weighted estimation, outcome regression based estimation and doubly robust estimation of the mean of an outcome subject to MNAR. For illustration, the methods are used to account for selection bias induced by HIV testing refusal in the evaluation of HIV seroprevalence in Mochudi, Botswana, using interviewer characteristics such as gender, age and years of experience as IVs. The development of coherent missing data models to account for nonmonotone missing at random (MAR) data by inverse probability weighting (IPW) remains to date largely unresolved. As a consequence, IPW has essentially been restricted for use only in monotone MAR settings. In chapter 2, we propose a class of models for nonmonotone missing data mechanisms that spans the MAR model, while allowing the underlying full data law to remain unrestricted. For parametric specifications within the proposed class, we introduce an unconstrained maximum likelihood estimator for estimating the missing data probabilities which is easily implemented using existing software. To circumvent potential convergence issues with this procedure, we also introduce a constrained Bayesian approach to estimate the missing data process which is guaranteed to yield inferences that respect all model restrictions. The efficiency of standard IPW estimation is improved by incorporating information from incomplete cases through an augmented estimating equation which is optimal within a large class of estimating equations. We investigate the finite-sample properties of the proposed estimators in extensive simulations and illustrate the new methodology in an application evaluating key correlates of preterm delivery for infants born to HIV infected mothers in Botswana, Africa. When a risk factor affects certain categories of a multinomial outcome but not others, outcome heterogeneity is said to be present. A standard epidemiologic approach for modeling risk factors of a categorical outcome typically entails fitting a polytomous logistic regression via maximum likelihood estimation. In chapter 3, we show that standard polytomous regression is ill-equipped to detect outcome heterogeneity, and will generally understate the degree to which such heterogeneity may be present. Specifically, nonsaturated polytomous regression will often a priori rule out the possibility of outcome heterogeneity from its parameter space. As a remedy, we propose to model each category of the outcome as a separate binary regression. For full efficiency, we propose to estimate the collection of regression parameters jointly by a constrained Bayesian approach which ensures that one remains within the multinomial model. The approach is straightforward to implement in standard software for Bayesian estimation.Biostatisticsmissing data; causal inference; semi-parametric theory; statistics; biostatistic

Harvard University - DASH

Recommended from our members

Semi-supervised and Representation Learning for Improved Classification and Stratification in EHR Data

Author: Wang Linshanshan
Publication venue
Publication date: 2026
Field of study

The rapid digitization of healthcare has given rise to vast repositories of electronic health record (EHR) data, offering unprecedented opportunities for data-driven advancements in disease prediction, patient stratification, and clinical decision-making. However, the high dimensionality, sparsity, and heterogeneity of EHR data present unique statistical and computational challenges. Moreover, the scarcity of high-quality labels—due to the cost and complexity of manual annotation—further complicates supervised modeling efforts. This dissertation addresses these challenges through a unified framework of semi-supervised learning and representation learning for improved classification and stratification in EHR data, with applications to phenotyping, disability prediction, and patient subgroup discovery. The overarching goal of this work is to develop scalable, robust, and interpretable methods that leverage both labeled and unlabeled EHR data, improve generalizability across populations, and uncover clinically meaningful structure in complex disease settings. The dissertation is composed of three interrelated papers, each tackling a key methodological bottleneck in modern EHR-based machine learning: (1) evaluating model performance under distributional shift, (2) learning rich patient representations in the presence of limited labels, and (3) stratifying heterogeneous patient populations using outcome-informed embeddings. In Chapter 1, we consider the problem of evaluating the performance of binary classifiers when labeled data are unavailable in a target population. This setting is common in clinical phenotyping tasks, where models are trained using limited chart-reviewed labels in one cohort and then applied to other cohorts with potentially different covariate distributions. We propose STEAM Semi-supervised Transfer lEarning of Accuracy Measures), a doubly robust estimation procedure for receiver operating characteristic (ROC) parameters under covariate shift. STEAM combines calibrated density ratio weighting with robust outcome imputation, using both unlabeled source and target data to improve efficiency while protecting against model misspecification. Through theoretical guarantees and empirical results, we demonstrate that STEAM enables accurate performance assessment in unlabeled target populations, with applications to phenotyping models in rheumatoid arthritis on temporally evolving EHR cohort. Building on the challenge of label scarcity, Chapter 2 shifts focus to semi-supervised representation learning for predictive modeling. We propose SCORE (Semi-supervised Clustering thrOugh REp- resentation learning), a generative embedding framework that models the joint distribution of high-dimensional EHR features using a multivariate Poisson-LogNormal distribution, with pretrained code embeddings capturing semantic relationships between clinical concepts. SCORE integrates limited labeled data via a hybrid Expectation-Maximization and Gaussian Variational Approximation algorithm, enabling efficient and theoretically sound inference in large-scale, partially labeled cohorts. We show that SCORE produces informative and transferable patient embeddings, improving prediction of disability status in multiple sclerosis (MS) and outperforming conventional supervised and unsupervised methods. Finally, Chapter 3 addresses the critical task of patient stratification in heterogeneous diseases. We focus on Alzheimer’s disease (AD), where progression and prognosis vary substantially with age. We propose SOLAR (age-Specific Outcome-guided representation Learning for pAtient clusteRing), a novel clustering framework that incorporates time-to-event outcomes and explicitly models age-group structure using a multitask learning paradigm. SOLAR jointly learns low-dimensional patient representations across age groups, encouraging shared structure while allowing age-specific flexibility. By integrating survival information and modeling age-related heterogeneity, SOLAR identifies clinically meaningful AD subtypes with distinct prognostic profiles, improving both interpretability and clinical utility over existing age-unaware or outcome-agnostic methods. Together, these three works present a cohesive framework for semi-supervised and representation learning in EHR analysis. The methods developed here contribute new strategies for evaluating, predicting, and stratifying patient outcomes in data-scarce, high-dimensional clinical settings. In doing so, they aim to advance the broader goals of personalized medicine and evidence-based healthcare by making machine learning more robust, scalable, and clinically relevant.Biostatistic

Harvard University - DASH

Recommended from our members

Statistical Methods for Missing Data in Electronic Health Records-based Research

Author: Thaweethai Tanayott
Publication venue
Publication date: 2021
Field of study

Because conducting large-scale, long-term randomized studies is prohibitively expensive and time-consuming, researchers have turned to observational studies using electronic health records (EHR) for answers. EHR include rich data on large populations over long periods of time and are available at relatively low cost. However, data are not collected for research purposes, and secondary analyses of EHR are subject to various challenges and biases. Specifically, the potential for selection bias is high when analyses are restricted to patients with complete data. Approaching selection bias as a missing data problem, one could apply standard methods, such as inverse probability weighting (IPW) and multiple imputation (MI), to adjust for selection. However, these methods fail to address the complex nature of EHR data, particularly the interplay of numerous decisions by patients, physicians, and insurers that collectively determine whether complete data is observed. One recently proposed method for addressing this issue involves breaking down the complex process that governs whether or not a patient has complete data into a series of more manageable sub-mechanisms. This method involves characterizing the data provenance, or the process by which data originates and appears in the EHR. If a clinician is interested in measuring BMI among patients 24 months after undergoing bariatric surgery, it might be the case that for a patient to have complete data in this context, they must: (1) be actively enrolled in their health plan at 24 months after surgery, (2) have a clinical encounter at 24 months, and (3) have their BMI measured at the encounter. Statistical models can then be built for 'selection' (i.e., being in the positive state) at each of the three sub-mechanisms. A framework for estimation and inference within this context has been developed in which IPW is used to adjust for selection at every sub-mechanism. This research proposal expands upon the existing framework by introducing ‘blended analysis’ strategies that give researchers the flexibility to apply MI and IPW simultaneously to control for selection bias. It has been previously demonstrated that there can be gains in efficiency when MI and IPW are used simultaneously. For a given missingness sub-mechanism in the modularized specification of the data provenance, rather than using IPW to adjust for selection of patients with complete data for a specific covariate, a researcher might consider imputing missing values of that covariate instead. In the first chapter, we introduce a robust variance estimation method when combining IPW with MI, and apply this strategy to an EHR-based study of bariatric surgery, weight loss, and chronic kidney disease. In the second chapter, we introduce the blended analysis framework, establishing estimation procedures under this framework. Throughout, we apply these methods to the DURABLE (DURAtion of Bariatric Long Term Effects) study, a large, ongoing, NIH-funded, multi-center retrospective cohort study investigating the health outcomes of patients who undergo bariatric surgery. While it is widely accepted that Roux-en-Y gastric bypass surgery (RYGB) leads to greater weight loss than vertical sleeve gastrectomy (VSG), there are concerns that the risks of RYGB are greater, especially among patients with chronic kidney disease at baseline. Using EHR, we examine whether the weight loss advantage of RYGB compared to VSG persists among subjects with chronic kidney disease. In general, IPW and MI-based methods fail to produce consistent estimates when data are MNAR; that is, when the probability that a given covariate is not measured depends on the value of the covariate itself, or on other factors that are only partially observed in EHR. Further, the assumption researchers must make as to whether data is or is not MNAR is statistically untestable. Rigorous sensitivity analyses are therefore needed to measure the extent to which estimators yielded by our methods are impacted by unobserved data. This is the focus of the third chapter

Harvard University - DASH

Recommended from our members

Generalizability Methods for Estimating Causal Population Effects

Author: Degtiar Irina
Publication venue
Publication date: 2021
Field of study

Studies are often performed in samples that do not resemble the target populations relevant for policy, treatment, or other decisions. Much of the causal inference literature has focused on addressing internal validity bias; however, both internal and external validity are necessary for unbiased estimates in a target population. The generalizability methods presented in this thesis allow for inference on the population of interest rather than the one in the study. Chapter 1 presents a framework for addressing external validity bias, including a synthesis of approaches for generalizability and transportability, the assumptions they require, as well as tests for the heterogeneity of treatment effects and differences between study and target populations. The chapter concludes with practical guidance for researchers and practitioners. Chapter 2 presents an innovative class of estimators, conditional cross-design synthesis (CCDS), for combining randomized and observational data to eliminate their respective external and internal validity biases. CCDS uses the region of covariate overlap between data types to remove potential unmeasured confounding bias in the observational data in order to extend inference beyond the support of the randomized data to the full target population. We derive outcome regression, propensity weighting, and double robust approaches under the CCDS framework. We illustrate the methods to estimate the causal effect of health insurance plans on cost among New York City Medicaid enrollees. Chapter 3 introduces novel approaches for generalizing from an evaluation study of a voluntary intervention to estimate population average treatment effects for future treated individuals, which can accommodate nonparametric outcome regression approaches such as Bayesian Additive Regression Trees and Bayesian Causal Forests. The generalizability approach incorporates uncertainty regarding target population treated group membership into the posterior credible intervals to better-reflect the uncertainty of scaling up a voluntary intervention. In a simulation based on real data, we estimate impacts of a national scale-up of a voluntary health policy model and highlight the benefit of using flexible regression approaches for generalizability

Harvard University - DASH

Recommended from our members

Bayesian Causal Inference With Intermediates

Author: Comment Leah Andrews
Publication venue
Publication date: 2019
Field of study

Causal inference from observational data can be complicated for a number of reasons, including complex functional forms for covariates, partially missing or wholly unmeasured confounders, and truncating events which obscure effects on the outcome of interest. In these instances, it can be useful to look at that intermediate variables to disentangle the causal effect of a treatment or exposure on the primary outcome of interest. Moreover, the intermediates can themselves become outcomes of interest as they become targets of public health intervention or metrics for quality of care. This dissertation explores the use of intermediate variables in two settings. In Chapter 1, we introduce a data-driven sensitivity analysis method. This Bayesian data fusion (BDF) procedure synthesizes information across multiple data sources to correct for confounding by a variable which is unmeasured in the main data set. We demonstrate this method for unmeasured exposure-induced mediator-outcome confounding in the context of Black-White racial disparities in colorectal cancer. In Chapters 2 and 3, we turn to the problem of understanding hospital readmissions among late-stage pancreatic cancer patients. Readmissions are a common proxy indicator for quality of care, but they can be truncated by death in a problem referred to as semicompeting risks. Chapter 2 lays out a general causal framework for semicompeting risks that is rooted in principal stratification. We motivate two new causal estimands: the time-varying survivor average causal effect (TV-SACE) and the restricted mean survivor average causal effect (RM-SACE). We also introduce a Bayesian estimation procedure which accommodates individual-level latent frailties, and we demonstrate its application in an evaluation of home support among newly diagnosed pancreatic cancer patients. Chapter 3 proposes a nonparametric estimation procedure for the TV-SACE and RM-SACE based on Bayesian Additive Regression Trees (BART), which allows for treatment effect heterogeneity with embedded interaction terms in the branches of the trees. With this newfound flexibility, we revisit the data analysis of Chapter 2 to understand how the changing composition of latent principal strata drives population-level effects and how heterogeneity informs individualized recommendations. Chapter 4 concludes with a discussion of unifying themes and future research directions.Biostatistic

Harvard University - DASH

Recommended from our members

Correcting for Biases Arising in Epidemiologic Research

Author: Peskoe Sarah B.
Publication venue
Publication date: 2019
Field of study

In chapter 1, we explore the performance of naive least squares estimators for latency parameters in linear models in the presence of measurement error. We prove that in many scenarios under a general measurement error setting, the least squares estimator for the latency parameter remains consistent, while the regression coefficient estimates are inconsistent as has previously been found in standard measurement error models where the primary disease model does not involve a latency parameter. Conditions under which this result holds are generalized to a wide class of covariance structures and mean functions. The findings are illustrated in a study of body mass index in relation to physical activity in the Health Professionals Follow-up Study. In chapter 2, we extend the results obtained in chapter 1 to the survival setting when the exposure of interest is a time-varying recent-moving cumulative average. We show that when the disease outcome is rare, the latency parameter for a surrogate exposure is approximately the same as the latency parameter for the corresponding true exposure. We show these results in a series of simulations and illustrate the findings in a study of air pollution and incidence lung cancer in the Nurses Health Study. In chapter 3, we specificy a statistical framework for estimation and inference based on inverse probability weighting (IPW) to adjust for selection bias in EHR-based research that allows for a hierarchy of missingness mechanisms to better align with the complex nature of electronic health record (EHR) data. We show that this estimator is consistent and asymptotically Normal, and we derive the form of the asymptotic variance. We use simulations to highlight the potential for bias in EHR studies when standard approaches are used to account for selection bias. We use this approach to adjust for selection in an on-going, multi-site EHR-based study of bariatric surgery on BMI.BiostatisticsBias Correction; Epidemiologic Researc

Harvard University - DASH

Recommended from our members

Bayesian Causal Inference for Estimating Impacts of Air Pollution Exposure

Author: Liao Shirley X.
Publication venue
Publication date: 2019
Field of study

Estimation of the causal effect of air pollution exposure on population health measures poses unique challenges. One commonly used method for estimating causal effects on such data is propensity score analysis (PSA), which controls for confounding in a ``design" stage where propensity scores (PS) are estimated and implemented. Our first paper addresses uncertainty in the design stage of PSA and formulates a probability distribution for the design-stage output in order to lend a degree of formality to Bayesian methods for PSA (BPSA) that have gained attention in recent literature. A procedure for obtaining the posterior distribution of causal effects after marginalizing over a distribution of design-stage outputs is then deployed in an investigation of the association between levels of fine particulate air pollution and elevated exposure to emissions from coal-fired power plants. In order to address seasonality in air pollution emissions, as well as time-varying confounding which occurs from weather and climate variables, our second paper extends two procedures for estimating the average treatment effect on the overlap population (ATO), which may be estimated with less bias and less variability over replications than the average treatment effect over the general population (ATE) via inverse probability weighting (IPW) or stabilize weighting (SW) when low covariate overlap exists in the data. An analysis using these methods is performed on Medicare beneficiaries residing across 18,480 zip codes in the U.S. to evaluate the effect of coal-fired power plant emissions exposure on ischemic heart disease hospitalization, accounting for seasonal patterns that lead to change in treatment over time. Our third paper addresses non-linear confounding and higher-order interactions which may exist in the relationship between ozone exposure and violent criminal activity by performing an analysis using Bayesian additive regression trees (BART), a powerful machine learning procedure able to model complex, non-linear relationships. This study employs time-series data from 6 cities in the US (Chicago, NYC, Atlanta, Philadelphia, Phoenix, LA) from 2009 to 2018 in order to estimate the causal effect of ozone exposure above NAAQS standards for air quality, as well as of a continuous causal effect of ozone exposure on violent crime rates.Biostatistic

Harvard University - DASH

Recommended from our members

Robust and Efficient Machine Learning Methods for the Analysis of Electronic Medical Records Data

Author: Gronsbell Jessica Lynn
Publication venue
Publication date: 2019
Field of study

In the last decade, electronic medical records (EMR) have emerged as a powerful tool to store and process health data worldwide. Though primarily implemented to improve the quality of patient care, EMR have simultaneously generated a promising data source for clinical and translational research, particularly when linked to specimen bio-repositories. However, much of the data stored in routine practice is difficult to make use of in secondary applications. The first step in recycling EMR data for research, identifying patients with specific diseases of interest or so-called phenotyping, has proven to be especially challenging due to the time intensiveness of obtaining validated disease status information. Typically, gold standard phenotype labels obtained from manual chart review are only available for a small training set nested in a large cohort. In contrast, information on a large number of clinical predictors of the phenotype are available for all subjects. To improve the robustness and efficiency of phenotyping, this thesis proposes semi-supervised learning (SSL) methods that fully leverage the auxiliary information contained in the predictors as well as an unsupervised feature selection method that does not rely on any gold standard labels. Chapter 1 proposes a semi-supervised approach for efficient evaluation of prediction performance measures for a binary classifier. In Chapters 2 and 3, I extend the SSL paradigm to settings where the gold standard labels are not randomly selected from the underlying pool of data as is typically assumed in the SSL literature in the context of estimating and evaluating prediction rules. I conclude with Chapter 4 where I introduce a feature selection procedure based entirely on unlabeled data.Biostatistic

Harvard University - DASH

Recommended from our members

Topics in Cluster-Correlated Data: Design, Informativeness, and Misclassification

Author: McGee Glen William
Publication venue
Publication date: 2019
Field of study

Cluster-correlated data are ubiquitous in biomedical research and introduce a number of methodological challenges. Motivated by applications in healthcare policy and epidemiology, this dissertation addresses three such problems. The first chapter considers the hospital-profiling setting, where quality-of-care is assessed on the basis of patient-level outcomes, clustered within hospitals. The latter two chapters are motivated by multigenerational studies, wherein interest lies in the effect of exposures on subsequent generations, with children clustered within families. In Chapter 1, we propose an outcome-dependent sampling solution to a health policy problem. Hospital readmission is a key marker of quality of care used by the Centers for Medicare and Medicaid Services to determine hospital reimbursement rates. Analyses of readmission are based on a generalized linear mixed model (GLMM) that permits estimation of hospital-specific measures while adjusting for case-mix differences. Recent moves to address health disparities call for expanding case-mix adjustment to include measures of socioeconomic status while minimizing burden to hospitals associated with data collection. We propose that detailed socioeconomic data be collected on a sub-sample of patients via a cluster-stratified case-control design paired with pseudo-maximum likelihood estimation. In simulations, the proposed approach proves highly efficient when interest lies in either fixed or random components of a GLMM and covariates are unobserved or expensive to collect. In the motivating study of Medicare beneficiaries, the proposed framework provides a means of mitigating disparities in terms of which hospitals are deemed underperformers relative to a naive analysis that fails to adjust for missing case-mix variables. We then shift our attention to multigenerational studies, which are susceptible to informative cluster size—occurring when the number of children to a mother (the cluster size) is related to their outcomes, given covariates. A natural question then emerges: what if some women bear no children at all? The impact of these potentially informative empty clusters is currently unknown, and Chapter 2 first evaluates the performance of standard methods for informative cluster size when cluster size is permitted to be zero. We find that if the informative cluster size mechanism induces empty clusters, standard methods lead to biased estimates of target parameters. Joint models of outcome and size are capable of valid conditional inference as long as empty clusters are explicitly included in the analysis, but in practice empty clusters regularly go unacknowledged. By contrast, estimating equation approaches necessarily omit empty clusters and therefore yield biased estimates of marginal effects. We thus propose a joint marginalized approach that readily incorporates empty clusters and, even in their absence, permits more intuitive interpretations of population-averaged effects than do current methods. Multigenerational studies require many years of follow-up, so exposures are often assessed retrospectively to maximize the number of observable generations—introducing recall bias and mis-measurement. Chapter 3 investigates exposure misclassification when cluster size is potentially informative, and in particular when misclassification is differential by cluster size. First, we show that misclassification in an exposure related to cluster size can induce informativeness even when cluster size would otherwise be non-informative. Second, we show that misclassification that is differential by informative cluster size can not only attenuate estimates of exposure effects but even inflate or reverse the sign of estimates. To correct for bias in estimating marginal parameters, we propose: (i) an approximate expected estimating equations framework, and (ii) an observed likelihood framework for joint marginalized models of cluster size and outcomes. Although the focus is on estimating marginal parameters, a corollary is that the observed likelihood approach permits valid inference for conditional parameters as well.Biostatistic

Harvard University - DASH

Recommended from our members

Reclaiming island food systems for nutrition and planetary health

Author: Marrero Abrania Dinora
Publication venue
Publication date: 2022
Field of study

Dietary colonialism in Caribbean small islands has driven declines in nutrition and cardiometabolic disease risk, resulting from the marginalization of traditional food cultures and subsequent food import dependence. Shifts away from traditional food production also undermine environmental sustainability, with resulting climatic changes threatening island food system stability. In Puerto Rico, smallholder agriculture holds promise in serving as a cornerstone to re-localized food supplies, providing fresh foods in local communities’ informal economies while managing varying ecological and climatic constraints. Informed by these findings, the following work sought to 1) characterize nutrient availability and environmental impacts of current diets in Latin America and the Caribbean, 2) identify island-food-system-specific dietary patterns and determine their associations with metabolic syndrome in Puerto Rico, and 3) narrate experiences in climate resilience among smallholder farmers to inform sustainable island food system re-localization. Findings pointed to the capacity of neo-traditional, plant-sourced diets in Puerto Rico and similar island settings to minimize environmental harm, preserve food cultures, and augment cardiometabolic health. Neo-traditional food systems leverage local landscapes and tight-knit social networks to make small-scale agriculture nutritious, sustainable, and climate resilient. Multisectoral approaches are needed to reduce highly processed, unsustainable food supplies in small islands. Ultimately, these efforts must leverage collection action and local food cultures to reclaim a more equitable, sustainable, and community-owned food system

Harvard University - DASH