1,721,113 research outputs found
Recommended from our members
Applications of Semi-parametric Estimation Methods in Causal Inference and Prediction
In this thesis, we argue for the use of loss-based semi-parametric estimation methods as an alternative to traditional parametric models in causal inference and prediction. We present a brief discussion on "black box" epidemiology in the first chapter and argue that risk factor epidemiology can be improved by using semi-parametric estimation methods. We demonstrate the use of semi-parametric methods by applying them to two different problems: one in causal inference and another in prediction. In each case, we demonstrate the process one would follow to define the question of interest, parameterize this question, and estimate it using semi-parametric methods. In the second chapter we introduce a formal concept of a perception effect, and define unmasking and placebo effects in the context of randomized trials. We employ modern tools from causal inference to derive semi-parametric estimators of such effects. The methods are illustrated on a motivating example from a recent pain trial where the occurrence of treatment-related side effects acts as a proxy for unmasking. In the third chapter, we redefine perception and unmasking effects for a longitudinal setting, and explore various causal graphs for the gabapentin trial. We demonstrate application of the semi-parametric methods in this more general setting by assuming a more complicated causal graph. To estimate the parameters, we use Maximum Likelihood Estimation and two different versions of Targeted Maximum Likelihood Estimation. Finally, in chapter four, we approach coronary heart disease risk prediction modeling from a semi-parametric perspective using data from the Framingham study. The "super learner" is used with a library of machine learning algorithms to create an ensemble risk prediction model for coronary heart disease. We define relative risk importance parameters for various risk factors and estimate them with semi-parametric methods used in earlier chapters. The results are compared to the Framingham study and those obtained by fitting a parametric model to the Framingham dataset
The Analysis of Cluster-Randomized Test-Negative Designs: Eliminating Dengue
According to the World Health Organization, dengue is the most critical and most rapidly spreading mosquito-borne viral disease in the world and is responsible for the infection of an estimated 380 million people across the globe annually. There is no cure for dengue, makingprevention key to disrupting the rapid progression of this disease into the world's population.Recent scientific advances target the mosquito's ability to carry and transmit viral diseases. The method motivating this research injects a safe, naturally occurring bacterium called Wolbachia into the mosquito population responsible for the spread of dengue and other arboviruses including Zika, chikungunya, and yellow fever. When successfully introduced into the mosquito population, Wolbachia prevents these viruses from replicating, which reduces the potential of transmission to humans.This dissertation addresses the statistical evaluation of the impact of studies of such mosquito-based interventions. Collecting reliable evidence for mosquito-borne interventions is often expensive and logistically prohibitive. The Cluster Randomized Test-Negative Designdiscussed in this thesis addresses many of the barriers to such vital research. In this trial setting and several variations, I propose and evaluate estimators of intervention impact. These results can be used to better inform policies and protect vulnerable populations
Recommended from our members
Estimating the size of unobserved populations in human rights: Problems in Syria and El Salvador
In this dissertation, I examine two human right estimation problems. First, I assess data on child abductions from El Salvador's civil war. Between 1979 and 1992, El Salvador was wracked by conflict between leftist guerrilla groups and right-wing nationalist governments. One feature of the conflict was the abduction of children by government military forces, or the forced surrender of children to those same forces. Since 1994, La Asociación Pro-Búsqueda de Niñas y Niños Desaparecidos has investigated cases of these child abductions. To date, they have opened more than 950 cases and located nearly 400 abducted children (now, young adults). The organization remains active, and new cases come to light each year. In Chapter 2, I examine Pro Busqueda's data, assessing what can be said to date about the total as yet unknown number of abductions that occurred. I demonstrate that more abductions occurred than the number of currently known cases discuss capture-recapture estimates under a range of assumptions about the data available today. I then lay out a plan for updating estimates as new data becomes available.Then, I examine current data on deaths from the ongoing conflict in Syria. Early in the conflict, the United Nations Office of the High Commissioner for Refugees (UNOHCR) contracted with statisticians at the Human Rights Data Analysis Group (HRDAG) to analyze data from multiple human rights groups that were documenting deaths from the conflict there. HRDAG produced three reports from the United Nations and has maintained ongoing relationships with the local human rights groups that are collecting the raw data. HRDAG is now in the unusual position of possessing a series of multiple ``snapshots'' of each group's data, collected at a number of points between 2012 and 2016. Using those snapshots, I examine how each group's data is changing over time, and discuss how those changes can impact resulting estimates of unreported deaths, showing that the changes can result in estimates for a single governorate that vary by nearly 100,000. In addition, I take advantage of the large number of processed cases to assess the performance of a variety of classification algorithms in determining whether two records refer to the same individual
Recommended from our members
Topics in Current Status Data
This dissertation considers topics in current status data, a type of survival data where the only available information on the survival time is whether or not the event time has occurred before the examination time. We introduce the concept of current status data and give some motivating examples to highlight some of the many areas in which this type of data naturally occur in practice. We discuss some of the well known and widely used methods for analyzing current status data, along with some of the more recent developments in the area, and provide appropriate references to these previously examined methods. Within this dissertation, we add to the existing literature in the area by developing ideas not previously addressed from a current status data perspective. We describe a simple method for nonparametric estimation of a distribution function based on current status data where observations of current status information are subject to (known) misclassification. Nonparametric maximum likelihood techniques are obtained through the use of a straightforward set of adjustments to the familiar pool-adjacent violators algorithm, which is generally used when misclassification is assumed absent. The methods are extended to allow for misclassification rates that vary over time, particularly when misclassification is most likely to occur close to the time of the true failure event. Using the ideas of binary generalized linear models with outcomes subject to misclassification we consider regression models for the underlying survival time. The ideas are motivated by and applied to an example on human papillomavirus (HPV) infection status amongst women examined in San Francisco. Additional applications on breastfeeding behaviors and menopausal status are also presented. As an extension we consider group testing with current status data in the presence of misclassification. Group testing combines samples, such as blood or urine, from a number of individuals and tests the group sample for the presence of the disease of interest instead of testing each individual sample. We examine whether group testing can be used to not only reduce the costs incurred with testing a large number of individuals but also improve the efficiency in estimating the underlying distribution function. We also seek to determine the optimal group size for nonparametric estimation of a distribution function, under various group testing scenarios. Regression models for the group testing approach are briefly considered. We also describe current status data from the perspective of counting processes. We examine the relationship between current status data and simple counting processes. Specifically we consider the multistate model defined by two survival times of interest where one only observes whether or not each of the individual survival times exceed a common observed monitoring time. We are interested in estimation of the distribution function of time to the first event and whether current status information on the subsequent event can be used to improve this estimate. For both single and multiple monitoring time scenarios, in the fully nonparametric setting, one cannot improve the naive estimator, using information on the first event only, when estimating smooth functionals of the distribution of time to the first event (van der Laan and Jewell (2003)). We therefore examine improving this naive estimator when parametric assumptions about the waiting time between the two events are made. For situations where this waiting time is modifiable by design, we also determine the optimal length of the waiting time for estimation of the cumulative hazard of the distribution of time to the first event in the recent past. The ideas are motivated by and applied to an example on simultaneous accurate and diluted HIV test data
Recommended from our members
Topics in Survival Analysis
This dissertation covers three distinct topics in survival analysis: 1) current status data in the context of group testing subject to misclassification; 2) marginal structural modeling of a safety outcome from clinical trial data; and 3) the relationship between preterm birth and weight gain in pregnancy. Abstracts for each chapter separately are presented below. Chapter 2. Group testing, introduced by Dorfman (1943), has been used to reduce costs when estimating the prevalence of a binary characteristic based on a screening test of k groups that include n independent individuals in total. If the unknown prevalence is low, and the screening test suffers from misclassification, it is also possible to obtain more precise prevalence estimates than those obtained from testing all n samples separately (Tu et al., 1994). In some applications, the individual binary response corresponds to whether an underlying time-to-event variable T is less than an observed screening time C, a data structure known as current status data. Given sufficient variation in the observed Cs, it is possible to estimate the distribution function, F, of T nonparametrically, at least at some points in its support, using the pool-adjacent-violators algorithm (Ayer et al., 1955). Here, we consider nonparametric estimation of F based on group tested current status data for groups of size k where the group tests positive if and only if any individual's unobserved T is less than its corresponding observed C. We investigate the performance of the group-based estimator as compared to the individual test nonparametric maximum likelihood estimator, and show that the former can be more precise in the presence of misclassification for low values of F(t). Potential applications include testing for the presence of various diseases from pooled samples where interest focuses on the age at incidence distribution rather than overall prevalence. We apply this estimator to the age-at-incidence curve for hepatitis C infection in a sample of U.S. women who gave birth to a child in 2014, where group assignment is done at random and based on maternal age. We discuss the relationship to other work in the literature, and potential extensions. Chapter 3. Marginal structural modeling was first developed to address time-dependent confounding in studies where the effect of a time-varying exposure on an outcome is of interest. This chapter begins by introducing the reader to the concept of time-dependent confounding, and describes inverse probability weighting estimators for parameters of marginal structural models. The second part of chapter 3 contains an application of marginal structural modeling in a drug safety study. Studies in pharmacoepidemiology are often conducted in rich data sources, such as clinical trials or administrative databases, where large quantities of information are collected repeatedly over time. These data sources can and should be exploited, but traditional methods often cannot incorporate all available data, and fail to take time-dependent confounding into account. Marginal structural modeling and weighted estimators, tools often used in observational studies, can help to alleviate these challenges. Our objective in this study was to estimate the relation between rheumatoid arthritis (RA) disease activity, cholesterol levels, and major adverse cardiovascular events (MACE) in patients with moderate to severe rheumatoid arthritis who are currently prescribed tocilizumab, accounting for the presence of time-dependent confounding, such as other inflammatory markers, lipid levels, and rheumatoid arthritis disease measures. We studied 3,986 patients enrolled in one of five clinical trials used to study tocilizumab, who then joined one of three long-term extension studies. We used a weighted logistic regression model to explore associations between pre-treatment levels of RA disease activity and cholesterol on the 5-year risk of MACE. We then used a logistic marginal structural model to explore causal relations between pre- and post-treatment RA disease activity and cholesterol levels, and 5-year risk of MACE, adjusting for time-dependent confounders. We did not find evidence that pre- or post-treatment levels of RA disease activity, HDL cholesterol, and LDL cholesterol were associated with increased risk of MACE in patients with moderate to severe rheumatoid arthritis taking tocilizumab, once time-dependent confounding from inflammatory markers and other lipid levels was taken into account. After adjustment for time dependent confounding, traditional markers of disease activity and cholesterol were not associated with an increased risk of cardiac events among RA patients treated with tocilizumab. Chapter 4. The relationship between weight gain in pregnancy and preterm birth is still contested due to their inherent dependence. In the first part of Chapter 4, we wanted to quantify the relationship between pregnancy weight gain with early and late preterm birth and evaluate whether associations differed between non-Hispanic (NH) black and NH white women. We analyzed a retrospective cohort of all live births to NH black and NH white women in the U.S. 2011-2015 (n = 10,714,983). We used weight gain z-scores in multiple logistic regression models, stratified by prepregnancy body mass index (BMI) and race, to calculate population attributable risks (PAR) and PAR percentages for early and late preterm birth. We found that both low and high pregnancy weight gain were related to preterm birth, but these associations varied by BMI and race, and differed from associations with late preterm birth. For high weight gain and early preterm birth, the PAR percentage ranged from 8-10% in NH black women and from 6-8% in NH white women. Racial differences were small or nonexistent for late preterm birth, with PAR percentages ranging from 2-7% in NH black women and from 3-7% in NH white women. We conclude that these findings add to evidence that moderate gestational weight gain could help prevent preterm birth, and suggest that the impact may be greatest for early preterm birth in NH black women. The second part of Chapter 4 is a preliminary analysis assessing the variety of measures of weight gain in pregnancy and their relationship with preterm birth. Serial GWG measurements provide ideal data, but are rarely available in population health datasets. The electronic medical records from 160,635 women in Sweden have been compiled to be the largest dataset in the world that contains repeated weight gain measures through pregnancy. Here, we describe the pattern of weight gain in pregnancy in 103,661 Swedish pregnancies, and assess whether the observed pattern before 37 weeks' gestation differs between preterm and term pregnancies
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Variations on the Author
“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
Appropriate Similarity Measures for Author Cocitation Analysis
We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis
Recommended from our members
Topics in Evidence Synthesis
This dissertation considers three different topics related to extracting and merging evidence from heterogeneous sources. This problem is addressed from different angles, from the field of design of experiment to machine learning.Within this dissertation, we add to the existing literature in each area by developing novel methodology and software. Adaptive trial designs can considerably improve upon traditional designs,by modifying design aspects of the ongoing trial, like early stopping,adding or dropping doses, or changing the sample size. We propose a two-stage Bayesian adaptive design for a Phase IIb study aimed at selecting the lowest effective dose for Phase III. In this setting, efficacy has been proved for a high dose in a Phase IIa proof-of-concept study, but the existence of alower but still effective dose is investigated before the scheduled Phase III starts.In the first stage patients are randomized to placebo, maximaltolerated dose, and one or more additional doses within the doserange. Based on an interim analysis, the study is either stopped forfutility or success, or enters the second stage, where newly recruitedpatients are allocated to placebo, some fairly high dose, and oneadditional dose chosen based on interim data. At the interim analysiscriteria based on the predictive probability of success are used todecide on whether to stop or to continue the trial, and, in the lattercase, which dose to select for the second stage.Finally, a dose will be selected as lowest effective dose for Phase IIIeither at the end of the first or at the end of the second stage. The operating characteristics of the procedure are evaluated viasimulations and results are presented for several scenarios comparingthe performance of the proposed procedure to those of the non adaptivedesign.The development of novel therapies in multiple sclerosis (MS) is one area where a range of surrogateoutcomes are used in various stages of clinical research. While the aim of treatments in MS is to preventdisability, a clinical trial for evaluating a drugs effect on disability progression would require a largesample of patients with many years of follow-up. The early stage of MS is characterized by relapses. Toreduce study size and duration, clinical relapses are accepted as primary endpoints in phase III trials. Forphase II studies, the primary outcomes are typically lesion counts based on Magnetic Resonance Imaging(MRI), as these are considerably more sensitive than clinical measures for detecting MS activity.Recently, Sormani and colleagues \cite{sormani2010surrogate} provided a systematic review, andused weighted regression analyses to examine the role of either MRI lesions or relapses as trial levelsurrogate outcomes for disability. We build on this work by developing a Bayesian three-level model,accommodating the two surrogates and the disability endpoint, and properly taking into account thattreatment effects are estimated with errors. Specifically, a combination of treatment effects based onMRI lesion count outcomes and clinical relapse, both expressed on the log risk ratio scale, were used todevelop a study level surrogate outcome model for the corresponding treatment effects based ondisability progression. While the primary aim for developing this model was to support decision makingin drug development, the proposed model may also be considered for future validation.In Genomics and Epidemiology we deal with a high number of features for each observation. Many well known approaches to drawing inferences in this kind of settings use the topology of the feature space, induced by an appropriate metric, to group observations and summarize their main characteristics to get rid of the noise and to predict an outcome of interest. In the present work we generalize this approach in the context of Loss-Based Estimation. We propose an alternative method for constructing a nonparametric multidimensional regression function. This approach is based on the simple idea of clustering data points in the feature space and then fitting a constant to the outcome. HOPACH-PAM is used for partition. This approach results in the choice of a small number of distinct regions easy to interpret. This is specifically illustrated by simulations from which we can see immediately the superiority of this method on CART. Pre-screening and feature selections methods are also developed to improve the performances and reduce the noise. Software is also available in the R package HOPSLAM (HOpach-Pam Supervised Learning AlgorithM) to make this methodology easily accessible
- …
