1,721,063 research outputs found

    Jointly modeling latent trajectories and a subsequent outcome variable: A Bayesian approach.

    No full text
    The general class of models described in this dissertation was motivated by research questions that come from two linked longitudinal datasets. We are interested in whether the growth and decline in young adolescents' various problem behaviors from 5th to 10th grade explain subsequent involvement in serious motor vehicle offenses. However, trajectories of problem behaviors are not directly observed and require estimation from data at several discrete time points. Thus, in stage 1 of the model, latent trajectories are summarized by a few latent trajectory variables. For example, in subject-specific polynomial models, the coefficients for linear or quadratic terms are the latent trajectory variables. In longitudinal analyses, the objective is often to treat these latent trajectory variables as outcome variables in order to investigate both systematic and random variation. However, the research questions addressed here require using latent trajectory variables as predictors of a subsequent outcome in stage 2 of the model. In this dissertation, we propose a general class of models that treats latent trajectory variables summarized in stage 1 of the model as predictors of a subsequent outcome. This general class of models allows for a variety of trajectory shapes and can incorporate outcomes that come from the exponential family of distributions. Furthermore, the model can be extended to multivariate latent trajectories as predictors of a future outcome. Because currently used methods do not provide flexibility, we propose a fully Bayesian approach to fit this model. Details on the implementation of the Bayesian approach are provided in the dissertation. We apply the Bayesian approach and two competing approaches to study trajectories of adolescent alcohol use as predictors of motor vehicle offenses incurred during later adolescence. Results from a simulation study done to evaluate the performance of the Bayesian method are also reported. To illustrate the flexibility of the Bayesian approach, we apply the Bayesian approach to examine a number of research questions that require the modeling of multivariate latent trajectories.PhDBiological SciencesBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/124156/2/3122020.pd

    Modeling nonignorable missing data for clustered longitudinal discrete outcomes: A Bayesian approach.

    No full text
    This thesis develops Bayesian methods for analyzing clustered longitudinal data of discrete outcomes with nonignorable (NI) missing values. The research was motivated by an applied project that evaluates the effect of an intervention program to reduce the number of hospitalizations and to improve the quality of life (QOL) for asthma patients. Transition Markov models with random effects for Poisson and ordinal outcomes are used to model change in patients status over time. Current methods of fitting such models require complete data or restricted assumptions on the missing data. We propose Bayesian pattern-mixture models that have the flexibility to incorporate models for missing data in both outcome and time varying covariates. The underlying assumptions related to NI missing data are represented using easy to understand parameters, which are used to perform sensitivity analysis. Simulations results demonstrate that the proposed method is more efficient under certain conditions than standard methods, which perform poorly under nonignorable missing data mechanisms. The proposed method was applied to analyze the data from the asthma project. The results of the analyses show no evidence of an intervention effect on the number of hospitalizations during the first 12 month follow-up period. During the second follow up the rate of hospitalization for patients in the intervention group was reduced by 77% (95%CI = (26%, 93%)) when compared to patients in the control group. Sensitivity analysis showed that these findings hold under several NI missing data mechanisms. With respect to the QOL, the intervention effect was more likely to occur during the first follow-up period. There was some evidence that for parents with high QOL at baseline, the odds of having high QOL at the first follow-up were 1.65 (90%CI = (1.04, 2.60)) times higher for parents in the treatment group as compared to parents in the control group. There was no such evidence in the second follow-up period. These findings were unchanged even when different assumptions about the dropouts at the second follow-up were used. The proposed methodology can be implemented to other situations involving longitudinal data with discrete outcomes from studies with complex design.PhDBiological SciencesBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/132244/2/3057983.pd

    A Bayesian method for finding interactions.

    No full text
    In genomic studies, datasets with a small sample size and a large number of potential predictors are common. Recently, gene-gene interactions (epistasis) and gene-environment interactions have been drawing increasing attention due to the etiology of complex diseases. If all possible pair wise interactions are to be explored, then this leads to a high dimensional model space. There is very little work to handle this common problem. The emphasis of my research is on selecting interactions and controlling the number of falsely discovered predictors with a limited sample size. The method I propose simultaneously satisfies the two properties for inclusion of interactions: interpretability and discovery. In addition, I develop a novel equivalence between variable selection procedures and the false discovery rate. One application of my research is the development of a model to aid the therapeutic decision by identifying prognostic factors or interactions among abundant variables from the clinical and molecular profiles of patients. Given a patient's profile, an optimal treatment involves a trade-off between efficacy and toxicity. My research also proposes a novel way to compare treatments with multiple endpoints.PhDBiological SciencesBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/126120/2/3237929.pd

    Multiple imputation for continuous non-normal missing data.

    No full text
    Missing data problems impose great challenges to both statisticians and data practitioners. Multiple imputation (1987), a popular method for dealing with missing data problems, fills in missing items with several sets of plausible values drawn from an imputation model. This approach is especially useful when public-use (shared) databases are analyzed by many ultimate users (researchers) with varying degrees of statistical expertise and computing power, and with different scientific questions and objectives. For continuous variables with missing data, most existing imputation approaches are based on normal assumption of the data. However, variables in real data sets often deviate from normality. The two major goals of this dissertation are to develop imputation approaches that account for non-normality arising from continuous data, and to assess the performance of the normal imputation approaches under non-normal data. Tukey's gh distribution (Tukey 1977) is a flexible family that can be used to model a wide variety of continuous distributions. We achieve the first goal of this dissertation by proposing a class of gh imputation approaches using the gh distribution and its extensions. In Chapter 2, we consider modeling and imputing univariate incomplete data using the gh distribution. In Chapter 3, we propose an imputation approach in which the error distribution of the regression model is modeled through the gh distribution. In Chapter 4, we propose an imputation approach for multivariate missing data based on a multivariate extension of the gh distribution. For all proposed gh imputation approaches, we conduct simulation studies to assess their performance under various non-normal distributions. To achieve the second goal of this dissertation, we also evaluate the performance of the normal imputation approaches in these simulation studies. We find that if the mean structure is correctly specified, the normal imputation methods perform well in estimating the marginal means and regression coefficients regardless of the normal assumption. The gh imputation approaches have comparable performance. For estimating the shape of the marginal distributions, such as the proportions that are less than certain quantiles, the normal imputation approaches can perform poorly if the distribution deviates from normality, while the gh imputation approaches perform consistently well. For illustration, we apply the proposed methods to data from an adolescent driving study.PhDBiological SciencesBiostatisticsPure SciencesStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/125381/2/3192656.pd

    Combining information from multiple surveys for small area estimation: Bayesian approaches.

    No full text
    Cancer surveillance research requires accurate estimates of cancer risk prevalence for small areas such as counties. Two popular data sources are the Behavioral Risk Factor Surveillance System (BRFSS), a telephone survey conducted by State agencies, and the National Health Interview Survey (NHIS), a personal survey. Both surveys have advantages and disadvantages. The BRFSS is a fairly large survey and almost every county is represented, but it has poor response rates and excludes the non-telephone households. The NHIS is a smaller survey and not all counties are represented, but includes households with or without telephones and has a higher response rate. After a brief small area estimation literature review, the dissertation examines non-response and non-coverage errors in the BRFSS and NHIS. The distributions of demographic variables from two survey samples in 2000 are compared to those in the 2000 Census at both national and large area levels. The socio-demographic variables include gender, age, race/ethnicity, education, marital status, employment status, and household or family income. The BRFSS sample is found to be further from the target population than the NHIS. Hence the BRFSS design-based estimates are potentially subject to higher non-coverage and non-response biases. Next, a hierarchical Bayesian approach is used to obtain county-level estimates by combining information from both surveys. The model incorporates potential non-coverage and non-response bias in the BRFSS and complex sample design features in both surveys. A Markov Chain Monte Carlo (MCMC) method simulates draws from the joint posterior distributions for the model based on the county-level design-based direct estimates. Due to confidentiality concerns, the application of the model in Chapter III is limited since the design-based county-level direct estimates are only available from the in-house NHIS and BRFSS data. Therefore, in Chapter IV we explore a large area level model for publicly available data employing the same county level model as in Chapter III. A MCMC method combining Gibbs sampling and the Metropolis-Hastings algorithms is used in model inference. The estimates are compared to those in III. In Chapters III and IV, simulations and model validations evaluate the inference and model estimates.PhDBiological SciencesBiostatisticsHealth and Environmental SciencesPublic healthUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/124596/2/3150122.pd

    Analyse unvollständiger Befragungsdaten - Multiple Imputation mittels Bayesian Bootstrap Predictive Mean Matching

    No full text
    Bamberg, Univ., Diss., 2009Multiple Imputation (MI) is a general purpose approach to impute partially incomplete data. The proposed method - Bayesian Bootstrap Predictive Mean Matching - is a variant that incorporates the robustifying properties of a nearest neighbour technique (Predictive Mean Matching) into MI.Multiple Imputation (MI) ist ein allgemeiner Ansatz zur Ergänzung fehlender Daten. Die vorgestellte Methode - Bayesian Bootstrap Predictive Mean Matching - ist eine MI-Variante, welche die robustifizierenden Eigenschaften eines Nearest-Neighbour-Verfahrens (Predictive Mean Matching) integriert

    KriMI: A Multiple Imputation Approach for Preserving Spatial Dependencies : Imputation of Regional Price Indices using the Example of Bavaria

    No full text
    Dissertation, Otto-Friedrich-Universität Bamberg, 2017Multiple imputation is a method to handle the problem of missing values in a dataset. As it accounts for the uncertainty brought in by the missing data, it is possible to conduct reliable statistical tests after this method has been implemented. Kriging uses neighbourhood effects to predict values of unobserved regions. It can be seen as an imputation technique. The unobserved regions are missing data points, and the kriging predictions are the imputations. Due to the fact of being a single imputation technique, no proper statistical inferences are possible after filling the dataset. If spatially dependent data face the problem of missing data and a proper statistical inference is needed, a modelling of the spatial correlation in the multiple imputation model is needed. Here this is prevailed by implementing kriging in the model used for multiple imputation. We call the resulting method KriMI. The exact problem can be found when looking at regional price levels in Bavaria. The Bavarian State Office for Statistics surveys the prices which are needed to compute the price index only in a few regions. The prices of the unobserved regions are treated as missing data

    Neues Analysepotential durch die Ergänzung zensierter Variablen

    No full text
    Bamberg, Univ., Diss., 2010Censoring of variables is a common problem with microdata. This situation often arises with wage and income variables due to manifold reasons. The data may not be available due to difficulties during the data collection process, it may be artificially censored to ensure confidentiality, or it may just not be reliable because of high wage earners tending above average not to answer income questions. An important example for this problem is the German IAB Employment Sample (IABS), which is based on administrative data from the social security systems. Here, right-censoring of wages occurs due to the contribution limit in the German social security system. If earnings are to be analyzed from right-censored or top-coded data, standard models cannot be applied. We treat this problem as a missing data problem and use multiple imputation approaches to impute the censored wages by draws of a random variable from a truncated distribution, based on Markov chain Monte Carlo techniques. In this dissertation thesis new single and multiple imputation methods allowing for heteroscedasticity are suggested. Whereas one goal of the thesis is to present new imputation approaches that are applicable for right-censored wages, a main objective is also to confirm the validity of multiple imputation approaches for right-censored wages in general and to show the superiority of the new multiple imputation approach considering heteroscedasticity in a wide range of situations. To assess the validity of this approach, we also develop alternative approaches using uncensored wage information from a survey (German Structure of Earnings Survey, GSES). Simulation studies are performed to compare the different imputation approaches under different situations and to show the superiority of the new approach working without external information. Additionally, analyses that were done with the IABS are replicated to demonstrate the validity of imputed wage data.Das Analysepotential einer Vielzahl von Mikrodatensätzen ist durch die Zensierung von Variablen beeinträchtigt. Das Auftreten von Zensierung kann vielfältige Ursachen haben und ist besonders häufig bei Lohn- und Einkommensvariablen zu finden. Ein Grund für dieses Problem kann sein, dass die Daten aufgrund von Schwierigkeiten bei der Datenerhebung nicht vollständig verfügbar sind. Weiterhin werden sensible Variablen oftmals künstlich zensiert, um die Vertraulichkeit der erhobenen Daten zu gewährleisten oder Angaben am oberen Rand sind schlicht nicht zuverlässig, da beispielsweise die Empfänger hoher Einkommen überdurchschnittlich dazu neigen Fragen zum Einkommen nicht zu beantworten. Ein wichtiges Beispiel für einen Mikrodatensatz bei dem das Problem der Zensierung auftritt ist die IAB-Beschäftigtenstichprobe (IABS), die auf administrativen Daten der Sozialversicherung basiert. Im Fall der IABS ist die Lohninformation aufgrund der Beitragsbemessungsgrenze im deutschen Sozialversicherungssystem zensiert. Sollen Löhne auf Basis rechtszensierter Daten analysiert werden, ist es nicht möglich auf Standardmodelle und –verfahren zurückzugreifen. Wir behandeln daher das Problem der Zensierung als ein Problem fehlender Daten und verwenden verschiedene Ansätze zur Ergänzung der zensierten Löhne durch Züge einer Zufallsvariable aus einer gestutzten Verteilung basierend auf Markov-chain-Monte-Carlo-Technik. In dieser Dissertation werden zusätzlich sowohl neue einfache als auch mehrfache Imputationverfahren unter Beachtung von Heteroskedastizität vorgeschlagen. Während es ein Ziel der Arbeit ist, diese neuen Imputationsverfahren für rechtszensierte Löhne zu entwickeln, liegt ein weiterer Fokus darauf, die generelle Anwendbarkeit von multiplen Imputationsverfahren für das Problem der Zensierung im Fall von Lohndaten nachzuweisen und die Überlegenheit des vorgeschlagenen neuen multiplen Imputationsverfahren unter Berücksichtigung von Heteroskedastizität gegenüber herkömmlichen Verfahren in einer Vielzahl von Simulationsstudien zu demonstrieren. Zur Beurteilung der Anwendbarkeit dieses neuen Ansatzes werden außerdem alternative Ansätze unter Verwendung von unzensierten Lohninformationen aus einer Befragung (Gehalts- und Lohnstrukturerhebung, GLS) entwickelt. Anschließend werden ebenfalls Simulationsstudien durchgeführt, um die verschiedenen Imputationsverfahren in verschiedenen Situationen zu vergleichen und die Überlegenheit des neuen Ansatzes, welches ohne externe Information auskommt, zu zeigen. Darüber hinaus werden Analysen die typischerweise mit der IABS durchgeführt werden repliziert, um nochmals die Anwendbarkeit der imputierten Lohndaten zu demonstrieren

    Statistical Challenges in Combining Information from Big and Small Data Sources

    Full text link
    Social Media, electronic health records, credit card transactional and administrative data, web scraping, and numerous other ways of collecting information have changed the landscape for those interested in addressing policy-relevant research questions. During the same time, the traditional sources of data, such as large-scale surveys, that have been a stable source for policy-relevant research have su ered set- backs due to large nonresponse and increasing data collection costs. The non-survey data usually contain detailed information on certain behaviors on a large number of individuals (such as all credit card transactions) but very little background information on them (such as important covariates to address the policy-relevant question). On the other hand, the survey data contains detailed information on co- variates but not so detailed information on the behaviors. Both data sources may not be perfect for the target population of interest. This paper develops and evaluates a framework for linking information from multiple imperfect data sources along with the Census data to draw statistical inference. An explicit modeling framework involving se- lection into the big data, sampling and nonresponse mechanism in the survey data, distribution of the key variables of interest and cer- tain marginal distributions from the Census Data are used as building blocks to draw inference about the population quantity of interest.http://deepblue.lib.umich.edu/bitstream/2027.42/120417/1/NAS-Paper.pdfDescription of NAS-Paper.pdf : Main Articl

    Erzeugung Mehrfach Imputierter Synthetischer Datensätze: Theorie und Implementierung

    No full text
    Bamberg, Univ., Diss., 2009The book describes different approaches to generating multiply imputed synthetic datasets to guarantee confidentiality. Each chapter is dedicated to one approach, first describing the general concept followed by a detailed application to a real dataset providing useful guidelines on how to implement the theory in practice.Die Arbeit beschreibt verschiedene Ansätze zur Erstellung mehrfach imputierter synthetischer Datensätze. Diese Datensätze können der interessierten Fachöffentlichkeit zur Verfügung gestellt werden, ohne den Datenschutz zu verletzen. Jedes Kapitel befasst sich mit einem eigenen Ansatz, wobei zunächst das allgemeine Konzept beschrieben wird. Anschließend bietet eine detailierte Anwendung auf einen realen Datensatz hilfreiche Richtlinien, wie sich die beschriebene Theorie in der Praxis anwenden läßt
    corecore