1,720,995 research outputs found

    Multiple systems estimation of victims of human trafficking: Model assessment and selection

    No full text
    Recently, multiple systems estimation (MSE) has been applied to estimate the number of victims of human trafficking in different countries. The estimation procedure consists of a log-linear analysis of a contingency table of population registers and covariates. As the number of potential models increases exponentially with the number of registers and covariates, it is practically impossible to fit and compare all models. Therefore, the model search needs to be restricted to a small subset of all potential models. This paper addresses principles and criteria for model assessment and selection for MSE of human trafficking with special attention to sparsity which is typical to human trafficking data. The concepts are illustrated on data from Slovakia and Romania

    Statistical underpinning of mutational signature analyses of cancer sequencing data

    No full text
    Cancer is a disease driven and characterised by mutations in the DNA. Thanks to massively parallel sequencing technologies, it is now possible to obtain the sequence of a cancer genome. The advent of modern sequencing technologies has allowed researchers to study the mutations involved in tumour development. More recently, attention has been drawn to the `passenger' mutations that are not involved in tumour development but bear fingerprints of the mutational processes that have been operative over a patient's lifetime. Those fingerprints, termed mutational signatures, appear consistently across cancer genomes that have been exposed to the underlying mutational processes. Computational analyses have identified over a hundred such signatures, and it is now possible to estimate the relative prevalence of mutational signatures in a cancer genome. Both types of analyses are perhaps unique in the medical literature, in that no confidence intervals or other representations of uncertainty are demanded when reporting the results. In this thesis, we address the problem of quantifying uncertainty around the reported mutational signatures and their relative prevalence in individual tumours. First, in Chapter 2, we review the available computational methods for mutational signature analyses, assessing the potential of existing approaches to characterise uncertainty. Then, in Chapter 3, we annotate ten statistical challenges. The remainder of the thesis is built on the aim of addressing some of those challenges. To estimate the relative prevalence of mutational signatures in individual tumours, a method that quantifies the uncertainty around the estimated solution is lacking. Moreover, those analyses assume that the true values for the signatures are `known' as they are propagated from previous analyses. In Chapter 4, we suggest a setting where the signatures are `partially known'. We propose a novel approach for this problem, in a Bayesian setting, providing credible intervals around the estimated solution, propagating prior uncertainty regarding `partially known' signatures, and updating prior beliefs about them. Estimation of mutational signatures is often performed in a matrix factorisation setting that is not fully probabilistic. While an alternative fully probabilistic approach is available, a post-processing method is needed to characterise the uncertainty around the reported solution. In Chapter 5, we introduce a novel post-processing approach to quantify uncertainty around the mutational signatures estimated in a cohort of cancer patients, along with software that allows investigators to use the proposed method and visualise results

    On the correspondence of deviances and maximum-likelihood and interval estimates from log-linear to logistic regression modelling

    Full text link
    Funding: The first author would like to acknowledge the support of the School of Mathematics and Statistics, as well as CREEM, at the University of St Andrews, and the University of St Andrews St Leonard’s 7th Century Scholarship.Consider a set of categorical variables P where at least one, denoted by Y, is binary. The log-linear model that describes the contingency table counts implies a logistic regression model, with outcome Y. Extending results from Christensen (1997, Log-linear models and logistic regression, 2nd edn. New York, NY, Springer), we prove that the maximum-likelihood estimates (MLE) of the logistic regression parameters equals the MLE for the corresponding log-linear model parameters, also considering the case where contingency table factors are not present in the corresponding logistic regression and some of the contingency table cells are collapsed together. We prove that, asymptotically, standard errors are also equal. These results demonstrate the extent to which inferences from the log-linear framework translate to inferences within the logistic regression framework, on the magnitude of main effects and interactions. Finally, we prove that the deviance of the log-linear model is equal to the deviance of the corresponding logistic regression, provided that no cell observations are collapsed together when one or more factors in P∖{Y} become obsolete. We illustrate the derived results with the analysis of a real dataset.Peer reviewe

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    On synthetic interval data with predetermined subject partitioning and partial control of the variables’ marginal correlation structure

    Full text link
    (Publication fee waved)A standard approach for assessing the performance of partition models is to create synthetic datasets with a prespecified clustering structure and assess how well the model reveals this structure. A common format involves subjects being assigned to different clusters, with observations simulated so that subjects within the same cluster have similar profiles, allowing for some variability. In this manuscript, we consider observations from interval variables. Interval data are commonly observed in cohort and Genome-Wide Association studies, and our focus is on Single-Nucleotide Polymorphisms. Theoretical and empirical results are utilized to explore the dependence structure between the variables in relation to the clustering structure for the subjects. A novel algorithm is proposed that allows control over the marginal stratified correlation structure of the variables, specifying exact correlation values within groups of variables. Practical examples are shown, and a synthetic dataset is compared to a real one, to demonstrate similarities and differences.Peer reviewe

    Bayesian nonparametrics and mixture modelling

    No full text
    This introductory chapter is aimed at post-graduate students, not necessarily with a strong mathematical background, but with knowledge of the fundamentals of probability and statistics. It is based on the author’s own research and other sources referenced within. We start with an introduction of Bayesian nonparametrics and the Dirichlet process. Parts of this introduction are based on lecture notes by Professor Tony O’Hagan (Lecture notes on Bayesian inference. University of Nottingham, 1996). We continue with an overview of Bayesian mixture modelling, considering mixture models with a finite number of components, where this number can be fixed or random. We then proceed to discuss the Dirichlet process mixture model where an infinite number of components is assumed. Relevant MCMC sampling ideas and principles are discussed in detail. Fitting selected models through MCMC sampling is illustrated using simple synthetic data sets, with example R code available in a Github repository

    Variations on the Author

    Full text link
    “Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship

    Appropriate Similarity Measures for Author Cocitation Analysis

    Full text link
    We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis

    Dispelling the Myths Behind First-author Citation Counts

    Full text link
    We conducted a full-scale evaluative citation analysis study of scholars in the XML research field to explore just how different from each other author rankings resulting from different citation counting methods actually are, and to demonstrate the capability of emerging data and tools on the Web in supporting more realistic citation counting methods. Our results contest some common arguments for the continued use of first-author citation counts in the evaluation of scholars, such as high correlations between author rankings by first-author citation counts and other citation counting methods, and high costs of using more realistic citation counting methods that are not well-supported by the ISI databases. It is argued that increasingly available digital full text research papers make it possible for citation analysis studies to go beyond what the ISI databases have directly supported and to employ more sophisticated methods
    corecore