1,721,041 research outputs found

    Identifying Metropolitan Areas

    Full text link
    There are endless ways in which we can group people together in space, the most common ways are of course cities, towns, villages, and metropolitan areas. Today we define cities, and metropolitan areas using a mixture of techniques such as historical boundaries and strict rules that deal with commuting and population density to determine the outlines of a metropolitan area[2]. The way in which these delineations are determined is important because they are the basis for various statistical calculations detailing anything from community diversity to median incomes. These statistics are then used to help determine economic and infrastructure initiatives, which could be beneficial to the community, but could also be wasteful if the statistics do not represent the true community. Thus a robust method of community delineation that captures the true structure of connections would allow for more representative statistics. These more representative statistics will allow policy makers to have a better understanding of the issues. With this better understanding they should then be better able to solve these problems. This robust method can be found by turning to network theory, the study of connections between entities. This field of study is useful one because metropolitan areas are supposed to represent closely connected areas. Specifically it would be useful to look at network theory in an attempt to find an algorithm that identifies the community structure by looking data of the workers who commute to and from various counties. Using these techniques we should be able to provide delineations that are judged to be both significant and proper.Bachelor of Scienc

    Modeling Epidemiological Spread on Contact Networks

    Full text link
    The rapid diffusion of a contagion amongst a population has detrimental impacts on both public health and economic stability. As individuals of a population come into contact, they can spread tangible materials, such as bacteria and viruses, through probabilistic diffusion of the resulting contact network. The probabilistic branching process of disease relies on the basic reproductive number, or the average contagiousness of the pathogen - a crude measure of the true impact of the virus. Multiple precautionary measures can be taken to reduce the reproductive number in the case of an epidemic. Stochastic agent-based modeling was used in this study to emulate and analyze the impacts of various public health measures on the coronavirus epidemic. These ABM simulations are carried out using the EpiModel package within R, implemented by the ERGM (Exponential Random Graph Model) package. This algorithm allows the network model to vary stochastically (randomly) over time. The following protocols implemented during the coronavirus epidemic were assessed: the mask mandate, the social distancing protocol, the reduction of initial infected population size (ex. travel restrictions), and national vaccination. The simulated examples within this study followed the susceptible-infectious-recovered/immune (SIR) compartmental model type. These simulations serve to model the spread of the coronavirus contagion through a connected network of 1,000 nodes and 4,000 edges over a span of 100 days. The findings of this study could point towards a tool for future researchers tasked with the difficult job of updating national safety standards and precautionary health measures to inhibit contagion spread.Bachelor of Scienc

    Comparison of Embedding Methods on SCOTUS Cases and Empirical Analysis of Phase Transitions in Node2Vec Hyperparamters

    Full text link
    The first goal of this paper is to analyze the information retained in different representations of the same data. In our case, the Supreme Court of the United States (SCOTUS) releases an opinion for cases which they hear. Each case contains the citations of previous cases as well as its opinion text. Thus, we have text data as well as citation data for each case, upon which we run different embedding models. Secondly, it is a comparison the performance of embedding models, specifically Doc2Vec and Node2Vec, that are extensions of the same group of models, Word2Vec. Since its inception, Word2Vec has become the basis for several other embedding models, including Doc2Vec and Node2Vec. Lastly, it is an analysis of the performance of Node2Vec as its parameters change. We find a rapid transition as the parameter increases, going from near-useless to performing near-perfectly within a short window of values. Instead of running this analysis on a real-world data set where most information is unknown, such as the SCOTUS cases, we run many simulations of networks using the Stochastic Block Model.Bachelor of Scienc

    Some asymptotic problems for dynamical random graphs

    Full text link
    This dissertation consists of two parts. In the first part we study the phase transition of a class of dynamical random graph processes, that evolve via the addition of new edges in a manner that incorporates both randomness as well as limited choice. As the density of edges increases, the graphs display a phase transition from the subcritical regime, where all components are small, to the supercritical regime, where a giant component emerges. We are interested in the behavior at criticality. First, we consider the simplest model of this kind, namely the Bohman-Frieze process. We show that the stochastic process of component sizes, in the critical window for the Bohman-Frieze process after proper scaling, converges to the standard multiplicative coalescent. Next, we study a more general family of dynamical random graph models, namely, the bounded-size-rule processes. We prove a useful upper bound on the size of the largest component in the barely subcritical regime. We then use this upper bound to study both sizes and surplus of the components of the bounded-size-rule processes in the critical window. In order to describe the joint evolution of sizes and surplus, we introduce the augmented multiplicative coalescent. Our main result shows that the vector of suitably scaled component sizes and surplus converges in distribution to the augmented multiplicative coalescent. In the second part of this dissertation, we study a large deviation problem related to the configuration model with a given degree distribution. We define a random walk associated with the depth-first-exploration of the random graph constructed from the configuration model. The large deviation principle of this random walk is studied using weak convergence techniques. Some large deviation bounds on the probabilities related to the sizes of the largest component are proved.Doctor of Philosoph

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Asymptotics and Approximation of Quasi-Stationary Distributions

    Full text link
    Stochastic dynamical systems with absorbing states are used to model systems arising from ecology, biology, chemical kinetics, and other fields. Despite the fact that these systems are eventually absorbed, they often persist for long periods of time prior to absorption. Quasi-stationary distributions (QSD) are the fundamental mathematical objects used to characterize the stability and long-term behavior of such systems prior to absorption. In the first part of this dissertation, I consider a collection of Markov chains that model the evolution of multitype biological populations. The state space of the chains is the positive orthant, with absorption at the boundary of the orthant, which represents the extinction of different population types. The main results of this part of the dissertation show that, as the size of the system increases, the behavior of the associated QSD can be characterized in terms of an underlying continuous-time dynamical system. The proofs of these results rely on uniform large deviation results for small noise stochastic dynamical systems and methods from the theory of dynamical systems. In the second part of this dissertation, I introduce two new stochastic approximation schemes that can be used to estimate the QSD of a finite-state Markov chain with absorbing states. Both methods are described in terms of a collection of particles evolving via interacting chains in which the interaction is given in terms of the total time occupation measure of all particles in the system and has the impact of reinforcing certain types of transitions. I characterize the asymptotic behavior of these approximation methods as time and the number of particles in the system simultaneously become large. In particular, I prove that the approximations given by these two methods converge almost surely to the Markov chain’s unique QSD and I establish Central Limit Theorems for the approximations’ fluctuations around the QSD under the key assumption that the ratio between the number of particles in the system and time goes to zero.Doctor of Philosoph

    Variations on the Author

    Full text link
    “Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship

    Appropriate Similarity Measures for Author Cocitation Analysis

    Full text link
    We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis

    Sparse Machine Learning Methods for Prediction and Personalized Medicine

    Full text link
    With growing interest to use black-box machine learning for complex data with many feature variables, it is critical to obtain a prediction model that only depends on a small set of features to maximize generalizability. Therefore, feature selection remains to be an important and challenging problem in modern applications. Most of existing methods for feature selection are based on either parametric or semiparametric models, so the resulting performance can severely suffer from model misspecification when high-order nonlinear interactions among the features are present. A very limited number of approaches for nonparametric feature selection were proposed, but they are computationally intensive and may not even converge. Thus, nonparametric feature selection for high-dimensional data is an important problem in statistics and machine learning fields. Futhermore, in the field of precision medicine, machine learning techniques are usually applied on a large health dataset containing patients' information to find optimal individual treatment rule (ITR), which makes the learning process computational demanding. Thus, identifying the truly important feature variables shortens the computation time and saves the cost of collecting redundant data. Therefore, we focus on developing machine learning techniques to perform variable selection for both prediction and personalized medicine in the dissertation. In the first project, we propose a novel and computationally efficient approach for nonparametric feature selection in regression field based on a tensor-product kernel function over the feature space. The importance of each feature is governed by a parameter in the kernel function which can be efficiently computed iteratively from a modified alternating direction method of multipliers (ADMM) algorithm. We prove the oracle selection property of the proposed method. Finally, we demonstrate the superior performance of our approach compared to existing methods via simulation studies and application to the prediction of Alzheimer's disease. In the second project, we continue to propose a new framework to perform nonparametric feature selection for both regression and classification problems. Under this framework, we learn prediction functions through empirical risk minimization over a reproducing kernel Hilbert space (RKHS). The space is generated by a novel tensor product kernel which depends on a set of parameters that determine the importance of the features. Computationally, we minimize the empirical risk with a penalty to estimate the prediction and kernel parameters simultaneously. The solution can be obtained by iteratively solving convex optimization problems. We study the theoretical property of the kernel feature space and prove oracle selection property and Fisher consistency of our proposed method. Finally, we demonstrate the superior performance of our approach compared to existing methods via extensive simulation studies and application to a microarray study of eye disease in animals. Finally, we focus on applying the nonparametric feature selection framework for treatment decision making with high-dimensional data. We directly estimate the decision function in Reproducing Kernel Hilbert Space (RKHS) generated by a novel constructed tensor product kernel with parameters capturing the importance of each variable. Computationally, we adopt two steps to separate the procedure for both estimating and tuning processes, which makes the computation more fast and stable. Finally, we demonstrate the superior performance of our approach compared to existing methods via one simulation study and application to type 2 diabetes.Doctor of Philosoph
    corecore