1,721,006 research outputs found
Recommended from our members
Detection and Localization of a Submatrix: Theory, Methods and Algorithms
We consider the problem of detecting and localizing an submatrix with larger-than-usualentries inside a large, noisy matrix. This problem arises from analysis of data ingenetics, bioinformatics, and social sciences. We consider that entries of the data matrix areindependently following distributions from a natural exponential family, which generalizesthe common Gaussian assumptions in the literature. In Chapter 2 a permutation test fortesting the existence of the elevated submatrix is studied. The test's asymptotic power isillustrated, and its robust variation (rank method) is also studied. In The latter part ofChapter 2 and Chapter 3 we remove the prior knowledge of the submatrix size, aimingto develop adaptive methods for detection and localization. Latter part of Chapter 2proposes a Bonferroni testing framework based on the permutation scan test, to solvethe detection problem. An accelerating framework is also developed without sacricingasymptotic power. In Chapter 3, a new size-adaptive estimator is proposed to solve thelocalization problem. Its asymptotic performance is studied, and two fast algorithms toapproximate the estimator are developed
Recommended from our members
Clustering and Mixture Modeling: Some Methodology and Theory
In this dissertation, we propose several methodology in clustering and mixture modeling when the user has some prior knowledge regarding one or more of the clusters or mixture components. We also provide theory in the consistency of K-medoids, and prove its consistency under two versions of ordinal information.We begin by considering the problem of clustering and mixture modeling in situations where the user has relevant prior knowledge on the center separation between clusters or mixture components. We propose a dynamic programming approach to solving the K-means problem exactly with a separation constraint on the centers in the context of real-valued data, and propose an EM algorithm that incorporates such a constraint in the context of fitting a Gaussian mixture model.We then move onto two-component mixture models, where one component plays the role of the background component. We consider situations where the user has prior information on this background component’s distribution: symmetric, monotonic, or log-concave. In each setting, we derive estimators for the background component and provide relevant theory. While estimating the log-concave background distribution, we also propose a method to estimate the largest concave minorant.In addition, we consider the situation where the user has information on the number of modes of a mixture density. We provide a dynamic programming approach to fitting a density function by maximum likelihood when it is constrained to have a given number of modal intervals. When this value is not available, we provide several data-driven ways for selecting it.For the last project, we establish the consistency of K-medoids in the context of metric spaces. Our general approach applies also to non-metric settings where only an ordering of the dissimilarities is available. We consider two types of ordinal information: one where all quadruple comparisons are available; and one where only triple comparisons are available
Recommended from our members
Detection of Sparse Heterogeneous Mixtures: Theory, Methods and Algorithms
The detection of sparse heterogeneous mixtures becomes important in settings where a small proportion of a population may be affected by a given treatment, for example. The situation is typically formalized as a contamination model. We consider such models in asymptotic regimes where the contamination proportion tends to zero at various rates. We study the following three settings: the contamination manifests itself as a change in variance, the contamination manifests itself as a positive dependence between the variables in the bivariate setting, and the effect is a shift in mean without knowing the null distribution. In each setting, we study how large the effect needs to be in order to reliably distinguish the null hypothesis and the alternative hypothesis. We show that the corresponding higher criticism test is first-order comparable to the likelihood ratio test, while other classical tests are suboptimal. In particular, we make connections between the first two settings. We consider the dependence problem from both parametric and nonparametric perspectives. In the last chapter, we consider a different problem, that is to examine the extent to which the causal inference resulting estimate is sensitive to the unmeasured confounders for survival and competing risks data
Recommended from our members
Topics in Clustering: Feature Selection and Semiparametric Modeling
The first part of this thesis is concerned with Sparse Clustering, which assumes that a potentially large set of features are associated with clustering observations but the true underlying clusters differ only with respect to some of the features. We propose two approaches for this purpose, both of which allow us to group the observations using only a carefully-chosen subset of the features. The first approach assumes that the data are generated from Gaussian mixture models in high dimensions and the difference between mean vectors of the Gaussian components is sparse. Enlightened by the connection between sparse principal component analysis (SPCA) and sparse clustering, we adapted multiple estimation strategies from SPCA to perform sparse clustering. We provide theoretical guarantee of the aggregated estimator and develop an iterative algorithm to uncover the important feature set in sparse clustering. The second one is a hill-climbing approach, which alternates between selecting the s most important features (that correspond to the s smallest within-cluster dissimilarities) and clustering observations based on the selected feature subset. This approach has been shown to be competitive with existing methods in literature on simulated and real-world datasets.In the second part of the thesis, we consider a semiparametric approach to clustering and develop related theory. We first consider the problem of fitting a mixture model under the assumption that the mixture components are symmetric and log-concave. We study the nonparametric maximum likelihood estimation (NPMLE) of a monotone and log-concave probability density (which we do as part of our algorithm), and derive some results in terms of existence, uniqueness and uniform consistency of the MLE. To t the mixture model, we propose a semiparametric EM (SEM) algorithm, which can be adapted to other semiparametric mixture models. We then consider mixture modeling in high dimensions using radial (or elliptical) distributions. In the process of working on this problem, we uncovered a difficulty in estimating the densities. We found that the i.i.d. d-dimensional data points sampled from a rotationally invariant distribution with , are highly concentrated on the sphere of a -dimensional ball as . This extends the well-known behavior of the normal distribution (its concentration around the sphere of radius square-root of the dimension) to other radial densities. We establish a form of concentration of measure, and even a convergence in distribution, under additional assumptions. We draw some possible consequences for statistical modeling in high-dimensions, including a possible universality property of Gaussian Mixtures
Recommended from our members
Template Matching and Texture Segmentation: Theory, Methods and Algorithms
Template matching, as an important topic in image processing, is the process of matching a clean and noiseless template to an observed, typically noisy signal. This topic is closely related to the problem of matching two or more noisy signals, sometimes referred to as 'aligning' or 'registering' the signals, and to methodology in spatial statistics falling under the umbrella name of 'scan statistic'. When the template has a point of discontinuity, matching a template to the signal can be interpreted as detecting the location of the discontinuity, a more specialized task more broadly referred to as 'change-point detection' in statistics. In Chapter 2, we study a standard mathematical model for matching a template to a noisy signal by M-estimation. While the most popular method may still be based on maximizing the (Pearson) correlation, the estimators we study can be made much more robust to heavy-tailed noise or the presence of outlying observations. We draw on standard empirical process theory and decision theory, to derive limit distributions and minimax convergence rates of M-estimators in a wide array of situations. In Chapter 3, we suggest a rank-based method for matching a template to a noisy signal, and study its asymptotic properties using some well-established techniques in empirical process theory combined with Hajek's projection method. The resulting estimator of the shift is shown to achieve a parametric rate of convergence and to be asymptotically normal. Texture segmentation, as another crucial role in image processing, is the process of partitioning the image into differently textured regions. In Chapter 4, we present our related work on texture segmentation. We provide some theoretical guarantees for the prototypical approach which consists in extracting local features in the neighborhood of a pixel and then applying a clustering algorithm for grouping the pixel according to these features. The proposed algorithms work on both stationary and non-stationary random fields
Recommended from our members
Part I - Constrained Shortest-Path For Manifold Learning And Multiple Manifold Clustering Part II - Community Detection In Large Graphs; Analysis, Design And Implementation
In Part I of this thesis, we address the problem of manifold learning and clustering by introducing a novel constrained shortest-path algorithm. In the case of self-intersection, existing methods such as Isomap fail to capture the true shape of the surface near the intersection. We tackle this problem by imposing a curvature constraint to the shortest-path algorithm used in Isomap. We demonstrate theoretically that under a regularity assumption on manifolds, the proposed algorithm is able to capture the underling parameter space of the manifolds with self-intersection. We apply our method to simulated and real datasets and show it performs comparably to current state-of-the-art methods such as K-manifold and spectral multi-manifold clustering clustering.In Part II, we propose a new community detection algorithm for large networks. We introduce a new distributed implementation of message passing algorithm with a dynamic probability of update. We analyze our algorithm on a few random and real networks to provide insight about the choice of parameter. We discuss the implementation of the algorithm on the Spark platform. We also analyze the performance of the algorithm in terms of speed and accuracy using given ground truths and a few modularity measures like separability and density of detected communities. We run our algorithm on graphs with millions of nodes and billions of edges. Experiments show that the proposed algorithm is efficient and linearly adaptable to the scale of our data by adding extra machines as the size of our networks increase
Recommended from our members
Multiple Testing and False Discovery Rate Control: Theory, Methods and Algorithms
Multiple testing, a situation where multiple hypothesis tests are performed simultaneously, is a core research topic in statistics that arises in almost every scientific field. When more hypotheses are tested, more errors are bound to occur. Controlling the false discovery rate (FDR) [BH95], which is the expected proportion of falsely rejected null hypotheses among all rejections, is an important challenge for making meaningful inferences. Throughout the dissertation, we analyze the asymptotic performance of several FDR-controlling procedures under different multiple testing settings. In Chapter 1, we study the famous Benjamini-Hochberg (BH) method [BH95] which often serves as benchmark among FDR-controlling procedures, and show that it is asymptotic optimal in a stylized setting. We then prove that a distribution-free FDR control method of Barber and Cand`es [FBC15], which only requires the (unknown) null distribution to be symmetric, can achieve the same asymptotic performance as the BH method, thus is also optimal. Chapter 2 proposes an interval-type procedure which identifies the longest interval with the estimated FDR under a given level and rejects the corresponding hypotheses with P-values lying inside the interval. Unlike the threshold approaches, this procedure scans over all intervals with the left point not necessary being zero. We show that this scan procedure provides strong control of the asymptotic false discovery rate. In addition, we investigate its asymptotic false non-discovery rate (FNR), deriving conditions under which it outperforms the BH procedure. In Chapter 3, we consider an online multiple testing problem where the hypotheses arrive sequentially in a stream, and investigate two procedures proposed by Javanmard and Montanari [JM15] which control FDR in an online manner. We quantify their asymptotic performance in the same location models as in Chapter 1 and compare their power with the (static) BH method. In Chapter 4, we propose a new class of powerful online testing procedures which incorporates the available contextual information, and prove that any rule in this class controls the online FDR under some standard assumptions. We also derive a practical algorithm that can make more empirical discoveries in an online fashion, compared to the state-of-the-art procedures
Recommended from our members
Statistical Inference: Global Testing, Multiple Testing and Causal Inference in Survival Analysis
In Chapter 1, we consider the problem of detecting a sparse mixture as studied by Ingster (1997) and Donoho and Jin (2004). We consider a wide array of base distributions. In particular, we study the situation when the base distribution has polynomial tails, a situation that has not received much attention in the literature. Perhaps surprisingly, we find that in the context of such a power-law distribution, the higher criticism does not achieve the detection boundary. However, the scan statistic does. In Chapter 2, we derive the large-sample distribution of several variants of the scan statistic applied to a point process on an interval, which can be applied to detect the presence of an anomalous interval with any length. The main ingredients in the proof are Kolmogorov's theorem, a Poisson approximation, and recent technical results by \cite{kabluchko2014limiting}. In Chapter 3, we consider causal inference in survival analysis in the presence of unmeasured confounders. Instrumental variable is an essential tool for addressing unmeasured confounding in observational studies. Two stage predictor substitution (2SPS) estimator and two stage residual inclusion(2SRI) are two commonly used approaches in applying instrumental variables. Recently 2SPS was studied under the additive hazards model in the presence of competing risks of time-to-events data, where linearity was assumed for the relationship between the treatment and the instrument variable. This assumption may not be the most appropriate when we have binary treatments. We consider the 2SRI estimator under the additive hazards model for general survival data and in the presence of competing risks, which allows generalized linear models for the relation between the treatment and the instrumental variable. We derive the asymptotic properties including a closed-form asymptotic variance estimate for the 2SRI estimator. We carry out numerical studies in finite samples, and apply our methodology to the linked Surveillance, Epidemiology and End Results (SEER) - Medicare database comparing radical prostatectomy versus conservative treatment in early-stage prostate cancer patients. In Chapter 4, we investigate the causal effects of etanercept (trade name Enbrel) on birth defects, a pharmaceutical that treats autoimmune diseases and recently went through the US FDA revised labeling for use in pregnancy, as the proportion of liveborn infants with major birth defects was higher for women exposed to etanercept compared to diseased etanercept unexposed women. An outstanding problem, which was not addressed in the data analysis leading up to the FDA relabeling, is the missing birth defect outcomes due to spontaneous abortion since, in accepted standard practice an infant or a fetus is assumed not to be malformed unless a defect is found. This led to likely bias (and missing not at random) because, according to the theory of ``terathanasia'', a defected fetus is more likely to be spontaneously aborted. In addition, the previous analysis stratified on live birth against spontaneous abortion, which was itself a post-exposure variable showing higher rate of spontaneous abortion in the unexposed women, hence did not lead to causal interpretation of the stratified results. We aim to estimate and provide inference for the causal parameters of scientific interest, including the principal effects, making use of the missing data mechanism informed by terathanasia. During the process we also deal with complications in the data including left truncation, observational nature, and rare events. We report our findings which not only provide a more in-depth analysis than previously done on etanercept, but also shed light on how similar studies on causal effects of medication (or vaccine, other substances etc.) during pregnancy may be analyzed
Recommended from our members
Essays on robust statistical estimation and inference
Due to the advancements of modern technologies, large-scale and high-dimensional data have been widely collected in almost every scientific disciplines.This introduces a several challenges including that the data are often accompanied by outliers due to possible measurement error, or many variables follow heavy-tailed distributions.
To address these challenges, my thesis proposes methodologies in the setting of the mean estimation and matrix recovery when the data have asymmetric and heavy-tailed distributions.
Additionally, I explore the characterization of tail behavior in random outcomes, focusing on expected shortfall, which is widely recognized as a measure of risk.
I propose nonparametric approaches for estimating expected shortfall, aiming to enhance its accuracy and applicability.In Chapter 1, we propose a robust estimator to recover approximately low-rank matrices in the presence of heavy-tailed and asymmetric noises.Focusing on three archetypal applications including matrix compressed sensing, matrix completion and multitask learning, we provide sub-Gaussian-type deviation bounds when the noise variables only have bounded variances.
Computationally, we propose a matrix version of the local adaptive majorize-minimization algorithm, which is much faster than the alternating direction
method of multiplier used in previous work and is scalable to large datasets.Chapter 2 studies the problem of robust and differentially private mean estimation and inference.We first provide a comprehensive analysis of the Huber mean estimator with increasing dimensions, including non-asymptotic deviation bound, Bahadur representation, and (uniform) Gaussian approximations.
Then, we privatize the Huber mean estimator via noisy gradient descent, and construct private confidence intervals for the proposed estimator by incorporating a private and robust covariance estimator.In Chapter 3, we consider the problem of nonparametric estimation of conditional expected shortfall functions.To mitigate the curse of dimensionality, we propose a two-step nonparametric ES estimator based on fully connected neural nets with the ReLU activation function.
This approach (i) involves unobservable surrogate response variables that must be estimated from data in a preliminary step, and (ii) uses a properly chosen Huber loss to achieve exponential deviation bounds under heavy-tailed response distributions.
Using a plugged-in nonparametric conditional quantile estimate, also trained on deep neural nets, we establish non-asymptotic high probability bounds for the final robust ES estimator, which are optimal as if the true quantile function were known without resorting to any type of sample splitting.
We demonstrate the effectiveness of deep robust ES regression with both numerical experiments and an empirical study on the impact of El Niño on heavy precipitations, for which effective tail learning is imperative.In Chapter 4, I introduce a two-step nonparametric ES estimator that involves a plugged-in quantile function estimate without sample-splitting. We provide non-asymptotic estimation and Gaussian approximation error bounds, depending explicitly on the effective dimension, sample size, regularization parameters, and quantile estimation error.
To construct pointwise confidence bands, we propose a fast multiplier bootstrap procedure and establish its validity.
We demonstrate the finite-sample performance of the proposed methods through numerical experiments
and an empirical study aimed at examining the heterogeneous effects of features on average and large medical expenses
Finite size percolation in regular trees
In the context of percolation in a regular tree, we study the size of the largest cluster and the length of the longest run starting within the first d generations. As d tends to infinity, we prove almost sure and weak convergence results.Percolation on trees Largest open cluster Longest success run Galton-Watson processes Chen-Stein method for Poisson approximation
- …
