1,721,131 research outputs found
The geometry of synchronization problems and learning group actions
We develop a geometric framework, based on the classical theory of fibre bundles, to characterize the cohomological nature of a large class of synchronization-type problems in the context of graph inference and combinatorial optimization. We identify each synchronization problem in topological group G on connected graph Γ with a flat principal G-bundle over Γ, thus establishing a classification result for synchronization problems using the representation variety of the fundamental group of Γ into G. We then develop a twisted Hodge theory on flat vector bundles associated with these flat principal G-bundles, and provide a geometric realization of the graph connection Laplacian as the lowest-degree Hodge Laplacian in the twisted de Rham–Hodge cochain complex. Motivated by these geometric intuitions, we propose to study the problem of learning group actions—partitioning a collection of objects based on the local synchronizability of pairwise correspondence relations—and provide a heuristic synchronization-based algorithm for solving this type of problems. We demonstrate the efficacy of this algorithm on simulated and real datasets
The Geometry of Cancer
Cancer is a complex, multifaceted disease that operates through dynamic changes in the genome. Cancer is best understood through the process that generates it -- random mutations operated on by natural selection -- and several global hallmarks that describe its broad mechanisms. While many genes, protein interactions, and pathways have been enumerated as a kind of ``parts'' list for cancer, researchers are attempting to synthesize broader models for inferring and predicting cancer behavior using high-throughput data and integrative analyses. The focus of this thesis is on the development of two novel methods that are optimized for the analysis of complex cancer phenotypes. The first method incorporates ideas from gradient learning with multitask learning to assess statistical dependencies across multiple related data sets. The second method integrates multiscale analysis on graphs and manifolds developed in applied harmonic analysis with sparse factor models, a mainstay of applied statistics. This method generates multiscale factors that are used for inferring hierarchical associations within complex biological networks. The primary biological focus is the inference of gene and pathway dependencies associated with cancer progression and metastatic disease in prostate cancer. Significant findings include evidence of Skp2 degradation of the cell-cycle regulator p27, and the upstream deregulation of the TGF-beta pathway, driving prostate cancer recurrence.</p
Modeling Cancer Progression on the Pathway Level
Over the past several decades, many genes have been discovered that govern important functions in the development of a variety of different cancers. However, biological insight from the list of genes is still limited and the underlying mechanisms that occur in the cell during tumorigenesis have not been well established. Studying cancer progression in terms of the oncogenic pathways that are responsible for specific actions that change normal cells into tumors is a means for bringing insight onto these issues. The work presented here will uncover mechanisms that are occurring at the pathway level that first initiate tumor formation and then continue through cancer progression and finally metastasis. This knowledge will allow for drug treatment that is better targeted towards an individual.Microarray technology has allowed for the collection of gene expression datasets from clinical cancer and other studies. These datasets can be used to study how expression levels of individual genes or groups of related genes are altered in individuals from different phenotypic groups. Statistical methods exist which assay pathway enrichment by phenotypic class but do not describe individual variation.
In order to study this individual variation, we developed a formal statistical method called ASSESS which measures the enrichment of a gene set in each sample in an expression dataset.As cancer advances through the stages of initiation, progression, and proliferation, multiple pathways experience disruptions at various times. However, there is still much unknown on these particular pathways that evidence gene expression changes throughout tumorigenesis. Using gene expression datasets comprised of individuals
with tumors classified by location and stage, we applied ASSESS in order to study the data on the pathway level. We then utilized novel statistical methods to uncover the pathways that play a role in cancer progression and in what order the pathways become perturbed.These analyses can give a basis for how genetic disruptions serve to alter actions in specific cell types. The results may provide insight that will lead to treatments of existing tumors and prevention of incipient cancers from forming. Treatments for existing tumors will use multiple drugs to target the pathways that show an altered state of activity.</p
Nonlinear Prediction in Credit Forecasting and Cloud Computing Deployment Optimization
This thesis presents data analysis and methodology for two prediction problems. The first problem is forecasting midlife credit ratings from personality information collected during early adulthood. The second problem is analysis of matrix multiplication in cloud computing.The goal of the credit forecasting problem is to determine if there is a link between personality assessments of young adults with their propensity to develop credit in middle age. The data we use is from a long term longitudinal study of over 40 years. We do find an association between credit risk and personality in this cohort Such a link has obvious implications for lenders but also can be used to improve social utility via more efficient resource allocationWe analyze matrix multiplication in the cloud and model I/O and local computation for individual tasks. We established conditions for which the distribution of job completion times can be explicitly obtained. We further generalize these results to cases where analytic derivations are intractable.We develop models that emulate the multiplication procedure, allowing job times for different deployment parameter settings to be emulated after only witnessing a subset of tasks, or subsets of tasks for nearby deployment parameter settings. The modeling framework developed sheds new light on the problem of determining expected job completion time for sparse matrix multiplication.</p
Euler Integration with Applications to Statistical Shape Analysis and Imaging
This dissertation focuses on use of Euler calculus in statistical shape analysis. The shapes have no metric structure, so to analyze them we need to define such a structure. Classical work on this has been done by Kendall, who imposed a metric structure by introducing landmarks on the shapes. These are points on the shape that have corresponding counterparts across the shapes. Such representation has two drawbacks: It does not use all the information about the shape and the choice of landmarks and correspondences is a challenging task that does not always have the right answer. One can create a digital version of Kendall’s shape spaces by looking at diffeomorphisms between the shapes. The obvious limitation in this method is that the shapes need to be diffeomorphic. The subject of this dissertation is a more general construction that makes use of the idea of integrating against the Euler characteristic. For shape analysis, this method was first proposed by Turner et al. In Chapter 2 we introduce an extension of the Euler calculus shape analysis framework for continuous type data. We show these lifted transforms retain the most important properties of the discrete transform, making them very well suited for statistical applications. We provide the necessary theoretical results as well as demonstrate the utility of this approach on real and simulated data. In Chapter 3 we present a first ever subshape selection pipeline that does not rely on diffeomorphisms nor landmarks. This is achieved by first transform the shapes with the aforementioned tools to a space where distances and inner products can be defined. With these tools we solve the statistical problem of feature selection. Finally, we pull back this evidence on the original shape space by an evidence reconstruction procedure. We providea detailed study of the method on simulated data and apply it on a problem in geometric
morphometrics.</p
Two Applications of Summary Statistics: Integrating Information Across Genes and Confidence Intervals With Missing Data
Gene set enrichment methods are useful for the mapping of individual genes or proteins to pathways and signatures. We use this approach to study the expression levels of proteins encoded by different genes, and compare individuals that have Alzheimer’s disease (AD) to those that are cognitively normal (CN). Different gene sets might show differential enrichment in the two classes. A correlation statistic is computed for measuring the correlation of a sample to one class rather than to the other, with respect to a gene. This allows us to find the enrichment score for the sample with respect to an entire gene set, and to analyze the gene sets that are differentially expressed in the two classes. The linear model is a powerful tool that we use to estimate the correlation statistic, thus accounting for the class, and also the other covariates such as age and sex of the individual.We study the Jeffreys and Clopper-Pearson intervals for binomial proportions when we have missing data. We use multiple imputation (MI) to deal with missing data. Using simulation studies, we compare the MI Wilson, MI Clopper-Pearson, and the MI Jefferys intervals. We then show that the MI Wilson interval has better repeated sampling properties among all in the case of high missingness. In the case of low missingness, the MI Wilson and MI Clopper-Pearson produce similar empirical coverage rates that are close to the nominal coverage. For a very low value of the binomial proportion, the Jeffreys interval has the largest coverage with the smallest average interval length.</p
Linear Dimension Reduction Approximately Preserving Level-Sets of the 1-Norm
We choose a family of matrices F : \R^D \to \R^k and a metric \rho on \R^k such that with highprobability, \rho(F (x), F (y)) is a strictly concave increasing function of ||x − y||_1 > 8 \epsilon^2for x, y \in \R^D , up to a multiplicative error of 1 ±\epsilon. In particular, if X is a set of Npoints in \R^D , the target dimension k may be chosen as C ln^2 (N^{c+2})/(\epsilon^2(1 −\epsilon )^2), withC a constant and \epsilon > N^{−c} , to ensure all pairs of points of X of distance at least 8\epsilon^2are treated this way, with failure probability at most N^{-c} for c > 1. In some cases,distances smaller than 8\epsilon^2 can also be addressed. For distances larger than \sqrt{1 +\epsilon} ,the target dimension can be reduced to C ln(N^{c+2})/(\epsilon^2(1 −\epsilon )^2).</p
Linear Subspace and Manifold Learning via Extrinsic Geometry
In the last few decades, data analysis techniques have had to expand to handle large sets of data with complicated structure. This includes identifying low dimensional structure in high dimensional data, analyzing shape and image data, and learning from or classifying large corpora of text documents. Common Bayesian and Machine Learning techniques rely on using the unique geometry of these data types, however departing from Euclidean geometry can result in both theoretical and practical complications. Bayesian nonparametric approaches can be particularly challenging in these areas. This dissertation proposes a novel approach to these challenges by working with convenient embeddings of the manifold valued parameters of interest, commonly making use of an extrinsic distance or measure on the manifold. Carefully selected extrinsic distances are shown to reduce the computational cost and to increase accuracy of inference. The embeddings are also used to yield straight forward derivations for nonparametric techniques. The methods developed are applied to subspace learning in dimension reduction problems, planar shapes, shape constrained regression, and text analysis.</p
Construction of Objective Bayesian Prior from Bertrand’s Paradox and the Principle of Indifference
The Principle of Indifference, which may be naïvely interpreted as the requirement to assign the same probability to different outcomes of a probabilistic event, is applied in Objective Bayesian analysis to form prior distributions (priors). However, though it may be desired that such priors are truly “objective”, they usually are not. This paper compares a number of usual objective priors - uniform, invariant, reference, and maximum entropy priors and examines them from an epistemological perspective to find what premises are implied if these objective priors are taken as an implementation of the Principle of Indifference that achieves complete objectivity in the resulting statistical analysis procedure. Then, given an conventional ignorance or lack-of-information interpretation for objectivity, it is found that these priors are indeed not completely “objective”. It may be possible to obtain a weaker, or more general, a priori analysis for ignorance such that there can be conceptually completely objective priors.</p
- …
