1,721,042 research outputs found
Recommended from our members
comparison of different hierarchical Dirichlet process implementations
The Hierarchical Dirichlet Process (HDP) is an important Bayesian nonparametric model for grouped data, such as corpus or document collections. It can be very useful in an NLP setting where we are trying to classify documents in a corpus. A great advantage of HDP is its flexibility: we do not need to specify the number of components (or topics) we want and can instead let the data decide. Like other Bayesian nonparametric models, exact posterior inference is intractable, instead we can use Monte Carlo Markov Chain (MCMC) methods to estimate the posterior distribution, and different MCMC methods can affect the performance of the HDP implementation. In this thesis, we will compare four different HDP samplers by applying them to a set of simulated data and a set of real data, and we will do this by comparing the mixing time of their NMI (normalized mutual information, which can be considered as the ``amount of information" obtained about one variable by observing the other variable) and perplexity
Recommended from our members
Testing in Network Models with Community Structure
We consider the problem of testing in network models with community structures. In the first part, we propose a goodness-of-fit test for degree-corrected stochastic block models (DCSBM). The test is based on an adjusted chi-square statistic for measuring equality of means among groups of multinomial distributions with observations. In the context of network models, the number of multinomials, , grows much faster than the number of observations, , corresponding to the degree of node , hence the setting deviates from classical asymptotics. We show that a simple adjustment allows the statistic to converge in distribution, under null, as long as the harmonic mean of grows to infinity. When applied sequentially, the test can also be used to determine the number of communities. Since the test statistic does not rely on a specific alternative, its utility goes beyond sequential testing and can be used to simultaneously test against a wide range of alternatives outside the DCSBM family. We show the effectiveness of the approach by extensive numerical experiments with simulated and real data. In the second part, we provide theoretical guarantees for label consistency in generalized -means problems, with an emphasis on the overfitted case where the number of clusters used by the algorithm is more than the ground truth. We provide conditions under which the estimated labels are close to a refinement of the true cluster labels. We consider both exact and approximate recovery of the labels. Our results hold for any constant-factor approximation to the -means problem. The results are also model-free and only based on bounds on the maximum or average distance of the data points to the true cluster centers. These centers themselves are loosely defined and can be taken to be any set of points for which the aforementioned distances can be controlled. We show the usefulness of the results with applications to some manifold clustering problems
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Recommended from our members
Unsupervised Methods on Structured Data
Classical unsupervised algorithms, such as k-means and PCA, utilize a simple generativemodel where the sampling distribution is determined by a collection of unobserved, latent
features. While this paradigm is powerful, it has the following consequence for applied
settings: any structured trend in the data must be explained by the latent features and the
assumptions therein. This requirement complicates the analysis of structured data sources,
such as images, videos, and networks, especially when the latent features of interest do not
govern every structured aspect of the data.In this work, we consider scenarios where the latent features may be partially decoupled from the structure of the data. Under this new setting we develop new algorithmicimprovements and insights for the following problems:
• Tissue intensity recovery for contaminated MRIs, where each pixel intensity is determined by an underlying tissue type and a spatially varying gain field.
• Semi-supervised node classification with graph aggregated features, where nodes are
assumed to follow a community-based structure
Recommended from our members
Problems in Epidemic Inference on Complex Networks
In this PhD dissertation, we study epidemics on networks of contacts through the lens of statistical inference. The current work is an attempt to infer the propagation parameters following the outset of an epidemic spread. My contributions rely on the progress on mathematical modeling of infectious outbreak, information diffusion, and viral habit formation. These achievements paved the path to forecast and contain the spread of infectious diseases and to optimize viral marketing campaigns. What distinguishes this work is the forensics view that aims to infer the network or the propagation parameters from the final stage of an epidemic. We study here multiple problems of this kind including epidemic source identification and epidemic network reconstruction. Such problems are NP-hard by nature and previous contributions are ad-hoc and inconclusive for realistic networks, either in size or structure. This work proposes new methods that estimate the parameters of interest in polynomial time with arbitrary accuracy. We provide theoretical error bound guarantees for some of the solutions. We accompany the results with comparative simulations on popular networks from social media, urban infrastructure, and disease pandemics
Recommended from our members
Optimal bipartite network clustering
We consider the problem of bipartite community detection in networks, or more generally the network biclustering problem. We present a fast two-stage procedure based on spectral initialization followed by the application of a pseudo-likelihood classifier twice. Under mild regularity conditions, we establish the weak consistency of the procedure (i.e., the convergence of the misclassification rate to zero) under a general bipartite stochastic block model. We show that the procedure is optimal in the sense that it achieves the optimal convergence rate that is achievable by a biclustering oracle, adaptively over the whole class, up to constants. The optimal rate we obtain sharpens some of the existing results and generalizes others to a wide regime of average degree growth. As a special case, we recover the known exact recovery threshold in the regime of sparsity. To obtain the general consistency result, as part of the provable version of the algorithm, we introduce a block partitioning scheme that is also computationally attractive, allowing for distributed implementation of the algorithm without sacrificing optimality. The provable version of the algorithm is derived from a general blueprint for pseudo-likelihood biclustering algorithms that employ simple EM type updates. We show the effectiveness of this general class by numerical simulations
Recommended from our members
An Analysis of Community Detection Methods in Multi-layer Networks
The community structures commonly exist in real-world networks such as brain networks, social networks, or trade networks. Since the information of a real-world network is often captured by us with different measures of view, such real-world networks often have a multi-layer structure with different layers sharing the same community assignment. In this scenario, being able to find out the community assignment consistently will help us understand the properties and behaviors of the network so that we can exploit these networks more effectively. In this thesis, we adopt multiple methods to solve the community detection task in different scenarios and discuss the pros and cons of them by comparing the results from multiple methods. We also propose and compare some of the rank-estimation methods, which are used for solving the number of different communities in a network
Recommended from our members
Prediction Model Development of Seismic Building Responses
The ability to predict building responses subjected to an earthquake could be used to identify building damage which would largely reduce human inspection effort and operation downtime. This thesis explores various of machine learning methods to formulate prediction model for seismic building responses over the great Los Angeles region using three actual earthquake scenario data (1994 Northridge, USA, 1999 Chi-Chi, Taiwan and 2000 Tottori, Japan). The result shows that the geospatial interpolation method kriging outperforms other candidates among all earthquakes in both accuracy and model stability using criteria such as cross-validation and median absolute residual difference. Some inconsistency in accuracy levels between different earthquakes are caused by 1)earthquake characteristicsand 2)representativeness of data samples of each event
Analysis of Nonlinear Control Systems: From Lifting Operators to Learning Interaction Laws in Networks
This dissertation explores a diverse set of problems in dynamical systems, control, estimation, and learning theory. Part I studies nonlinear systems using operator theory, specifically Carleman Linearization. Chapter one delves into the convergence of Carleman Linearization over a characterizable time horizon. The findings show that the Carleman Linearization converges to the original solution for general time-varying nonlinear systems with an analytic right-hand side over a finite time horizon. The third chapter introduces a new method to solve the Hamilton-Jacobi-Bellman Equation using the tools from Carleman Linearization. The analysis demonstrates the convergence of the method and proves that the control input obtained stabilizes the nonlinear dynamics after a certain truncation length. The second part of this dissertation is focused on examining the learning of dynamical systems under the presence of uncertainties. Chapter three explores techniques to enhance the robustness of recurrent neural networks by employing concepts from control and estimation theories. Initially, the chapter outlines how to measure the robustness of the recurrent neural networks and then introduces a novel algorithm to estimate the output covariance and biases for RNNs. We then utilized the gradient descent algorithm to minimize covariances along biases to obtain a robust RNN model. In Chapter four, we analyze the learning of nonlinear couplings in a network of interacting agents in a non-parametric set-up, where only a single sample trajectory is available. The study demonstrates that for geometrically ergodic networks, assuming the compactness of the hypothesis space, learning algorithms converge even when only a single sample trajectory is available. Additionally, we reveal that if the hypothesis space is convex and coercive, the empirical estimator converges uniquely. Part III of this dissertation is dedicated to developing and analyzing a systematic framework to study the risk of undesired events in the network of interconnected agents. In Chapter six, we explore the inherent risk associated with non-minimum phase systems. Using the systematic risk framework, we then investigate the trade-offs between collision risk, network topology, control cost, and non-minimum phase zeros of the system. In the last Chapter, we propose a framework to evaluate the risk of misperception resulting from noisy environmental observations. We employ the Expected Shortfall (Average Value-at-Risk) measure to evaluate the risk of collision between pairs of vehicles and the risk of violating traffic laws for each vehicle under possible misperceptions. Obtaining an explicit expression for the risk measure allows us to investigate potential trade-offs between overall misperception-induced risks and network architecture
- …
