1,721,100 research outputs found

    Bayesian principal curve clustering by species-sampling mixture models

    No full text
    In questo lavoro siamo interessati al raggruppamento di dati il cui supporto `e “curvo”. Per perseguire questo obiettivo, seguiamo un approccio bayesiano non parametrico utilizzando un modello mistura a campionamento di specie. Il nos- tro primo obiettivo `e quello di definire una classe generale/flessibile di distribuzioni parametriche, in modo che queste possano modellare gruppi con forme non usuali. A tal fine, estendiamo la definizione di curva principale data in [8] (Tibshirani 1992) ad un contesto bayesiano. In conclusione, in questo lavoro proponiamo un nuovo modello gerarchico, nel quale i dati in ciascun gruppo hanno distribuzione parametrica centrata su una curva. L’assegnazione a priori dei dati ai gruppi `e invece rappresentata mediante la legge di variabili latenti al secondo livello di gerarchia, le quali son distribuite secondo un processo a campionamento di specie. Come applicazione consideriamo l’individuazione di faglie sismiche per dati provenienti da un catalogo di terremoti italiano.In this work we are interested in clustering data whose support is “curved”. For this purpose, we will follow a Bayesian nonparametric approach by considering a species sampling mixture model. Our first goal is to define a general/flexible class of distributions, such that they can model data from clusters with non standard shape. To this end, we extend the definition of principal curve given in [8] (Tibshirani 1992) into a Bayesian framework. We propose a new hierarchical model, where the data in each cluster are parametrically distributed around the Bayesian principal curve, and the prior cluster assignment is given on the latent variables at the second level of hierarchy according to a species sampling model. As an application we will consider the detection of seismic faults using data coming from Italian earthquake catalogues

    Capture-recapture models with heterogeneous temporary emigration

    Full text link
    We propose a novel approach for modeling capture-recapture (CR) data on open populations that exhibit temporary emigration, while also accounting for individual heterogeneity to allow for differences in visit patterns and capture probabilities between individuals. Our modeling approach combines changepoint processes-fitted using an adaptive approach-for inferring individual visits, with Bayesian mixture modeling-fitted using a nonparametric approach-for identifying dusters of individuals with similar visit patterns or capture probabilities. The proposed method is extremely flexible as it can be applied to any CR dataset and is not reliant upon specialized sampling schemes, such as Pollock's robust design. We fit the new model to motivating data on salmon anglers collected annually at the Gaula river in Norway. Our results when analyzing data from the 2017, 2018, and 2019 seasons reveal two clusters of anglers-consistent across years-with substantially different visit patterns. Most anglers are allocated to the "occasional visitors" cluster, making infrequent and shorter visits with mean total length of stay at the river of around seven days, whereas there also exists a small cluster of "super visitors," with regular and longer visits, with mean total length of stay of around 30 days in a season. Our estimate of the probability of catching salmon whilst at the river is more than three times higher than that obtained when using a model that does not account for temporary emigration, giving us a better understanding of the impact of fishing at the river. Finally, we discuss the effect of the COVID-19 pandemic on the angling population by modeling data from the 2020 season. Supplementary materials for this article are available online

    Bayesian nonparametric covariate driven clustering Un modello bayesiano nonparametrico per clustering in presenza di covariate

    Full text link
    In this paper we introduce a Bayesian model for clustering individuals with covariates. This model combines the joint distribution of data in the sample, given the parameter and covariates, with a prior for this parameter. Here, the partition of the sample subjects is the parameter, and the prior we assume encourages two subjects to co-cluster when they have similar covariates. Cluster estimates are based on the posterior distribution of the random partition, given data. As an application, we fit our model to a dataset on gap times between recurrent blood donations from AVIS (Italian Volunteer Blood-donors Association), the largest provider of blood donations in Italy

    Is infinity that far? A Bayesian nonparametric perspective of finite mixture models

    No full text
    Mixture models are one of the most widely used statistical tools when dealing with data from heterogeneous populations. Following a Bayesian nonparametric perspective, we introduce a new class of priors: the Normalized Independent Point Process. We investigate the probabilistic properties of this new class and present many special cases. In particular, we provide an explicit formula for the distribution of the implied partition, as well as the posterior characterization of the new process in terms of the superposition of two discrete measures. We also provide consistency results. Moreover, we design both a marginal and a conditional algorithm for finite mixture models with a random number of components. These schemes are based on an auxiliary variable MCMC, which allows handling the otherwise intractable posterior distribution and overcomes the challenges associated with the Reversible Jump algorithm. We illustrate the performance and the potential of our model in a simulation study and on real data applications

    A conditional algorithm for Bayesian finite mixture models via normalized point process (Un algoritmo per la stima bayesiana di misture finito dimensionali costruite mediante normalizzazione di processi di punto)

    No full text
    La classe dei modelli mistura e frequentemente utilizzata come strumento per l’analisi di popolazioni eterogenee. Per ottenere delle stime bayesiane dei parametri di questi modelli, sono comunemente utilizzati gli algoritmi MCMC di tipo “Reversible Jump”. Tuttavia, questi ultimi sono molto difficili da configurare, in special modo quando i dati appartengono a spazi di dimensione elevata. In questo lavoro, come primo passo, introdurremo una classe di misure di probabilità aleatorie. Tali misure saranno costruite come normalizzazione di processi di punto finito dimensionali di cui daremo una caratterizzazione a posteriori. Come secondo passo, utilizzeremo gli elementi della nuova classe come misure miscelanti in modelli mistura, generalizzando, così, la ben nota famiglia di misture di Dirichlet finito dimensionali. Proporremo un campionatore di tipo Gibbs in alternativa all’usuale algoritmo a salti reversibili. In particolare, prendendo in prestito la nomenclatura dalla letteratura bayesiana nonparametrica, costruiremo un algoritmo di tipo condizionale basandoci sulla caratterizzazione a posteriori del processo di punto finito dimensionale non normalizzato. Per illustrare le prestazioni del nostro algoritmo e la flessibilità del modello, illustreremo due esempi di mistura considerando il popolare set di dati Galaxy.Modelling via finite mixtures is one of the most fruitful Bayesian approach, particularly useful when there is unobserved heterogeneity in the data. The most popular algorithm under this model is the reversible jump MCMC, that can be nontrivial to design, especially in high-dimensional spaces. In this work, we first introduce a class of finite discrete random probability measures obtained by normalization of finite point processes. Then, we use the new class as the mixing measure of a mixture model and derive its posterior characterization. The resulting new class encompasses the popular finite Dirichlet mixture model; here, in order to compute posterior, we propose an alternative to the reversible jump. In particular, borrowing notation from the nonparametric Bayesian literature, we set up a conditional MCMC algorithm based on the posterior characterization of the unnormalized point process. In order to show the performance of our algorithm and the flexibility of the model, we illustrate some examples on the popular Galaxy dataset

    Model-based clustering of categorical data based on the Hamming distance

    No full text
    A model-based approach is developed for clustering categorical data with no natural ordering. The proposed method exploits the Hamming distance to define a family of probability mass functions to model the data. The elements of this family are then considered as kernels of a finite mixture model with an unknown number of components. Conjugate Bayesian inference has been derived for the parameters of the Hamming distribution model. The mixture is framed in a Bayesian nonparametric setting, and a transdimensional blocked Gibbs sampler is developed to provide full Bayesian inference on the number of clusters, their structure, and the group-specific parameters, facilitating the computation with respect to customary reversible jump algorithms. The proposed model encompasses a parsimonious latent class model as a special case when the number of components is fixed. Model performances are assessed via a simulation study and reference datasets, showing improvements in clustering recovery over existing approaches. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work

    Personalized treatment selection via product partition models with covariates

    No full text
    Precision medicine is an approach for disease treatment that defines treatment strategies based on the individual characteristics of the patients. Motivated by an open problem in cancer genomics, we develop a novel model that flexibly clusters patients with similar predictive characteristics and similar treatment responses; this approach identifies, via predictive inference, which one among a set of treatments is better suited for a new patient. The proposed method is fully model based, avoiding uncertainty underestimation attained when treatment assignment is performed by adopting heuristic clustering procedures, and belongs to the class of product partition models with covariates, here extended to include the cohesion induced by the normalized generalized gamma process. The method performs particularly well in scenarios characterized by considerable heterogeneity of the predictive covariates in simulation studies. A cancer genomics case study illustrates the potential benefits in terms of treatment response yielded by the proposed approach. Finally, being model based, the approach allows estimating clusters' specific response probabilities and then identifying patients more likely to benefit from personalized treatment

    Cluster Analysis of Curved-Shaped Data with Species-Sampling Mixture Models

    No full text
    We are interested in clustering data whose support is “curved”. Recently we have ad- dressed this problem, introducing a model which combines two ingredients: species sampling mixtures of parametric densities on one hand, and a deterministic clustering procedure (DBSCAN) on the other. In short, under this model two observations share the same cluster if the distance between the densities corresponding to their latent parameters is smaller than a threshold. However, in this case, the prior cluster assignment is based on the geometry of the space of kernel densities rather than a direct random partition prior elicitation. Following the latter alternative, a new hierarchical model for clustering is proposed here, where the data in each cluster are parametrically distributed around a curve (principal curve), and the prior cluster assignment is given on the latent variables at the second level of hierarchy according to a species sampling model. These two mixture models are compared here with respect to cluster estimates obtained for a simulated bivariate dataset from two clusters, one being banana-shaped

    Forecasting short-term defaults of firms in a commercial network via Bayesian spatial and spatio-temporal methods

    No full text
    To protect financial institutions from unexpected credit losses, during the monitoring phase of granted loans it is of primary importance to foresee any evidence of a contagion of liquidity distress across a network of firms. This term indicates a situation of lack of solvency of a firm (e.g., a customer) that propagates to other firms (e.g, its suppliers), which could consequently face challenges in repaying their own granted loans. In this paper, we look for the evidence of contagion of liquidity distress on an Intesa Sanpaolo proprietary dataset by means of Bayesian spatial and spatio-temporal models. Our results indicate that such models can detect cases of distress not yet apparent from covariate information collected on the firms by instead borrowing information from the network, leading to improved forecasting performance on the prediction of short-term default with respect to state-of-the-art methods
    corecore