1,720,981 research outputs found
Multivariate Latent Variable Transition Models of Longitudinal Mixed Data: an Analysis on Alcohol Use Disorder
Alcohol abuse is a dangerous habit in young people. The National Youth Survey is a longitudinal American study in part devoted to the investigation of alcohol disorder during time. The symptoms of alcohol disorder are measured by binary and ordinal items. In the literature it is well recognized that the alcohol abuse can be measured by a latent construct; therefore generalized latent variable models for mixed data represents the ideal framework to analyze these data. However, it might be desirable to cluster individuals according to the different severity of the alcohol use disorder and to investigate how the groups vary during time. We present a new methodological framework that includes two levels of latent variables: one continuous hidden variable for dimension reduction and clustering and a discrete random variable accounting for the dynamics of the alcohol disorder symptoms. The effect of covariates is also measured and a testing procedure for the temporal assumption is developed. This work addresses three important issues. First, it represents a unified framework for the analysis of longitudinal multivariate mixed data. Secondly, it captures and models the unobserved heterogeneity of the data. Finally it describes the dynamics of the data through the definition of latent constructs
Quantile-based classifiers
: Classification with small samples of high-dimensional data is important in many application areas. Quantile classifiers are distance-based classifiers that require a single parameter, regardless of the dimension, and classify observations according to a sum of weighted componentwise distances of the components of an observation to the within-class quantiles. An optimal percentage for the quantiles can be chosen by minimizing the misclassification error in the training sample. It is shown that this choice is consistent for the classification rule with the asymptotically optimal quantile and that under some assumptions, as the number of variables goes to infinity, the probability of correct classification converges to unity. The effect of skewness of the distributions of the predictor variables is discussed. The optimal quantile classifier gives low misclassification rates in a comprehensive simulation study and in a real-data application
A supervised classification strategy based on the novel directional distribution depth function
Deep mixtures of unigrams for uncovering topics in textual data
Mixtures of unigrams are one of the simplest and most efficient tools for clustering textual data, as they assume that documents related to the same topic have similar distributions of terms, naturally described by multinomials. When the classification task is particularly challenging, such as when the document-term matrix is high-dimensional and extremely sparse, a more composite representation can provide better insight into the grouping structure. In this work, we developed a deep version of mixtures of unigrams for the unsupervised classification of very short documents with a large number of terms, by allowing for models with further deeper latent layers; the proposal is derived in a Bayesian framework. The behavior of the deep mixtures of unigrams is empirically compared with that of other traditional and state-of-the-art methods, namely k-means with cosine distance, k-means with Euclidean distance on data transformed according to semantic analysis, partition around medoids, mixture of Gaussians on semantic-based transformed data, hierarchical clustering according to Ward’s method with cosine dissimilarity, latent Dirichlet allocation, mixtures of unigrams estimated via the EM algorithm, spectral clustering and affinity propagation clustering. The performance is evaluated in terms of both correct classification rate and Adjusted Rand Index. Simulation studies and real data analysis prove that going deep in clustering such data highly improves the classification accuracy
Dealing with overdispersion in multivariate count data
The problem of overdispersion in multivariate count data is a challenging issue. It covers a central role mainly due to the relevance of modern technology-based data, such as Next Generation Sequencing and textual data from the web or digital collections. A comprehensive analysis of the likelihood-based models for extra-variation data is presented. Particular attention is paid to the models feasible for high-dimensional data. A new approach together with its parametric-estimation procedure is proposed. It can be viewed as a deeper version of the Dirichlet-Multinomial distribution and it leads to important results allowing to get a better approximation of the observed variability. A significative comparison of the proposed model and existing strategies is made through two different simulation studies and an empirical data set, that confirm a better capability to describe overdispersion
Does Tourism Consumption Behaviour Mirror Differences in Living Standards?
Based on the theoretical foundation of well-being measurement, the study explores differences in living standards by analysing the distribution of tourism expenditure. A mixture of regression models is used to explore the heterogeneity in tourism consumption by identifying groups of families with similar tourism consumption behaviour as a function of certain socio-demographic and economic factors. The empirical analysis, performed on Italian expenditure data, suggests that there are three different patterns of consumption behaviour conditional to the socio-demographic and economic covariates in the tourism market and that differences in tourism consumption between groups of households mirror inequalities in living standards
INEQUALITIES AND TOURISM CONSUMPTION BEHAVIOUR: A MIXTURE MODEL ANALYSIS
The criticism of income as a measure of well-being and trends in living standards is well known and recently scholars have been involved in defining measures to better assess material well-being and differences in living standards. Recent evidence shows that individuals improve their well-being significantly if they are able to spend on higher-order goods and services like tourism and leisure activities. In the light of that, our study proposes to explore differences in living standards in Italy by analysing the distribution of tourism expenditure. For this aim, Mixtures of Regression Models were used in order to investigate whether there is an unobserved heterogeneity in tourism consumption by identifying the presence of groups of families with similar tourism consumption behaviour as function of some socio-demographic and economic factors. The analysis shows that tourism has not become part of the lifestyle of Italians yet.The criticism of income as a measure of well-being and trends in living standards is well known and recently scholars have been involved in defining measures to better assess material well-being and differences in living standards. Recent evidence shows that individuals improve their well-being significantly if they are able to spend on higher-order goods and services like tourism and leisure activities. In the light of that, our study proposes to explore differences in living standards in Italy by analysing the distribution of tourism expenditure. For this aim, Mixtures of Regression Models were used in order to investigate whether there is an unobserved heterogeneity in tourism consumption by identifying the presence of groups of families with similar tourism consumption behaviour as function of some socio-demographic and economic factors. The analysis shows that tourism has not become part of the lifestyle of Italians yet
Bayesian smooth-and-match inference for ordinary differential equations models linear in the parameters
Dynamic processes are crucial in many empirical fields, such as in oceanography, climate science, and engineering. Processes that evolve through time are often well described by systems of ordinary differential equations (ODEs). Fitting ODEs to data has long been a bottleneck because the analytical solution of general systems of ODEs is often not explicitly available. We focus on a class of inference techniques that uses smoothing to avoid direct integration. In particular, we develop a Bayesian smooth-and-match strategy that approximates the ODE solution while performing Bayesian inference on the model parameters. We incorporate in the strategy two main sources of uncertainty: the noise level of the measured observations and the model approximation error. We assess the performance of the proposed approach in an extensive simulation study and on a canonical data set of neuronal electrical activity
Infinite mixtures of infinite factor analysers
Factor-analytic Gaussian mixtures are often employed as a modelbased approach to clustering high-dimensional data. Typically, the numbers of clusters and latent factors must be fixed in advance of model fitting. The pair which optimises some model selection criterion is then chosen. For computational reasons, having the number of factors differ across clusters is rarely considered. Here the infinite mixture of infinite factor analysers (IMIFA) model is introduced. IMIFA employs a Pitman-Yor process prior to facilitate automatic inference of the number of clusters using the stick-breaking construction and a slice sampler. Automatic inference of the cluster-specific numbers of factors is achieved using multiplicative gamma process shrinkage priors and an adaptive Gibbs sampler. IMIFA is presented as the flagship of a family of factor-analytic mixtures. Applications to benchmark data, metabolomic spectral data, and a handwritten digit example illustrate the IMIFA model's advantageous features. These include obviating the need for model selection criteria, reducing the computational burden associated with the search of the model space, improving clustering performance by allowing cluster-specific numbers of factors, and uncertainty quantification
Regional Disparities in Consumption Behaviour: Italian Households in Time of Crisis
The relevant fact of the severe downturn due to the Great Recession is that declining consumer spending. The extent of the cutbacks in consumption expenditures differs among regions and also the distribution pattern across households differs.
This paper examines how the 2008-9 crisis affected regional disparities in household consumption behaviour by analysing the distribution of the real total expenditure. Specifically, a mixture of regression models is used to explore the heterogeneity in total expenditure across regions and households by identifying groups of families with similar consumption behaviour as a function of certain socio-demographic and economic factors. The empirical analysis, performed on Italian expenditure data, suggests that there are regional differences in the pattern of consumption behaviour in different macro-areas in Italy and that within macro-areas differences in consumption behaviour and in its determinants among the groups reflect differences in consumption reaction across regions and households
- …
