1,721,012 research outputs found
Dimensionality reduction and simultaneous classication approaches for complex data: methods and applications
Statistical learning (SL) is the study of the generalizable extraction of knowledge from data (Friedman et al. 2001). The concept of learning is used when human expertise does not exist, humans are unable to explain their expertise, solution changes in time, solution needs to be adapted to particular cases. The principal algorithms used in SL are classified in: (i) supervised learning (e.g. regression and classification), it is trained on labelled examples, i.e., input where the desired output is known. In other words, supervised learning algorithm attempts to generalize a function or mapping from inputs to outputs which can then be used speculatively to generate an output for previously unseen inputs; (ii) unsupervised learning (e.g. association and clustering), it operates on unlabeled examples, i.e., input where the desired output is unknown, in this case the objective is to discover structure in the data (e.g. through a cluster analysis), not to generalize a mapping from inputs to outputs; (iii) semi-supervised, it combines both labeled and unlabeled examples to generate an appropriate function or classifier.
In a multidimensional context, when the number of variables is very large, or when it is believed that some of these do not contribute much to identify the groups structure in the data set, researchers apply a continuous model for dimensionality reduction as principal component analysis, factorial analysis, correspondence analy- sis, etc., and sequentially a discrete clustering model on the object scores computed as K-means, mixture models, etc. This approach is called tandem analysis (TA) by Arabie & Hubert (1994).
However, De Sarbo et al. (1990) and De Soete & Carrol (1994) warn against this approach, because the methods for dimension reduction may identify dimensions that do not necessarily contribute much to perceive the groups structure in the data and that, on the contrary, may obscure or mask the groups structure that could exist in the data. A solution to this problem is given by a methodology that includes the simultaneous detection of factors and clusters on the computed scores. In the case of continuous data, many alternative methods combining cluster analysis and the search for a reduced set of factors have been proposed, focusing on factorial meth- ods, multidimensional scaling or unfolding analysis and clustering (e.g., Heiser 1993, De Soete & Heiser 1993). De Soete & Carroll (1994) proposed an alternative to the K-means procedure, named reduced K-means (RKM), which appeared to equal the earlier proposed projection pursuit clustering (PPC) (Bolton & Krzanowski 2012). RKM simultaneously searches for a clustering of objects, based on the K-means criterion (MacQueen 1967), and a dimensionality reduction of the variables, based on the principal component analysis (PCA). However, this approach may fail to recover the clustering of objects when the data contain much variance in directions orthogonal to the subspace of the data in which the clusters reside (Timmerman et al. 2010). To solve this problem, Vichi & Kiers (2001), proposed the factorial K-means (FKM) model. FKM combines K-means cluster analysis with PCA, then finding the best subspace that best represents the clustering structure in the data. In other terms FKM works in the reduced space, and simultaneously searches the best partition of objects based on the use of K-means criterion, represented by the best reduced orthogonal space, based on the use of PCA.
When categorical variables are observed, TA corresponds to apply first multiple correspondence analysis (MCA) and subsequently the K-means clustering on the achieved factors. Hwang et al (2007) proposed an extension of MCA that takes into account cluster-level heterogeneity in respondents’ preferences/choices. The method involves combining MCA and k-means in a unified framework. The former is used for uncovering a low-dimensional space of multivariate categorical variables while the latter is used for identifying relatively homogeneous clusters of respondents. In the last years, the dimensionality reduction problem is very known also in other statistical contexts such as structural equation modeling (SEM). In fact, in a wide range of SEMs applications, the assumption that data are collected from a single ho- mogeneous population, is often unrealistic, and the identification of different groups (clusters) of observations constitutes a critical issue in many fields.
Following this research idea, in this doctoral thesis we propose a good review on the more recent statistical models used to solve the dimensionality problem discussed above. In particular, in the first chapter we show an application on hyperspectral data classification using the most used discriminant functions to solve the high di- mensionality problem, e.g., the partial least squares discriminant analysis (PLS-DA); in the second chapter we present the multiple correspondence K-means (MCKM) model proposed by Fordellone & Vichi (2017), which identifies simultaneously the best partition of the N objects described by the best orthogonal linear combination of categorical variables according to a single objective function; finally, in the third chapter we present the partial least squares structural equation modeling K-means (PLS-SEM-KM) proposed by Fordellone & Vichi (2018), which identifies simultane- ously the best partition of the N objects described by the best causal relationship among the latent constructs
Prototype definition through consensus analysis between fuzzy c-means and archetypal analysis
The general aim of cluster analysis is to build prototypes, or typologies of units that present similar characteristics. In this paper we propose an alternative approach based on consensus analysis of two different clustering methods to suitably obtain prototypes.
The clustering methods used are fuzzy c-means (centre approach) and archetypal analysis (extreme approach). The consensus clustering is used to assess the correspondence between the clustering solutions obtained
Finding groups in structural equation modeling through the partial least squares algorithm
The identification of different homogeneous groups of observations and their appropriate analysis in PLS-SEM has become a critical issue in many application fields. Usually, both SEM and PLS-SEM assume the homogeneity of units on which the model is applied. The approaches of segmentation proposed in the literature, consist of estimating separate models for each segment of statistical units, assigning these units to segments defined a priori. These approaches are not fully acceptable because no causal structure is postulated among variables. In other words, a model approach should be used, where the clusters obtained are homogeneous, both with respect to the structural causal relationships, and the mean differences between clusters. Therefore, a new methodology is proposed, where simultaneously non-hierarchical clustering and PLS-SEM is applied. This methodology is motivated by the fact that the sequential approach (i.e., the application, first, of SEM or PLS-SEM and subsequently the use of a clustering algorithm on the latent scores obtained) may fail to find the correct clustering structure of data. A simulation study and an application on real data are included to evaluate the performance of the proposed methodology
Comments about the use of PLS path modeling in building a Job Quality Composite Indicator
A composite indicator is formed when elementary indicators are compiled into a single index, on the basis of an underlying model of the multidimensional concept that is being measured. The PLS path modeling allows the estimation of composite indicators and the measurement model could be expressed both as formative and re-ective.
In this paper we construct a composite indicator of job quality using the PLS path modeling approach and compare results obtained by the formative and the re-ective measurement models of the general concept. We observe that the two approaches can give different results. Consequently, we give some suggestions in order to estimate stable and reliable models
Building Well-Being Composite Indicator for Micro-Territorial Areas Through PLS-SEM and K-Means Approach
In the analysis of the difference in the distribution and profiles of the equitable and sustainable well-being, the territorial dimension is a fundamental reading-key for local policies since it allows the areas of advantage or relative deprivation to emerge more accurately. Specifically, in Italy the provincial level coincides with the administrative area of metropolitan cities, which are the subject of growing attention from European and national policies. The BES 2018 report by Italian National Institute of Statistics (ISTAT) has confirmed that from 2015 an improvement in many areas of well-being has been marked, even if territorial differences remain stable both in levels and dynamics. These differences appear in some cases as real structural differences between the North and South of Italy. Then, the measures of equitable and sustainable well-being in the territories allow, in various degrees, to deepen and specify this situation employing synthetic measures of well-being. In this work, we propose a statistical methodology focused on the simultaneous partial least squares structural equation modeling and simultaneous K-means clustering to obtain a composite indicator of Italian well-being and at the same time a classification of Italian territorial micro-areas by means of the just updated provincial data about BES 2018. In this way, the territorial differences of well-being can be more reliably and more exactly defined on the basis of the relationships among all elementary indicators and domains proposed in the analysis of well-being by ISTAT
From Tandem To Simultaneous Dimensionality Reduction And Clustering Of Tourism Data
The study of tourist demand is a critical component of a successful destination management strategy. In order to define tourist segments, many factors play an important role in the decision-making process. Tourism motivations are often used as segmentation bases of tourism market since they can affect the choices about travel destination, type of holiday and consumer behaviour. A tourist destination offers many experiences and products, which appeal different market segments. This paper aims to identify a posteriori segments of tourism demand by means of multidimensional approach employing a simultaneous factorial dimensionality reduction and clustering method. On the basis of results, tourists are classified in two clusters in order to understand the relationship between motivations and consumer behaviour. In particular, the two observed clusters represent the very satisfied tourists and the tourists unsatisfied at different level, respectively. Moreover, in terms of cost of the holiday, the first group has a per capita expenditure bigger than second group
Unsupervised Hierarchical Classification Approach for Imprecise Data in the Breast Cancer Detection
(1) Background: in recent years, a lot of the research of statistical methods focused on the classification problem in presence of imprecise data. A particular case of imprecise data is the interval-valued data. Following this research line, in this work a new hierarchical classification technique for multivariate interval-valued data is suggested for diagnosis of the breast cancer; (2) Methods: an unsupervised hierarchical classification method for imprecise multivariate data (called HC-ID) is performed for diagnosis of breast cancer (i.e., to discriminate between benign or malignant masses) and the results have been compared with the conventional (unsupervised) hierarchical classification approach (HC); (3) Results: the application on real data shows that the HC-ID procedure performs better HC procedure in terms of accuracy (HC-ID = 0.80, HC = 0.66) and sensitivity (HC-ID = 0.61, HC = 0.08). In the results obtained by the usual procedure, there is a high degree of false-negative (i.e., benign cancer diagnosis in malignant status) affected by the high degree of variability (i.e., uncertainty) characterizing the worst data
Simultaneous Supervised and Unsupervised Classification Modeling for Assessing Cluster Analysis and Improving Results Interpretability
In the unsupervised classification field, the unknown number of clusters and the lack of assessment and interpretability of the final partition by means of inferential tools denote important limitations that could negatively influence the reliability of the final results. In this work, we propose to combine unsupervised classification with supervised methods in order to enhance the assessment and interpretation of the obtained partition. In particular, the approach consists in combining of the clustering method k-means (KM) with logistic regression (LR) modeling to have an algorithm that allows an evaluation of the partition identified through KM, to assess the correct number of clusters, and to verify the selection of the most important variables. An application on real data is presented to better clarify the utility of the proposed approach
Multiple Correspondence K-Means: Simultaneous Versus Sequential Approach for Dimension Reduction and Clustering
In this work, a discrete model for clustering and a continuous factorial one for dimension reduction are simultaneously fitted to categorical data, with the aim of identifying the best partition of the objects, described by the best orthogonal linear combinations of the factors, according to the least-squares criterion. This new methodology named multiple correspondence k-means is a useful alternative to the Tandem Analysis in the case of categorical data. Then, this approach has a double objective: data reduction and synthesis, simultaneously in the direction of rows and columns of the data matrix
Partial least squares discriminant analysis: A dimensionality reduction method to classify hyperspectral data
The recent development of more sophisticated spectroscopic methods allows
acqui- sition of high dimensional datasets from which valuable information may
be extracted using multivariate statistical analyses, such as dimensionality
reduction and automatic classification (supervised and unsupervised). In this
work, a supervised classification through a partial least squares discriminant
analysis (PLS-DA) is performed on the hy- perspectral data. The obtained
results are compared with those obtained by the most commonly used
classification approaches
- …
