1,721,442 research outputs found
Contributed Discussion of Paganin, S., Herring, A. F. , Olshan, A. F. , Dunson, D. B., and The National Birth Defects Prevention Study: "Centered Partition Processes: Informative Priors for Clustering"
Christian Hennig’s contribution to the Discussion of ‘Testing by betting: A strategy for statistical and scientific communication’ by Glenn Shafer
How many bee species? a case study in determining the number of clusters
It is argued that the determination of the best number of clusters k
is crucially dependent on the aim of clustering. Existing supposedly
“objective” methods of estimating k ignore this. k can be determined
by listing a number of requirements for a good clustering in the given
application and finding a k that fulfils them all. The approach is illustrated
by application to the problem of finding the number of species in a data
set of Australasian tetragonula bees. Requirements here include two new
statistics formalising the largest within-cluster gap and cluster separation.
Due to the typical nature of expert knowledge, it is difficult to make
requirements precise, and a number of subjective decisions is involved
Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters
There are two notoriously hard problems in cluster analysis, estimating the number of clusters, and checking whether the population to be clustered is not actually homogeneous. Given a dataset, a clustering method and a cluster validation index, this paper proposes to set up null models that capture structural features of the data that cannot be interpreted as indicating clustering. Artificial datasets are sampled from the null model with parameters estimated from the original dataset. This can be used for testing the null hypothesis of a homogeneous population against a clustering alternative. It can also be used to calibrate the validation index for estimating the number of clusters, by taking into account the expected distribution of the index under the null model for any given number of clusters. The approach is illustrated by three examples, involving various different clustering techniques (partitioning around medoids, hierarchical methods, a Gaussian mixture model), validation indexes (average silhouette width, prediction strength and BIC), and issues such as mixed-type data, temporal and spatial autocorrelation
Clustering with the Average Silhouette Width
The Average Silhouette Width (ASW) is a popular cluster validation index to estimate the number of clusters. The question whether it also is suitable as a general objective function to be optimized for finding a clustering is addressed. Two algorithms (the standard version OSil and a fast version FOSil) are proposed, and they are compared with existing clustering methods in an extensive simulation study covering known and unknown numbers of clusters. Real data sets are analysed, partly exploring the use of the new methods with non-Euclidean distances. The ASW is shown to satisfy some axioms that have been proposed for cluster quality functions. The new methods prove useful and sensible in many cases, but some weaknesses are also highlighted. These also concern the use of the ASW for estimating the number of clusters together with other methods, which is of general interest due to the popularity of the ASW for this task
Robust Improper Maximum Likelihood: Tuning, Computation, and a Comparison With Other Methods for Robust Gaussian Clustering
The two main topics of this paper are the introduction of the “optimally tuned improper
maximum likelihood estimator” (OTRIMLE) for robust clustering based on the multivariate
Gaussian model for clusters, and a comprehensive simulation study comparing the OTRIMLE
to Maximum Likelihood in Gaussian mixtures with and without noise component, mixtures of
t-distributions, and the TCLUST approach for trimmed clustering. The OTRIMLE uses an im-
proper constant density for modelling outliers and noise. This can be chosen optimally so that
the non-noise part of the data looks as close to a Gaussian mixture as possible. Some deviation
from Gaussianity can be traded in for lowering the estimated noise proportion. Covariance ma-
trix constraints and computation of the OTRIMLE are also treated. In the simulation study, all
methods are confronted with setups in which their model assumptions are not exactly fulfilled,
and in order to evaluate the experiments in a standardized way by misclassification rates, a new
model-based definition of “true clusters” is introduced that deviates from the usual identifica-
tion of mixture components with clusters. In the study, every method turns out to be superior
for one or more setups, but the OTRIMLE achieves the most satisfactory overall performance.
The methods are also applied to two real datasets, one without and one with known “true”
clusters
Quantile-based classifiers
: Classification with small samples of high-dimensional data is important in many application areas. Quantile classifiers are distance-based classifiers that require a single parameter, regardless of the dimension, and classify observations according to a sum of weighted componentwise distances of the components of an observation to the within-class quantiles. An optimal percentage for the quantiles can be chosen by minimizing the misclassification error in the training sample. It is shown that this choice is consistent for the classification rule with the asymptotically optimal quantile and that under some assumptions, as the number of variables goes to infinity, the probability of correct classification converges to unity. The effect of skewness of the distributions of the predictor variables is discussed. The optimal quantile classifier gives low misclassification rates in a comprehensive simulation study and in a real-data application
An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture-based clustering
We introduce a new approach to deciding the number of clusters. The approach is applied to Optimally Tuned Robust Improper Maximum Likelihood Estimation (OTRIMLE; Coretto & Hennig, Journal of the American Statistical Association111, 1648–1659) of a Gaussian mixture model allowing for observations to be classified as ‘noise’, but it can be applied to other clustering methods as well. The quality of a clustering is assessed by a statistic Q that measures how close the within-cluster distributions are to elliptical unimodal distributions that have the only mode in the mean. This non-parametric measure allows for non-Gaussian clusters as long as they have a good quality according to Q. The simplicity of a model is assessed by a measure S that prefers a smaller number of clusters unless additional clusters can reduce the estimated noise proportion substantially. The simplest model is then chosen that is adequate for the data in the sense that its observed value of Q is not significantly larger than what is expected for data truly generated from the fitted model, as can be assessed by parametric bootstrap. The approach is compared with model-based clustering using the Bayesian information criterion (BIC) and the integrated complete likelihood (ICL) in a simulation study and on real two data sets
- …
