Search CORE

1,721,442 research outputs found

Hennig, C.

Author: Hennig C.
Publication venue
Publication date: 17/03/2016
Field of study

Universitas Maritim Raja Ali Haji Pusat Jurnal Ilmiah

Contributed Discussion of Paganin, S., Herring, A. F. , Olshan, A. F. , Dunson, D. B., and The National Birth Defects Prevention Study: "Centered Partition Processes: Informative Priors for Clustering"

Author: Hennig C
Publication venue
Publication date: 01/01/2021
Field of study

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Ten Great Ideas about Chance

Author: Hennig C
Publication venue
Publication date: 01/01/2020
Field of study

Book review - no abstrac

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Christian Hennig’s contribution to the Discussion of ‘Testing by betting: A strategy for statistical and scientific communication’ by Glenn Shafer

Author: Hennig C.
Publication venue
Publication date: 01/01/2021
Field of study

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

How many bee species? a case study in determining the number of clusters

Author: Hennig C
Christian Hennig
Publication venue
Publication date: 10/10/2013
Field of study

It is argued that the determination of the best number of clusters k is crucially dependent on the aim of clustering. Existing supposedly “objective” methods of estimating k ignore this. k can be determined by listing a number of requirements for a good clustering in the given application and finding a k that fulfils them all. The approach is illustrated by application to the problem of finding the number of species in a data set of Australasian tetragonula bees. Requirements here include two new statistics formalising the largest within-cluster gap and cluster separation. Due to the typical nature of expert knowledge, it is difficult to make requirements precise, and a number of subjective decisions is involved

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters

Author: Hennig C
Lin C-J
Hennig C
Lin CJ
Publication venue
Publication date: 01/01/2015
Field of study

There are two notoriously hard problems in cluster analysis, estimating the number of clusters, and checking whether the population to be clustered is not actually homogeneous. Given a dataset, a clustering method and a cluster validation index, this paper proposes to set up null models that capture structural features of the data that cannot be interpreted as indicating clustering. Artificial datasets are sampled from the null model with parameters estimated from the original dataset. This can be used for testing the null hypothesis of a homogeneous population against a clustering alternative. It can also be used to calibrate the validation index for estimating the number of clusters, by taking into account the expected distribution of the index under the null model for any given number of clusters. The approach is illustrated by three examples, involving various different clustering techniques (partitioning around medoids, hierarchical methods, a Gaussian mixture model), validation indexes (average silhouette width, prediction strength and BIC), and issues such as mixed-type data, temporal and spatial autocorrelation

UCL Discovery

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Apollo (Cambridge)

Clustering with the Average Silhouette Width

Author: Batool F.
Hennig C.
Publication venue
Publication date: 01/01/2021
Field of study

The Average Silhouette Width (ASW) is a popular cluster validation index to estimate the number of clusters. The question whether it also is suitable as a general objective function to be optimized for finding a clustering is addressed. Two algorithms (the standard version OSil and a fast version FOSil) are proposed, and they are compared with existing clustering methods in an extensive simulation study covering known and unknown numbers of clusters. Real data sets are analysed, partly exploring the use of the new methods with non-Euclidean distances. The ASW is shown to satisfy some axioms that have been proposed for cluster quality functions. The new methods prove useful and sensible in many cases, but some weaknesses are also highlighted. These also concern the use of the ASW for estimating the number of clusters together with other methods, which is of general interest due to the popularity of the ASW for this task

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Robust Improper Maximum Likelihood: Tuning, Computation, and a Comparison With Other Methods for Robust Gaussian Clustering

Author: Hennig C
Coretto P
Publication venue
Publication date: 01/01/2016
Field of study

The two main topics of this paper are the introduction of the “optimally tuned improper maximum likelihood estimator” (OTRIMLE) for robust clustering based on the multivariate Gaussian model for clusters, and a comprehensive simulation study comparing the OTRIMLE to Maximum Likelihood in Gaussian mixtures with and without noise component, mixtures of t-distributions, and the TCLUST approach for trimmed clustering. The OTRIMLE uses an im- proper constant density for modelling outliers and noise. This can be chosen optimally so that the non-noise part of the data looks as close to a Gaussian mixture as possible. Some deviation from Gaussianity can be traded in for lowering the estimated noise proportion. Covariance ma- trix constraints and computation of the OTRIMLE are also treated. In the simulation study, all methods are confronted with setups in which their model assumptions are not exactly fulfilled, and in order to evaluate the experiments in a standardized way by misclassification rates, a new model-based definition of “true clusters” is introduced that deviates from the usual identifica- tion of mixture components with clusters. In the study, every method turns out to be superior for one or more setups, but the OTRIMLE achieves the most satisfactory overall performance. The methods are also applied to two real datasets, one without and one with known “true” clusters

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Quantile-based classifiers

Author: Viroli C.
Hennig C.
Publication venue
Publication date: 01/01/2016
Field of study

: Classification with small samples of high-dimensional data is important in many application areas. Quantile classifiers are distance-based classifiers that require a single parameter, regardless of the dimension, and classify observations according to a sum of weighted componentwise distances of the components of an observation to the within-class quantiles. An optimal percentage for the quantiles can be chosen by minimizing the misclassification error in the training sample. It is shown that this choice is consistent for the classification rule with the asymptotically optimal quantile and that under some assumptions, as the number of variables goes to infinity, the probability of correct classification converges to unity. The effect of skewness of the distributions of the predictor variables is discussed. The optimal quantile classifier gives low misclassification rates in a comprehensive simulation study and in a real-data application

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture-based clustering

Author: Coretto P.
Hennig C.
Publication venue
Publication date: 01/01/2022
Field of study

We introduce a new approach to deciding the number of clusters. The approach is applied to Optimally Tuned Robust Improper Maximum Likelihood Estimation (OTRIMLE; Coretto & Hennig, Journal of the American Statistical Association111, 1648–1659) of a Gaussian mixture model allowing for observations to be classified as ‘noise’, but it can be applied to other clustering methods as well. The quality of a clustering is assessed by a statistic Q that measures how close the within-cluster distributions are to elliptical unimodal distributions that have the only mode in the mean. This non-parametric measure allows for non-Gaussian clusters as long as they have a good quality according to Q. The simplicity of a model is assessed by a measure S that prefers a smaller number of clusters unless additional clusters can reduce the estimated noise proportion substantially. The simplest model is then chosen that is adequate for the data in the sense that its observed value of Q is not significantly larger than what is expected for data truly generated from the fitted model, as can be assessed by parametric bootstrap. The approach is compared with model-based clustering using the Bayesian information criterion (BIC) and the integrated complete likelihood (ICL) in a simulation study and on real two data sets

Archivio della Ricerca - Università di Salerno