1,720,987 research outputs found
Robust inference for parsimonious model-based clustering
We introduce a robust clustering procedure for parsimonious model-based clustering. The classical mclust framework is robustified through impartial trimming and eigenvalue-ratio constraints (the tclust framework, which is robust but not affine invariant). An advantage of our resulting mtclust approach is that eigenvalue-ratio constraints are not needed for certain model formulations, leading to affine invariant robust parsimonious clustering. We illustrate the approach via simulations and a benchmark real data example. R code for the proposed method is available at https://github.com/afarcome/mtclust
Advances in robust clustering methods with applications
Robust methods in statistics are mainly concerned with deviations from model assumptions.
As already pointed out in Huber (1981) and in Huber & Ronchetti
(2009) \these assumptions are not exactly true since they are just a mathematically
convenient rationalization of an often fuzzy knowledge or belief". For that reason \a
minor error in the mathematical model should cause only a small error in the nal
conclusions". Nevertheless it is well known that many classical statistical procedures
are \excessively sensitive to seemingly minor deviations from the assumptions".
All statistical methods based on the minimization of the average square loss may
suer of lack of robustness. Illustrative examples of how outliers' in
uence may
completely alter the nal results in regression analysis and linear model context are
provided in Atkinson & Riani (2012). A presentation of classical multivariate tools'
robust counterparts is provided in Farcomeni & Greco (2015).
The whole dissertation is focused on robust clustering models and the outline of the
thesis is as follows.
Chapter 1 is focused on robust methods. Robust methods are aimed at increasing
the eciency when contamination appears in the sample. Thus a general denition
of such (quite general) concept is required. To do so we give a brief account of
some kinds of contamination we can encounter in real data applications. Secondly
we introduce the \Spurious outliers model" (Gallegos & Ritter 2009a) which is the
cornerstone of the robust model based clustering models. Such model is aimed at
formalizing clustering problems when one has to deal with contaminated samples.
The assumption standing behind the \Spurious outliers model" is that two dierent
random mechanisms generate the data: one is assumed to generate the \clean"
part while the another one generates the contamination. This idea is actually very
common within robust models like the \Tukey-Huber model" which is introduced in
Subsection 1.2.2. Outliers' recognition, especially in the multivariate case, plays a
key role and is not straightforward as the dimensionality of the data increases. An
overview of the most widely used (robust) methods for outliers detection is provided
within Section 1.3. Finally, in Section 1.4, we provide a non technical review of the
classical tools introduced in the Robust Statistics' literature aimed at evaluating the robustness properties of a methodology.
Chapter 2 is focused on model based clustering methods and their robustness' properties.
Cluster analysis, \the art of nding groups in the data" (Kaufman & Rousseeuw
1990), is one of the most widely used tools within the unsupervised learning context.
A very popular method is the k-means algorithm (MacQueen et al. 1967) which is
based on minimizing the Euclidean distance of each observation from the estimated
clusters' centroids and therefore it is aected by lack of robustness. Indeed even a
single outlying observation may completely alter centroids' estimation and simultaneously
provoke a bias in the standard errors' estimation. Cluster's contours may be
in
ated and the \real" underlying clusterwise structure might be completely hidden.
A rst attempt of robustifying the k- means algorithm appeared in Cuesta-Albertos
et al. (1997), where a trimming step is inserted in the algorithm in order to avoid
the outliers' exceeding in
uence.
It shall be noticed that k-means algorithm is ecient for detecting spherical homoscedastic
clusters. Whenever more
exible shapes are desired the procedure becomes
inecient. In order to overcome this problem Gaussian model based clustering
methods should be adopted instead of k-means algorithm. An example, among
the other proposals described in Chapter 2, is the TCLUST methodology (Garca-
Escudero et al. 2008), which is the cornerstone of the thesis. Such methodology is
based on two main characteristics: trimming a xed proportion of observations and
imposing a constraint on the estimates of the scatter matrices. As it will be explained
in Chapter 2, trimming is used to protect the results from outliers' in
uence
while the constraint is involved as spurious maximizers may completely spoil the
solution.
Chapter 3 and 4 are mainly focused on extending the TCLUST methodology.
In particular, in Chapter 3, we introduce a new contribution (compare Dotto et al.
2015 and Dotto et al. 2016b), based on the TCLUST approach, called reweighted
TCLUST or RTCLUST for the sake of brevity. The idea standing behind such
method is based on reweighting the observations initially
agged as outlying. This
is helpful both to gain eciency in the parameters' estimation process and to provide
a reliable estimation of the true contamination level. Indeed, as the TCLUST
is based on trimming a xed proportion of observations, a proper choice of the
trimming level is required. Such choice, especially in the applications, can be cumbersome.
As it will be claried later on, RTCLUST methodology allows the user to
overcome such problem. Indeed, in the RTCLUST approach the user is only required
to impose a high preventive trimming level. The procedure, by iterating through a
sequence of decreasing trimming levels, is aimed at reinserting the discarded observations
at each step and provides more precise estimation of the parameters and a nal estimation of the true contamination level ^.
The theoretical properties of the methodology are studied in Section 3.6 and proved
in Appendix A.1, while, Section 3.7, contains a simulation study aimed at evaluating
the properties of the methodology and the advantages with respect to some other
robust (reweigthed and single step procedures).
Chapter 4 contains an extension of the TCLUST method for fuzzy linear clustering
(Dotto et al. 2016a). Such contribution can be viewed as the extension of
Fritz et al. (2013a) for linear clustering problems, or, equivalently, as the extension
of Garca-Escudero, Gordaliza, Mayo-Iscar & San Martn (2010) to the fuzzy
clustering framework. Fuzzy clustering is also useful to deal with contamination.
Fuzziness is introduced to deal with overlapping between clusters and the presence
of bridge points, to be dened in Section 1.1. Indeed bridge points may arise in case
of overlapping between clusters and may completely alter the estimated cluster's
parameters (i.e. the coecients of a linear model in each cluster). By introducing
fuzziness such observations are suitably down weighted and the clusterwise structure
can be correctly detected. On the other hand, robustness against gross outliers,
as in the TCLUST methodology, is guaranteed by trimming a xed proportion of
observations. Additionally a simulation study, aimed at comparing the proposed
methodology with other proposals (both robust and non robust) is also provided in
Section 4.4.
Chapter 5 is entirely dedicated to real data applications of the proposed contributions.
In particular, the RTCLUST method is applied to two dierent datasets. The
rst one is the \Swiss Bank Note" dataset, a well known benchmark dataset for clustering
models, and to a dataset collected by Gallup Organization, which is, to our
knowledge, an original dataset, on which no other existing proposals have been applied
yet. Section 5.3 contains an application of our fuzzy linear clustering proposal
to allometry data. In our opinion such dataset, already considered in the robust
linear clustering proposal appeared in Garca-Escudero, Gordaliza, Mayo-Iscar &
San Martn (2010), is particularly useful to show the advantages of our proposed
methodology. Indeed allometric quantities are often linked by a linear relationship
but, at the same time, there may be overlap between dierent groups and outliers
may often appear due to errors in data registration.
Finally Chapter 6 contains the concluding remarks and the further directions of
research. In particular we wish to mention an ongoing work (Dotto & Farcomeni,
In preparation) in which we consider the possibility of implementing robust parsimonious
Gaussian clustering models. Within the chapter, the algorithm is briefly
described and some illustrative examples are also provided. The potential advantages
of such proposals are the following. First of all, by considering the parsimonious
models introduced in Celeux & Govaert (1995), the user is able to impose the shape of the detected clusters, which often, in the applications, plays a key role.
Secondly, by constraining the shape of the detected clusters, the constraint on the
eigenvalue ratio can be avoided. This leads to the removal of a tuning parameter of
the procedure and, at the same time, allows the user to obtain ane equivariant estimators.
Finally, since the possibility of trimming a xed proportion of observations
is allowed, then the procedure is also formally robust
The power of (extended) monitoring in robust clustering
We complement the work of Cerioli, Riani, Atkinson and Corbellini
by discussing monitoring in the context of robust clustering. This implies extending the approach to clustering, and possibly monitoring more than one
parameter simultaneously. The cases of trimming and snipping are discussed
separately, and special attention is given to recently proposed methods like
double clustering, reweighting in robust clustering, and fuzzy regression clustering
Statistical Analyses in the case of an Italian nurse accused of murdering patients
Suspicions about medical murder sometimes arise due to a surprising or
unexpected series of events, such as an apparently unusual number of deaths
among patients under the care of a particular nurse. But also a single
disturbing event might trigger suspicion about a particular nurse, and this
might then lead to investigation of events which happened when she was thought
to be present. In either case, there is a statistical challenge of
distinguishing event clusters that arise from criminal acts from those that
arise coincidentally from other causes. We show that an apparently striking
association between a nurse's presence and a high rate of deaths in a hospital
ward can easily be completely spurious. In short: in a medium-care hospital
ward where many patients are suffering terminal illnesses, and deaths are
frequent, most deaths occur in the morning. Most nurses are on duty in the
morning, too. There are less deaths in the afternoon, and even less at night;
correspondingly, less nurses are on duty in the afternoon, even less during the
night. Consequently, a full time nurse works the most hours when the most
deaths occur. The death rate is higher when she is present than when she is
absent.Comment: Published in "Law, Probability and Risk". 33 page
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Variations on the Author
“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
Appropriate Similarity Measures for Author Cocitation Analysis
We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis
- …
