1,721,040 research outputs found

    Robust variable selection for model-based learning in presence of adulteration

    Full text link
    The problem of identifying the most discriminating features when performing supervised learning has been extensively investigated. In particular, several methods for variable selection have been proposed in model-based classification. The impact of outliers and wrongly labeled units on the determination of relevant predictors has instead received far less attention, with almost no dedicated methodologies available. Two robust variable selection approaches are introduced: one that embeds a robust classifier within a greedy-forward selection procedure and the other based on the theory of maximum likelihood estimation and irrelevance. The former recasts the feature identification as a model selection problem, while the latter regards the relevant subset as a model parameter to be estimated. The benefits of the proposed methods, in contrast with non-robust solutions, are assessed via an experiment on synthetic data. An application to a high-dimensional classification problem of contaminated spectroscopic data is presented

    A two-stage Bayesian semiparametric model for novelty detection with robust prior information

    Full text link
    Novelty detection methods aim at partitioning the test units into already observed and previously unseen patterns. However, two significant issues arise: there may be considerable interest in identifying specific structures within the novelty, and contamination in the known classes could completely blur the actual separation between manifest and new groups. Motivated by these problems, we propose a two-stage Bayesian semiparametric novelty detector, building upon prior information robustly extracted from a set of complete learning units. We devise a general-purpose multivariate methodology that we also extend to handle functional data objects. We provide insights on the model behavior by investigating the theoretical properties of the associated semiparametric prior. From the computational point of view we, propose, a suitable ξ: ξ-sequence to construct an independent slice-efficient sampler that takes into account the difference between manifest and novelty components. We showcase our model performance through an extensive simulation study and applications on both multivariate and functional datasets, in which diverse and distinctive unknown patterns are discovered

    CLADAG 2019 Special Issue: Selected Papers on Classification and Data Analysis (editoriale)

    No full text
    This special issue of Statistical Analysis and Data Mining collects papers presented at the 12-th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS), held in Cassino, Italy, September 11 – 13, 2019. The CLADAG group, founded in 1997, promotes advanced methodological research in multivariate statistics with a special vocation in Data Analysis and Classification. CLADAG is a member of the International Federation of Classification Societies (IFCS). It organizes a biennial international scientific meeting, schools related to classification and data analysis, publishes a newsletter, and cooperates with other member societies of the IFCS to the organization of their conferences. Founded in 1985, the IFCS is a federation of national, regional, and linguistically-based classification societies aimed at promoting classification research. Previous CLADAG meetings were held in Pescara (1997), Roma (1999), Palermo (2001), Bologna (2003), Parma (2005), Macerata (2007), Catania (2009), Pavia (2011), Modena and Reggio Emilia (2013), Cagliari (2015), and Milano (2017). Best papers from the conference have been submitted to this special issue, and five of them have been selected for publication, following a blind peer-review process. The manuscripts deal with different data analysis issues: mixture of distributions, compositional data analysis, Markov chain for web usability, survival analysis, and applications to high-throughput, eye-tracking, and insurance transaction data. The paper by S.X. Lee et al. proposes a parallelization strategy of the Expectation-Maximization (EM) algorithm, with a special focus on the estimation of finite mixtures of flexible distribution such as the canonical fundamental skew t distribution (CFUST). The parallel implementation of the EM-algorithm is suitable for single-threaded and multi-threaded processors as well as for single machine and multiple-node systems. The EM algorithm is also discussed in the paper of L. Scrucca. Here, a fast and efficient Modal EM algorithm for identifying the modes of a density estimated through a finite mixture of Gaussian distributions with parsimonious component covariance structures is provided. The proposed approach is based on an iterative procedure aimed at identifying the local maxima, exploiting features of the underlying Gaussian mixture model. Motivated by applications in high-throughput compositional data analysis, the paper by N. Štefelová et al. proposes a data-driven weighting strategy to enhance marker identification through PLS regression with compositional predictors. The weighting strategy draws on the correlation structure between response variable and pairwise log-ratios. Its practical relevance is illustrated through an analysis of metabolite signals associated with the emission of greenhouse gases from cattle. The paper by G. Zammarchi et al. exploits Markov chain to analyse web usability of a University website using eye tracking methodology. With the aim of improving its usability, the paper compares performances of high school and University students in terms of time to completion, number of fixations and difficulty ratio across ten different tasks. Data from a commercial insurance company in the Czech Republic are instead exploited by D. Zapletal to compare the efficacy of some survival analysis models within an insurance transaction framework. The ability to identify relevant explanatory variables through the Cox proportional hazard model and some competing risk models (i.e., the cause-specific and the sub-distribution hazard models) is assessed on a large data set consisting of more than 200 thousand individuals. In brief, this special issue is in line with the CLADAG goal of supporting the interchange of ideas in Classification and Data Analysis. We strongly believe it well represents the scientific characteristics of the CLADAG community, and we invite all readers to join the next CLADAG conference, which will be in Florence, September 11 to 13, 2021

    La muta del serpente bianco

    No full text
    The legend of white snake is one of the four Chinese legends about love, spreading out among folks in ancient China. Through oral and various conventional literature, this tale became a classical theme of different artistic forms (ballads, precious scrolls, novels, scripts for story-telling, dramas, Beijing opera), and from the 20th century it became a theme of films, TV dramas, cartoon, and comic strips. The analysis of five movies on the legend, filmed between 1927 and 2011, confirmed the important role that repetition plays in commercial movies and identified a link between the narrative techniques of oral literature and the ones of cinema entertainment, thanks to the comparison with the old literary versions. This is a confirmation of how consuming literature and films are performing functions once typical of oral literature

    Romanzo ed educazione alla storia. Scritti sul romanzo storico nel quinquennio 1902-1906

    No full text
    In the early years of the twentieth century the discourse on the nature and role of xiaoshuo in Chinese literary system was partly characterized by the need to elaborate detailed narrative taxonomies comparable to those adopted in the West and in Japan. Among the categories that entered the new lexicon figures that of «historical novel» (lishi xiaoshuo), a new critical idiom used to designate forms of fictional narratives based on historical facts. The analysis of different uses of this category label in commercial and critical writings from the period between 1902 and 1906 reveals how much the definition of this narrative subgenre was still interwoven with the traditional historiographical paradigm, both in terms of narrative theme and sociological function. As a consequence, historical fiction was mainly valued for its role in popularizing historical literacy. At the same time, however, the new cultural role assigned to fiction allowed the gradual emergence of the reciprocal functionality of history and fiction, particularly in the writings of Wu Jianren (1866-1910)

    Co-authorship Network in Statistics: methodological issues and empirical results

    No full text
    The present contribution aims at discussing some issues on the analysis of co-authorship networks providing empirical results on a case study. Two main methodological issues are related to the heterogeneity of the bibliographic archives available to collect collaboration data, and the disambiguation problem to obtain a correct identification of authors per paper. Within this scenario, we are interested in performing community detection algorithm to discover groups, and in analyzing the changes in the groups’ structure over time as a result of the first research assessment exercise attempted in Italy in the period 2004-2010. The results of Italian academic statisticians and their co-authorship relationships provide a fertile ground for reflection

    Forty-Nive Years of Aramaic and Semitic Philology at Layard's Home, Ca' Cappello

    No full text
    Aramaic has been taught at Ca' Foscari for almost fifty years in the unique setting of Ca' Cappello, former Venetian residence of the archaeologist Austen Henry Layard. There, in a most inspiring environment for Semitists, from 1969 to prente, seven specialists have taught Semitic languages to generations of students. The broad scope of the subject represents the appeal of Semitic Philology to Ca' Foscari students: to those interested in the history and languages of the Ancient Near East and to students who concentrate on modern Semitic languages and contemporary issues

    Robust variable selection in the framework of classification with label noise and outliers: Applications to spectroscopic data in agri-food

    Full text link
    Classification of high-dimensional spectroscopic data is a common task in analytical chemistry. Well-established procedures like support vector machines (SVMs) and partial least squares discriminant analysis (PLS-DA) are the most common methods for tackling this supervised learning problem. Nonetheless, interpretation of these models remains sometimes difficult, and solutions based on feature selection are often adopted as they lead to the automatic identification of the most informative wavelengths. Unfortunately, for some delicate applications like food authenticity, mislabeled and adulterated spectra occur both in the calibration and/or validation sets, with dramatic effects on the model development, its prediction accuracy and robustness. Motivated by these issues, the present paper proposes a robust model-based method that simultaneously performs variable selection, outliers and label noise detection. We demonstrate the effectiveness of our proposal in dealing with three agri-food spectroscopic studies, where several forms of perturbations are considered. Our approach succeeds in diminishing problem complexity, identifying anomalous spectra and attaining competitive predictive accuracy considering a very low number of selected wavelengths
    corecore