1,721,018 research outputs found
A review and computer code for accessing the structrual dimension of a regression model: Uncorrelated 2D views
The general goal of a regression analysis is to understand how the conditional cdf F(y/x) of a response variable (y) varies as a set of predictors varies. The process of knowledge may gain advantage from the use of graphical data representations. Unfortunately, the so-called "curse of dimensionality" can make the use of graphics difficult. Nevertheless, many regression problems may have a relatively simple structural dimension, thus, it is possible to draw a plot in lower dimensions that contains all the essential information. Several graphical and non-graphical methodologies have been proposed in order to reduce the dimensionality of a regression problem. In this article we review a graphical method based on dynamic graphics, and present a computer implementation in the Xlisp-Stat programming language. Examples and a case study are given as an outline for performing a regression analysis. © 2001 Elsevier Science B.V. All rights reserved
A COVINDEX based on a GAM beta regression model with an application to the COVID-19 pandemic in Italy
Detecting changes in COVID-19 disease transmission over time is a key indicator of epidemic growth. Near real-time monitoring of the pandemic growth is crucial for policy makers and public health officials who need to make informed decisions about whether to enforce lockdowns or allow certain activities. The effective reproduction number Rt is the standard index used in many countries for this goal. However, it is known that due to the delays between infection and case registration, its use for decision making is somewhat limited. In this paper a near real-time COVINDEX is proposed for monitoring the evolution of the pandemic. The index is computed from predictions obtained from a GAM beta regression for modelling the test positive rate as a function of time. The proposal is illustrated using data on COVID-19 pandemic in Italy and compared with Rt. A simple chart is also proposed for monitoring local and national outbreaks by policy makers and public health officials
Regularized sliced inverse regression with applications in classification
Consider the problem of classifying a number of objects into one of several groups or classes based oil a set of characteristics. This problem has been extensively studied under the general subject of discriminant analysis in the statistical literature, or supervised pattern recognition in the machine learning field. Recently, dimension reduction methods, such as SIR and SAVE, have been used for classification purposes. In this paper we propose a regularized version of the SIR method which is able to gain information from both the structure of class means and class variances. Furthermore, the introduction of a shrinkage parameter allows the method to be applied in under-resolution problems, such as those found in gene expression microarray data. The REGSIR method is illustrated on two different classification problems using real data sets
Graphics for studying logistic regression models
In this article we focus on logistic regression models for binary responses. An existing result shows that the log-odds can be modelled depending on the log of the ratio between the conditional densities of the predictors given the response variable. This suggests that relevant statistical information could be extracted investigating the inverse problem. Thus, we present different methods for studying the log-density ratio through graphs, which allow us to select which predictors are needed, and how they should be included in a logistic regression model. We also discuss data analysis examples based on real datasets available in literature in order to provide further insights into the methodology proposed. © Springer-Vedag 2003
On the Influence of Data Imbalance on Supervised Gaussian Mixture Models
Imbalanced data present a pervasive challenge in many real-world applications of statistical and machine learning, where the instances of one class significantly outnumber those of the other. This paper examines the impact of class imbalance on the performance of Gaussian mixture models in classification tasks and establishes the need for a strategy to reduce the adverse effects of imbalanced data on the accuracy and reliability of classification outcomes. We explore various strategies to address this problem, including cost-sensitive learning, threshold adjustments, and sampling-based techniques. Through extensive experiments on synthetic and real-world datasets, we evaluate the effectiveness of these methods. Our findings emphasize the need for effective mitigation strategies for class imbalance in supervised Gaussian mixtures, offering valuable insights for practitioners and researchers in improving classification outcomes
On some extensions to GA package: Hybrid optimisation, parallelisation and islands evolution
Genetic algorithms are stochastic iterative algorithms in which a population of individuals evolve by emulating the process of biological evolution and natural selection. The R package GA provides a collection of general purpose functions for optimisation using genetic algorithms. This paper describes some enhancements recently introduced in version 3 of the package. In particular, hybrid GAs have been implemented by including the option to perform local searches during the evolution. This allows to combine the power of genetic algorithms with the speed of a local optimiser. Another major improvement is the provision of facilities for parallel computing. Parallelisation has been implemented using both the master-slave approach and the islands evolution model. Several examples of usage are presented, with both real-world data examples and benchmark functions, showing that often high-quality solutions can be obtained more efficiently
Class prediction and gene selection for DNA microarrays using regularized sliced inverse regression
The monitoring of the expression profiles of thousands of genes have proved to be particularly promising for biological classification. DNA microarray data have been recently used for the development of classification rules, particularly for cancer diagnosis. However, microarray data present major challenges due to the complex, multiclass nature and the overwhelming number of variables characterizing gene expression profiles. A regularized form of sliced inverse regression (REGSIR) approach is proposed. It allows the simultaneous development of classification rules and the selection of those genes that are most important in terms of classification accuracy. The method is illustrated on some publicly available microarray data sets. Furthermore, an extensive comparison with other classification methods is reported. The REGSIR performance is comparable with the best classification methods available, and when appropriate feature selection is made the performance can be considerably improved. © 2007 Elsevier B.V. All rights reserved
A fast and efficient Modal EM algorithm for Gaussian mixtures
In the modal approach to clustering, clusters are defined as the local maxima of the underlying probability density function, where the latter can be estimated either nonparametrically or using finite mixture models. Thus, clusters are closely related to certain regions around the density modes, and every cluster corresponds to a bump of the density. The Modal Expectation-Maximization (MEM) algorithm is an iterative procedure that can identify the local maxima of any density function. In this contribution, we propose a fast and efficient MEM algorithm to be used when the density function is estimated through a finite mixture of Gaussian distributions with parsimonious component-covariance structures. After describing the procedure, we apply the proposed MEM algorithm on both simulated and real data examples, showing its high flexibility in several contexts
Genetic Algorithms for Subset Selection in Model-Based Clustering
Model-based clustering assumes that the data observed can be represented by a finite mixture model, where each cluster is represented by a parametric distribution. The Gaussian distribution is often employed in the multivariate continuous case. The identification of the subset of relevant clustering variables enables a parsimonious number of unknown parameters to be achieved, thus yielding a more efficient estimate, a clearer interpretation and often improved clustering partitions. This paper discusses variable or feature selection for model-based clustering. Following the approach of Raftery and Dean (J Am Stat Assoc 101(473):168–178, 2006), the problem of subset selection is recast as a model comparison problem, and BIC is used to approximate Bayes factors. The criterion proposed is based on the BIC difference between a candidate clustering model for the given subset and a model which assumes no clustering for the same subset. Thus, the problem amounts to finding the feature subset which maximises such a criterion. A search over the potentially vast solution space is performed using genetic algorithms, which are stochastic search algorithms that use techniques and concepts inspired by evolutionary biology and natural selection. Numerical experiments using real data applications are presented and discussed
Visualization of Model-Based Clustering Structures
Model-based clustering based on a finite mixture of Gaussian components is an effective method for looking for groups of observations in a dataset. In this paper we propose a dimension reduction method, called MCLUSTSIR, which is able to show clustering structures depending on the selected Gaussian mixture model. The method aims at finding those directions which are able to display both variation in cluster means and variations in cluster covariances. The resulting MCLUSTSIR variables are defined as a linear mapping method which projects the data onto a suitable subspace
- …
