1,721,027 research outputs found

    Learning from high-dimensional biomedical datasets: The issue of class imbalance

    Full text link
    As witnessed by a vast corpus of literature, dimensionality reduction is a fundamental step for biomedical data analysis. Indeed, in this domain, there is often the need for coping with a huge number of data attributes (or features). By removing irrelevant or redundant attributes, feature selection techniques can significantly reduce the complexity of the original problem, with important benefits in terms of domain understanding and knowledge discovery. When learning from biomedical data, however, the dimensionality issue is often addressed without a joint consideration of other critical aspects that may compromise the performance of the induced models. The adverse implications of an imbalanced class distribution, for example, are often neglected in this domain. The aim of this work is to investigate the effectiveness of hybrid learning strategies that incorporate both methods for dimensionality reduction as well as methods for alleviating the issue of class imbalance. Specifically, we combine different feature selection techniques, both univariate and multivariate, with sampling-based class balancing methods and cost-sensitive classification. The performance of the resulting learning schemes is experimentally evaluated on six high-dimensional genomic benchmarks, using different classification algorithms, with interesting insight about the best strategies to use based on the characteristics of the data at hand

    Handling Class Imbalance in High-Dimensional Biomedical Datasets

    No full text
    When dealing with biomedical data, the first and most challenging issue is often the huge dimensionality, i.e. the presence of a very high number of features for each of the problem instances at hand. A vast literature is available on different dimensionality reduction techniques that can be suitable for handling such kind of data, with a special focus on feature selection algorithms that allow to discard uninformative/useless features. In most cases, however, the dimensionality issue is addressed without a joint consideration of other potential problems in the data, including an imbalanced class distribution that may hinder the construction of effective classification models. Class imbalance, in turn, has been mostly treated in literature as an independent problem, especially in application fields where the number of features is not so critical. But several biomedical datasets are both high-dimensional and class-imbalanced, so there is a strong need for designing and evaluating learning strategies that can properly deal with both the issues simultaneously. In this work, we experiment with using feature selection techniques in conjunction with sampling-based class balancing methods and cost-sensitive classification, in order to gain insight into the most effective strategies to use when dealing with such complex data

    Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests

    Full text link
    Class imbalance and high dimensionality are two major issues in several real-life applications, e.g., in the fields of bioinformatics, text mining and image classification. However, while both issues have been extensively studied in the machine learning community, they have mostly been treated separately, and little research has been thus far conducted on which approaches might be best suited to deal with datasets that are class-imbalanced and high-dimensional at the same time (i.e., with a large number of features). This work attempts to give a contribution to this challenging research area by studying the effectiveness of hybrid learning strategies that involve the integration of feature selection techniques, to reduce the data dimensionality, with proper methods that cope with the adverse effects of class imbalance (in particular, data balancing and cost-sensitive methods are considered). Extensive experiments have been carried out across datasets from different domains, leveraging a well-known classifier, the Random Forest, which has proven to be effective in high-dimensional spaces and has also been successfully applied to imbalanced tasks. Our results give evidence of the benefits of such a hybrid approach, when compared to using only feature selection or imbalance learning methods alone

    Special Issue on Emerging Trends and Challenges in Supervised Learning Tasks

    Full text link
    With the massive growth of data-intensive applications, the machine learning field has gained widespread popularity [...

    Feature Selection on Imbalanced Domains: A Stability-Based Analysis

    No full text
    A large body of literature has shown the beneficial impact of feature selection on the efficiency, interpretability, and generalization ability of machine learning models. Most of the existing studies, however, focus on the effectiveness of feature selection algorithms in identifying small subsets of predictive features, often neglecting the stability of the selection process, i.e., its robustness with respect to sample variation, which can be crucial for the actual exploitation of the results. In particular, little research has so far investigated the stability of feature selection methods in class-imbalanced domains, where some classes are underrepresented and any perturbation in the set of training records can strongly affect the final selection outcome. This work aims to investigate this important issue by studying the stability of different selection algorithms across high-dimensional datasets that present different levels of class imbalance. To this end, a methodological pipeline is discussed which allows a joint evaluation of the selection outcome both in terms of stability and final predictive performance. Although not exhaustive, our experiments provide very useful insight into which methods can be more stable on imbalanced data while still ensuring good generalization results

    Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study

    Full text link
    High dimensionality and class imbalance have been largely recognized as important issues in machine learning. A vast amount of literature has indeed investigated suitable approaches to address the multiple challenges that arise when dealing with high-dimensional feature spaces (where each problem instance is described by a large number of features). As well, several learning strategies have been devised to cope with the adverse effects of imbalanced class distributions, which may severely impact on the generalization ability of the induced models. Nevertheless, although both the issues have been largely studied for several years, they have mostly been addressed separately, and their combined effects are yet to be fully understood. Indeed, little research has been so far conducted to investigate which approaches might be best suited to deal with datasets that are, at the same time, high-dimensional and class-imbalanced. To make a contribution in this direction, our work presents a comparative study among different learning strategies that leverage both feature selection, to cope with high dimensionality, as well as cost-sensitive learning methods, to cope with class imbalance. Specifically, different ways of incorporating misclassification costs into the learning process have been explored. As well, different feature selection heuristics have been considered, both univariate and multivariate, to comparatively evaluate their effectiveness on imbalanced data. The experiments have been conducted on three challenging benchmarks from the genomic domain, gaining interesting insight into the beneficial impact of combining feature selection and cost-sensitive learning, especially in the presence of highly skewed data distributions

    Exploiting Feature Selection in Human Activity Recognition: Methodological Insights and Empirical Results Using Mobile Sensor Data

    Full text link
    Human Activity Recognition (HAR) using mobile sensor data has gained increasing attention over the last few years, with a fast-growing number of reported applications. The central role of machine learning in this field has been discussed by a vast amount of research works, with several strategies proposed for processing raw data, extracting suitable features, and inducing predictive models capable of recognizing multiple types of daily activities. Since many HAR systems are implemented in resource-constrained mobile devices, the efficiency of the induced models is a crucial aspect to consider. This paper highlights the importance of exploiting dimensionality reduction techniques that can simplify the model and increase efficiency by identifying and retaining only the most informative and predictive features for activity recognition. More in detail, a large experimental study is presented that encompasses different feature selection algorithms as well as multiple HAR benchmarks containing mobile sensor data. Such a comparative evaluation relies on a methodological framework that is meant to assess not only the extent to which each selection method is effective in identifying the most predictive features but also the overall stability of the selection process, i.e., its robustness to changes in the input data. Although often neglected, in fact, the stability of the selected feature sets is important for a wider exploitability of the induced models. Our experimental results give an interesting insight into which selection algorithms may be most suited in the HAR domain, complementing and significantly extending the studies currently available in this field

    Using Artificial Intelligence for COVID-19 Detection in Blood Exams: A Comparative Analysis

    Full text link
    COVID-19 is an infectious disease that was declared a pandemic by the World Health Organization (WHO) in early March 2020. Since its early development, it has challenged health systems around the world. Although more than 12 billion vaccines have been administered, at the time of writing, it has more than 623 million confirmed cases and more than 6 million deaths reported to the WHO. These numbers continue to grow, soliciting further research efforts to reduce the impacts of such a pandemic. In particular, artificial intelligence techniques have shown great potential in supporting the early diagnosis, detection, and monitoring of COVID-19 infections from disparate data sources. In this work, we aim to make a contribution to this field by analyzing a high-dimensional dataset containing blood sample data from over forty thousand individuals recognized as infected or not with COVID-19. Encompassing a wide range of methods, including traditional machine learning algorithms, dimensionality reduction techniques, and deep learning strategies, our analysis investigates the performance of different classification models, showing that accurate detection of blood infections can be obtained. In particular, an F-score of 84% was achieved by the artificial neural network model we designed for this task, with a rate of 87% correct predictions on the positive class. Furthermore, our study shows that the dimensionality of the original data, i.e. the number of features involved, can be significantly reduced to gain efficiency without compromising the final prediction performance. These results pave the way for further research in this field, confirming that artificial intelligence techniques may play an important role in supporting medical decision-making

    DEW 2019: Data Exploration in the Web 3.0 Age

    No full text
    Now in its third edition, the Data Exploration in the Web 3.0 Age (DEW) track of the IEEE International WETICE Conference continues to bring together researchers and practitioners from both the Academia and Industry working in the areas related to data exploration, in a very broad sense. Papers accepted for presentation at DEW 2019 are representatives of emerging topics in the fields of data and text mining, machine learning, semantic web and Internet of things
    corecore