1,721,026 research outputs found
Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains
Selecting a subset of relevant features is crucial to the analysis of high-dimensional datasets coming from a number of application domains, such as biomedical data, document and image analysis. Since no single selection algorithm seems to be capable of ensuring optimal results in terms of both predictive performance and stability (i.e. robustness to changes in the input data), researchers have increasingly explored the effectiveness of “ensemble” approaches involving the combination of different selectors. While interesting proposals have been reported in the literature, most of them have been so far evaluated in a limited number of settings (e.g. with data from a single domain and in conjunction with specific selection approaches), leaving unanswered important questions about the large-scale applicability and utility of ensemble feature selection. To give a contribution to the field, this work presents an empirical study which encompasses different kinds of selection algorithms (filters and embedded methods, univariate and multivariate techniques) and different application domains. Specifically, we consider 18 classification tasks with heterogeneous characteristics (in terms of number of classes and instances-to-features ratio) and experimentally evaluate, for feature subsets of different cardinalities, the extent to which an ensemble approach turns out to be more robust than a single selector, thus providing useful insight for both researchers and practitioners
Feature Selection for high-dimensional data: the issue of stability
Feature selection has become a necessary step to the analysis of high-dimensional datasets coming from several application domains (e.g., web data, document and image analysis, biological data). Though well-established methods exist to select highly discriminative features, discarding the ones that may be either redundant or irrelevant to the problem at hand, little attention has been so far given to the stability of these methods, in cases where the composition of the original dataset is perturbed to some extent (e.g., by adding new records or by random sampling). In this work, we highlight the importance of jointly considering both stability and predictive performance when the selection results are used for knowledge discovery and domain understanding. As a case study, we consider five popular feature selection algorithms, representatives of different selection approaches, and experimentally investigate their behaviour across three different domains: Internet advertisements, text categorization and biomedical data classification. Useful insight on the “intrinsic” stability of each algorithm seems to emerge, despite the peculiar characteristics of each dataset
A Framework for the Modular Composition of Learning Objects
This paper describes a framework to support the modular composition of learning objects for achieving educational goals. The presented approach involves two modeling spaces: the semantic space of educational concepts and the digital space of learning objects. The framework abstracts the parallel features of both spaces for modeling conceptual skeletons, called views, that encapsulate learning objects as well as knowledge on how they can be sequenced. Each view is linked to a concept of the semantic space and is featured by metadata. In order to achieve modularity, portability and extensibility, views are mapped into an XML Schema, based on which they can be systematically implemented and managed by intelligent agents
Smart spaces for adaptive information integration in bioinformatics
Bioinformatics is reaping the benefits of advances in Semantic Web technology thanks to the growing number of available biomedical web resources and portals. Although positive in general, this abundance poses practical challenges to researchers who must be skilled in techniques for retrieving and integrating sparse and complex contents, and thereof calls for more “intelligent” and user friendly ways of interaction to easily get information. With the aim of making available the intelligent functionality of smart systems, this paper presents SSAIIB (Smart Spaces for Adaptive Information Integration in Bioinformatics), a reference framework for designing bioinformatics smart applications that support discovering, aggregating and delivering contents from web resources according to user’s goals, tasks and concerns. The “Smart Spaces” are software environments whose smartness lies in their ability to adaptively accomplish specific user’s activities such as the exploiting content from biomedical resources, integrating data captured from different sources, supporting data analytics, etc. SSAIIB is structured around two main technologies: service oriented architectures and software agents. In particular, it relies on mechanisms for dynamically assembling suitable services and the use of agents as a natural metaphor for both modelling user’s activities and accessing web resources. A case study is presented that shows the application of SSAIIB to the design and the implementation of a smart space for annotating biomedical texts
Similarity of feature selection methods: An empirical study across data intensive classification tasks
In the past two decades, the dimensionality of datasets involved in machine learning and data mining applications has increased explosively. Therefore, feature selection has become a necessary step to make the analysis more manageable and to extract useful knowledge about a given domain. A large variety of feature selection techniques are available in literature, and their comparative analysis is a very difficult task. So far, few studies have investigated, from a theoretical and/or experimental point of view, the degree of similarity/dissimilarity among the available techniques, namely the extent to which they tend to produce similar results within specific application contexts. This kind of similarity analysis is of crucial importance when two or more methods are combined in an ensemble fashion: indeed the ensemble paradigm is beneficial only if the involved methods are capable of giving different and complementary representations of the considered domain. This paper gives a contribution in this direction by proposing an empirical approach to evaluate the degree of consistency among the outputs of different selection algorithms in the context of high dimensional classification tasks. Leveraging on a proper similarity index, we systematically compared the feature subsets selected by eight popular selection methods, representatives of different selection approaches, and derived a similarity trend for feature subsets of increasing size. Through an extensive experimentation involving sixteen datasets from three challenging domains (Internet advertisements, text categorization and micro-array data classification), we obtained useful insight into the pattern of agreement of the considered methods. In particular, our results revealed how multivariate selection approaches systematically produce feature subsets that overlap to a small extent with those selected by the other methods
An Evolutionary Method for Combining Different Feature Selection Criteria in Microarray Data Classification
The classification of cancers from gene expression profiles is a challenging research area in bioinformatics since the high dimensionality of micro-array data results in irrelevant and redundant information that affects the performance of classification. This paper proposes using an evolutionary algorithm to select relevant gene subsets in order to further use them for the classification task. This is achieved by combining valuable results from different feature ranking methods into feature pools whose dimensionality is reduced by a wrapper approach involving a genetic algorithm and SVM classifier. Specifically, the GA explores the space defined by each feature pool looking for solutions that balance the size of the feature subsets and their classification accuracy. Experiments demonstrate that the proposed method provide good results in comparison to different state of art methods for the classification of micro-array data
Exploiting biomedical web resources: a case study
An increasing number of web resources continue to be extensively used by healthcare operators to obtain more accurate diagnostic results. In particular, health care is reaping the benefits of technological advances in genomic for facing the demand of genetic tests that allow a better comprehension of diagnostic results. Within this context, Gene Ontology (GO) is a popular and effective mean for extracting knowledge from a list of genes and evaluating their semantic similarity. This paper investigates about the potential and any limits of GO ontology as support for capturing information about a set of genes which are supposed to play a significant role in a pathological condition. In particular, we present a case study that exploits some biomedical web resources for devising several groups of functionally coherent genes and experiments about the evaluation of their semantic similarity over GO. Due to the GO structure and content, results reveal limitations that not affect the evaluation of the semantic similarity when genes exhibit simple correlations but influence the estimation of the relatedness of genes belonging to complex organizations
- …
