1,721,006 research outputs found

    Cancer recognition with bagged ensembles of Support Vector Machines

    No full text
    Expression-based classification of tumors requires stable, reliable and variance reduction methods, as DNA microarray data are characterized by low size, high dimensionality, noise and large biological variability. In order to address the variance and curse of dimensionality problems arising from this difficult task, we propose to apply bagged ensembles of Support Vector Machines (SVM) and feature selection algorithms to the recognition of malignant tissues. Presented results show that bagged ensembles of SVMs are more reliable and achieve equal or better classification accuracy with respect to single SVMs, whereas feature selection methods can further enhance classification accuracy

    Bagged ensembles of Support Vector Machines for gene expression data analysis

    No full text
    Extracting information from gene expression data is a difficult task, as these data are characterized by very high dimensional, small sized, samples and large degree of biological variability. However, a possible way of dealing with the curse of dimensionality is offered by feature selection algorithms, while variance problems arising from small samples and biological variability can be addressed through ensemble methods based on resampling techniques. These two approaches have been combined to improve the accuracy of Support Vector Machines (SVM) in the classification of malignant tissues from DNA microarray data. To assess the accuracy and the confidence of the predictions performed proper measures have been introduced. Presented results show that bagged ensembles of SVM are more reliable and achieve equal or better classification accuracy with respect to single SVM, whereas feature selection methods can further enhance classification accuracy

    Modeling gene expression data via positive Boolean functions

    No full text
    In this work we propose an artificial model for the generation of biologically plausible gene expression data to be used in the evaluation of the performance of gene selection and clustering methods. The model allows to fix in advance the set of relevant genes and the functional classes involved in the problem; the input-output relationship is constructed by synthesizing a positive Boolean function. Despite its simplicity, it is sufficiently rich to take account of the specific peculiarities of gene expression data, including biological variability. A Java code had been developed to allow the user choose the model parameters according to the characteristics of the experiment he want to simulate. This permits to insert the artificial model into a distributed system for microarray analysis, in particular one based on a Grid infrastructure

    Assessment of clusters reliability for high dimensional genomic data

    No full text
    Motivation: Discovering new subclasses of pathologies and expression signatures related to specific phenotypes are challenging problems in the context of gene expression data analysis. To pursue these objectives, we need to estimate the natural number and the stability of the discovered clusters. To this end, new approaches based on random subspaces and bootstrap methods have been recently proposed. Methods: We present a method based on randomized embedding between euclidean subspaces to assess the stability of clusters characterized by low cardinality and very high dimensionality. In particular we propose a cluster stability measure based on similarity between randomly projected data obeying the Johnson Lindenstrauss lemma, in order to control the distortion induced by randomized maps. As a by-product of our approach we may also assess the stability of the overall clustering (thus estimating the number of "natural clusters" in a data set), and the confidence of the assignments of each example to each cluster. The proposed approach may be applied to any clustering algorithm, comprising classical hierarchical and fuzzy clustering. Results: At first we evaluated the distortion induced by the random mappings from very high to lower dimensional euclidean spaces using high dimensional synthetic data, showing that we may obtain distortions lower than that predicted by the Johnson Lindenstrauss lemma. Then we applied the proposed stability indices, based on embeddings into lower dimensional spaces with limited distortio n, to both synthetic and gene expression data,. In particular we computed the s- index (stability index) specific for each cluster, the overall validity index S that estimates the reliability of the overall clustering, and the AC (Assignment-Confidence) index that estimates the reliability of the membership of a specific example to a specific cluster. Results with synthetic and gene expression data clustered with classical hierarchical clustering algorithms show the effectiveness of the proposed approach

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Variations on the Author

    Full text link
    “Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship

    Appropriate Similarity Measures for Author Cocitation Analysis

    Full text link
    We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis
    corecore