1,720,983 research outputs found

    Bag of Words approaches for Bioinformatics

    No full text
    Molti problemi di Pattern Recognition statistica sono stati affrontati nella letteratura recente attraverso la rappresentazione "bag of words", una rappresentazione particolarmente appropriata quando negli oggetti del problema si riescono ad individuare dei semplici elementi "costituenti". Mediante la rappresentazione bag of words, gli oggetti vengono caratterizzati da un vettore in cui ogni elemento conta il numero di occorrenze dei costituenti nell'oggetto. Nonostante il grande successo ottenuto in diversi campi della ricerca scientifica, tecniche e modelli basati su questa rappresentazione non sono ancora stati sfruttati appieno in Bioinformatica, a causa delle sfide metodologiche e applicative poste da questa specifica disciplina. Ciononostante, in questo contesto la rappresentazione bag of words sembra essere particolarmente appropriata: da un lato, numerosi problemi bioinformatici sono inerentemente posti attraverso meccanismi di conteggio; dall'altro, in molti scenari biologici la struttura degli oggetti che li caratterizzano è assente o sconosciuta, e uno dei maggiori svantaggi della rappresentazione bag of words (che non modella tale struttura) viene a cadere. Questa tesi si inserisce nel contesto appena presentato, e promuove l'utilizzo della rappresentazione bag of words per caratterizzare oggetti e problemi in Bioinformatica e Biologia Computazionale. In questa tesi vengono investigate tutte le problematiche relative alla creazione di rappresentazioni e modelli bag of words per specifici problemi, e vengono proposte possibili soluzioni e approcci. In dettaglio, sono stati individuati ed analizzati in questa tesi tre specifici problemi bioinformatici: l'analisi dell'espressione genica, il modeling dell'infezione HIV, e l'identificazione di omologia remota fra proteine. Per ogni scenario sono state analizzate le motivazioni, i vantaggi, e le sfide poste dall'utilizzo di rappresentazioni e modelli bag of words, e sono state proposte diverse soluzioni. I meriti degli approcci proposti sono stati dimostrati attraverso estese validazioni sperimentali, sia sfruttando benchmark ampiamente utilizzati in letteratura, sia utilizzando dati derivanti dall'interazione diretta con laboratori e gruppi di ricerca clinici/biologici. La conclusione raggiunta indica che gli approcci basati sulla rappresentazione bag of words possono avere un impatto determinante nelle comunità della Bioinformatica e Biologia Computazionale.In recent years, several Pattern Recognition problems have been successfully faced by approaches based on the "bag of words" representation. This representation is particularly appropriate when the pattern is characterized (or assumed to be characterized) by the repetition of basic, "constituting" elements called words. By assuming that all possible words are stored in a dictionary, the bag of words vector for one particular object is obtained by counting the number of times each element of the dictionary occurs in the object. Even if largely applied to several scientific fields (with increasingly sophisticated approaches), techniques based on this representation have not been completely exploited in Bioinformatics, due to the methodological and applicative challenges derived from the peculiar scenario. However, in this context the bag of words paradigm seems to be particularly suited: on one hand, many biological mechanisms inherently subsume a counting process; on the other hand, in many Bioinformatics scenarios the objects of the problem are either unstructured or with unknown structure, so that one of the main drawbacks of the bag of words representation (it destroys the object's structure) does not hold anymore. This permits to exploit and to derive highly effective and interpretable solutions, a stringent need in nowadays Bioinformatics research. This thesis is inserted in the above described scenario, and promotes the use of the bag of words paradigm to face problems in Bioinformatics. We investigated the different problematics and aspects related to the creation of bag of words models and representations for some specific Bioinformatics problems, as well as proposing original solutions and approaches based on this representation. In particular, in this thesis three scenarios have been analyzed: the gene expression analysis, the modeling of HIV infection, and the protein remote homology detection. For each scenario, motivations, advantages, and challenges of the bag of words representations are addressed, proposing possible solutions. The merits of bag of words representations and models have been demonstrated in extensive experimental evaluations, exploiting widely used benchmarks as well as datasets derived from direct interactions with biological and clinical laboratories and research groups. With this thesis, we provided evidence that the bag of words representation can have a significant impact on the Bioinformatics and Computational Biology communities

    2D shapes classification using BLAST

    No full text
    This paper presents a novel 2D shape classication approach, which exploits in this context the huge amount of work carried out by bioinformaticians in the biological sequence analysis research field. In particular, in the approach presented here, we propose to encode shapes as biological sequences, employing the widely known sequence alignment tool called BLAST (Basic Local Alignment Search Tool) to devise a similarity score, used in a nearest neighbour scenario. Obtained results on standard datasets show the feasibility of the proposed approach

    A bioinformatics approach to 2D shape classification

    No full text
    In the past, the huge and profitable interaction between Pattern Recognition and biology/bioinformatics was mainly unidirectional, namely targeted at applying PR tools and ideas to analyse biological data. In this paper we investigate an alternative approach, which exploits bioinformatics solutions to solve PR problems: in particular, we address the 2D shape classification problem using classical biological sequence analysis approaches - for which a vast amount of tools and solutions have been developed and improved in more than 40 years of research. First, we highlight the similarities between 2D shapes and biological sequences, then we propose three methods to encode a shape as a biological sequence. Given the encoding, we can employ standard biological sequence analysis tools to derive a similarity, which can be exploited in a nearest neighbor framework. Classification results, obtained on 5 standard datasets, confirm the potentials of the proposed unconventional interaction between PR and bioinformatics. Moreover, we provide some evidences of how it is possible to exploit other bioinformatics concepts and tools to interpret data and results, confirming the flexibility of the proposed framework

    2D shape recognition using biological sequence alignment tools

    No full text
    In this paper a novel 2D shape recognition approach is proposed. The main idea is to exploit in this context the huge amount of work carried out by bioinformaticians in the biological sequence analysis research field. In the proposed approach, we encode shapes as biological sequences, employing standard and well established sequence alignment tools to devise a similarity score, finally used in a nearest neighbour scenario. Despite its simplicity, obtained results on standard datasets are really encouraging

    Soft Ngram representation and modeling for protein remote homology detection

    No full text
    Remote homology detection represents a central problem in bioinformatics, where the challenge is to detect functionally related proteins when their sequence similarity is low. Recent solutions employ representations derived from the sequence profile, obtained by replacing each amino acid of the sequence by the corresponding most probable amino acid in the profile. However, the information contained in the profile could be exploited more deeply, provided that there is a representation able to capture and properly model such crucial evolutionary information. In this paper we propose a novel profile-based representation for sequences, called soft Ngram. This representation, which extends the traditional Ngram scheme (obtained by grouping N consecutive amino acids), permits to consider all of the evolutionary information in the profile: this is achieved by extracting Ngrams from the whole profile,equipping them with a weight directly computed from the corresponding evolutionary frequencies. We illustrate two different approaches to model the proposed representation and to derive a feature vector, which can be effectively used for classification using a support vector machine (SVM). A thorough evaluation on three benchmarks demonstrates that the new approach outperforms other Ngram-based methods, and shows very promising results also in comparison with a broader spectrum of techniques

    A Multimodal Approach for Protein Remote Homology Detection

    No full text
    Protein remote homology detection represents a crucial and challenging task in bioinformatics: even if effective methods appeared in recent years, in several cases a proper characterization of remote evolutionary correlation can not be derived. In such situations, it may be possible that information derived from other sources helps, provided that it is possible to properly integrate such (even partial) information into existing models. In this paper, we provide some evidence that this route is feasible: inspired by the multimodal retrieval literature, we show how it is possible to exploit a simple multimodal approach to improve a model learned from a set of sequences, by using knowledge derived from a partial set of corresponding 3D structures. We investigate (with the SCOP 1.53 benchmark) the suitability of the proposed multimodal scheme, showing that a beneficial effect can be obtained even when a very reduced amount of structures are available. A further detailed analysis on a member of the GPCR superfamily confirms that this multimodal approach can extract information that cannot be obtained from sequence-based techniques

    Feature selection using Counting Grids: application to microarray data

    No full text
    In this paper a novel feature selection scheme is proposed, which exploits the potentialities of a recent probabilistic generative model, the Counting Grid. This model is able to cluster together similar observations, highlighting the compactness of a class and its underlying structure. The proposed feature selection scheme is applied to the expression microarray scenario, a peculiar context with very few patterns and a huge number of features. Experiments on benchmark datasets show that the proposed approach is effective and stable, assessing state-of-the-art classification accuracies

    From Web to Physical and Back: WP User Profiling with Deep Learning

    No full text
    This position paper discusses the definition and implementation of Web-Physical (WP) user profiles, which allow the creation of personalized recommendations and innovative behavioral predictions in particular scenarios, i.e., fairs. The nature of a WP profile builds upon two different worlds: the Web (social networks and web applications) and the Physical one, each one of them being explored through (big) data collection platforms. These two platforms collect radically different information: on the one hand, information of appreciation towards a particular product or service (web domain) together with other metadata; on the other, the leases (x, y) of users in the exhibition space (physical domain). In this scenario, our research idea consists in identifying how the information in the two domains can be merged in a whole entity under a theoretical point of view: this will unleash tangible repercussions in terms of personalized recommendations and effective behavioral predictions, where with personalized recommendation we mean a suggestion to a user in physical terms (eg a pavilion to visit) and / or in web terms (eg a site to visit) and with behavioral prediction a prediction of where a user can go in the future, even in a multimedia perspective (physical + web)

    Biologically-aware Latent Dirichlet Allocation (BaLDA) for the Classification of Expression Microarray

    No full text
    Biologically-aware Latent Dirichlet Allocation (BaLDA) for the Classification of Expression Microarra
    corecore