1,721,028 research outputs found

    Integration and querying of genomic and proteomic semantic annotations for biomedical knowledge extraction

    Full text link
    Understanding complex biological phenomena involves answering complex biomedical questions on multiple biomolecular information simultaneously, which are expressed through multiple genomic and proteomic semantic annotations scattered in many distributed and heterogeneous data sources; such heterogeneity and dispersion hamper the biologists’ ability of asking global queries and performing global evaluations. To overcome this problem, we developed a software architecture to create and maintain a Genomic and Proteomic Knowledge Base (GPKB), which integrates several of the most relevant sources of such dispersed information (including Entrez Gene, UniProt, IntAct, Expasy Enzyme, GO, GOA, BioCyc, KEGG, Reactome and OMIM). Our solution is general, as it uses a flexible, modular and multilevel global data schema based on abstraction and generalization of integrated data features, and a set of automatic procedures for easing data integration and maintenance, also when the integrated data sources evolve in data content, structure and number. These procedures also assure consistency, quality and provenance tracking of all integrated data, and perform the semantic closure of the hierarchical relationships of the integrated biomedical ontologies. At http://www.bioinformatics.deib.polimi.it/GPKB/, a Web interface allows graphical easy composition of queries, although complex, on the knowledge base, supporting also semantic query expansion and comprehensive explorative search of the integrated data to better sustain biomedical knowledge extraction

    Visual composition of complex queries on an integrative genomic and proteomic data warehouse

    No full text
    Biomedical questions are usually complex and regard several different life science aspects. Numerous valuable and he- terogeneous data are increasingly available to answer such questions. Yet, they are dispersedly stored and difficult to be queried comprehensively. We created a Genomic and Proteomic Data Warehouse (GPDW) that integrates data provided by some of the main bioinformatics databases. It adopts a modular integrated data schema and several metadata to de- scribe the integrated data, their sources and their location in the GPDW. Here, we present the Web application that we developed to enable any user to easily compose queries, although complex, on all data integrated in the GPDW. It is publicly available at http://www.bioinformatics.dei.polimi.it/GPKB/. Through a visual interface, the user is only re- quired to select the types of data to be included in the query and the conditions on their values to be retrieved. Then, the Web application leverages the metadata and modular schema of the GPDW to automatically compose an efficient SQL query, run it on the GPDW and show the extracted requested data, enriched with links to external data sources. Per- formed tests demonstrated efficiency and usability of the developed Web application, and showed its and GPDW re- levance in supporting answering biomedical questions, also difficult

    Protein-protein interaction associated disorders revealed via data integration

    No full text
    Numerous protein-protein interaction (PPI) data are provided by using new high-throughput experimental and computational techniques; they are being collected in different databases. The data generally do not contain phenotypic or even functional or structural information about the interactors, which in many cases are available in other databases. Thus, to have widespread coverage, it is necessary to combine the data from different databases. For this purpose, we are developing a framework to create and maintain a data warehouse on the basis of a conceptual data model. Then, we applied an automatic association inference method, based on the transitive closure concept. In particular, by leveraging IntAct and Mint PPI data, Entrez protein encoding gene data and OMIM genetic disorder data, we inferred associations between proteins and genetic disorders and their phenotypes. In our data warehouse, 46,154 human PPIs regarding 12,178 distinct human proteins were integrated. These human proteins are encoded by 11,232 different human genes. By applying transitive closure concept, we identified 1,130 gene networks and found 1,136 human PPIs associated with 628 genetic disorders. The interactions between the proteins, that are associated to the specific disease with transitive closure method, will help researchers to focus on protein interactions of the disease. This will helps to reveal the disease because of malfunctioning protein interactions. Then possibly the disease treatment strategy such as synthetic protein engineering could be applied. This hypothesis shows the importance of the integration of the PPI data with the genetic disorder data

    GenoSurf: metadata driven semantic search system for integrated genomic datasets

    Full text link
    Many valuable resources developed by world-wide research institutions and consortia describe genomic datasets that are both open and available for secondary research, but their metadata search interfaces are heterogeneous, not interoperable and sometimes with very limited capabilities. We implemented GenoSurf, a multi-ontology semantic search system providing access to a consolidated collection of metadata attributes found in the most relevant genomic datasets; values of 10 attributes are semantically enriched by making use of the most suited available ontologies. The user of GenoSurf provides as input the search terms, sets the desired level of ontological enrichment and obtains as output the identity of matching data files at the various sources. Search is facilitated by drop-down lists of matching values; aggregate counts describing resulting files are updated in real time while the search terms are progressively added. In addition to the consolidated attributes, users can perform keyword-based searches on the original (raw) metadata, which are also imported; GenoSurf supports the interplay of attribute-based and keyword-based search through well-defined interfaces. Currently, GenoSurf integrates about 40 million metadata of several major valuable data sources, including three providers of clinical and experimental data (TCGA, ENCODE and Roadmap Epigenomics) and two sources of annotation data (GENCODE and RefSeq); it can be used as a standalone resource for targeting the genomic datasets at their original sources (identified with their accession IDs and URLs), or as part of an integrated query answering system for performing complex queries over genomic regions and metadata

    Genomic data integration and user-defined sample-set extraction for population variant analysis

    Full text link
    Population variant analysis is of great importance for gathering insights into the links between human genotype and phenotype. The 1000 Genomes Project established a valuable reference for human genetic variation; however, the integrative use of the corresponding data with other datasets within existing repositories and pipelines is not fully supported. Particularly, there is a pressing need for flexible and fast selection of population partitions based on their variant and metadata-related characteristics
    corecore