1,721,028 research outputs found
Integration and querying of genomic and proteomic semantic annotations for biomedical knowledge extraction
Understanding complex biological phenomena involves answering complex biomedical questions on multiple
biomolecular information simultaneously, which are expressed through multiple genomic and proteomic semantic annotations
scattered in many distributed and heterogeneous data sources; such heterogeneity and dispersion hamper the biologists’ ability
of asking global queries and performing global evaluations. To overcome this problem, we developed a software architecture to
create and maintain a Genomic and Proteomic Knowledge Base (GPKB), which integrates several of the most relevant sources
of such dispersed information (including Entrez Gene, UniProt, IntAct, Expasy Enzyme, GO, GOA, BioCyc, KEGG, Reactome
and OMIM). Our solution is general, as it uses a flexible, modular and multilevel global data schema based on abstraction and
generalization of integrated data features, and a set of automatic procedures for easing data integration and maintenance, also
when the integrated data sources evolve in data content, structure and number. These procedures also assure consistency,
quality and provenance tracking of all integrated data, and perform the semantic closure of the hierarchical relationships of the
integrated biomedical ontologies. At http://www.bioinformatics.deib.polimi.it/GPKB/, a Web interface allows graphical easy
composition of queries, although complex, on the knowledge base, supporting also semantic query expansion and
comprehensive explorative search of the integrated data to better sustain biomedical knowledge extraction
Visual composition of complex queries on an integrative genomic and proteomic data warehouse
Biomedical questions are usually complex and regard several different life science aspects. Numerous valuable and he- terogeneous data are increasingly available to answer such questions. Yet, they are dispersedly stored and difficult to be queried comprehensively. We created a Genomic and Proteomic Data Warehouse (GPDW) that integrates data provided by some of the main bioinformatics databases. It adopts a modular integrated data schema and several metadata to de- scribe the integrated data, their sources and their location in the GPDW. Here, we present the Web application that we developed to enable any user to easily compose queries, although complex, on all data integrated in the GPDW. It is publicly available at http://www.bioinformatics.dei.polimi.it/GPKB/. Through a visual interface, the user is only re- quired to select the types of data to be included in the query and the conditions on their values to be retrieved. Then, the Web application leverages the metadata and modular schema of the GPDW to automatically compose an efficient SQL query, run it on the GPDW and show the extracted requested data, enriched with links to external data sources. Per- formed tests demonstrated efficiency and usability of the developed Web application, and showed its and GPDW re- levance in supporting answering biomedical questions, also difficult
Protein-protein interaction associated disorders revealed via data integration
Numerous protein-protein interaction (PPI) data are provided by using new high-throughput experimental and computational techniques; they are being collected in different databases. The data generally do not contain phenotypic or even functional or structural information about the interactors, which in many cases are available in other databases. Thus, to have widespread coverage, it is necessary to combine the data from different databases. For this purpose, we are developing a framework to create and maintain a data warehouse on the basis of a conceptual data model. Then, we applied an automatic association inference method, based on the transitive closure concept. In particular, by leveraging IntAct and Mint PPI data, Entrez protein encoding gene data and OMIM genetic disorder data, we inferred associations between proteins and genetic disorders and their phenotypes. In our data warehouse, 46,154 human PPIs regarding 12,178 distinct human proteins were integrated. These human proteins are encoded by 11,232 different human genes. By applying transitive closure concept, we identified 1,130 gene networks and found 1,136 human PPIs associated with 628 genetic disorders. The interactions between the proteins, that are associated to the specific disease with transitive closure method, will help researchers to focus on protein interactions of the disease. This will helps to reveal the disease because of malfunctioning protein interactions. Then possibly the disease treatment strategy such as synthetic protein engineering could be applied. This hypothesis shows the importance of the integration of the PPI data with the genetic disorder data
GenoSurf: metadata driven semantic search system for integrated genomic datasets
Many valuable resources developed by world-wide research institutions and consortia describe genomic datasets that are both open and available for secondary research, but their metadata search interfaces are heterogeneous, not interoperable and sometimes with very limited capabilities. We implemented GenoSurf, a multi-ontology semantic search system providing access to a consolidated collection of metadata attributes found in the most relevant genomic datasets; values of 10 attributes are semantically enriched by making use of the most suited available ontologies. The user of GenoSurf provides as input the search terms, sets the desired level of ontological enrichment and obtains as output the identity of matching data files at the various sources. Search is facilitated by drop-down lists of matching values; aggregate counts describing resulting files are updated in real time while the search terms are progressively added. In addition to the consolidated attributes, users can perform keyword-based searches on the original (raw) metadata, which are also imported; GenoSurf supports the interplay of attribute-based and keyword-based search through well-defined interfaces. Currently, GenoSurf integrates about 40 million metadata of several major valuable data sources, including three providers of clinical and experimental data (TCGA, ENCODE and Roadmap Epigenomics) and two sources of annotation data (GENCODE and RefSeq); it can be used as a standalone resource for targeting the genomic datasets at their original sources (identified with their accession IDs and URLs), or as part of an integrated query answering system for performing complex queries over genomic regions and metadata
Genomic data integration and user-defined sample-set extraction for population variant analysis
Population variant analysis is of great importance for gathering insights into the links between human genotype and phenotype. The 1000 Genomes Project established a valuable reference for human genetic variation; however, the integrative use of the corresponding data with other datasets within existing repositories and pipelines is not fully supported. Particularly, there is a pressing need for flexible and fast selection of population partitions based on their variant and metadata-related characteristics
Identification of gene annotations and interactions and protein-protein interaction associated disorders through data integration
- …
