1,721,061 research outputs found
Generalization of Probabilistic Latent Semantic Analysis to k-partite Graphs
Many data can be easily modelled as bipartite or k-partite graphs. Among the many computational analyses that can be run on such graphs, link prediction, i.e., the inference of novel links between nodes, is one of the most valuable and has many applications on real world data. While for bipartite graphs many methods exist for this task, only few algorithms are able to perform link prediction on k-partite graphs. The Probabilistic Latent Semantic Analysis (PLSA) is an algorithm based on latent variables, named topics, designed to perform matrix factorisation. As such, it is straightforward to apply PLSA to the task of link prediction on bipartite graphs, simply by decomposing the association matrix. In this work we extend PLSA to k-partite graphs; in particular we designed an algorithm able to perform link prediction on k-partite graphs, by exploiting the information in all the layers of the target graph. Our experiments confirm the capability of the proposed method to effectively perform link prediction on k-partite graphs
Data Science for Genomic Data Management: Challenges, Resources, Experiences
We highlight several challenges which are faced by data scientists who use public datasets for solving biological and clinical problems. In spite of the large efforts in building such public datasets, they are dispersed over many sources and heterogeneous for their formats and sequencing/calling techniques, often meeting highly variable quality standards. Moreover, for most research questions, scientists hardly find datasets with enough samples for building and training machine learning models. Data scarcity depends on the complexity of the genomic domain, with its multi-facets, as well as the lack of organic initiatives to provide standardization across communities. In this paper, we discuss our approach to genomic data management, that can strongly improve the problems of data dispersion and format heterogeneity through high-level abstractions for genomics. We briefly present the computational resources that were recently developed by the GeCo project (ERC Advanced Grant); they include GDM, a Genomic Data Model providing interoperability across data formats; GMQL, a genometric query language for answering data science queries over genomic datasets; and an in-house integrated repository providing attribute-based and keyword-based search over normalized metadata from several open data repositories. We describe these resources at work on a specific research question, and we highlight how we managed to produce a model for addressing such specific research question by overcoming the lack of sufficient samples and labelled datasets
Advancing healthcare through data: the BETTER project's vision for distributed analytics
Introduction: Data-driven medicine is essential for enhancing the accessibility and quality of the healthcare system. The availability of data plays a crucial role in achieving this goal.
Methods: We propose implementing a robust data infrastructure of FAIRification and data fusion for clinical, genomic, and imaging data. This will be embedded within the framework of a distributed analytics platform for healthcare data analysis, utilizing the Personal Health Train paradigm.
Results: This infrastructure will ensure the findability, accessibility, interoperability, and reusability of data, metadata, and results among multiple medical centers participating in the BETTER Horizon Europe project. The project focuses on studying rare diseases, such as intellectual disability and inherited retinal dystrophies.
Conclusion: The anticipated impacts will benefit a wide range of healthcare practitioners and potentially influence health policymakers
Systematic analysis of SARS-CoV-2 Omicron subvariants’ impact on B and T cell epitopes
Introduction: Epitopes are specific structures in antigens that are recognized by the immune system. They are widely used in the context of immunology-related applications, such as vaccine development, drug design, and diagnosis / treatment / prevention of disease. The SARS-CoV-2 virus has represented the main point of interest within the viral and genomic surveillance community in the last four years. Its ability to mutate and acquire new characteristics while it reorganizes into new variants has been analyzed from many perspectives. Understanding how epitopes are impacted by mutations that accumulate on the protein level cannot be underrated. Methods: With a focus on Omicron-named SARS-CoV-2 lineages, including the last WHO-designated Variants of Interest, we propose a workflow for data retrieval, integration, and analysis pipeline for conducting a database-wide study on the impact of lineages' characterizing mutations on all T cell and B cell linear epitopes collected in the Immune Epitope Database (IEDB) for SARS-CoV-2. Results: Our workflow allows us to showcase novel qualitative and quantitative results on 1) coverage of viral proteins by deposited epitopes; 2) distribution of epitopes that are mutated across Omicron variants; 3) distribution of Omicron characterizing mutations across epitopes. Results are discussed based on the type of epitope, the response frequency of the assays, and the sample size. Our proposed workflow can be reproduced at any point in time, given updated variant characterizations and epitopes from IEDB, thereby guaranteeing to observe a quantitative landscape of mutations' impact on demand. Conclusion: A big data-driven analysis such as the one provided here can inform the next genomic surveillance policies in combatting SARS-CoV-2 and future epidemic viruses
Metadata management for scientific databases
Most scientific databases consist of datasets (or sources) which in turn include samples (or files) with an identical structure (or schema). In many cases, samples are associated with rich metadata, describing the process that leads to building them (e.g.: the experimental conditions used during sample generation). Metadata are typically used in scientific computations just for the initial data selection; at most, metadata about query results is recovered after executing the query, and associated with its results by post-processing. In this way, a large body of information that could be relevant for interpreting query results goes unused during query processing. In this paper, we present ScQL, a new algebraic relational language, whose operations apply to objects consisting of data–metadatapairs, by preserving such one-to-one correspondence throughout the computation. We formally define each operation and we describe an optimization, called meta-first, that may significantly reduce the query processing overhead by anticipating the use of metadata for selectively loading into the execution environment only those input samples that contribute to the result samples. In ScQL, metadata have the same relevance as data, and contribute to building query results; in this way, the resulting samples are systematically associated with metadata about either the specific input samples involved or about query processing, thereby yielding a new form of metadata provenance. We present many examples of use of ScQL, relative to several application domains, and we demonstrate the effectiveness of the meta-first optimization
Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying
While a huge amount of (epi)genomic data of multiple types is becoming available by using Next Generation Sequencing (NGS) technologies, the most important emerging problem is the so-called tertiary analysis, concerned with sense making, e.g., discovering how different (epi)genomic regions and their products interact and cooperate with each other. We propose a paradigm shift in tertiary analysis, based on the use of the Genomic Data Model (GDM), a simple data model which links genomic feature data to their associated experimental, biological and clinical metadata. GDM encompasses all the data formats which have been produced for feature extraction from (epi)genomic datasets. We specifically describe the mapping to GDM of SAM (Sequence Alignment/Map), VCF (Variant Call Format), NARROWPEAK (for called peaks produced by NGS ChIP-seq or DNase-seq methods), and BED (Browser Extensible Data) formats, but GDM supports as well all the formats describing experimental datasets (e.g., including copy number variations, DNA somatic mutations, or gene expressions) and annotations (e.g., regarding transcription start sites, genes, enhancers or CpG islands). We downloaded and integrated samples of all the above-mentioned data types and formats from multiple sources. The GDM is able to homogeneously describe semantically heterogeneous data and makes the ground for providing data interoperability, e.g., achieved through the GenoMetric Query Language (GMQL), a high-level, declarative query language for genomic big data. The combined use of the data model and the query language allows comprehensive processing of multiple heterogeneous data, and supports the development of domain-specific data-driven computations and bio-molecular knowledge discovery
Framework for Supporting Genomic Operations
Next Generation Sequencing (NGS) is a family of technologies for reading the DNA or RNA, capable of producing whole genome sequences at an impressive speed, and causing a revolution of both biological research and medical practice. In this exciting scenario, while a huge number of specialized bio-informatics programs extract information from sequences, there is an increasing need for a new generation of systems and frameworks capable of integrating such information, providing holistic answers to the needs of biologists and clinicians. To respond to this need, we developed GMQL, a new query language for genomic data management that operates on heterogeneous genomic datasets. In this paper, we focus on three domain-specific operations of GMQL used for the efficient processing of operations on genomic regions, and we describe their efficient implementation; the paper develops a theory of binning strategies as a generic approach to parallel execution of genomic operations, and then describes how binning is embedded into two efficient implementations of the operations using Flink and Spark, two emerging frameworks for data management on the cloud
A review on viral data sources and search systems for perspective mitigation of COVID-19
With the outbreak of the COVID-19 disease, the research community is producing unprecedented efforts dedicated to better understand and mitigate the effects of the pandemic. In this context, we review the data integration efforts required for accessing and searching genome sequences and metadata of SARS-CoV2, the virus responsible for the COVID-19 disease, which have been deposited into the most important repositories of viral sequences. Organizations that were already present in the virus domain are now dedicating special interest to the emergence of COVID-19 pandemics, by emphasizing specific SARS-CoV2 data and services. At the same time, novel organizations and resources were born in this critical period to serve specifically the purposes of COVID-19 mitigation while setting the research ground for contrasting possible future pandemics. Accessibility and integration of viral sequence data, possibly in conjunction with the human host genotype and clinical data, are paramount to better understand the COVID-19 disease and mitigate its effects. Few examples of host-pathogen integrated datasets exist so far, but we expect them to grow together with the knowledge of COVID-19 disease; once such datasets will be available, useful integrative surveillance mechanisms can be put in place by observing how common variants distribute in time and space, relating them to the phenotypic impact evidenced in the literature
Cross-organism learning method to discover new gene functionalities
BACKGROUND:
Knowledge of gene and protein functions is paramount for the understanding of physiological and pathological biological processes, as well as in the development of new drugs and therapies. Analyses for biomedical knowledge discovery greatly benefit from the availability of gene and protein functional feature descriptions expressed through controlled terminologies and ontologies, i.e., of gene and protein biomedical controlled annotations. In the last years, several databases of such annotations have become available; yet, these valuable annotations are incomplete, include errors and only some of them represent highly reliable human curated information. Computational techniques able to reliably predict new gene or protein annotations with an associated likelihood value are thus paramount.
METHODS:
Here, we propose a novel cross-organisms learning approach to reliably predict new functionalities for the genes of an organism based on the known controlled annotations of the genes of another, evolutionarily related and better studied, organism. We leverage a new representation of the annotation discovery problem and a random perturbation of the available controlled annotations to allow the application of supervised algorithms to predict with good accuracy unknown gene annotations. Taking advantage of the numerous gene annotations available for a well-studied organism, our cross-organisms learning method creates and trains better prediction models, which can then be applied to predict new gene annotations of a target organism.
RESULTS:
We tested and compared our method with the equivalent single organism approach on different gene annotation datasets of five evolutionarily related organisms (Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum). Results show both the usefulness of the perturbation method of available annotations for better prediction model training and a great improvement of the cross-organism models with respect to the single-organism ones, without influence of the evolutionary distance between the considered organisms. The generated ranked lists of reliably predicted annotations, which describe novel gene functionalities and have an associated likelihood value, are very valuable both to complement available annotations, for better coverage in biomedical knowledge discovery analyses, and to quicken the annotation curation process, by focusing it on the prioritized novel annotations predicted
- …
