1,720,976 research outputs found

    PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets

    Full text link
    Background: With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation. Results: We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine. Conclusions: PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability

    Analysis and Visualization of Mutation Enrichments for Selected Genomic Regions and Cancer Types

    No full text
    Several studies highlight the relevance of somatic mutations in non-coding regions of the genome which exhibit common interesting behaviors. MutViz is a tool for the identification of mutation enrichments on arbitrary sets of user-defined regions; for a variety of cancer types, it contains preloaded mutations from public datasets, well organized within an effective database organization. MutViz provides a user-friendly interface helping the user in providing sets of regions as input and in obtaining their fast exploration as output, together with simple statistical testing of novel hypotheses

    Exploring genomic datasets: From batch to interactive and back

    Full text link
    Genomic data management is focused on achieving high performance over big datasets using batch, cloud-based architectures; this enables the execution of massive pipelines, but hampers the capability of exploring the solution space when it is not well-defined, by choosing different experimental samples or query extraction parameters. We present PyGMQL, a Python-based interoperability software layer that enables testing of experimental pipelines; PyGMQL solves the impedance mismatch between a batch execution environment and the agile programming style of Python, and provides transparency of access when exploration requires integrating local and remote resources.Wrapping PyGMQLand Python primitives within Jupyter notebooks guarantees reproducibility of the pipeline when used in different contexts or by different scientists. The software is freely available at https://github.com/DEIB-GECO/PyGMQL

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Variations on the Author

    Full text link
    “Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship

    Appropriate Similarity Measures for Author Cocitation Analysis

    Full text link
    We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis

    EpiSurf: Metadata-driven search server for analyzing amino acid changes within epitopes of SARS-CoV-2 and other viral species

    Full text link
    EpiSurf is a Web application for selecting viral populations of interest and then analyzing how their amino acid changes are distributed along epitopes. Viral sequences are searched within ViruSurf, which stores curated metadata and amino acid changes imported from the most widely used deposition sources for viral databases (GenBank, COVID-19 Genomics UK (COG-UK) and Global initiative on sharing all influenza data (GISAID)). Epitopes are searched within the open source Immune Epitope Database or directly proposed by users by indicating their start and stop positions in the context of a given viral protein. Amino acid changes of selected populations are joined with epitopes of interest; a result table summarizes, for each epitope, statistics about the overlapping amino acid changes and about the sequences carrying such alterations. The results may also be inspected by the VirusViz Web application; epitope regions are highlighted within the given viral protein, and changes can be comparatively inspected. For sequences mutated within the epitope, we also offer a complete view of the distribution of amino acid changes, optionally grouped by the location, collection date or lineage. Thanks to these functionalities, EpiSurf supports the user-friendly testing of epitope conservancy within selected populations of interest, which can be of utmost relevance for designing vaccines, drugs or serological assays. EpiSurf is available at two endpoints

    Investigating Deep Learning based Breast Cancer Subtyping using Pan-cancer and Multi-omic Data

    No full text
    Breast Cancer comprises multiple subtypes implicated in prognosis. Existing stratification methods rely on the expression quantification of small gene sets. Next Generation Sequencing promises large amounts of omic data in the next years. In this scenario, we explore the potential of machine learning and, particularly, deep learning for breast cancer subtyping. Due to the paucity of publicly available data, we leverage on pan-cancer and non-cancer data to design semi-supervised settings. We make use of multi-omic data, including microRNA expressions and copy number alterations, and we provide an in-depth investigation of several supervised and semi-supervised architectures Obtained accuracy results show simpler models to perform at least as well as the deep semi-supervised approaches on our task over gene expression data. When multi-omic data types are combined together, performance of deep models shows little (if any) improvement in accuracy, indicating the need for further analysis on larger datasets of multi-omic data as and when they become available. From a biological perspective, our linear model mostly confirms known gene-subtype annotations. Conversely, deep approaches model non-linear relationships, which is reflected in a more varied and still unexplored set of representative omic features that may prove useful for breast cancer subtyping
    corecore