1,721,019 research outputs found

    Exploring the Intersection of Multi-Omics and Machine Learning in Cancer Research

    Full text link
    Cancer biology and machine learning represent two seemingly disparate yet intrinsically linked fields of study. Cancer biology, with its complexities at the cellular and molecular levels, brings up a myriad of challenges. Of particular concern are the deviations in cell behaviour and rearrangements of genetic material that fuel transformation, growth, and spread of cancerous cells. Contemporary studies of cancer biology often utilise wide arrays of genomic data to pinpoint and exploit these abnormalities with an end-goal of translating them into functional therapies. Machine learning allows machines to make predictions based on the learnt data without explicit programming. It leverages patterns and inferences from large datasets, making it an invaluable tool in the modern era of large scale genomics. To this end, this doctoral thesis is underpinned by three themes: the application of machine learning, multi-omics, and cancer biology. It focuses on employment of machine learning algorithms to the tasks of cell annotation in single-cell RNA-seq datasets and drug response prediction in pre-clinical cancer models. In the first study, the author and colleagues developed a pipeline named Ikarus to differentiate between neoplastic and healthy cells within single-cell datasets, a task crucial for understanding the cellular landscape of tumours. Ikarus is designed to construct cancer cell-specific gene signatures from expert-annotated scRNA-seq datasets, score these genes, and distribute the scores to neighbouring cells via network propagation. This method successfully circumvents two common challenges in single-cell annotation: batch effects and unstable clustering. Furthermore, Ikarus utilises a multi-omic approach by incorporating CNVs inferred from scRNA-seq to enhance classification accuracy. The second study investigated how multi-omic analysis could enhance drug response prediction in pre-clinical cancer models. The research suggests that the typical practice of panel sequencing — a deep profiling of select, validated genomic features — is limited in its predictive power. However, incorporating transcriptomic features into the model significantly improves predictive ability across a variety of cancer models and is especially effective for drugs with collateral effects. This implies that the combined use of genomic and transcriptomic data has potential advantages in the pharmacogenomic arena. This dissertation recapitulates the findings of two aforementioned studies, which were published in Genome Biology and Cancers journals respectively. The two studies illustrate the application of machine learning techniques and multi-omic approaches to address conceptually distinct problems within the realm of cancer biology.Die Krebsbiologie und das maschinelle Lernen sind zwei scheinbar konträre, aber intrinsisch verbundene Forschungsbereiche. Insbesondere die Krebsbiologie ist auf zellul ̈arer und molekularer Ebene hoch komplex und stellt den Forschenden vor eine Vielzahl von Herausforderungen. Zu verstehen wie abweichendes Zellverhalten und die Umstrukturierung genetischer Komponente die Transformation, das Wachstum und die Ausbreitung von Krebszellen antreiben, ist hierbei eine besondere Herausforderung. Gleichzeitig bestrebt die Krebsbiologie diese Abnormalitäten zu nutzen zu machen, Wissen aus ihnen zu gewinnen und sie so in funktionale Therapien umzusetzen. Maschinelles Lernen ermöglicht es Vorhersagen auf der Grundlage von gelernten Daten ohne explizite Programmierung zu treffen. Es erkennt Muster in großen Datensätzen, erschließt sich so Erkenntnisse und ist deswegen ein unschätzbar wertvolles Werkzeug im modernen Zeitalter der Hochdurchsatz Genomforschung. Aus diesem Grund ist maschinelles Lernen eines der drei Haupthemen dieser Doktorarbeit, neben Multi-Omics und Krebsbiologie. Der Fokus liegt hierbei insbesondere auf dem Einsatz von maschinellen Lernalgorithmen zum Zweck der Zellannotation in Einzelzell RNA-Sequenzdatensätzen und der Vorhersage der Arzneimittelwirkung in präklinischen Krebsmodellen. In der ersten, hier präsentierten Studie, entwickelten der Autor und seine Kollegen eine Pipeline namens Ikarus. Diese kann zwischen neoplastischen und gesunden Zellen in Einzelzell-Datensätzen unterscheiden. Eine Aufgabe, die für das Verst ̈andnis der zellulären Landschaft von Tumoren entscheidend ist. Ikarus ist darauf ausgelegt, krebszellenspezifische Gensignaturen aus expertenanotierten scRNA-seq-Datensätzen zu konstruieren, diese Gene zu bewerten und die Bewertungen über Netzwerkverbreitung auf benachbarte Zellen zu verteilen. Diese Methode umgeht erfolgreich zwei häufige Herausforderungen bei der Einzelzellannotation: den Chargeneffekt und die instabile Clusterbildung. Darüber hinaus verwendet Ikarus, durch das Einbeziehen von scRNA-seq abgeleiteten CNVs, einen Multi-Omic-Ansatz der die Klassifikationsgenauigkeit verbessert. Die zweite Studie untersuchte, wie Multi-Omic-Analysen die Vorhersage der Arzneimittelwirkung in präklinischen Krebsmodellen optimieren können. Die Forschung legt nahe, dass die übliche Praxis des Panel Sequenzierens - die umfassende Profilierung ausgewählter, validierter genomischer Merkmale - in ihrer Vorhersagekraft begrenzt ist. Durch das Einbeziehen transkriptomischer Merkmale in das Modell konnte jedoch die Vorhersagefähigkeit bei verschiedenen Krebsmodellen signifikant verbessert werden, ins besondere für Arzneimittel mit Nebenwirkungen. Diese Dissertation fasst die Ergebnisse der beiden oben genannten Studien zusammen, die jeweils in Genome Biology und Cancers Journalen veröffentlicht wurden. Die beiden Studien veranschaulichen die Anwendung von maschinellem Lernen und Multi-Omic-Ansätzen zur Lösung konzeptionell unterschiedlicher Probleme im Bereich der Krebsbiologie

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Computational mapping of regulatory domains of human genes

    Full text link
    Ljudski genom sadrži milijune regulatornih elemenata - enhancera - koji kvantitativno reguliraju ekspresiju gena. Unatoč ogromnom napretku u razumijevanju načina na koji enhanceri reguliraju ekspresiju gena, području još uvijek nedostaje pristup koji je sustavan, integrativan i dostupan za otkrivanje i dokumentiranje cis-regulatornih odnosa u cijelom genomu. Razvili smo novu računalnu metodu - reg2gene - koja modelira i integrira aktivnost enhancera~ekspresije gena. reg2gene sastoji se od tri glavna koraka: 1) kvantifikacija podataka, 2) modeliranje podataka i procjena značaja, i 3) integracija podataka prikupljenih u reg2gene R paketu. Kao rezultat toga, identificirali smo dva skupa enhancer-gen interakcija (EGA): fleksibilni skup od ~ 230K EGA (flexibleC) i strogi skup od ~ 60K EGA (stringentC). Utvrdili smo velike razlike u prethodno objavljenim računalnim modelima enhancer-gen interakcija; uglavnom u lokaciji, broju i svojstvima definiranih enhancera i EGA. Izveli smo detaljno mjerenje performansi sedam skupova računalno modeliranih EGA-a, ali smo pokazali da se niti jedan od trenutno dostupnih skupova referentnih podataka ne može koristiti kao referentni skup podataka "zlatnI standard". Definirali smo dodatni referentni skup pozitivnih i negativnih EGA -a pomoću kojih smo pokazali da stringentC ima najveću pozitivnu prediktivnu vrijednost (PPV). Pokazali smo potencijal EGA-a za identifikaciju genskih meta nekodirajucih SNP-ova. Proveli smo funkcionalnu analizu kako bismo otkrili nove genske mete, pleiotropiju enhancera i mehanizme aktivnosti enhancera. Ovaj rad poboljšava naše razumijevanje regulacije ekspresije gena posredovane enhancerima.Das menschliche Genom enthält Millionen von regulatorischen Elementen - Enhancern -, die die Genexpression quantitativ regulieren. Trotz des enormen Fortschritts beim Verständnis, wie Enhancer die Genexpression steuern, fehlt es in diesem Bereich immer noch an einem systematischen, integrativen und zugänglichen Ansatz zur Entdeckung und Dokumentation von cis-regulatorischen Beziehungen im gesamten Genom. Wir haben eine neuartige Methode - reg2gene - entwickelt, die Genexpression~Enhancer-Aktivität modelliert und integriert. reg2gene besteht aus drei Hauptschritten: 1) Datenquantifizierung, 2) Datenmodellierung und Signifikanzbewertung und 3) Datenintegration, die in dem R-Paket reg2gene zusammengefasst sind. Als Ergebnis haben wir zwei Sätze von Enhancer-Gen-Assoziationen (EGAs) identifiziert: den flexiblen Satz von ~230K EGAs (flexibleC) und den stringenten Satz von ~60K EGAs (stringentC). Wir haben große Unterschiede zwischen den bisher veröffentlichten Berechnungsmodellen für Enhancer-Gene-Assoziationen festgestellt, vor allem in Bezug auf die Lage, die Anzahl und die Eigenschaften der definierten Enhancer-Regionen und EGAs. Wir führten ein detailliertes Benchmarking von sieben Sets von rechnerisch modellierten EGAs durch, zeigten jedoch, dass keiner der derzeit verfügbaren Benchmark-Datensätze als "goldener Standard" verwendet werden kann. Wir definierten einen zusätzlichen Benchmark-Datensatz mit positiven und negativen EGAs, mit dem wir zeigten, dass das stringentC-Modell den höchsten positiven Vorhersagewert (PPV) hatte. Wir haben das Potenzial von EGAs zur Identifizierung von Genzielen von nicht-kodierenden SNP-Gene-Assoziationen nachgewiesen. Schließlich führten wir eine funktionelle Analyse durch, um neue Genziele, Enhancer-Pleiotropie und Mechanismen der Enhancer-Aktivität zu ermitteln. Insgesamt bringt diese Arbeit unser Verständnis der durch Enhancer vermittelten Regulierung der Genexpression in Gesundheit und Krankheit voran.Human genome contains millions of regulatory elements - enhancers - that quantitatively regulate gene expression. Multiple experimental and computational approaches were developed to associate enhancers with their gene targets. Despite the tremendous progress in understanding how enhancers tune gene expression, the field still lacks an approach that is systematic, integrative and accessible for discovering and documenting cis-regulatory relationships across the genome. We developed a novel computational approach - reg2gene- that models and integrates gene expression ~ enhancer activity. reg2gene consists of three main steps: 1) data quantification, 2) data modelling and significance assessment, and 3) data integration gathered in the reg2gene R package. As a result we identified two sets of enhancer-gene associations (EGAs): the flexible set of ~230K EGAs (flexibleC), and the stringent set of ~60K EGAs (stringentC). We identified major differences across previously published computational models of enhancer-gene associations; mostly in the location, number and properties of defined enhancer regions and EGAs. We performed detailed benchmarking of seven sets of computationally modelled EGAs, but showed that none of the currently available benchmark datasets could be used as a “golden-standard” benchmark dataset. To account for that observation, we defined an additional benchmark set of positive and negative EGAs with which we showed that the stringentC model had the highest positive predictive value (PPV) across all analyzed computational models. We reviewed the influence of EGA sets on the functional analysis of risk SNPs and demonstrated the potential of EGAs to identify gene targets of non-coding SNP-gene associations. Lastly, we performed a functional analysis to detect novel gene targets, enhancer pleiotropy, and mechanisms of enhancer activity. Altogether, this work advances our understanding of enhancer-mediated gene expression regulation in health and disease

    Integrative analysis of data from multiple experiments

    Full text link
    Auf die Entwicklung der Hochdurchsatz-Sequenzierung (HTS) folgte eine Reihe von speziellen Erweiterungen, die erlauben verschiedene zellbiologischer Aspekte wie Genexpression, DNA-Methylierung, etc. zu messen. Die Analyse dieser Daten erfordert die Entwicklung von Algorithmen, die einzelne Experimenteberücksichtigen oder mehrere Datenquellen gleichzeitig in betracht nehmen. Der letztere Ansatz bietet besondere Vorteile bei Analyse von einzelligen RNA-Sequenzierung (scRNA-seq) Experimenten welche von besonders hohem technischen Rauschen, etwa durch den Verlust an Molekülen durch die Behandlung geringer Ausgangsmengen, gekennzeichnet sind. Um diese experimentellen Defizite auszugleichen, habe ich eine Methode namens netSmooth entwickelt, welche die scRNA-seq-Daten entrascht und fehlende Werte mittels Netzwerkdiffusion über ein Gennetzwerk imputiert. Das Gennetzwerk reflektiert dabei erwartete Koexpressionsmuster von Genen. Unter Verwendung eines Gennetzwerks, das aus Protein-Protein-Interaktionen aufgebaut ist, zeige ich, dass netSmooth anderen hochmodernen scRNA-Seq-Imputationsmethoden bei der Identifizierung von Blutzelltypen in der Hämatopoese, zur Aufklärung von Zeitreihendaten unter Verwendung eines embryonalen Entwicklungsdatensatzes und für die Identifizierung von Tumoren der Herkunft für scRNA-Seq von Glioblastomen überlegen ist. netSmooth hat einen freien Parameter, die Diffusionsdistanz, welche durch datengesteuerte Metriken optimiert werden kann. So kann netSmooth auch dann eingesetzt werden, wenn der optimale Diffusionsabstand nicht explizit mit Hilfe von externen Referenzdaten optimiert werden kann. Eine integrierte Analyse ist auch relevant wenn multi-omics Daten von mehrerer Omics-Protokolle auf den gleichen biologischen Proben erhoben wurden. Hierbei erklärt jeder einzelne dieser Datensätze nur einen Teil des zellulären Systems, während die gemeinsame Analyse ein vollständigeres Bild ergibt. Ich entwickelte eine Methode namens maui, um eine latente Faktordarstellungen von multiomics Daten zu finden.The development of high throughput sequencing (HTS) was followed by a swarm of protocols utilizing HTS to measure different molecular aspects such as gene expression (transcriptome), DNA methylation (methylome) and more. This opened opportunities for developments of data analysis algorithms and procedures that consider data produced by different experiments. Considering data from seemingly unrelated experiments is particularly beneficial for Single cell RNA sequencing (scRNA-seq). scRNA-seq produces particularly noisy data, due to loss of nucleic acids when handling the small amounts in single cells, and various technical biases. To address these challenges, I developed a method called netSmooth, which de-noises and imputes scRNA-seq data by applying network diffusion over a gene network which encodes expectations of co-expression patterns. The gene network is constructed from other experimental data. Using a gene network constructed from protein-protein interactions, I show that netSmooth outperforms other state-of-the-art scRNA-seq imputation methods at the identification of blood cell types in hematopoiesis, as well as elucidation of time series data in an embryonic development dataset, and identification of tumor of origin for scRNA-seq of glioblastomas. netSmooth has a free parameter, the diffusion distance, which I show can be selected using data-driven metrics. Thus, netSmooth may be used even in cases when the diffusion distance cannot be optimized explicitly using ground-truth labels. Another task which requires in-tandem analysis of data from different experiments arises when different omics protocols are applied to the same biological samples. Analyzing such multiomics data in an integrated fashion, rather than each data type (RNA-seq, DNA-seq, etc.) on its own, is benefitial, as each omics experiment only elucidates part of an integrated cellular system. The simultaneous analysis may reveal a comprehensive view

    Variations on the Author

    Full text link
    “Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship

    Appropriate Similarity Measures for Author Cocitation Analysis

    Full text link
    We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis

    Dispelling the Myths Behind First-author Citation Counts

    Full text link
    We conducted a full-scale evaluative citation analysis study of scholars in the XML research field to explore just how different from each other author rankings resulting from different citation counting methods actually are, and to demonstrate the capability of emerging data and tools on the Web in supporting more realistic citation counting methods. Our results contest some common arguments for the continued use of first-author citation counts in the evaluation of scholars, such as high correlations between author rankings by first-author citation counts and other citation counting methods, and high costs of using more realistic citation counting methods that are not well-supported by the ISI databases. It is argued that increasingly available digital full text research papers make it possible for citation analysis studies to go beyond what the ISI databases have directly supported and to employ more sophisticated methods

    MIMIC-IV-Ext clinical decision support for referral, triage and diagnosis

    No full text
    Accurate medical decision-making is critical for both patients and clinicians. Patients often find it difficult to interpret their symptoms, determine their severity, and select the right specialist to see. At the same time, clinicians face challenges in integrating complex patient data to make timely, accurate diagnoses. Recent advancements in large language models (LLMs) offer the potential to bridge this gap by supporting decision-making for both patients and healthcare providers. To support the development of LLMs in healthcare, we have curated and extended the MIMIC-IV dataset. We extracted, processed, and extended the MIMIC-IV dataset, which resulted in 9,150 real-world medical cases with various pathologies. From this larger set 2,200 were further refined to meet the specific needs of our task. This extended dataset is designed to evaluate and improve LLMs' ability to assist with triage, specialist referral, and diagnosis, using critical patient information such as history of present illness, vitals signs and other relevant data. This allows for the development of models that can help both clinicians and patients in making better-informed decisions, ultimately improving healthcare delivery and patient outcomes

    Author Index

    No full text
    Nao informado
    corecore