Search CORE

1,721,127 research outputs found

Predicting phenotypic traits of prokaryotes from protein domain frequencies

Author: Notedame Cedric
Mühlhausen Stefanie
Stefanie Mühlhausen
Lingner Thomas
Gabaldón Toni
Notredame Cedric
Thomas Lingner
Meinicke Peter
Toni Gabaldón
Peter Meinicke
Cedric Notredame
Muehlhausen Stefanie
Publication venue
Publication date: 01/01/2010
Field of study

Background: Establishing the relationship between an organism's genome sequence and its phenotype is a fundamental challenge that remains largely unsolved. Accurately predicting microbial phenotypes solely based on genomic features will allow us to infer relevant phenotypic characteristics when the availability of a genome sequence precedes experimental characterization, a scenario that is favored by the advent of novel high-throughput and single cell sequencing techniques. Results: We present a novel approach to predict the phenotype of prokaryotes directly from their protein domain frequencies. Our discriminative machine learning approach provides high prediction accuracy of relevant phenotypes such as motility, oxygen requirement or spore formation. Moreover, the set of discriminative domains provides biological insight into the underlying phenotype-genotype relationship and enables deriving hypotheses on the possible functions of uncharacterized domains. Conclusions: Fast and accurate prediction of microbial phenotypes based on genomic protein domain content is feasible and has the potential to provide novel biological insights. First results of a systematic check for annotation errors indicate that our approach may also be applied to semi-automatic correction and completion of the existing phenotype annotation.German Academic Exchange Service (DAAD

Bath Research Portal

Crossref

Springer - Publisher Connector

GRO.publications

GRO.publications (Univ. Göttingen)

NGS applications in genome evolution and adaptation : A reproducible approach to NGS data analysis and integration

Author: Prieto Barja Pablo
Publication venue
Publication date: 12/01/2017
Field of study

In this PhD I have used NGS technologies in different organisms and scenarios such as in ENCODE, comparing the conservation and evolution of long non-coding RNA sequences between human and mouse, using experimental evidences from genome, transcriptome and chromatin. A similar approach was followed in other organisms such as the mesoamerican common bean and in chicken. Other analysis carried with NGS data involved the well known parasite, Leishmania Donovani, the causative agent of Leishmaniasis. I used NGS data obtained from genome and transcriptome to study the fate of its genome in survival strategies for adaptation and long term evolution. All this work was approached while working in tools and strategies to efficiently design and implement the bioinformatics analysis also known as pipelines or workflows, in order to make them easy to use, easily deployable, accessible and highly performing. This work has provided several strategies in order to avoid lack of reproducibility and inconsistency in scientific research with real biological applications towards sequence analysis and genome evolution.En aquest doctorat he utilitzat tecnologies NGS en diferents organismes i projectes com l'ENCODE, comparant la conservació i evolució de seqüències de RNA llargs no codificant entre el ratolí i l'humà, utilitzant evidències experimentals del genoma, transcriptoma i cromatina. He seguit una estratègia similar en altres organismes com són la mongeta mesoamericana i el pollastre. En altres anàlisis he hagut d'utilitzar dades NGS en l'estudi del conegut paràsit leishmània Donovani, l'agent causatiu de la malaltia Leishmaniosis. Utilitzant dades NGS obtingudes del genoma i transcriptoma he estudiat les conseqüències del genoma en estratègies d'adaptació i evolució a llarg termini. Aquest treball es va realitzar mentre treballava en eines i estratègies per dissenyar eficientment i implementar els anàlisis bioinformàtics coneguts com a diagrames de treball, per tal de fer-los fàcils d'utilitzar, fàcilment realitzables, accessibles i amb un alt rendiment. Aquest treball present diverses estratègies per tal d'evitar la falta de reproductibilitat i consistència en la investigació científica amb aplicacions reals a la biologia de l'anàlisi de seqüències i evolució de genomes.Programa de doctorat en Biomedicin

UPF Digital Repository

Detecting and comparing non-coding RNAs

Author: Bussotti Giovanni
Publication venue
Publication date: 23/01/2013
Field of study

In recent years there has been a growing interest in the field of non-coding RNA. This surge is a direct consequence of the discovery of a huge number of new non-coding genes, and of the finding that many of these transcripts are involved in key cellular functions. In this context, accurately detecting and comparing RNA sequences becomes extremely important. Aligning nucleotide sequences is one of the main requisite when searching for homologous genes. Accurate alignments reveal evolutionary relationships, conserved regions and more generally, any biologically relevant pattern. Comparing RNA molecules is, however, a challenging task. The nucleotide alphabet is simpler and therefore less informative than that of proteins. Moreover for many non-coding RNAs, evolution is likely to be mostly constrained at the structure level and not on the sequence level. This results in a very poor sequence conservation impeding the comparison of these molecules. These difficulties define a context where new methods are urgently needed in order to exploit experimental results at their full potential. These are the issues I have tried to address in my PhD. I have started by developing a novel algorithm able to reveal the homology relationship of distantly related ncRNA genes, and then I have applied the approach thus defined in combination with other sophisticated data mining tools to discover novel non-coding genes and generate genome-wide ncRNA predictions.En los últimos años el interés en el campo de los ARN no codificantes ha crecido mucho a causa del enorme aumento de la cantidad de secuencias no codificantes disponibles y a que muchos de estos transcriptos han dado muestra de ser importantes en varias funciones celulares. En este contexto, es fundamental el desarrollo de métodos para la correcta detección y comparativa de secuencias de ARN. Alinear nucleótidos es uno de los enfoques principales para buscar genes homólogos, identificar relaciones evolutivas, regiones conservadas y en general, patrones biológicos importantes. Sin embargo, comparar moléculas de ARN es una tarea difícil. Esto es debido a que el alfabeto de nucleótidos es más simple y por ello menos informativo que el de las proteínas. Además es probable que para muchos ARN la evolución haya mantenido la estructura en mayor grado que la secuencia, y esto hace que las secuencias sean poco conservadas y difícilmente comparables. Por lo tanto, hacen falta nuevos métodos capaces de utilizar otras fuentes de información para generar mejores alineamientos de ARN. En esta tesis doctoral se ha intentado dar respuesta exactamente a estas temáticas. Por un lado desarrollado un nuevo algoritmo para detectar relaciones de homología entre genes de ARN no codificantes evolutivamente lejanos. Por otro lado se ha hecho minería de datos mediante el uso de datos ya disponibles para descubrir nuevos genes y generar perfiles de ARN no codificantes en todo el genoma.Programa de doctorat en Biomedicin

UPF Digital Repository

Alignment uncertainty, regressive alignment and large scale deployment

Author: Floden Evan
Publication venue
Publication date: 30/11/2018
Field of study

A multiple sequence alignment (MSA) provides a description of the relationship between biological sequences where columns represent a shared ancestry through an implied set of evolutionary events. The majority of research in the field has focused on improving the accuracy of alignments within the progressive alignment framework and has allowed for powerful inferences including phylogenetic reconstruction, homology modelling and disease prediction. Notwithstanding this, when applied to modern genomics datasets - often comprising tens of thousands of sequences - new challenges arise in the construction of accurate MSA. These issues can be generalised to form three basic problems. Foremost, as the number of sequences increases, progressive alignment methodologies exhibit a dramatic decrease in alignment accuracy. Additionally, for any given dataset many possible MSA solutions exist, a problem which is exacerbated with an increasing number of sequences due to alignment uncertainty. Finally, technical difficulties hamper the deployment of such genomic analysis workflows - especially in a reproducible manner - often presenting a high barrier for even skilled practitioners. This work aims to address this trifecta of problems through a web server for fast homology extension based MSA, two new methods for improved phylogenetic bootstrap supports incorporating alignment uncertainty, a novel alignment procedure that improves large scale alignments termed regressive MSA and finally a workflow framework that enables the deployment of large scale reproducible analyses across clusters and clouds titled Nextflow. Together, this work can be seen to provide both conceptual and technical advances which deliver substantial improvements to existing MSA methods and the resulting inferences.Un alineament de seqüència múltiple (MSA) proporciona una descripció de la relació entre seqüències biològiques on les columnes representen una ascendència compartida a través d'un conjunt implicat d'esdeveniments evolutius. La majoria de la investigació en el camp s'ha centrat a millorar la precisió dels alineaments dins del marc d'alineació progressiva i ha permès inferències poderoses, incloent-hi la reconstrucció filogenètica, el modelatge d'homologia i la predicció de malalties. Malgrat això, quan s'aplica als conjunts de dades de genòmica moderns, que sovint comprenen desenes de milers de seqüències, sorgeixen nous reptes en la construcció d'un MSA precís. Aquests problemes es poden generalitzar per formar tres problemes bàsics. En primer lloc, a mesura que augmenta el nombre de seqüències, les metodologies d'alineació progressiva presenten una disminució espectacular de la precisió de l'alineació. A més, per a un conjunt de dades, existeixen molts MSA com a possibles solucions un problema que s'agreuja amb un nombre creixent de seqüències a causa de la incertesa d'alineació. Finalment, les dificultats tècniques obstaculitzen el desplegament d'aquests fluxos de treball d'anàlisi genòmica, especialment de manera reproduïble, sovint presenten una gran barrera per als professionals fins i tot qualificats. Aquest treball té com a objectiu abordar aquesta trifecta de problemes a través d'un servidor web per a l'extensió ràpida d'homologia basada en MSA, dos nous mètodes per a la millora de l'arrencada filogenètica permeten incorporar incertesa d'alineació, un nou procediment d'alineació que millora els alineaments a gran escala anomenat MSA regressivu i, finalment, un marc de flux de treball permet el desplegament d'anàlisis reproduïbles a gran escala a través de clústers i computació al núvol anomenat Nextflow. En conjunt, es pot veure que aquest treball proporciona tant avanços conceptuals com tècniques que proporcionen millores substancials als mètodes MSA existents i les conseqüències resultants.Programa de doctorat en Biomedicin

UPF Digital Repository

Large-scale comparative bioinformatics analyses

Author: Chatzou Maria
Publication venue
Publication date: 07/11/2016
Field of study

One of the main and most recent challenges of modern biology is to keep-up with growing amount of biological data coming from next generation sequencing technologies. Keeping up with the growing volumes of experiments will be the only way to make sense of the data and extract actionable biological insights. Large-scale comparative bioinformatics analyses are an integral part of this procedure. When doing comparative bioinformatics, multiple sequence alignments (MSAs) are by far the most widely used models as they provide a unique insight into the accurate measure of sequence similarities and are therefore instrumental to revealing genetic and/or functional relationships among evolutionarily related species. Unfortunately, the well-established limitation of MSA methods when dealing with very large datasets potentially compromises all downstream analysis. In this thesis I expose the current relevance of multiple sequence aligners, I show how their current scaling up is leading to serious numerical stability issues and how they impact phylogenetic tree reconstruction. For this purpose, I have developed two new methods, MEGA-Coffee, a large scale aligner and Shootstrap a novel bootstrapping measure incorporating MSA instability with branch support estimates when computing trees. The large amount of computation required by these two projects was carried using Nextflow, a new computational framework that I have developed to improve computational efficiency and reproducibility of large-scale analyses like the one carried out in the context of these studies.Uno de los principales y más recientes retos de la biología moderna es poder hacer frente a la creciente cantidad de datos biológicos procedentes de las tecnologías de secuenciación de alto rendimiento. Mantenerse al día con los crecientes volúmenes de datos experimentales es el único modo de poder interpretar estos datos y extraer conclusiones biológicos relevantes. Los análisis bioinformáticos comparativos a gran escala son una parte integral de este procedimiento. Al hacer bioinformática comparativa, los alineamientos múltiple de secuencias (MSA) son con mucho los modelos más utilizados, ya que proporcionan una visión única de la medida exacta de similitudes de secuencia y son, por tanto, fundamentales para inferir las relaciones genéticas y / o funcionales entre las especies evolutivamente relacionadas. Desafortunadamente, la conocida limitación de los métodos MSA para analizar grandes bases de datos, puede potencialmente comprometer todos los análisis realizados a continuación. En esta tesis expongo la relevancia actual de los métodos de alineamientos multiples de secuencia, muestro cómo su uso en datos masivos está dando lugar a serios problemas de estabilidad numérica y su impacto en la reconstrucción del árbol filogenético. Para este propósito, he desarrollado dos nuevos métodos, MEGA-café, un alineador de gran escala y Shootstrap una nueva medida de bootstrapping que incorpora la inestabilidad del MSA con las estimaciones de apoyo de rama en el cálculo de árboles filogéneticos. La gran cantidad de cálculo requerido por estos dos proyectos se realizó utilizando Nextflow, un nuevo marco computacional que se ha desarrollado para mejorar la eficiencia computacional y la reproducibilidad del análisis a gran escala como la que se lleva a cabo en el contexto de estos estudios.Programa de doctorat en Biomedicin

UPF Digital Repository

Novel methods for multiple sequence alignment and evolutionary modeling

Author: Mansouri Leila
Publication venue
Publication date: 27/03/2023
Field of study

El continuo aumento de los proyectos de producción de datos genómicos a gran escala, como el "Earth BioGenome Project" (Lewin et al., 2018), pone los métodos de análisis de datos bajo una presión sin precedentes. Es necesario crear nuevas estrategias para poder analizar todas estas secuencias. Los métodos de modelización más utilizados en biología son los alineamientos múltiples de secuencias (MSA) y la reconstrucción de árboles filogenéticos. En esta tesis, he abordado estos dos temas desde el ángulo del análisis de secuencias de proteínas, con un enfoque particular que distingue entre los análisis basados en estructuras y los basados en secuencias. Los problemas a los cuales se enfrentan las metodologías existentes, sin embargo, no son únicamente computacionales. De hecho, el escalado de métodos clave como los MSA no sólo necesita más recursos informáticos, sino también requiere la mejora de los algoritmos, ya que la fiabilidad del MSA disminuye cuando se trabaja con más de 1000 secuencias (Sievers et al., 2011). Para solucionar este problema, he colaborado en el desarrollo de un nuevo algoritmo de MSA, denominado regresivo (“regressive”) (Garriga et al., 2021), que ofrece una mejor escalabilidad que sus equivalentes progresivos (“progressive”) en términos de cálculo y precisión. Aunque alinear con precisión secuencias distantemente relacionadas seguirá siendo difícil, la amplia aceptación de que la información tridimensional es mucho más resiliente que su equivalente secuencial puede ofrecer una solución alternativa. Hasta ahora, la falta de datos estructurales experimentales ha limitado la relevancia práctica de esta observación, pero la situación está cambiando rápidamente. Gracias a la mejoría recientemente lograda en la predicción de la estructura de las proteínas (Jumper et al., 2021), se está generando una cantidad masiva de datos estructurales cuya calidad se acerca a la experimental. En la actualidad se dispone de más de 200 millones de modelos que pueden utilizarse para el tipo de análisis que se realiza actualmente con los datos cristalográficos. Anticipándome a esto, he explorado la posibilidad de utilizar estructuras predichas por AlphaFold2 (AF2) para estimar MSA basados en estructuras (Baltzis, Mansouri et al., 2022). He descubierto que los MSA basados en modelos estructurales AF2 muestran una mejoría muy significativa en la exactitud con respecto a sus contrapartes basados en secuencias. A continuación, he abordado el problema del análisis de secuencias desde un ángulo filogenético enfocando inicialmente en escenarios evolutivos de parálogos y, posteriormente, he evaluado el potencial del uso de datos estructurales para la reconstrucción de la evolución de secuencias usando secuencias de proteínas arbitrariamente relacionadas. Estos análisis coinciden en corroborar la idoneidad de la información estructural de las proteínas para fines de análisis evolutivo.The massive ongoing scale-up of genomics data production projects, such as the Earth BioGenome Project (Lewin et al., 2018), puts data analysis methods under unprecedented pressure. New approaches are needed to analyse all these sequences. The most commonly used modelling methods in biology are multiple sequence alignments (MSAs) and phylogenetic tree reconstruction. In this thesis, I have addressed these two topics from the angle of protein sequence analysis with a specific interest in the relationship between structure-based and sequence-based analyses. The problem of data scaling up is not only computational. Indeed, the scale-up of key methods such as MSA modelling does not merely need more computational resources. Still, it also requires conceptual algorithmic improvements, since MSA accuracy decreases when dealing with more than 1000 sequences (Sievers et al., 2011). To address this issue, I helped in the development of a new MSA algorithm, named regressive (Garriga et al., 2021), featuring improved scaling-up capacities over its progressive counterparts in terms of computation and accuracy. Accurately aligning distantly related sequences will, however, remain a challenge but this problem could be alleviated using protein structures as it is well established that three-dimensional information is much more resilient than its sequence counterpart. The scarcity of experimental structural data has, so far, limited the practical relevance of this observation, however, the situation is rapidly changing. Thanks to the newly achieved improvement of protein structure prediction (Jumper et al., 2021), a massive amount of experimental-grade structural data is being generated. Over 200 million models are now available and they may be used for the kind of analysis currently carried out on crystalographic data. Anticipating this, I have explored the possibility of using AlphaFold2(AF2)-predicted structures to estimate structure-based-MSAs (Baltzis, Mansouri et al., 2022). I have found that MSAs based on AF2 structural models display a highly significant improvement in accuracy over their sequence-based counterparts. Next, I have addressed the problem of sequence analysis from a phylogenetic angle initially with a focus on paralogous evolutionary scenarios and, subsequently, I evaluated the potential of the use of structural data for the reconstruction of sequence evolution on arbitrarily related protein sequences . These analyses coincide in supporting the suitability of protein structure information for evolutionary analysis purposes.Programa de doctorat en Biomedicin

UPF Digital Repository

Influence of alignment uncertainty on homology and phylogenetic modeling

Author: Chang Jia-Ming
Publication venue
Publication date: 25/07/2013
Field of study

Most evolutionary analyses are based upon pre-estimated multiple sequence alignment models. From a computational point of view, it is too complex to estimate a correct alignment, as it is to derive a correct tree from that alignment. Several works have recently reported on the influence of alignment on downstream analysis, and on the uncertainty inherent to their estimation. Chapter 1 develops the notion of alignment uncertainty as either inherent to the data (internal) or resulting from methodological biases (external). Chapter 2 presents two contributions of mine for the improvement of MSA methods through the use of homology extension (TM-Coffee) and thanks to an improved word-matching algorithm (SymAlign). In Chapter 3, I show how alignment uncertainty can be used to improve the trustworthiness of phylogenetic analysis. Chapter 4 shows how a similar improvement can be obtained through a simple adaptation of the T-Coffee transitive score, thus allowing downstream analysis to take into account internal alignment uncertainty. The final chapter contained a discussion of our current results and possible future work.La mayoría de los análisis evolutivos están basados en modelos establecidos de alineamiento de secuencia múltiple. Desde un punto de vista computacional, es igual de complejo la estimación de un alineamiento correcto, como la obtención de un árbol correcto a partir del alineamiento. Recientemente varios trabajos han informado sobre la influencia del alineamiento en los análisis posteriores, y en la incertidumbre inherente a su estimación. El Capítulo 1 desarrolla el concepto de incertidumbre de alineación, tanto inherente a los datos (internos), como resultante de los sesgos metodológicos (externo). El Capítulo 2 presenta dos contribuciones mías para la mejora de los métodos de MSA a través del uso de la extensión de homología (TM‐Coffee) y gracias a un algoritmo de coincidencia de palabra mejorado (SymAlign). En el capítulo 3, se muestra cómo la incertidumbre de alineación puede ser utilizada para mejorar la confiabilidad del análisis filogenético. El capítulo 4 nos muestra como se puede obtener una mejora similar por medio de una simple adaptación de la puntuación transitiva del T-- Coffee, lo cual permite un análisis posterior para tener en cuenta la incertidumbre de alineación interna. El último capítulo contiene un análisis de los resultados actuales y los posibles futuros trabajos.Programa de doctorat en Biomedicin

UPF Digital Repository

Impact of recent protein structure prediction methods on homology, evolutionary and functional inference

Author: Baltzis Athanasios
Publication venue
Publication date: 20/03/2023
Field of study

Recent advances in deep learning techniques have revolutionised protein structure modelling. Since AlphaFold2’s release, a set of tools have now become available to predict native-like structures at near-experimental accuracy for a large fraction of the proteome. This massive amount of structural data is now powering every kind of biological inference requiring structural information. The work presented here features an exploration of the impact of experimental and predicted protein structural information onto homology, evolutionary and functional inference. The first part addresses the issue of accurate multiple sequence alignment (MSA) computation through a novel large-scale algorithmic approach and the systematic use of predicted structural information. In the second part, I explored the contribution of MSAs and structural information to refine phylogenetic and functional inference. On top of developing generic structure-based phylogeny reconstruction methods, I used RBM10, a well characterised splicing factor, as a showcase for the use of predicted structural information to support the inference of functional and phenotypic predictions, especially in the case of pathogenic mutations. The last part of this thesis presents a best-practice bioinformatics pipeline, nf-core/proteinfold, implemented using the Nextflow workflow management system and following nf-core guidelines. This pipeline was developed as a support for the rest of the projects in order to provide a solution to the need of high throughput structure predictions.Els avenços recents en tècniques de deep learning han revolucionat la modelització d'estructures de proteïnes. Desde el llançament d'AlphaFold2, hi ha disponibles un conjunt d'eines per preveure les estructures de forma nativa amb una precisió gairebé experimental per una gran part del proteoma. A dia d'avui, aquesta gran quantitat de data estructural està alimentant tot tipus de inferència biològica que requereix informació estructural. El treball que es presenta aquí conté una exploració de l'impacte de la informació estructural experimental i predictiva de la proteïna en la inferència de la homologia, l'evolució i la funció. La primera part resolt el problema de la computació precisa d'alineacions de seqüències múltiples (MSA) a través d'un nou enfocament algorítmic de gran escala i l'ús sistemàtic de informació estructural predictiva. En la segona part, exploro la contribució de MSAs i la informació estructural per refinar la inferència filogenètica i funcional. A més a més de desenvolupar mètodes genèrics de reconstrucció filogenètica basada en estructures, he utilitzat RBM10, un factor d'empalmament ben caracteritzat, com un exemple per l'ús d'informació estructural predictiva per recolzar la inferència de prediccions funcional i fenotípica, especialment en el cas de mutacions patogèniques. La última part d'aquesta tesis presenta un pipeline bioinformatic best-practise, nf-core/proteinfold, implementat utilitzant el sistema de gestió de fluxos de treball Nextflow i seguint les directrius de nf-core. Aquest pipeline ha sigut desenvolupat com un suport a la resta de projectes i per proveir una solució a la necessitat de prediccions estructurals de gran escala.Programa de doctorat en Biomedicin

UPF Digital Repository

Big behavioral data analysis : computational methods for the study of continuous recordings behavior

Author: Espinosa-Carrasco José
Publication venue
Publication date: 08/11/2016
Field of study

New high-throughput behavioral systems enable the recording of continuous behavioral sequences with an unprecedented richness of signals and a deep temporal resolution. Automated systems offer neuroscience the opportunity to tackle in a new way the old question of how the brain orchestrates behavior and ultimately understand brain function itself, however, they accumulate large amounts of data leading to what is being termed Big Behavioral Data. The manipulation, analysis and contextualization of these data to obtain useful biological insights is not a trivial problem. This thesis presents Pergola, a computational framework to comprehensively analyze spontaneous longitudinal behaviors. Pergola provides access to a large set of mature genomic bioinformatics tools for the analysis and visualization of continuous behavioral recordings. I also explored multidimensional analysis techniques to help reducing the huge spatio-temporal dimensionality derived from behavioral recordings, and the high variability associated to all behavioral paradigms. This problem is addressed adapting Principal Component Analysis (PCA) for statistical inference on complex behaviors such as the recognition of learning strategies.Els nous sistemes d’alt rendiment per l’estudi del comportament permeten el enregistrement de senyals continues de comportament amb una riquesa de senyals i una resolució temporal sense precedents. Els sistemes automàtics ofereixen a la neurociència la oportunitat d’abordar d’una nova manera la vella qüestió de com el cervell orquestra el comportament i finalment entendre la pròpia funció cerebral, però a la vegada acumulen grans quantitats de dades, el que s’ha vingut a anomenar Big Behavioral Data. La manipulació, anàlisis i contextualització d’aquestes enormes quantitats de dates per a obtenir coneixements biològics útils no és un problema trivial. Aquesta tesi presenta Pergola, un marc computacional per analitzar exhaustivament els comportaments espontanis longitudinals. Pèrgola ofereix accés a un ampli conjunt d'eines madures de la bioinformàtica genòmica que poden ser usades per a l'anàlisi i visualització d'enregistraments contínues de comportament. També he explorat tècniques d'anàlisi multidimensionals per ajudar a reduir l'enorme dimensió espai-temporal derivada dels enregistraments de comportament, i l'alta variabilitat associada a tots els paradigmes de comportament. He adreçat aquest problema mitjançant l'Anàlisi de Components Principals (PCA) per la inferència estadística de comportaments complexos com per exemple, el reconeixement de les estratègies d'aprenentatge.Programa de doctorat en Biomedicin

UPF Digital Repository

Workflow management applications for comparative omics

Author: Vignoli Alessio
Publication venue
Publication date: 03/12/2025
Field of study

El crecimiento explosivo de los datos biológicos exige soluciones computacionales que sean escalables, reproducibles y robustas. Los sistemas de gestión de flujos de trabajo (workflows), especialmente cuando se combinan con la contenedorización, abordan estos desafíos al automatizar, paralelizar y estandarizar análisis bioinformáticos complejos. En esta tesis exploro las aplicaciones de dichos sistemas en la ómica comparativa, con un enfoque en el desarrollo e implementación de canalizaciones (pipelines) reutilizables dentro de la comunidad nf-core. A través del estudio piloto TANGO1, que investigó la región transmembrana de la proteína TANGO1, se identificaron y abordaron varias necesidades computacionales críticas mediante soluciones personalizadas de flujos de trabajo. Estas incluyen REPORTHO y MULTIPLESEQUENCEALIGN para la recuperación y alineamiento de ortólogos, PARALOGS para el análisis filogenético de familias génicas, y STIMULUS para la selección de modelos en aprendizaje automático. En conjunto, estos proyectos ilustran cómo los gestores de flujos de trabajo potencian la investigación biológica al mejorar la reproducibilidad, la eficiencia y la integración de datos en diversas aplicaciones ómicas.The explosive growth of biological data demands computational solutions that are scalable, reproducible, and robust. Workflow management systems, especially when combined with containerization, address these challenges by automating, parallelizing, and standardizing complex bioinformatics analyses. In this thesis I explore the applications of such systems in comparative omics, with a focus on the development and implementation of reusable pipelines within the nf-core community. Through the TANGO1 pilot study, which investigated the transmembrane region of the TANGO1 protein, several critical computational needs were identified and addressed via custom workflow solutions. These include REPORTHO and MULTIPLESEQUENCEALIGN for ortholog retrieval and alignment, PARALOGS for phylogenetic analysis of gene families, and STIMULUS for model selection in machine learning. Together, these projects illustrate how workflow managers empower biological research by enhancing reproducibility, efficiency, and data integration across diverse omics applications.Universitat Pompeu Fabra. Doctorat en Biomedicin

UPF Digital Repository