1,721,023 research outputs found
Predictive power of epigenetic age – opportunities and cautions
The advent of epigenetic age estimation through DNA methylation analysis has transformed our understanding of biological aging, offering a more refined perspective than traditional chronological measures. Current research in DNA methylation primarily focuses on developing epigenetic
clocks, which estimate biological age based on DNA methylation
patterns (a.k.a methylage). Discrepancies between chronological and biological age, known as age acceleration, have been identified as early indicators of diseases such as cancer and neurodegenerative disorders [1]. Once properly estimated, age acceleration has the potential to serve as
a biomarker for risk factors in many common diseases. However, the precise determination of biological age and age acceleration remains a significant challenge in this field due to both technical limitations and variability in methylation patterns across populations.
In fact, to date, while simple models of epigenetic age provide valuable insights, they often lack the reliability needed for clinical applications. Literature, indeed, reports that highly accurate epigenetic clocks (i.e., able to properly recapitulate chronological age) typically fail to detect significant
age accelerations [2]. This suggests that traditional epigenetic clocks tend to capture broad trends while missing critical details essential for translational medicine. More recently, these limitations have been addressed by shifting the focus from accurate biological/chronological age prediction
to the ambitious goal of predicting mortality risk and healthspan. This second generation of DNA methylation-based epigenetic clocks incorporates, in fact, additional lifestyle-associated indicators, such as smoking pack-years, and proves to be more sensitive to age acceleration than traditional biological age predictions [3,4]. While these newer models are still often based on linear frameworks, incorporating these additional covariates helps these new-generation clocks identify factors impacting age acceleration that may be obscured by linear models that rely solely on DNA methylation data. However, this increased sensitivity to age acceleration
comes at the cost of reduced accuracy in predicting chronological age, as well as diminished reliability, since lifestyle metadata is often incomplete or unavailable.
Upon closer inspection, these performances are not entirely surprising, as all such clocks are based on linear models, which are limited in their ability to capture complex relationships within the data.
Interestingly, recent work [5] has shown how the non-linear epigenetic pacemaker (EPM) clock is able to identify significant associations between polybrominated biphenyl (PBB) exposure and accelerated epigenetic aging, a validated association completely overlooked by linear clocks. This suggests that non-linear models may be more appropriate to capture the complexity of the relation between aging and methylation. Building further on these observations, we believe that exploiting novel non-linear regression approaches, such as generative artificial intelligence (AI) models, along with the integration of lifestyle-related features, holds significant promise for enhancing the predictive power of DNA methylation-based models. These models can capture complex, non-linear relationships within the data that traditional linear models may overlook and have the potential to identify subtle patterns and interactions between DNA methylation and various lifestyle factors, leading to more accurate predictions of biological age and age acceleration. Incorporating lifestyle-related features, such as diet, physical activity, smoking habits and environmental exposures, can provide a more holistic view of the factors influencing epigenetic aging. By integrating
these variables into AI models, researchers can develop more sophisticated clocks that not only predict biological age but also offer insights into specific lifestyle modifications that could mitigate age acceleration and reduce disease.
Moreover, AI-driven models can continuously learn and improve as more data become available. This adaptive learning capability is crucial for personalized medicine, as it allows the models to stay up-to-date with the latest scientific findings and population-specific trends. For instance, AI models can be trained on diverse datasets from different demographic
groups, ensuring that the predictions are relevant and accurate
across various populations.
To date, the main limitation in this direction are twofold: the opacity of deep-learning algorithms (requiring explainable AI to be developed, as in a recent immunological clock [6]) and the lack of sufficiently large and diverse datasets with associated lifestyle metadata to robustly train AI-based models.
High-quality, longitudinal datasets that track individuals’
DNA methylation patterns and lifestyle factors over time are
essential for developing and validating these advanced models
[7]. Further novel approaches like generative models and
transfer learning [8] can be leveraged in this research area.
Generative models, by learning features from existing populations,
can simulate new data that mimic the characteristics of
the original population: shall this reference population be
small, generative models are a means to enlarge the representative
dataset. Transfer learning, on the other hand, allows AI
models to leverage data from populations or systems that are
similar to the one of interest (proxy), but much more abundant,
thus, again, increasing the available dataset. Generally,
however, data from proxy systems are used during the initial
training phase. In this way, the limited amount of data from
the real system can be used and is generally sufficient to fine-
tune the algorithm, thereby improving performance despite
constraints on data diversity. Anyways, at the moment, collaborative
efforts between research institutions, healthcare providers,
and public health organizations are needed to gather
and share multi-longitudinal data to build proper training
datasets. Addressing these challenges will be critical for translating
cutting-edge research into practical tools for personalized
medicine, with the ambition to improve health outcomes
and extend healthspan.
Beyond the limitations observed above and discussed at
large in the literature [9], other applications deserve
discussion.
In addition to prevention, in fact, methylage could be
tested to assess the effectiveness of therapies. In particular,
physical therapies exploit the ability of our cells to transduce
(mechanic, optic, magnetic, electric) signals (or combinations
thereof) into biochemical signals and effectors. For
reasons that are not fully elucidated, the efforts posed to
obtain evidence on the effectiveness of such therapies is
quite different from the vast majority of pharmacological
research [10], despite the recent birth of disciplines like
bioelectronic medicine [11] or mechanopharmacology [12].
Physical therapies are known to interfere with wound healing
[13] and epithelial mesenchymal transition, where
numerous modifications occur, including systemic methylation
[14]. Yet, to the best of our knowledge, although
exploration of the effect of exercise has been pursued
[15], the usage of this approach to assess reproducibility
and dosage of physical therapies (beyond lifestyle habits,
hence including physical exercise) is only in its infancy.
Physical therapies lack of systematic exploration, and
exploiting cutting-edge approaches like assessment of
methylage could be beneficial to promote evidence and
limitations as well as a means to quantify stimuli and
dose–response relationships.
Finally, it is also worth noting that the choices on the
computational aspects of epigenetic age are likely to have
an impact on other research and application areas, beyond
the clinic. In particular, given the interest in the correlation of
methylomic changes with a variety of factors, the relevance of
methylage should be promoted (and warned against)
accordingly.
Indeed, blooming correlations are being drawn in a great
variety of research areas. For instance, the importance of
methylation changes under psychological [16], socioeconomic
[17] and environmental stress [18] is well recognized in literature,
but not always accompanied by information on chronological
age as confounding factor [19].
Indeed, focus on multi-modal analyses is likely to be an area
where attention should be strongly paid in the upcoming
research on methylomic. In fact, given the political potential
behind these types of analyses, robustness of the biological
datum first, and strong interdisciplinary interpretation then, are
crucial, to ensure evidence-based policies are designed [20].
Methylage is therefore a research topic with ample room to
continue to challenge scientists in search of optimal solutions
transdisciplinary, a worthwhile quest given the socioeconomic
and specifically clinical potential behind its application
Editorial: Computational Methods for Analysis of DNA Methylation Data
DNA methylation is among the most studied epigenetic modifications in eukaryotes. The interest in DNA methylation stems from its role in development, as well as its well- established association with phenotypic changes. Particularly, there is strong evidence that methylation pattern alterations in mammals are linked to developmental disorders and cancer (Kulis and Esteller, 2010). Owing to its potential as a prognostic marker for preventive medicine, in recent years, the analysis of DNA methylation data has garnered interest in many different contexts of computational biology (Bock, 2012). As it typically happens with omic data, processing, analyzing and interpreting large-scale DNA methylation datasets requires computational methods and software tools that address multiple challenges. In the present Research Topic, we collected papers that tackle different aspects of computational approaches for the analysis of DNA methylation data. These manuscripts address novel computational solutions for copy number variation detection, cell-type deconvolution and methylation pattern imputation, while others discuss interpretations of well-established computational techniques.
Over the last 10 years, DNA methylation profiles have been successfully exploited to develop biomarkers of age, also referred to as epigenetic clocks (Bell et al., 2019). Epigenetic clocks accurately estimate both chronological and biological age from methylation levels. DNA methylation age and, most importantly, its deviation from chronological age have been shown to be associated with a variety of health issues. More recently, a second generation of epigenetic clocks has emerged. The new generation of clocks incorporates not only methylation profiles but also environmental variants, such as smoking and alcohol consumption, and they outperform the first generation in mortality prediction and prognosis of certain diseases. In our collection, the review by Chen et al. compares the first and second generation of epigenetic clocks that predict cancer risk and discusses pathways known to exhibit altered methylation in aging tissues and cancer.
Differentially methylated regions (DMRs), that is genomic regions that show significant differences in methylation levels across distinct biological and/or medical conditions (e.g., normal vs. disease), have been reported to be implicated in a variety of disorders (Rakyan et al., 2011). As a result, identifying DMRs is one of the most critical and fundamental challenges in deciphering disease mechanisms at the molecular level. Although DNA methylation patterns remain stable during normal somatic cell growth, alterations in genomic methylation may be caused by genetic alterations, or vice versa. However, standard DMR analysis often ignores whether methylation alterations should be viewed as a cause or an effect. Rhamani et al. discuss the effect of model directionality, i.e. whether the condition of interest (phenotype) may be affected by methylation or whether it may affect methylation, in differential methylation analyses at the cell-type level. They show that correctly accounting for model directionality has a significant impact on the ability to identify cell type specific differential methylation.
Different cell types exhibit DMRs at many genomic regions and such rich information can be exploited to infer underlying cell type proportions using deconvolution techniques. DNA methylation-based cell mixture deconvolution approaches can be classified into two main categories: reference-based and reference-free. While the latter are more broadly applicable, as they do not rely on the availability of methylation profiles from each of the purified cell types that compose a tissue of interest, they are also less precise. Reference-based approaches use DMRs specific to cell types (reference library) to determine the underlying cellular composition within a DNA methylation sample. The quality of the reference library has a big impact on the accuracy of reference-based approaches. Bell-Glenn et al. present RESET, a framework for reference library selection for deconvolution algorithms exploiting a modified version of the Dispersion Separability Criteria score, for the inference of the best DMRs composing the library, contributing to de facto standards (Koestler et al., 2016). In short, RESET does not require researchers to identify a priori the size of the reference library (number of DMRs), nor to rely on costly associated purified cells’ mDNA profiles.
Within a cellular population, the methylation patterns of different cell types and at specific genomic locations are indicative of cellular heterogeneity. Alterations of such heterogeneity are predictive of development as well as prognostic markers of diseases. Computational methods that exploit heterogeneity in methylation patterns are typically constrained by partially observed patterns due to the nature of shotgun sequencing, which frequently generates limited coverage for downstream analysis. One possible solution to overcome such limitations is offered by Chang et al. presenting BSImp, a probabilistic based imputation method that uses local information to impute partially observed methylation patterns. They show that using this approach they are able to recover heterogeneity estimates at 15% more regions with moderate sequencing depths. This should therefore improve our ability to study how methylation heterogeneity is associated with disease.
Finally, recent studies have shown how the associations between Copy Number Variations (CNVs) and methylation alterations offer a richer and hence more informative picture of the samples under study, in particular for tumor data characterized by large scale genomic rearrangements (Sun et al., 2018). Consequently, recent technological and methodological developments have enabled the possibility to measure CNVs from DNA methylation data. The main advantage of DNA methylation based CNV approaches is that they offer the possibility to integrate both genomic (copy number) and epigenomic (methylation) information. Mariani et al. propose MethylMasteR, an R software package that integrates DNA methylation-based CNV calling routines, facilitating standardization, comparison and customization of CNV analyses. This package, built into the Docker architecture to seamlessly mange dependencies, includes four of the most commonly used routines for this integrated analysis, ChAMP (Morris et al., 2014), SeSAMe (Zhou et al., 2018), Epicopy (Cho et al., 2019), plus a custom version of cnAnalysis450k (Knoll et al., 2017), overall enabling analysis of comparative results.
All the topics in this issue, although limited to specific aspects of DNA methylation analysis, highlight the importance of research in this field, the associated computational challenges and illustrate the significant impact that this type of data will likely have on preventive medicine
Editorial: Computational methods for analysis of DNA methylation data, volume II
DNA methylation stands out as one of the most extensively investigated epigenetic modifications within the realm of eukaryotic biology. DNA methylation-based predictive models for chronological age, referred to as epigenetic clocks, serve as widely used tools for investigating age-related pathologies and physiological alterations. Discrepancies between predicted and actual chronological ages are frequently interpreted as manifestations of biological age acceleration, a phenomenon linked to the onset of various disorders. A plethora of epigenetic clocks have been developed in the literature (Di Lena et al., 2021), and several studies have demonstrated associations between epigenetic age acceleration and pathological conditions (Horvath and Raj, 2018). This active area of research is currently engaged in endeavors to enhance the predictive capabilities of epigenetic clocks and facilitate the translation of their applications into the realm of predictive medicine. Following the significant interest garnered by the first volume of this Research Topic, we are pleased to introduce the second volume. This edition encompasses five contributions dedicated to exploring advancements and challenges in the development of DNA methylation-based epigenetic clocks, as well as examining the applications of epigenetic clocks and DNA methylation analysis in studying disease biology.
In the context of epigenetic clock development, Sala et al. delved into the impact of covariates, such as sex and tissue specificities, as well as training parameters, including the size of the training set and the linear regression model utilized, on the performance of epigenetic clocks. The authors showed that the size of the training set significantly influences prediction performance, as expected. Sex specificity does not substantially affect clock performance, as evidenced by the lack of statistically significant differences between sex-specific and sex-generic linear regression clocks. Conversely, tissue-specific clocks demonstrate superior performance compared to multi-tissue clocks, typically trained on a majority of blood samples. Moreover, the widely utilized elastic-net regression model exhibits comparable or superior prediction performance relative to ridge and lasso penalization models. These findings offer valuable insights for the development of linear regression epigenetic clocks with enhanced performance capabilities.
A complementary analysis of regression models was provided by Farrell et al. who compared the performance of epigenetic clocks based on penalized linear regression models with that of the non-linear epigenetic pacemaker (EPM) model (Farrell et al., 2020). The EPM model considers DNA methylation as a function of a time-dependent epigenetic state. Differently from linear regression models, the epigenetic state is influenced not only by age but also by other factors, such as sex and cell composition. The authors applied both models to a study on polybrominated biphenyl (PBB) exposure to predict epigenetic age dependent on PBB exposure. They found that both models perform well, with the EPM model showing superior performance. Importantly, only the EPM model identifies significant associations with PBB exposure, highlighting its robustness in investigating factors impacting age acceleration that may be obscured by linear regression models.
A novel regression model for epigenetic clocks, BayesAge, was introduced by Mboning et al. BayesAge was tailored for bisulfite sequencing data. BayesAge utilizes maximum-likelihood estimation (MLE) to address missing data issues and can estimate error bounds, enhancing age inference reliability. Furthermore, BayesAge incorporates LOWESS (LOcally WEighted Scatterplot Smoothing) to capture non-linear associations between DNA methylation data and age. Performance comparisons on down-sampled data indicate superior performance of BayesAge over other linear and non-linear regression models, representing a promising advancement in epigenetic age prediction.
It is well established that several diseases, including cancer (Dugué et al., 2017) and infection with human immunodeficiency virus type 1 (HIV-1) (Horvath and Levine, 2015), are associated with accelerated aging. Two papers within this Research Topic scrutinize the effect of Highly Active Anti-Retroviral Therapy (HAART) on DNA methylation and biological aging. Sehl et al. employed different epigenetic clocks to analyze age acceleration in people living with HIV before and after the initiation of HAART. They discovered that epigenetic aging decreases after HAART initiation but remains persistently greater than that of age-matched seronegative controls. The authors further demonstrated that the magnitude of acceleration is associated with cumulative viral load and changes in T-cell subsets. In parallel, Zhang et al. analyzed the epigenomic-wide changes associated with the initiation of HAART in people living with HIV. They identified CpGs, unrelated to HIV viral load, significantly associated with HAART initiation by comparing DNA methylation profiles of people living with HIV shortly before HAART and post-HAART against seronegative controls. Epigenome-wide association study (EWAS) analysis of such CpGs elucidates that HAART initiation alters DNA methylation in genes associated with immune response and HIV infection. Moreover, enrichment analysis detects Gene Ontologies related to transplant rejection, transplant-related diseases, and other immunologic signatures. Collectively, these findings provide insights into potential biological functions associated with DNA methylation changes induced by HAART.
The most significant conclusion drawn from the first and second volumes of this Research Topic is that although DNA methylation analysis shows great potential, there is a pressing need for further investigation and refinement of methodologies to fully harness its predictive power and translate its findings into actionable insights for clinical practice
Evaluation of different computational methods for DNA methylation-based biological age
In recent years there has been a widespread interest in researching biomarkers of aging that could predict physiological vulnerability better than chronological age. Aging, in fact, is one of the most relevant risk factors for a wide range of maladies, and molecular surrogates of this phenotype could enable better patients stratification. Among the most promising of such biomarkers is DNA methylation-based biological age. Given the potential and variety of computational implementations (epigenetic clocks), we here present a systematic review of such clocks. Furthermore, we provide a large-scale performance comparison across different tissues and diseases in terms of age prediction accuracy and age acceleration, a measure of deviance from physiology. Our analysis offers both a state-of-the-art overview of the computational techniques developed so far and a heterogeneous picture of performances, which can be helpful in orienting future research
Fold recognition by scoring protein maps using the congruence coefficient
Motivation
Protein fold recognition is a key step for template-based modeling approaches to protein structure prediction. Although closely related folds can be easily identified by sequence homology search in sequence databases, fold recognition is notoriously more difficult when it involves the identification of distantly related homologs. Recent progress in residue–residue contact and distance prediction opens up the possibility of improving fold recognition by using structural information contained in predicted distance and contact maps.
Results
Here we propose to use the congruence coefficient as a metric of similarity between maps. We prove that this metric has several interesting mathematical properties which allow one to compute in polynomial time its exact mean and variance over all possible (exponentially many) alignments between two symmetric matrices, and assess the statistical significance of similarity between aligned maps. We perform fold recognition tests by recovering predicted target contact/distance maps from the two most recent Critical Assessment of Structure Prediction editions and over 27 000 non-homologous structural templates from the ECOD database. On this large benchmark, we compare fold recognition performances of different alignment tools with their own similarity scores against those obtained using the congruence coefficient. We show that the congruence coefficient overall improves fold recognition over other methods, proving its effectiveness as a general similarity metric for protein map comparison.
Availability and implementation
The congruence coefficient software CCpro is available as part of the SCRATCH suite at: http://scratch.proteomics.ics.uci.edu/
GOTA
Background: Functional annotation of genes and gene products is a major challenge in the post-genomic era. Nowadays, gene function curation is largely based on manual assignment of Gene Ontology (GO) annotations to genes by using published literature. The annotation task is extremely time-consuming, therefore there is an increasing interest in automated tools that can assist human experts.
Results: Here we introduce GOTA, a GO term annotator for biomedical literature.The proposed approach makes use only of information that is readily available from public repositories and it is easily expandable to handle novel sources of information. We assess the classification capabilities of GOTA on a large benchmark set of publications. The overall performances are encouraging in comparison to the state of the art in multi-label classification over large taxonomies. Furthermore, the experimental tests provide some interesting insights into the potential improvement of automated annotation tools.
Conclusions: GOTA implements a flexible and expandable model for GO annotation of biomedical literature. The current version of the GOTA tool is freely available at http://gota.apice.unibo.it
methyLImp2: faster missing value estimation for DNA methylation data
Motivation methyLImp, a method we recently introduced for the missing value estimation of DNA methylation data, has demonstrated competitive performance in data imputation compared to the existing, general-purpose, approaches. However, imputation running time was considerably long and unfeasible in case of large datasets with numerous missing values.Results methyLImp2 made possible computations that were previously unfeasible. We achieved this by introducing two important modifications that have significantly reduced the original running time without sacrificing prediction performance. First, we implemented a chromosome-wise parallel version of methyLImp. This parallelization reduced the runtime by several 10-fold in our experiments. Then, to handle large datasets, we also introduced a mini-batch approach that uses only a subset of the samples for the imputation. Thus, it further reduces the running time from days to hours or even minutes in large datasets.Availability and implementation The R package methyLImp2 is under review for Bioconductor. It is currently freely available on Github https://github.com/annaplaksienko/methyLImp2
NET-GE: a novel NETwork-based Gene Enrichment for detecting biological processes associated to Mendelian diseases
Enrichment analysis is a widely applied procedure for shedding light on the molecular mechanisms and functions at the basis of phenotypes, for enlarging the dataset of possibly related genes/proteins and for helping interpretation and prioritization of newly determined variations. Several standard and Network-based enrichment methods are available. Both approaches rely on the annotations that characterize the genes/proteins included in the input set; network based ones also include in different ways physical and functional relationships among different genes or proteins that can be extracted from the available biological networks of interactions
Equicontinuity and sensitivity of nondeterministic cellular automata
Nondeterministic Cellular Automata (NCA) are the class of multivalued functions characterized by nondeterministic block maps. We extend the notions of equicontinuity and sensitivity to multivalued functions and investigate the characteristics of equicontinuous, almost equicontinuous and sensitive NCA. The dynamical behavior of nondeterministic CA in these classes is much less constrained than in the deterministic setting. In particular, we show that there are transitive NCA with equicontinuous points and equicontinuous NCA that are not reversible
Estimage: a webserver hub for the computation of methylation age
Methylage is an epigenetic marker of biological age that exploits the correlation between the methylation state of specific CG dinucleotides (CpGs) and chronological age (in years), gestational age (in weeks), cellular age (in cell cycles or as telomere length, in kilobases). Using DNA methylation data, methylage is measurable via the so called epigenetic clocks. Importantly, alterations of the correlation between methylage and age (age acceleration or deceleration) have been stably associated with pathological states and occur long before clinical signs of diseases become overt, making epigenetic clocks a potentially disruptive tool in preventive, diagnostic and also in forensic applications. Nevertheless, methylage dependency from CpGs selection, mathematical modelling, tissue specificity and age range, still makes the potential of this biomarker limited. In order to enhance model comparisons, interchange, availability, robustness and standardization, we organized a selected set of clocks within a hub webservice, EstimAge (Estimate of methylation Age, http://estimage.iac.rm.cnr.it), which intuitively and informatively enables quick identification, computation and comparison of available clocks, with the support of standard statistics
- …
