1,721,013 research outputs found

    A comparison of topologically associating domain callers over mammals at high resolution

    No full text
    Background: Topologically associating domains (TADs) are locally highly-interacting genome regions, which also play a critical role in regulating gene expression in the cell. TADs have been first identified while investigating the 3D genome structure over High-throughput Chromosome Conformation Capture (Hi-C) interaction dataset. Substantial degree of efforts have been devoted to develop techniques for inferring TADs from Hi-C interaction dataset. Many TAD-calling methods have been developed which differ in their criteria and assumptions in TAD inference. Correspondingly, TADs inferred via these callers vary in terms of both similarities and biological features they are enriched in. Result: We have carried out a systematic comparison of 27 TAD-calling methods over mammals. We use Micro-C, a recent high-resolution variant of Hi-C, to compare TADs at a very high resolution, and classify the methods into 3 categories: feature-based methods, Clustering methods, Graph-partitioning methods. We have evaluated TAD boundaries, gaps between adjacent TADs, and quality of TADs across various criteria. We also found particularly CTCF and Cohesin proteins to be effective in formation of TADs with corner dots. We have also assessed the callers performance on simulated datasets since a gold standard for TADs is missing. TAD sizes and numbers change remarkably between TAD callers and dataset resolutions, indicating that TADs are hierarchically-organized domains, instead of disjoint regions. A core subset of feature-based TAD callers regularly perform the best while inferring reproducible domains, which are also enriched for TAD related biological properties. Conclusion: We have analyzed the fundamental principles of TAD-calling methods, and identified the existing situation in TAD inference across high resolution Micro-C interaction datasets over mammals. We come up with a systematic, comprehensive, and concise framework to evaluate the TAD-calling methods performance across Micro-C datasets. Our research will be useful in selecting appropriate methods for TAD inference and evaluation based on available data, experimental design, and biological question of interest. We also introduce our analysis as a benchmarking tool with publicly available source code.Publisher versio

    DRGAT: Predicting drug responses via diffusion-based graph attention network

    No full text
    Accurately predicting drug response depending on a patient's genomic profile is critical for advancing personalized medicine. Deep learning approaches rise and especially the rise of graph neural networks leveraging large-scale omics datasets have been a key driver of research in this area. However, these biological datasets, which are typically high dimensional but have small sample sizes, present challenges such as overfitting and poor generalization in predictive models. As a complicating matter, gene expression (GE) data must capture complex inter-gene relationships, exacerbating these issues. In this article, we tackle these challenges by introducing a drug response prediction method, called drug response graph attention network (DRGAT), which combines a denoising diffusion implicit model for data augmentation with a recently introduced graph attention network (GAT) with high-order neighbor propagation (HO-GATs) prediction module. Our proposed approach achieved almost 5% improvement in the area under receiver operating characteristic curve compared with state-of-the-art models for the many studied drugs, indicating our method's reasonable generalization capabilities. Moreover, our experiments confirm the potential of diffusion-based generative models, a core component of our method, to mitigate the inherent limitations of omics datasets by effectively augmenting GE data.TÜBİTA

    BioCode: A Data-Driven Procedure to Learn the Growth of Biological Networks

    Full text link
    Probabilistic biological network growth models have been utilized for many tasks including but not limited to capturing mechanism and dynamics of biological growth activities, null model representation, capturing anomalies, etc. Well-known examples of these probabilistic models are Kronecker model, preferential attachment model, and duplication-based model. However, we should frequently keep developing new models to better fit and explain the observed network features while new networks are being observed. Additionally, it is difficult to develop a growth model each time we study a new network. In this paper, we propose BioCode, a framework to automatically discover novel biological growth models matching user-specified graph attributes in directed and undirected biological graphs. BioCode designs a basic set of instructions which are common enough to model a number of well-known biological graph growth models. We combine such instruction-wise representation with a genetic algorithm based optimization procedure to encode models for various biological networks. We mainly evaluate the performance of BioCode in discovering models for biological collaboration networks, gene regulatory networks, metabolic networks, and protein interaction networks which features such as assortativity, clustering coefficient, degree distribution closely match with the true ones in the corresponding real biological networks. As shown by the tests on the simulated graphs, the variance of the distributions of biological networks generated by BioCode is similar to the known models' variance for these biological network types

    Pagerank-based unsupervised deep vertex representations for anti-money laundering detection

    No full text
    Anti-money laundering is an international web of laws, regulations, and procedures aimed at uncovering money that has been disguised as legitimate income. Strict anti-money laundering (AML) laws and procedures require major and continuous transaction observation in inferring possible illegal events. Nevertheless, traditional rule-based approaches in banks frequently generate a significant number of false positives, which impose a major burden. In this case, deep learning approaches, especially graph-based Graph Neural Network-based (GNN) methods, could be explored in generating better anti-money laundering results. Here, we propose a diffusion-based AMLPD, which is novel in generating unsupervised node embeddings via learning graph embeddings inductively while detecting AML. AMLPD assumes a direction between edges, and it incorporates vertex and edge feature knowledge while encoding graph's structure knowledge. AMLPD infers a vertex's local state via combining diffusion with PageRank, which is an important knowledge for AML when embedded into low dimensional space Then, our approach can detect AMLs by a classifier using this low dimensional representation. Our approach can be scaled to larger data, as well as it can help with explainable AI by facilitating the embeddings analysis. According to experiments, our approach outperforms the baseline approaches. Therefore, AMLPD is favourable in enhancing the quality of GNN-based AML identification

    Hi–C interaction graph analysis reveals the impact of histone modifications in chromatin shape

    No full text
    Abstract Chromosome conformation capture experiments such as Hi–C map the three-dimensional spatial organization of genomes in a genome-wide scale. Even though Hi–C interactions are not biased towards any of the histone modifications, previous analysis has revealed denser interactions around many histone modifications. Nevertheless, simultaneous effects of these modifications in Hi–C interaction graph have not been fully characterized yet, limiting our understanding of genome shape. Here, we propose ChromatinCoverage and its extension TemporalPrizeCoverage methods to decompose Hi–C interaction graph in terms of known histone modifications. Both methods are based on set multicover with pairs, where each Hi–C interaction is tried to be covered by histone modification pairs. We find 4 histone modifications H3K4me1, H3K4me3, H3K9me3, H3K27ac to be significantly predictive of most Hi–C interactions across species, cell types and cell cycles. The proposed methods are quite effective in predicting Hi–C interactions and topologically-associated domains in one species, given it is trained on another species or cell types. Overall, our findings reveal the impact of subset of histone modifications in chromatin shape via Hi–C interaction graph

    Joint Modeling of Histone Modifications in 3D Genome Shape Through Hi-C Interaction Graph

    No full text
    Chromosome conformation capture experiments such as Hi-C are used to map the three-dimensional spatial organization of genomes. Even though Hi-C interactions are not biased towards any of the histone modifications, previous analysis has revealed denser interactions around many histone modifications. Nevertheless, simultaneous effects of these modifications in Hi-C interaction graph have not been fully characterized yet, limiting our understanding of genome shape. Here, we propose Coverage Hi-C to decompose Hi-C interaction graph in terms of known histone modifications. Coverage Hi-C is based on set multicover with pairs, where each Hi-C interaction is covered by histone modification pairs. We find 4 histone modifications H3K4me1, H3K4me3, H3K9me3, H3K27ac to be significantly predictive of most Hi-C interactions across species and cell types. Coverage Hi-C is quite effective in predicting Hi-C interactions and topologically-associated domains (TADs) in one species, given it is trained on another species or cell types

    Analysis of chromatin structure reveals the connection between sQTLs and the splicing of distant genes

    No full text
    Gene expression and regulation with or without alternative splicing are key factors for cells to properly function. Distant splicing quantitative trait loci (distant sQTLs) are genomic mutations that impact the alternative splicing patterns of far-away genes. Nevertheless, the procedures causing a distant sQTL to regulate the alternative splicing of genes are not well defined. Higher resolution chromosome conformation capture experiments like Micro-C or Hi-C together with an expanding number of sQTL datasets on humans help us in understanding the spatial processes governing distant sQTL relationships at a genome-wide scale. In this study, we focus on analyzing whether spatial closeness helps in regulating sQTL-gene interactions over high-order chromatin topological domain structure, which is inferred from chromosome conformation experiments. We discover larger-scale chromatin shape to be in line with sQTL associations. In detail, sQTLs are generally spatially near their splicing genes in 3D, they frequently appear near topologically associating domain (TAD) and frequently interacting region (FIRE) boundaries, and are favorably related to genes over TADs and FIREs. Additionally, we discover that inside-domain sQTLs accompanied by functional regulatory elements, including enhancers and promoters, are spatially closer than all inside-domain sQTLs. This result suggests that spatial closeness between sQTLs and their distant splicing genes obtained from chromatin’s TAD structure has major importance in regulating alternative splicing and thus in gene regulation. Our results are robust across different experiments such as Hi-C and Micro-C, different TAD inference methods, different Hi-C binning resolutions, different alternative splicing events, and once we control for eQTLs, which are shown to be spatially close to their genes. © The Author(s) 2025.TÜBİTAKPublisher versio

    Financial statement fraud detection with a categorical-to-numerical data representation

    No full text
    Identifying fraudulent financial reports and elucidating the mechanisms of fraud are critical for safeguarding investors from substantial losses. Financial statements present detailed accounting entries in tabular form; they inherently combine categorical and numerical variables governed by accounting dependencies, yet most existing methods fail to model interpretable interactions between these feature types. In this case, handling categorical variables together with numerical variables is important in enhancing the financial statement fraud detection performance. Here, we compare the methods for transforming categorical to numerical attributes, which are then used for financial statement fraud detection. We perform comprehensive experiments on two real-world datasets: FiGraph and USFSD. We compare 4 state-of-the-art specialized categorical-to-numerical transformation techniques with several other simpler statistical encoding mechanisms, such as target, label, Helmert, and GLMM encodings, as well as methods that can directly work on categorical data, such as CatBoost. These specialized transformation techniques are Hierarchical Coupling Learning-based CURE, Graph-based Categorical Embedding GCE, and Transitive Distance Learning-based embedding. The results reveal that the performance of CURE and XGBoost together surpasses all state-of-the-art techniques, achieving significant relative gains in macro-level recall over the second-best performing approaches, CatBoost and FTTransformer, while also providing clear and interpretable insights into the discovered fraud pathways.TÜBİTAKPublisher versio

    Metric labeling and semimetric embedding for protein annotation prediction

    No full text
    Computational techniques have been successful at predicting protein function from relational data (functional or physical interactions). These techniques have been used to generate hypotheses and to direct experimental validation. With few exceptions, the task is modeled as multilabel classification problems where the labels (functions) are treated independently or semi-independently. However, databases such as the Gene Ontology provide information about the similarities between functions. We explore the use of the Metric Labeling combinatorial optimization problem to make use of heuristically computed distances between functions to make more accurate predictions of protein function in networks derived from both physical interactions and a combination of other data types. To do this, we give a new technique (based on convex optimization) for converting heuristic semimetric distances into a metric with minimum least-squared distortion (LSD). The Metric Labeling approach is shown to outperform five existing techniques for inferring function from networks. These results suggest that Metric Labeling is useful for protein function prediction, and that LSD minimization can help solve the problem of converting heuristic distances to a metric
    corecore