1,721,026 research outputs found

    Datasets in support of the Southampton doctoral thesis 'Applying large scale metanalysis of transcriptomic data to uncover hyper-responsive genes and prediction via machine learning'

    No full text
    The SQLite databases contain the outputs from the large scale analysis of pre-existing RNA-seq and microarray datasets performed in chapter 2. Both SQLite databases contain the outputs of limma- a package used to perform differential expressed gene analysis on the datasets from Gene Expression Omnibus (GEO)- https://www.ncbi.nlm.nih.gov/geo/. The Schema for both databases are as follows- the data table contains the outputs and statistics from limma. The meta table contains metadata about the number of treated and control samples, the type of experiment conducted and the tissue used. These datasets where used to derive the priors used in chapters 3 to 5 based on the proportion of datasets wherein a given gene is identified as differentially expressed- i.e. p-value below 0.05. Die to the size of the file, this is only available on request, please use https://library.soton.ac.uk/datarequest The machine_learning_input.csv file is a comma delaminated file containing the genomic and transcript based features used to predict a gene&#39;s prior in the machine learning models. For more information see the readme file. The RNK files are tab delimited files. The .RNK files&#39; first column is the gene whils the second is the rank from 1 to 0. These files were used to assess the enrichment of desired DEGs across 22 perturbation studies in chapter 2 using GSEA- https://www.gsea-msigdb.org/gsea/index.jsp. 1 represents a gene with the lowest rank- highest priority. Whilst 0 represents the lowest priority for a given gene. The .RDS images are the R images used for the novel GEOreflect approach for ranking DEGs in bulk transcriptomic data developed in chapter 3. They are also needed to run the RShiny application used to showcase the method. The code for which can be found at GitHub (https://github.com/brandoncoke/GEOreflect) as well ain in the GEOreflect_bulk_DEG_analysis.tar. The .RDS files require R and the readRDS() function to load into the environment and contains the percentile matrices used to calculate a platform p-value rank. Within the GEOreflect_bulk_DEG_analysis.tar file is an R script GEOreflect_functions.R which when sourced after loading one of the .RDS images into the R environment enables the user to perform the GEOreflect method on bulk RNA-seq transcriptomic datasets by loading the percentile_matrix_p_value_RNAseq.RDS image. Alternatively when analysing GPL570 microarray datasets the percentile_matrix.RDS file needs to be loaded into the R environment and the appropiate R function then needs to be applied the DEG list. To run the RShiny application ensure both .RDS files are in the directory with the app.R file i.e. after using git clone https://github.com/brandoncoke/GEOreflect move both .RDS files into the GEOreflect directory with the cloned repository. The csv files with the scRNA-seq appended. These files contain the normalised mutual index, adjusted rand index and Silhouette coefficeint obtained when using 6 single cell RNA-sequencing techniques- GEOreflect, Seurat&#39;s vst method, CellBRF, genebasis and CellBRF with the 3 sigma rule imposed. This analysis was carried out in chapter 3. These .csvs use their GEO identifier in the file name or for Zheng et al&#39;s data from genomics 10X. The name assigned to it via the DuoClustering2018 R package. The machine_learning_input.csv file is a comma delaminated file containing the genomic and transcript based features used to predict a gene&#39;s prior in the machine learning models. The inputs from this file were used to develop the machine learning models used in chapter 5. First row- gene is the HNGC identifier for the genes whilst the min_to_be_sig column represents a gene&#39;s CDF value at 0.05 for their p-value distribution obtained from the RNA-seq datasets i.e. the target y for the regressor model. The sd column is unused- and was only relevant when calculating the priors using GPL570 microarray data were there can be redundant probes resulting in multiple priors for the same gene. This column would represent the standard deviation. </span

    RNA-Seq of ZIKV infection series in paediaitric brain tumour and NPC cell lines

    No full text
    RNA-Seq data of six cell lines (three brain tumour cell and three neural precursor cell lines) cultured under monolayer conditions and infected for 12-24 hours with Zika virus (Strain KU365779.1)</span

    TransformerGO: predicting protein–protein interactions by modelling the attention between sets of gene ontology terms

    No full text
    MOTIVATION: Protein–protein interactions (PPIs) play a key role in diverse biological processes but only a small subset of the interactions has been experimentally identified. Additionally, high-throughput experimental techniques that detect PPIs are known to suffer various limitations, such as exaggerated false positives and negatives rates. The semantic similarity derived from the Gene Ontology (GO) annotation is regarded as one of the most powerful indicators for protein interactions. However, while computational approaches for prediction of PPIs have gained popularity in recent years, most methods fail to capture the specificity of GO terms. RESULTS: We propose TransformerGO, a model that is capable of capturing the semantic similarity between GO sets dynamically using an attention mechanism. We generate dense graph embeddings for GO terms using an algorithmic framework for learning continuous representations of nodes in networks called node2vec. TransformerGO learns deep semantic relations between annotated terms and can distinguish between negative and positive interactions with high accuracy. TransformerGO outperforms classic semantic similarity measures on gold standard PPI datasets and state-of-the-art machine-learning-based approaches on large datasets from Saccharomyces cerevisiae and Homo sapiens. We show how the neural attention mechanism embedded in the transformer architecture detects relevant functional terms when predicting interactions. AVAILABILITY AND IMPLEMENTATION: https://github.com/Ieremie/TransformerGO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online

    Single-cell pluripotency regulatory networks

    No full text
    Pluripotent stem cells (PSCs) are a popular model system for investigating development, tissue regeneration, and repair. Although much is known about the molecular mechanisms that regulate the balance between self-renewal and lineage commitment in PSCs, the spatiotemporal integration of responsive signaling pathways with core transcriptional regulatory networks are complex and only partially understood. Moreover, measurements made on populations of cells reveal only average properties of the underlying regulatory networks, obscuring their fine detail. Here, we discuss the reconstruction of regulatory networks in individual cells using novel single-cell transcriptomics and proteomics, in order to expand our understanding of the molecular basis of pluripotency, including the role of cell–cell variability within PSC populations, and ways in which networks may be controlled in order to reliably manipulate cell behaviorior

    Integrated analysis of the Wnt responsive proteome in human cells reveals diverse and cell-type specific networks

    No full text
    Wnt signalling is a fundamentally important signalling pathway that regulates many aspects of metazoan development and is frequently dysregulated in cancer. Although many of the core components of the Wnt signalling pathway{,} such as [small beta]-catenin{,} have been extensively studied{,} the broad systems level responses of the mammalian cell to Wnt signalling are less well understood. In addition{,} the cell- or tissue-specific protein networks that modulate Wnt signalling in the diverse tissues or developmental stages in which it functions remain to be defined. To address these questions{,} we undertook a broad survey of the Wnt response in different human cell lines using both interaction and expression proteomics approaches. Our data reveal both similar and divergent responses of pathways and processes in the three cell-lines analysed as well as a marked attenuation of the response to exogenous Wnt treatment in cells harbouring a stabilizing (activating) mutation of [small beta]-catenin. We also identify cell-type specific components of the Wnt signalling network and find that by integrating expression and interaction proteomics data a more complete description of the Wnt interaction network can be achieved. Finally{,} our results attest to the power of LC-MS/MS to reveal novel cellular responses in even relatively well studied biological pathways such as Wnt signalling

    How do oncoprotein mutations rewire protein–protein interaction networks?

    No full text
    The acquisition of mutations that activate oncogenes or inactivate tumor suppressors is a primary feature of most cancers. Mutations that directly alter protein sequence and structure drive the development of tumors through aberrant expression and modification of proteins, in many cases directly impacting components of signal transduction pathways and cellular architecture. Cancer-associated mutations may have direct or indirect effects on proteins and their interactions and while the effects of mutations on signaling pathways have been widely studied, how mutations alter underlying protein–protein interaction networks is much less well understood. Systematic mapping of oncoprotein protein interactions using proteomics techniques as well as computational network analyses is revealing how oncoprotein mutations perturb protein–protein interaction networks and drive the cancer phenotyp

    Comparative analysis of the Arabidopsis and rice expressed sequence tag (EST) sets

    No full text
    Large numbers of expressed sequence tags (ESTs) have now been generated from a variety of model organisms. In plants, substantial collections of ESTs are available for Arabidopsis and rice, in each case representing significant proportions of the estimated total numbers of genes. Large-scale comparisons of Arabidopsis and rice sequences are especially interesting due to the fact that these two species are representatives of the two subclasses of the flowering plants (Dicotyledonae and Monocotyledonae, respectively). Here we present the results of systematic analysis of the Arabidopsis and rice EST sets. Non-redundant sets of sequences from Arabidopsis and rice were first separately derived and then combined so that gene families in common between the two species could be identified. Our results show that 58% of non-singleton ESTs are derived from genes in gene families common to the two species. These gene families constitute the basis of a core set of higher plant genes

    Protein language models meet reduced amino acid alphabets

    No full text
    Motivation: protein language models (PLMs), which borrowed ideas for modelling and inference from natural language processing, have demonstrated the ability to extract meaningful representations in an unsupervised way. This led to significant performance improvement in several downstream tasks. Clustering amino acids based on their physical-chemical properties to achieve reduced alphabets has been of interest in past research, but their application to PLMs or folding models is unexplored.Results: here, we investigate the efficacy of PLMs trained on reduced amino acid alphabets in capturing evolutionary information, and we explore how the loss of protein sequence information impacts learned representations and downstream task performance. Our empirical work shows that PLMs trained on the full alphabet and a large number of sequences capture fine details that are lost in alphabet reduction methods. We further show the ability of a structure prediction model(ESMFold) to fold CASP14 protein sequences translated using a reduced alphabet. For 10 proteins out of the 50 targets, reduced alphabets improve structural predictions with LDDT-Cα differences of up to 19%.</p

    The USP7 protein interaction network and its roles in tumorigenesis

    No full text
    Ubiquitin-specific protease (USP7), also known as Herpesvirus-associated ubiquitin-specific protease (HAUSP), is a deubiquitinase. There has been significant recent attention on USP7 following the discovery that USP7 is a key regulator of the p53-MDM2 pathway. The USP7 protein is 130 kDa in size and has multiple domains which bind to a diverse set of proteins. These interactions mediate key developmental and homeostatic processes including the cell cycle, immune response, and modulation of transcription factor and epigenetic regulator activity and localization. USP7 also promotes carcinogenesis through aberrant activation of the Wnt signalling pathway and stabilization of HIF-1α. These findings have shown that USP7 may induce tumour progression and be a therapeutic target. Together with interest in developing USP7 as a target, several studies have defined new protein interactions and the regulatory networks within which USP7 functions. In this review, we focus on the protein interactions of USP7 that are most important for its cancer-associated roles.</p

    Computational framework for analysis of prey–prey associations in interaction proteomics identifies novel human protein–protein interactions and networks

    No full text
    Large-scale protein-protein interaction data sets have been generated for several species including yeast and human and have enabled the identification, quantification, and prediction of cellular molecular networks. Affinity purification-mass spectrometry (AP-MS) is the preeminent methodology for large-scale analysis of protein complexes, performed by immunopurifying a specific “bait” protein and its associated “prey” proteins. The analysis and interpretation of AP-MS data sets is, however, not straightforward. In addition, although yeast AP-MS data sets are relatively comprehensive, current human AP-MS data sets only sparsely cover the human interactome. Here we develop a framework for analysis of AP-MS data sets that addresses the issues of noise, missing data, and sparsity of coverage in the context of a current, real world human AP-MS data set. Our goal is to extend and increase the density of the known human interactome by integrating bait-prey and cocomplexed preys (prey-prey associations) into networks. Our framework incorporates a score for each identified protein, as well as elements of signal processing to improve the confidence of identified protein-protein interactions. We identify many protein networks enriched in known biological processes and functions. In addition, we show that integrated bait-prey and prey-prey interactions can be used to refine network topology and extend known protein networks.<br/
    corecore