1,721,008 research outputs found
Computational analysis of multilevel omics data for the elucidation of molecular mechanisms of cancer
Philosophiae Doctor - PhDCancer is a group of diseases that arises from irreversible genomic and epigenomic alterations that result in unrestrained proliferation of abnormal cells. Detailed understanding of the molecular mechanisms underlying a cancer would aid the identification of most, if not all, genes responsible for its progression and the development of molecularly targeted chemotherapy. The challenge of recurrence after treatment shows that our understanding of cancer mechanisms is still poor. As a contribution to overcoming this challenge, we provide an integrative multi-omic analysis on glioblastoma multiforme (GBM) for which large data sets on di erent classes of genomic and epigenomic alterations have been made available in the Cancer Genome Atlas data portal. The rst part of this study involves protein network analysis for the elucidation of GBM tumourigenic molecular mechanisms, identification of driver genes, prioritization of genes in chromosomal regions with copy number alteration, and co-expression and transcriptional analysis. Functional modules were obtained by edge-betweenness clustering of a protein network constructed from genes with predicted functional impact mutations and differentially expressed genes. Pathway enrichment analysis was performed on each module to identify statistical overrepresentation of signaling pathways. Known and novel candidate cancer driver genes were identi ed in the modules, and functionally relevant genes in chromosomal regions altered by homologous deletion or high-level amplication were prioritized with the protein network. Co-expressed modules enriched in cancer biological processes and transcription factor targets were identified using network genes that demonstrated high expression variance. Our findings show that GBM's molecular mechanisms are much more complex than those reported in previous studies. We next identified differentially expressed miRNAs for which target genes associated with the protein network were also differentially expressed. MiRNAs and target genes were prioritized based on the number of targeted genes and targeting miRNAs, respectively. MiRNAs that correlated with time to progression were selected by an elastic net-penalized Cox regression model for survival analysis. These miRNA were combined into a signature that independently predicted adjuvant therapy-linked progression-free survival in GBM and its subtypes and overall survival in GBM. The results show that miRNAs play significant roles in GBM progression and patients' survival finally, a prognostic mRNA signature that independently predicted progression-free and overall survival was identified. Pathway enrichment analysis was carried on genes with high expression variance across a cohort to identify those in chemoradioresistance associated pathways. A support vector machine-based method was then used to identify a set of genes that discriminated between rapidly- and slowly-progressing GBM patients, with minimal 5 % cross-validation error rate. The prognostic value of the gene set was demonstrated by its ability to predict adjuvant therapy-linked progression-free and overall survival in GBM and its subtypes and was validated in an independent data set. We have identified a set of genes involved in tumourigenic mechanisms that could potentially be exploited as targets in drug development for the treatment of primary and recurrent GBM. Furthermore, given their demonstrated accuracy in this study, the identified miRNA and mRNA signatures have strong potential to be combined and developed into a robust clinical test for predicting prognosis and treatment response
Massively-Parallel Computational Identification of Novel Broad Spectrum Antivirals to Combat Coronavirus Infection
Philosophiae Doctor - PhDGiven the significant disease burden caused by human coronaviruses, the discovery of an effective antiviral strategy is paramount, however there is still no effective therapy to combat infection. This thesis details the in silica exploration of ligand libraries to identify candidate
lead compounds that, based on multiple criteria, have a high probability of inhibiting the 3 chymotrypsin-like protease (3CUro) of human coronaviruses. Atomistic models of the 3CUro were obtained from the Protein Data Bank or theoretical models were successfully generated by homology modelling. These structures served the basis of both structure- and ligand-based drug design studies. Consensus molecular docking and pharmacophore modelling protocols were adapted to explore the ZINC Drugs-Now dataset in a high throughput virtual screening strategy to identify ligands which computationally bound to the active site of the 3CUro . Molecular dynamics was further utilized to confirm the binding mode and interactions observed in the static structure- and ligand-based techniques were correct via analysis of various parameters in a IOns simulation. Molecular docking and pharmacophore models identified a total of 19 ligands which displayed
the potential to computationally bind to all 3CUro included in the study. Strategies employed to identify these lead compounds also indicated that a known inhibitor of the SARS-Co V 3CUro also has potential as a broad spectrum lead compound. Further analysis by molecular dynamic simulations largely confirmed the binding mode and ligand orientations identified by the former techniques. The comprehensive approach used in this study improves the probability of identifying experimental actives and represents a cost effective pipeline for the often expensive and time consuming process of lead discovery. These identified lead compounds represent an ideal
starting point for assays to confirm in vitro activity, where experimentally confirmed actives will be proceeded to subsequent studies on lead optimization
Development of a simple artificial intelligence method to accurately subtype breast cancers based on gene expression barcodes
>Magister Scientiae - MScINTRODUCTION:
Breast cancer is a highly heterogeneous disease. The complexity of achieving an accurate diagnosis and an effective treatment regimen lies within this heterogeneity. Subtypes of the disease are not simply molecular, i.e. hormone receptor over-expression or absence, but the tumour itself is heterogeneous in terms of tissue of origin, metastases, and histopathological variability. Accurate tumour classification vastly improves treatment decisions, patient outcomes and 5-year survival rates. Gene expression studies aided by transcriptomic technologies such as microarrays and next-generation sequencing (e.g. RNA-Sequencing) have aided oncology researcher and clinician understanding of the complex molecular portraits of malignant breast tumours. Mechanisms governing cancers, which include tumorigenesis, gene fusions, gene over-expression and suppression, cellular process and pathway involvementinvolvement, have been elucidated through comprehensive analyses of the cancer transcriptome. Over the past 20 years, gene expression signatures, discovered with both microarray and RNA-Seq have reached clinical and commercial application through the development of tests such as Mammaprint®, OncotypeDX®, and FoundationOne® CDx, all which focus on chemotherapy sensitivity, prediction of cancer recurrence, and tumour mutational level.
The Gene Expression Barcode (GExB) algorithm was developed to allow for easy interpretation and integration of microarray data through data normalization with frozen RMA (fRMA) preprocessing and conversion of relative gene expression to a sequence of 1's and 0's. Unfortunately, the algorithm has not yet been developed for RNA-Seq data. However, implementation of the GExB with feature-selection would contribute to a machine-learning based robust breast cancer and subtype classifier.
METHODOLOGY:
For microarray data, we applied the GExB algorithm to generate barcodes for normal breast and breast tumour samples. A two-class classifier for malignancy was developed through feature-selection on barcoded samples by selecting for genes with 85% stable absence or presence within a tissue type, and differentially stable between tissues. A multi-class feature-selection method was employed to identify genes with variable expression in one subtype, but 80% stable absence or presence in all other subtypes, i.e. 80% in n-1 subtypes.
For RNA-Seq data, a barcoding method needed to be developed which could mimic the GExB algorithm for microarray data. A z-score-to-barcode method was implemented and differential gene expression analysis with selection of the top 100 genes as informative features for classification purposes.
The accuracy and discriminatory capability of both microarray-based gene signatures and the RNA-Seq-based gene signatures was assessed through unsupervised and supervised machine-learning algorithms, i.e., K-means and Hierarchical clustering, as well as binary and multi-class Support Vector Machine (SVM) implementations.
RESULTS:
The GExB-FS method for microarray data yielded an 85-probe and 346-probe informative set for two-class and multi-class classifiers, respectively. The two-class classifier predicted samples as either normal or malignant with 100% accuracy and the multi-class classifier predicted molecular subtype with 96.5% accuracy with SVM.
Combining RNA-Seq DE analysis for feature-selection with the z-score-to-barcode method, resulted in a two-class classifier for malignancy, and a multi-class classifier for normal-from-healthy, normal-adjacent-tumour (from cancer patients), and breast tumour samples with 100% accuracy. Most notably, a normal-adjacent-tumour gene expression signature emerged, which differentiated it from normal breast tissues in healthy individuals.
CONCLUSION: A potentially novel method for microarray and RNA-Seq data transformation, feature selection and classifier development was established. The universal application of the microarray signatures and validity of the z-score-to-barcode method was proven with 95% accurate classification of RNA-Seq barcoded samples with a microarray discovered gene expression signature. The results from this comprehensive study into the discovery of robust gene expression signatures holds immense potential for further R&F towards implementation at the clinical endpoint, and translation to simpler and cost-effective laboratory methods such as qtPCR-based tests
Computational analysis of multilevel omics data for the elucidation of molecular mechanisms of cancer
Philosophiae Doctor - PhDCancer is a group of diseases that arises from irreversible genomic and epigenomic alterations that result in unrestrained proliferation of abnormal cells. Detailed understanding of the molecular mechanisms underlying a cancer would aid the identification of most, if not all, genes responsible for its progression and the development of molecularly targeted chemotherapy. The challenge of recurrence after treatment shows that our understanding of cancer mechanisms is still poor. As a contribution to overcoming this challenge, we provide an integrative multi-omic analysis on glioblastoma multiforme (GBM) for which large data sets on di erent classes of genomic and epigenomic alterations have been made available in the Cancer Genome Atlas data portal. The rst part of this study involves protein network analysis for the elucidation of GBM tumourigenic molecular mechanisms, identification of driver genes, prioritization of genes in chromosomal regions with copy number alteration, and co-expression and transcriptional analysis. Functional modules were obtained by edge-betweenness clustering of a protein network constructed from genes with predicted functional impact mutations and differentially expressed genes. Pathway enrichment analysis was performed on each module to identify statistical overrepresentation of signaling pathways. Known and novel candidate cancer driver genes were identi ed in the modules, and functionally relevant genes in chromosomal regions altered by homologous deletion or high-level amplication were prioritized with the protein network. Co-expressed modules enriched in cancer biological processes and transcription factor targets were identified using network genes that demonstrated high expression variance. Our findings show that GBM's molecular mechanisms are much more complex than those reported in previous studies. We next identified differentially expressed miRNAs for which target genes associated with the protein network were also differentially expressed. MiRNAs and target genes were prioritized based on the number of targeted genes and targeting miRNAs, respectively. MiRNAs that correlated with time to progression were selected by an elastic net-penalized Cox regression model for survival analysis. These miRNA were combined into a signature that independently predicted adjuvant therapy-linked progression-free survival in GBM and its subtypes and overall survival in GBM. The results show that miRNAs play significant roles in GBM progression and patients' survival finally, a prognostic mRNA signature that independently predicted progression-free and overall survival was identified. Pathway enrichment analysis was carried on genes with high expression variance across a cohort to identify those in chemoradioresistance associated pathways. A support vector machine-based method was then used to identify a set of genes that discriminated between rapidly- and slowly-progressing GBM patients, with minimal 5 % cross-validation error rate. The prognostic value of the gene set was demonstrated by its ability to predict adjuvant therapy-linked progression-free and overall survival in GBM and its subtypes and was validated in an independent data set. We have identified a set of genes involved in tumourigenic mechanisms that could potentially be exploited as targets in drug development for the treatment of primary and recurrent GBM. Furthermore, given their demonstrated accuracy in this study, the identified miRNA and mRNA signatures have strong potential to be combined and developed into a robust clinical test for predicting prognosis and treatment response
Development of a simple artificial intelligence method to accurately subtype breast cancers based on gene expression barcodes
>Magister Scientiae - MScINTRODUCTION:
Breast cancer is a highly heterogeneous disease. The complexity of achieving an accurate diagnosis and an effective treatment regimen lies within this heterogeneity. Subtypes of the disease are not simply molecular, i.e. hormone receptor over-expression or absence, but the tumour itself is heterogeneous in terms of tissue of origin, metastases, and histopathological variability. Accurate tumour classification vastly improves treatment decisions, patient outcomes and 5-year survival rates. Gene expression studies aided by transcriptomic technologies such as microarrays and next-generation sequencing (e.g. RNA-Sequencing) have aided oncology researcher and clinician understanding of the complex molecular portraits of malignant breast tumours. Mechanisms governing cancers, which include tumorigenesis, gene fusions, gene over-expression and suppression, cellular process and pathway involvementinvolvement, have been elucidated through comprehensive analyses of the cancer transcriptome. Over the past 20 years, gene expression signatures, discovered with both microarray and RNA-Seq have reached clinical and commercial application through the development of tests such as Mammaprint®, OncotypeDX®, and FoundationOne® CDx, all which focus on chemotherapy sensitivity, prediction of cancer recurrence, and tumour mutational level.
The Gene Expression Barcode (GExB) algorithm was developed to allow for easy interpretation and integration of microarray data through data normalization with frozen RMA (fRMA) preprocessing and conversion of relative gene expression to a sequence of 1's and 0's. Unfortunately, the algorithm has not yet been developed for RNA-Seq data. However, implementation of the GExB with feature-selection would contribute to a machine-learning based robust breast cancer and subtype classifier.
METHODOLOGY:
For microarray data, we applied the GExB algorithm to generate barcodes for normal breast and breast tumour samples. A two-class classifier for malignancy was developed through feature-selection on barcoded samples by selecting for genes with 85% stable absence or presence within a tissue type, and differentially stable between tissues. A multi-class feature-selection method was employed to identify genes with variable expression in one subtype, but 80% stable absence or presence in all other subtypes, i.e. 80% in n-1 subtypes.
For RNA-Seq data, a barcoding method needed to be developed which could mimic the GExB algorithm for microarray data. A z-score-to-barcode method was implemented and differential gene expression analysis with selection of the top 100 genes as informative features for classification purposes.
The accuracy and discriminatory capability of both microarray-based gene signatures and the RNA-Seq-based gene signatures was assessed through unsupervised and supervised machine-learning algorithms, i.e., K-means and Hierarchical clustering, as well as binary and multi-class Support Vector Machine (SVM) implementations.
RESULTS:
The GExB-FS method for microarray data yielded an 85-probe and 346-probe informative set for two-class and multi-class classifiers, respectively. The two-class classifier predicted samples as either normal or malignant with 100% accuracy and the multi-class classifier predicted molecular subtype with 96.5% accuracy with SVM.
Combining RNA-Seq DE analysis for feature-selection with the z-score-to-barcode method, resulted in a two-class classifier for malignancy, and a multi-class classifier for normal-from-healthy, normal-adjacent-tumour (from cancer patients), and breast tumour samples with 100% accuracy. Most notably, a normal-adjacent-tumour gene expression signature emerged, which differentiated it from normal breast tissues in healthy individuals.
CONCLUSION: A potentially novel method for microarray and RNA-Seq data transformation, feature selection and classifier development was established. The universal application of the microarray signatures and validity of the z-score-to-barcode method was proven with 95% accurate classification of RNA-Seq barcoded samples with a microarray discovered gene expression signature. The results from this comprehensive study into the discovery of robust gene expression signatures holds immense potential for further R&F towards implementation at the clinical endpoint, and translation to simpler and cost-effective laboratory methods such as qtPCR-based tests
Massively-Parallel Computational Identification of Novel Broad Spectrum Antivirals to Combat Coronavirus Infection
Philosophiae Doctor - PhDGiven the significant disease burden caused by human coronaviruses, the discovery of an effective antiviral strategy is paramount, however there is still no effective therapy to combat infection. This thesis details the in silica exploration of ligand libraries to identify candidate
lead compounds that, based on multiple criteria, have a high probability of inhibiting the 3 chymotrypsin-like protease (3CUro) of human coronaviruses. Atomistic models of the 3CUro were obtained from the Protein Data Bank or theoretical models were successfully generated by homology modelling. These structures served the basis of both structure- and ligand-based drug design studies. Consensus molecular docking and pharmacophore modelling protocols were adapted to explore the ZINC Drugs-Now dataset in a high throughput virtual screening strategy to identify ligands which computationally bound to the active site of the 3CUro . Molecular dynamics was further utilized to confirm the binding mode and interactions observed in the static structure- and ligand-based techniques were correct via analysis of various parameters in a IOns simulation. Molecular docking and pharmacophore models identified a total of 19 ligands which displayed
the potential to computationally bind to all 3CUro included in the study. Strategies employed to identify these lead compounds also indicated that a known inhibitor of the SARS-Co V 3CUro also has potential as a broad spectrum lead compound. Further analysis by molecular dynamic simulations largely confirmed the binding mode and ligand orientations identified by the former techniques. The comprehensive approach used in this study improves the probability of identifying experimental actives and represents a cost effective pipeline for the often expensive and time consuming process of lead discovery. These identified lead compounds represent an ideal
starting point for assays to confirm in vitro activity, where experimentally confirmed actives will be proceeded to subsequent studies on lead optimization
Genome assembly of next-generation sequencing data for the Oryx bacillus : species of the Mycobacterium tuberculosis complex
>Magister Scientiae - MScNext generation sequencing (NGS) technology platforms have accelerated ability to
produce completed genome assemblies. Recently, collaborators at Tygerberg Medical
School outsourced the sequencing of Oryx bacillus, a member of the Mycobacterium
tuberculosis complex (MTC). A total of 31,271,059 short reads were generated and
required filtering, assembly and annotation using bioinformatics algorithms. In this
project, an NGS assembly pipeline was implemented, tailored specifically for SOLiD
sequence data. The raw reads were aligned to seven fully sequenced and annotated MTC members, namely, Mycobacterium tuberculosis H37Rv, H37Ra, CDC1551, F11, KZN 1435, Mycobacterium bovis AF2122/97 and Mycobacterium bovis BCG str. Pasteur 1173P2 using NovoalignCS. Depth and breadth of sequence coverage across each base of the reference genome was calculated using BEDTools, and structural variation. Structural variation at the nucleotide level including deletions, insertions and single nucleotidepolymorphisms (SNPs) were called using three tools, GATK, SAMtools and Nesoni. These variations were further filtered using in-house PERL scripts. Putative
functional roles for the alterations at the DNA level were extrapolated from the
overlap with essential genes present in annotated MTC members. Approximately 20,730,631 short reads (59.78%) out of a total of 31,271,059 reads aligned to the seven reference genomes. The per base sequence coverage calculations revealed an average of 1,243 unaligned regions. These unaligned regions overlapped with mycobacterial regions of difference (RD) and genetic phage elements acquired by the MTC through horizontal gene transfer and are genes prevalent in the clinical isolates of M. tuberculosis. A total of 2,680 genetic variations were identified and categorised into 845 synonymous and 1,724 non-synonymous SNPs together with 44 insertions and 67 deletions. Some of the variant alleles overlapped known genes to be involved in TB drug resistance. While the biological significance of our findings remain to be elucidated, it nonetheless deserves further attention, because SNPs have the potential to impact on strain phenotype by gene disruption. Therefore, any hypotheses generated from these large-scale analyses will be tested by our collaborators at Tygerberg medical school
Integrating regulatory and methylome data for the discovery of clear cell Renal Cell Carcinoma (ccRCC) variants
>Magister Scientiae - MScKidney cancers, of which clear cell renal cell carcinoma comprises an estimated 70%, have been placed amongst the top ten most common cancers in both males and females. With a mortality rate that exceeds 40%, kidney cancer is considered the most lethal cancer of the genitourinary system. Despite advances in its treatment, the mortality- and incidence rates across all stages of the disease have continued to climb. Since the release of the Human Genome Project in the early 2000’s, most genetics studies have focused on the protein coding region of the human genome, which accounts for a mere 2% of the entire genome. It has been suggested that diverting our focus to the other 98% of the genome, which was previously dismissed as non-functional “junk DNA”, could possibly contribute significantly to our understanding of the underlying mechanisms of complex diseases.In this study a whole genome sequencing somatic mutation data set from the International Cancer Genome Consortium was used. The non-coding somatic mutations within the promoter, intronic, 5-prime untranslated and 3-prime untranslated regions of clear cell renal cell carcinoma-implicated genes were extracted and submitted to RegulomDB for their functional annotation.As expected, most of the variants were located within the intronic regions and only a small subset of identified variants was predicted to be deleterious. Although the variants all belonged to a selected subset of kidney cancer-associated genes, the genes frequently mutated in the non-coding regions were not the same genes that were frequently mutated in the whole exome studies (where the focus is on the
coding sequences). This indicates that with whole genome sequencing studies a new set of genes/variants previously unassociated with the clear cell renal cell carcinoma could be identified. In addition, most of the non-coding somatic variants fell within multiple transcriptions factor binding sites. Since many of these variants were also deleterious (as predicted by RegulomDB), this suggests that mutations in the non-coding regions could contribute to disease due to their role in transcription factor binding site disruptions and their subsequent impact on transcriptional regulation. The substantial overlap between the genes with the most aberrantly methylated variants and the genes with the most transcription factor binding site disruptions signifies a potential link between differential methylation and transcription factor binding site affinities. In contrast to the upregulated DNA methylation generally seen in promoter methylation studies, all of the significant hits in this study were hypomethylated, with the subsequent up-regulation of the genes of interest, suggesting that in the clear cell renal cell carcinoma, aberrant methylation may play a role in activating proto-oncogenes, rather than the silencing of genes. When a cross-analysis was carried out between the gene expression patterns and the transcription factor binding site disruptions, the non-coding somatic variants and differential methylation profiles, the genes affected again showed a clear overlap. Interestingly, most of the variants were not present in the 1000genomes data and thus represent novel mutations, which possibly occurred as a result of genomic instability. However, identifying novel variants are always promising, since they epitomise the possibility of developing pioneering ways to target diseases. The numerous detrimental effects a single non-coding mutation can have on other genomic processes have been demonstrated in this study and therefore validate the inclusion of non-coding regions of the genome in genetic studies in order to study complex multifactorial diseases.National Research Foundation (NRF) and DAA
Identification of new respiratory viruses in the new millennium
The rapid advancement of molecular tools in the past 15 years has allowed for the retrospective discovery of several new respiratory viruses as well as the characterization of novel emergent strains. The inability to characterize the etiological origins of respiratory conditions, particularly in children, led several researchers to pursue the discovery of the underlying etiology of disease. In 2001, this led to the discovery of human metapneumovirus (hMPV) and soon following that the outbreak of Severe Acute Respiratory Syndrome coronavirus (SARS-CoV) promoted an increased interest in coronavirology and the latter discovery of human coronavirus (HCoV) NL63 and HCoV-HKU1. Human bocavirus, with its four separate lineages, discovered in 2005, has been linked to acute respiratory tract infections and gastrointestinal complications. Middle East Respiratory Syndrome coronavirus (MERS-CoV) represents the most recent outbreak of a completely novel respiratory virus, which occurred in Saudi Arabia in 2012 and presents a significant threat to human health. This review will detail the most current clinical and epidemiological findings to all respiratory viruses discovered since 2001
- …
