1,721,019 research outputs found

    An Enhanced Random Forests Approach to Predict Heart Failure from Small Imbalanced Gene Expression Data

    No full text
    Myocardial infarctions and heart failure are the cause of more than 17 million deaths annually worldwide. ST-segment elevation myocardial infarctions (STEMI) require timely treatment, because delays of minutes have serious clinical impacts. Machine learning can provide alternative ways to predict heart failure and identify genes involved in heart failure. For these scopes, we applied a Random Forests classifier enhanced with feature elimination to microarray gene expression of 111 patients diagnosed with STEMI, and measured the classification performance through standard metrics such as the Matthews correlation coefficient (MCC) and area under the receiver operating characteristic curve (ROC AUC). Afterwards, we used the same approach to rank all genes by importance, and to detect the genes more strongly associated with heart failure. We validated this ranking by literature review and gene set enrichment analysis. Our classifier employed to predict heart failure achieved MCC = +0.87 and ROC AUC = 0.918, and our analysis identified KLHL22, WDR11, OR4Q3, GPATCH3, and FAH as top five protein-coding genes related to heart failure. Our results confirm the effectiveness of machine learning feature elimination in predicting heart failure from gene expression, and the top genes found by our approach will be able to help biologists and cardiologists further our understanding of heart failure

    Computational intelligence identifies alkaline phosphatase (Alp), alpha-fetoprotein (afp), and hemoglobin levels as most predictive survival factors for hepatocellular carcinoma

    No full text
    Liver cancer kills approximately 800 thousand people annually worldwide, and its most common subtype is hepatocellular carcinoma (HCC), which usually affects people with cirrhosis. Predicting survival of patients with HCC remains an important challenge, especially because technologies needed for this scope are not available in all hospitals. In this context, machine learning applied to medical records can be a fast, low-cost tool to predict survival and detect the most predictive features from health records. In this study, we analyzed medical data of 165 patients with HCC: we employed computational intelligence to predict their survival, and to detect the most relevant clinical factors able to discriminate survived from deceased cases. Afterwards, we compared our data mining results with those obtained through statistical tests and scientific literature findings. Our analysis revealed that blood levels of alkaline-phosphatase (ALP), alpha-fetoprotein (AFP), and hemoglobin are the most effective prognostic factors in this dataset. We found literature supporting association of these three factors with hepatoma, even though only AFP has been used in a prognostic index. Our results suggest that ALP and hemoglobin can be candidates for future HCC prognostic indexes, and that physicians could focus on ALP, AFP, and hemoglobin when studying HCC records

    Data analytics and clinical feature ranking of medical records of patients with sepsis

    No full text
    Background: Sepsis is a life-threatening clinical condition that happens when the patient’s body has an excessive reaction to an infection, and should be treated in one hour. Due to the urgency of sepsis, doctors and physicians often do not have enough time to perform laboratory tests and analyses to help them forecast the consequences of the sepsis episode. In this context, machine learning can provide a fast computational prediction of sepsis severity, patient survival, and sequential organ failure by just analyzing the electronic health records of the patients. Also, machine learning can be employed to understand which features in the medical records are more predictive of sepsis severity, of patient survival, and of sequential organ failure in a fast and non-invasive way. Dataset and methods: In this study, we analyzed a dataset of electronic health records of 364 patients collected between 2014 and 2016. The medical record of each patient has 29 clinical features, and includes a binary value for survival, a binary value for septic shock, and a numerical value for the sequential organ failure assessment (SOFA) score. We disjointly utilized each of these three factors as an independent target, and employed several machine learning methods to predict it (binary classifiers for survival and septic shock, and regression analysis for the SOFA score). Afterwards, we used a data mining approach to identify the most important dataset features in relation to each of the three targets separately, and compared these results with the results achieved through a standard biostatistics approach. Results and conclusions: Our results showed that machine learning can be employed efficiently to predict septic shock, SOFA score, and survival of patients diagnoses with sepsis, from their electronic health records data. And regarding clinical feature ranking, our results showed that Random Forests feature selection identified several unexpected symptoms and clinical components as relevant for septic shock, SOFA score, and survival. These discoveries can help doctors and physicians in understanding and predicting septic shock. We made the analyzed dataset and our developed software code publicly available online

    Clinical Feature Ranking Based on Ensemble Machine Learning Reveals Top Survival Factors for Glioblastoma Multiforme

    Full text link
    glioblastoma multiforme (GM) is a malignant tumor of the central nervous system considered to be highly aggressive and often carrying a terrible survival prognosis. an accurate prognosis is therefore pivotal for deciding a good treatment plan for patients. In this context, computational intelligence applied to data of electronic health records (EHRs) of patients diagnosed with this disease can be useful to predict the patients' survival time. In this study, we evaluated different machine learning models to predict survival time in patients suffering from glioblastoma and further investigated which features were the most predictive for survival time. we applied our computational methods to three different independent open datasets of EHRs of patients with glioblastoma: the shieh dataset of 84 patients, the berendsen dataset of 647 patients, and the lammer dataset of 60 patients. our survival time prediction techniques obtained concordance index (C-index) = 0.583 in the shieh dataset, C-index = 0.776 in the berendsen dataset, and C-index = 0.64 in the lammer dataset, as best results in each dataset. since the original studies regarding the three datasets analyzed here did not provide insights about the most predictive clinical features for survival time, we investigated the feature importance among these datasets. To this end, we then utilized random survival forests, which is a decision tree-based algorithm able to model non-linear interaction between different features and might be able to better capture the highly complex clinical and genetic status of these patients. our discoveries can impact clinical practice, aiding clinicians and patients alike to decide which therapy plan is best suited for their unique clinical status

    A Machine Learning Analysis of Health Records of Patients with Chronic Kidney Disease at Risk of Cardiovascular Disease

    No full text
    Chronic kidney disease (CKD) describes a long-term decline in kidney function and has many causes. It affects hundreds of millions of people worldwide every year. It can have a strong negative impact on patients, especially when combined with cardiovascular disease (CVD): patients with both conditions have lower survival chances. In this context, computational intelligence applied to electronic health records can provide insights to physicians that can help them make better decisions about prognoses or therapies. In this study we applied machine learning to medical records of patients with CKD and CVD. First, we predicted if patients develop severe CKD, both including and excluding information about the year it occurred or date of the last visit. Our methods achieved top mean Matthews correlation coefficient (MCC) of +0.499 in the former case and a mean MCC of +0.469 in the latter case. Then, we performed a feature ranking analysis to understand which clinical factors are most important: age, eGFR, and creatinine when the temporal component is absent; hypertension, smoking, and diabetes when the year is present. We then compared our results with the current scientific literature, and discussed the different results obtained when the time feature is excluded or included. Our results show that our computational intelligence approach can provide insights about diagnosis and relative important of different clinical variables that otherwise would be impossible to observe

    Eleven quick tips for data cleaning and feature engineering

    Full text link
    Applying computational statistics or machine learning methods to data is a key component of many scientific studies, in any field, but alone might not be sufficient to generate robust and reliable outcomes and results. Before applying any discovery method, preprocessing steps are necessary to prepare the data to the computational analysis. In this framework, data cleaning and feature engineering are key pillars of any scientific study involving data analysis and that should be adequately designed and performed since the first phases of the project. We call “feature” a variable describing a particular trait of a person or an observation, recorded usually as a column in a dataset. Even if pivotal, these data cleaning and feature engineering steps sometimes are done poorly or inefficiently, especially by beginners and unexperienced researchers. For this reason, we propose here our quick tips for data cleaning and feature engineering on how to carry out these important preprocessing steps correctly avoiding common mistakes and pitfalls. Although we designed these guidelines with bioinformatics and health informatics scenarios in mind, we believe they can more in general be applied to any scientific area. We therefore target these guidelines to any researcher or practitioners wanting to perform data cleaning or feature engineering. We believe our simple recommendations can help researchers and scholars perform better computational analyses that can lead, in turn, to more solid outcomes and more reliable discoveries

    The Venus score for the assessment of the quality and trustworthiness of biomedical datasets

    Full text link
    Biomedical datasets are the mainstays of computational biology and health informatics projects, and can be found on multiple data platforms online or obtained from wet-lab biologists and physicians. The quality and the trustworthiness of these datasets, however, can sometimes be poor, producing bad results in turn, which can harm patients and data subjects. To address this problem, policy-makers, researchers, and consortia have proposed diverse regulations, guidelines, and scores to assess the quality and increase the reliability of datasets. Although generally useful, however, they are often incomplete and impractical. The guidelines of Datasheets for Datasets, in particular, are too numerous; the requirements of the Kaggle Dataset Usability Score focus on non-scientific requisites (for example, including a cover image); and the European Union Artificial Intelligence Act (EU AI Act) sets forth sparse and general data governance requirements, which we tailored to datasets for biomedical AI. Against this backdrop, we introduce our new Venus score to assess the data quality and trustworthiness of biomedical datasets. Our score ranges from 0 to 10 and consists of ten questions that anyone developing a bioinformatics, medical informatics, or cheminformatics dataset should answer before the release. In this study, we first describe the EU AI Act, Datasheets for Datasets, and the Kaggle Dataset Usability Score, presenting their requirements and their drawbacks. To do so, we reverse-engineer the weights of the influential Kaggle Score for the first time and report them in this study. We distill the most important data governance requirements into ten questions tailored to the biomedical domain, comprising the Venus score. We apply the Venus score to twelve datasets from multiple subdomains, including electronic health records, medical imaging, microarray and bulk RNA-seq gene expression, cheminformatics, physiologic electrogram signals, and medical text. Analyzing the results, we surface fine-grained strengths and weaknesses of popular datasets, as well as aggregate trends. Most notably, we find a widespread tendency to gloss over sources of data inaccuracy and noise, which may hinder the reliable exploitation of data and, consequently, research results. Overall, our results confirm the applicability and utility of the Venus score to assess the trustworthiness of biomedical data
    corecore