1,721,182 research outputs found
Implementation of polygenic risk scores from sequencing data towards practice by utilizing large publicly available datasets
Methods: We developed a protocol for long-read targeted sequencing using capture probes from Twist Bioscience and applied this workflow to sequence 21 pharmacogenes from 41 samples with PacBio HiFi technology. Results: In total, 41 samples had an average on target phasing of 62% (47%-73%) and the average haploblock size was 7,509bp demonstrating the large number of nucleotides in the target region that were phased. In the CYP3A locus, 1,088 unique variants were detected, of which 570 variants were located in the core regions of CYP3A4, CYP3A5 and CYP3A7. Only 27 of these variants (2%) are included in the clinically used *-allele nomenclature. Notably, 1 frameshift-, 5 missense-and 8 splice site variants which are not included in clinical nomenclature were detected. Per individual, an average of 155 unique variants were detected and 34% (5%-86%) of nucleotides were phased in the CYP3A locus. Conclusions: Our results indicate that a panel-based long-read sequencing approach can phase the majority of variants in complex genomic regions, revealing a high abundance of unknown but potentially impactful variants in the CYP3A locus.Flemish Special Research Fund (BOF) [BOF21DOC23
Support vector machines
A support vector machine (SVM) is a supervised machine learning (ML) method capable of learning from data and making decisions. The fundamental principles of the SVM were already introduced in the 1960s by Vapnik and Chervonenkis 1 in a theory that was further developed throughout the next decennia. However, it was only in the 1990s that SVMs attracted greater attention from the scientific community , and this was attributed to 2 significant improvements. The first extension is a kernel trick that allows the SVM to classify highly nonlinear problems. 2 The second permitted the extension of the SVM to solve problems in a regression framework 3 called support vector regression machine. These improvements have resulted in a decisive general approximator that nowadays finds its use in many applications. Typically, the mathematics and theory behind SVMs are complex and require a deep understanding of optimization theory, algebra, and learning theory. Nonetheless, the main idea can be intuitively explained , and this article will consider a classification problem to illustrate the concepts. In what follows, it can be noticed that SVMs differ from previously presented methods as they exploit geometries in the data and are not directly rooted in statistics (eg, generalized linear models). However, they originate from mathematics and engineering and are often compared with logistic regression explained in the previous article. The starting point of an SVM is straightforward as it will try to solve a particular binary classification problem by the simplest model possible, separating the subjects that belong to the 2 different classes by a classification boundary. In 2 dimensions, this classification boundary will form a straight line. In 3 dimensions, this classification boundary will become a plane, a line generalization. This boundary will be called a hyperplane for higher dimensions, which can be considered a plane in .3 dimensions and is beyond our imagination. Again, the question is how well such a simple model classifies and how well-learned concepts are generalizable to previously unseen data. Figure 1, A shows a separable classification problem. It is perfectly possible to separate the blue from the red by using a straight line as a classification boundary. However, as illustrated in the plot, multiple options are possible. Which line should we select as our boundary to minimize the risk of misclassifying a previously unseen subject? The solution to this question presents itself in Figure 1, B and is called a maximum-margin classi-fier. The basic idea is simple. To minimize misclassifica-tion risk, we want our classification boundary positioned as far as possible from neighboring subjects belonging to the different classes. The margin is maximized between the classification boundary and the training data allowing for a tolerance region when predicting a class label for new subjects. An important observation can be made from the figure. Data points far from the classification line do not influence its position. The only data points that determine the decision boundary are the 3 points in black in Figure 1, B. These points are called support points or support vectors. In other words, if we would remove all the subjects from our training dataset apart from these 3 support vectors, then the location of the decision boundary would remain unaltered. This example indicates that support vectors significantly influence the decision boundary, and changes in the training data will dramatically impact the decision boundary. Figure 1, C shows an additional subject indicated by an arrow added to the training dataset. Coincidentally, this subject lies close to the decision boundary and is an influential support vector that will modify the maxim-margin problem, resulting in a different classification boundary, as indicated by the green. However, when eyeballing the classification problem, we can be pretty satisfied with the previous classifier, indicated by dashes, which yields wider overall margins concerning their neighboring training subjects.Flemish Governmen
Experimental design in quantitative proteomics
Metabolites and proteins are potential biomarkers. They can be identified with the help of mass spectrometry (MS). However, measurements obtained by using MS are prone to various random and systematic errors. The sensitivity of the technology to the errors poses practical challenges, including concerns about reproducibility of the MS-based assays and the possibility of false findings. Given the sensitivity, the proper design of MS-based experiments becomes of utmost importance. In this chapter, we review the basic experimental-design tools that can be used to prevent occurrence of errors that might cause misleading findings in MS-based experiments. We also present results of an experiment aimed at investigating variability of the intensity measurements produced by a MALDI-TOF mass spectrometer. The knowledge about the potential sources of systematic and random errors is fundamental in order to properly design an MS experiment
Implementation of polygenic risk scores from sequencing data towards practice by utilizing large publicly available datasets
Methods: We developed a protocol for long-read targeted sequencing using capture probes from Twist Bioscience and applied this workflow to sequence 21 pharmacogenes from 41 samples with PacBio HiFi technology. Results: In total, 41 samples had an average on target phasing of 62% (47%-73%) and the average haploblock size was 7,509bp demonstrating the large number of nucleotides in the target region that were phased. In the CYP3A locus, 1,088 unique variants were detected, of which 570 variants were located in the core regions of CYP3A4, CYP3A5 and CYP3A7. Only 27 of these variants (2%) are included in the clinically used *-allele nomenclature. Notably, 1 frameshift-, 5 missense-and 8 splice site variants which are not included in clinical nomenclature were detected. Per individual, an average of 155 unique variants were detected and 34% (5%-86%) of nucleotides were phased in the CYP3A locus. Conclusions: Our results indicate that a panel-based long-read sequencing approach can phase the majority of variants in complex genomic regions, revealing a high abundance of unknown but potentially impactful variants in the CYP3A locus.Flemish Special Research Fund (BOF) [BOF21DOC23
A “Refined Hydrogen Rule” and a “Refined Hydrogen and Halogen Rule” for Organic Molecules
Deriving chemical formulas of organic molecules, based on spectral information, with heuristic rules is a commonly recurring task. The computational effort and the potentially extensive list of candidate formulas put a strain on the downstream analysis. In this paper, we introduce a set of redefined heuristics based on the hydrogen and halogen rules that reduce the computational burden and the number of candidate formulas for organic molecules, such as peptides and lipids.Claesen, J (reprint author), CEN SCK, Microbiol Unit, Boeretang 200, B-2400 Mol, Belgium; Hasselt Univ, Data Sci Inst, I BioStat, Hasselt, Belgium.
[email protected]
IsoSpec2: Ultrafast Fine Structure Calculator
High-resolution mass spectrometry becomes increasingly available with its ability to resolve the fine isotopic structure of measured analytes. It allows for high-sensitivity spectral deconvolution, leading to less false-positive identifications. Analytes can be identified by comparing their theoretical isotopic signal with the observed peaks. Necessary calculations are, however, computationally demanding and lead to long processing times. For wheat (trictum oestivum) alone, Uniprot holds more than 142 000 candidate protein sequences. This is doubled upon sequence reversal for identification FDR estimation and further multiplied by performing in silico digestion into peptides. The same peptide might originate from more than one protein, which reduces the overall number of sequences to be calculated. However, it is still huge. IsoSpec2 can perform these calculations fast. Compared to IsoSpec1, the algorithm is simpler, orders of magnitude faster, and offers more flexibility for the developers of algorithms for raw data analysis. It is freely available under a 2-clause BSD license, with bindings for the C++, C, R, and Python programming languages.We thank Dr. Blaz.ej Miasojedow. This work was supported by Deutsche Forschungsgemeinschaft DFG (SFB1292, Z01), Bundesministerium fur Bildung und Forschung BMBF (DIASyM, FKZ: 031L0217A), Polish NCN Grants 2017/26/D/ST6/00304, 2018/29/B/ST6/00681, and partially by Flemish SBO Grant InSPECtor, 120025, IWT. Plots were made with ggplot2,43 Keynote, and Inkscape.Lacki, MK (corresponding author), Johannes Gutenberg Univ Mainz, Inst Immunol, Univ Med Ctr, D-55131 Mainz, Germany.
Startek, MP (corresponding author), Univ Warsaw, Dept Math Informat & Mech, PL-02097 Warsaw, Poland.
[email protected]; [email protected]
Identifying Process Differences with ToF-SIMS: An MVA Decomposition Strategy
In time-of-flight secondary ion mass spectrometry (ToF-SIMS), multivariate analysis (MVA) methods such as principal component analysis (PCA) are routinely employed to differentiate spectra. However, additional insights can often be gained by comparing processes, where each process is characterized by its own start and end spectra, such as when identical samples undergo slightly different treatments or when slightly different samples receive the same treatment. This study proposes a strategy to compare such processes by decomposing the loading vectors associated with them, which highlights differences in the relative behavior of the peaks. This strategy identifies key information beyond what is captured by the loading vectors or the end spectra alone. While PCA is widely used, partial least-squares discriminant analysis (PLS-DA) serves as a supervised alternative and is the preferred method for deriving process-related loading vectors when classes are narrowly separated. The effectiveness of the decomposition strategy is demonstrated using artificial spectra and applied to a ToF-SIMS materials science case study on the photodegradation of N719 dye, a common dye in photovoltaics, on a mesoporous TiO2 anode. The study revealed that the photodegradation process varies over time, and the resulting fragments have been identified accordingly. The proposed methodology, applicable to both labeled (supervised) and unlabeled (unsupervised) spectral data, can be seamlessly integrated into most modern mass spectrometry data analysis workflows to automatically generate a list of peaks whose relative behavior varies between two processes, and is particularly effective in identifying subtle differences between highly similar physicochemical processes.Research Foundation – Flanders FWO PhD Fellowship grant 11K4322 (N.F.
Defining Spectral Quality in Mass Spectrometry-Based Proteomics: A Retrospective Review
Mass spectrometry-based proteomics is essential for advancing preventive and personalised medicine. Technological advancements have greatly increased both the number and sensitivity of spectra generated in a single experiment. Traditionally, spectra are identified using database search engines that depend on large and continuously expanding databases. This expansion enlarges the search space, posing challenges for controlling the false discovery rate in peptide identification. While many bioinformatic workflows employ rescoring algorithms as a post-processing step to manage false discoveries, preprocessing spectra offers a promising alternative. One such method, spectral quality assessment, classifies spectra as "high" quality (likely containing a peptide) or "low" quality (predominantly consisting of noise). This review provides a comprehensive perspective on spectral quality assessment, examining existing tools and their underlying principles. We discuss key considerations such as the definition of spectral quality, normalisation, the use of experimental training data, and future research in the field. By highlighting the potential of spectral quality assessment to improve peptide identification and reduce false discoveries, we aim to elaborate on its potential for the proteomics community.This study was funded by the Research Foundation – Flanders (FWO) under the “Beyond the Genome: Ethical Aspects of Large Cohort Studies”project (Case number G070722N) and the Flemish Institute for Technological Research (VITO)
CPred: Charge State Prediction for Modified and Unmodified Peptides in Electrospray Ionization
The mass-to-charge ratio serves as a critical parameter in peptide identification via mass spectrometry, enabling the precise determination of peptide masses and facilitating their differentiation based on unique charge characteristics, especially when peptides are ionized by tools like electrospray ionization, which produces multiply charged ions. We developed a neural network called CPred, which can accurately predict the charge state distribution from +1 to +7 for the modified and unmodified peptides. CPred was trained on the large-scale synthetic training data, consisting of tryptic and non-tryptic peptides, and various fragmentation methods. The model was further evaluated on independent, external test data sets. Results were evaluated through the Pearson correlation coefficient and showed high correlations of up to 0.9997117 between the predicted and acquired charge state distributions. The effect of specifying modifications in the neural network and feature importance was further investigated, revealing the value of modifications and vital peptide properties in holding on to protons. CPreds' accurate predictions of the charge state distribution can play an essential role in boosting confidence in peptide identifications during rescoring as a novel feature.This research was funded by the Research Foundation Flanders (FWO) under the “Beyond the Genome: Ethical Aspects of Large Cohort Studies” project (Case number G070722N). The resources and services used in this work were provided by the VSC (Flemish Supercomputer Centre), funded by the Research Foundation Flanders (FWO) and the Flemish Government
Unsupervised learning
A s mentioned in the previous article, 1 unsuper-vised learning involves using datasets without clear notice of the dependent (response) variable. Unsupervised means that the machine or computer should learn patterns from the data without referring to any specific response. Unsupervised learning aims to explore the data structure and generate a hypothesis rather than to test any hypothesis by statistical methods or to construct prediction or classification models on the basis of a set of conditions and a specified response. Algorithms for unsupervised learning can be subdivided into 2 categories: (1) clustering algorithms and (2) informative data transformations To better illustrate the concepts, we will use the data-set of Konstantonis et al 2 to investigate decisions about extraction and identification of treatment predictors in Class I malocclusions. The dataset comprises 542 randomly selected records of patients with a Class I relationship observed in a university graduate program and 5 private orthodontic offices. For each participant, several variables are observed: 26 cephalometric variables , 6 model measurements, 2 demographic variables (gender, age), and the type of treatment: nonextraction (397) or extraction of the 4 first premolars (145). More details about the dataset can be found in Konstantonis et al. 2 The scope of this study is evident as the authors want to predict the optimal treatment (response) given the set of explanatory (predictor) variables. Furthermore, they wanted to identify essential variables in predicting the treatment. The data can be presented in a tabular format that organizes all the information, as depicted in Table I. Clustering A clustering task can be best defined by an example. Consider the image in Figure 1. A task related to this image could be determining how many herds of animals with different genera are visible in this picture. On the basis of the physical characteristics of each animal, you could try to lump them into homogenous clusters (groups). In this example, you could cluster (group) the animals with black and white stripe patterns and place the horned animals with brownish fur in another cluster. To execute this task, it is not necessary to be an expert in wildebeest or zebra, nor is it required to have these animals tagged by a label that explains the genus of the animal. Clustering algorithms can discover this structure in a dataset without any prior knowledge. Toward this aim, a clustering algorithm will compute a distance measure to quantify similarity or dissimilarity between different subjects in the dataset. On the basis of this measure, subjects will be clustered (grouped) or split from each other to yield clusters (groups) that have the highest similarity within the cluster and the largest differences between the clusters. Typically, a clustering method has 3 key elements: (1) a distance measure to quantify the similarity or dissimi-larity between subjects; (2) an additional distance measure to quantify the difference between clusters or between a cluster and a subject (ie, linkage); and (3) a computer algorithm that maximizes the similarity within a cluster and the dissimilarity between the clusters. The variance is often used to measure the heterogeneity in a dataset. In this case, clustering will minimize the variance within the clusters and maximize the variance between the clusters. The distance is a number that tells us how far 2 subjects are separated by considering the difference for each observed variable. In the next example, the Frankfort mandibular incisor angle (FMIA) and the incisor mandib-ular plane angle (IMPA) are examined for 3 patients in the dataset of Konstantonis et al. 2 Two patients exhibit an FMIA and IMPA combination of (41.8/113.0) and (52.0/114.5). These patients seem very alike when considering these 2 covariates, especially when contrasting these observations with our third patient, who has an FMIA and IMPA of (89.1/76.0). Intuitively, the thirdThis work was supported by the Flemish Government under the “Onderzoeksprogramma Artifici€ele Intelligentie (AI) Vlaanderen” progra
- …
