Riviste UNIMI
Not a member yet
    21278 research outputs found

    Assessing Methods for Predictive Cut-Point Estimation: A Simulation-Based Comparison

    No full text
    IntroductionThe identification of an optimal cut-point for continuous biomarkers plays a crucial role in defining patient subgroups likely to benefit from specific treatments. While the literature has extensively covered prognostic biomarkers, those that provide outcome prediction regardless of treatment, the methodological framework for identifying predictive effect, which inform treatment effect heterogeneity, is less developed. This is primarily due to the added complexity of modelling treatment-biomarker interactions, which poses challenges related to statistical power, overfitting, and bias. ObjectivesThis study aimed to compare three statistical methods for the identification of predictive cut-points in time-to-event data. Our goal was to assess their performance in estimating the correct interaction effect and identifying a responder subgroup, under simulation settings that account for variability in treatment efficacy, biomarker predictive effect, and subgroup prevalence. MethodsWe implemented three approaches: Procedure B of the Biomarker-Adaptive Threshold Design (M1), which combines test statistics across possible cut-points using a permutation test based on likelihood-ratio statistics; the Differential Hazard Ratio method (M2), which selects the cut-point with the largest difference in HRs across adjacent thresholds; and a Minimum P-value method (M3) adapted for interaction terms in the Cox model [1,2]. We conducted a simulation study with 1000 replications from an exponential distribution with an expected censoring rate of approximately 40%. Eight main scenarios were defined by all possible combinations of two sample sizes (n = 300 and n = 500), two treatment effect sizes (HR = 1 or 0.5), two interaction effect sizes (HR = 1 or 0.5), and a biomarker prognostic effect set to HR = 0.6. In addition, we included two extra scenarios calibrated to achieve 80% power: one based on the interaction effect test (β for treatment-biomarker interaction) and one on the subgroup effect test (β within responders). In each replication, the true cut-point was randomly drawn from the biomarker distribution between the 20th and 80th percentiles. For each method, we evaluated statistical power, cut-point estimation bias, subgroup and predictive coefficient estimation bias, and type I error. A significance level of 0.05 was used for all three methods. The procedures were also evaluated on a real case on a prostate cancer clinical trial conducted by the Second Veterans Administration Cooperative Urologic Research Group [3]. ResultsM1 consistently demonstrated robust performance, with type I error close to the nominal level ( , 5.6%) and minimal bias in cut-point estimation ( ,  0.005±0.06). It maintained good power even when the subgroup size was small. M2 showed unstable cut-point estimates ( ,  0.055±0.42) and high variability in interaction estimates ( ,  0.463±1.46), yielding a very low power ( , 16.2%). While the M3 achieved the highest power in some scenarios ( , 82.1%), it exhibited significant type I error inflation ( , 50.1%) and substantial bias due to multiple testing without correction ( ,  -0.401±1.730). In small subgroups, all methods experienced reduced performance, but M1 remained the most stable. On the prostate cancer dataset, M1 identified a plausible treatment-responsive subgroup, while the other two methods produced conflicting or less reliable results. Conclusions Our results highlight the need for robust methods in predictive cut-point estimation. M1 showed the best balance between error control and accuracy. In contrast, M2 and M3 may lead to overfitting, unstable estimates, and inflated first error rates. Future research should extend these comparisons to more complex models including multivariate biomarkers

    Impact of Distance from Healthcare Facilities and Quality of Hospital Care on Patient Healthcare Travel: A Study of Oncological Surgery for Colon Cancer in Italy

    No full text
    Introduction: Colon cancer surgery is a complex and essential procedure in the treatment of this disease, requiring advanced medical infrastructure and highly specialized personnel to ensure optimal patient outcomes [1,2]. The geographic distribution of healthcare resources shapes accessibility to critical interventions such as colon cancer surgery, and greater distance from treatment centers has been associated with more advanced stage at diagnosis and higher mortality among patients with this carcinoma [3,4,5]. Furthermore, hospital- and provider-related factors—such as high procedure volumes and greater specialization—also influence patient outcomes [1,2,6,7]. All of these factors can affect patients’ decisions to travel for care. In Italy, the uneven distribution and variable quality of centers performing colon cancer surgery may impact equity in service delivery. Analyzing disparities in access to care is crucial for understanding how regional variations in infrastructure and service quality influence patient mobility [8]. Objectives: To assess the impact of hospital care quality and distance from healthcare facilities (both hospitals and specialist outpatient oncology centers) on patient healthcare travel among those undergoing colon cancer surgery, and to identify any territorial inequalities in access to services. Methods: This study examines the interaction between geographic accessibility and hospital quality in shaping patient healthcare travel for colon cancer surgeries across Italy, using maps to visually represent spatial dynamics of access to care and quality [9]. Two primary distance metrics were calculated: the actual travel time from each patient’s municipality of residence to the hospital where surgery was performed, and the potential travel time from each municipality to the nearest capable facility. These metrics quantify the geographic impedance patients face when seeking specialized oncological surgery. Geographic coordinates of all Italian hospitals and municipal centers were used, and car travel times were computed via the OpenStreetMap Routing Machine [10] and the R statistical software. To gauge the phenomenon at the health district level, we computed both a “healthcare escape index” (indicating the propensity to travel outside one’s area for care) and an “outpatient oncology service supply index” (for chemo- and radiotherapy services) [8,11]. The cohort was identified through the National Repository of Hospital Discharge Records (SDO), linked to the Tax Registry Information System for vital status and follow-up data, and includes all patients aged 15–100 years, resident in Italy, diagnosed with colon cancer and undergoing elective partial or total colectomy in any public or accredited private hospital from 1 January 2015 to 30 November 2023 [12]. Facility-level quality indicators were integrated into the analysis according to the National Outcomes Program (PNE) classification framework, with particular focus on the Treemap tool’s classification for colon cancer surgery quality, which employs the 30-day postoperative mortality indicator under a predefined protocol [13]. Results: To capture geographic disparities at a finer granularity than prior Italian healthcare travel studies (which were limited to regional or ASL levels), we performed detailed mapping at the level of individual ASL health districts. In addition to care quality, we examined the healthcare escape index—measuring the tendency to seek care outside one’s area relative to local health needs. We also developed a local outpatient care network indicator, based on the distribution of chemo- and radiotherapy centers and their distance from patient residences, to assess the effectiveness of the territorial oncology outpatient network, given that these treatments form an integral part of the oncological care pathway alongside surgery. Both indicators provided a more granular understanding of how the oncology care network and patient care travel dynamics operate across the territory. Conclusions: This study explores the complex relationship between geographic accessibility to healthcare services, the quality of those services, and patient healthcare travel, focusing on colon cancer surgery across Italy. We map the distribution of surgical centers and the broader network of linked outpatient oncology services, offering a detailed visual representation of the national geographic landscape of care provision and its association with patient healthcare travel. The findings can serve as a key tool to identify determinants leading patients to forego local healthcare services

    A Novel One-Class Classification Framework for Highly Imbalanced Binary Outcomes: the OC-Cat Approach

    No full text
    Introduction Extremely rare events can challenge traditional classification models, which may exhibit reduced power in highly unbalanced datasets (i.e., when two or more target groups are unevenly represented). Moreover, this effect seems to be accentuated by the reduction of the sample size. Some of the easiest and intuitive methods proposed to handle unbalanced datasets, while still using a classical statistical models, are random under- or oversampling or hybrid methods[1]. Alternatively, other approaches have been proposed with different strategies, such as ensemble models (e.g. AdaBoost, XGBoost), or novelty detection models[2]. In medicine, this kind of scenario can occur when analysing catheter related/associated blood stream infections (CRBSI/CABSI), whose incidence usually remains <1/1000 catheter days[3], but could be higher in very frail patients[4]. Catheter insertion has a potential risk of complications and longer hospitalization: the use of decision-making algorithms is of great importance in order to avoid complications for these patients[5]. Objectives The main purpose of our study is to adopt a novel anomaly detection model focused on binary/categorical covariates to predict risk of CRBSI/CABSI occurrence at baseline. To reach this result, we use a combined approach: features reduction, novelty detection algorithm and importance grid for model explainability. Methods Data from hospital patients who received a vascular access device (VAD) placements at the University Hospital Luigi Sacco in Milan between January 2021 and January 2025 were analysed. All patients underwent central or peripheral catheterization in a non-ICU department. Parameters were collected at catheter insertion: age, sex, any major comorbidities, active intravenous drug usage, parenteral nutrition, regimen of hospitalization, transfer from the ICU, type of catheter, number of lumens, tunnel, exit site and number of placement attempts. All continuous variables were discretized into categorical format, yielding  29 Boolean and 2 categorical features. The designed framework (OC-Cat) combines: 1) a graph-search-based feature selection method; 2) a one-class soft classifier designed (based on characterization of patients who didn’t incurred in catheter infection); 3) a feature ranking that clarifies the classifier\u27s decisions by ordering features based on their unique role in identifying uninfected patients. In details: we assess the redundancy of each pair of features using the excess over independence metric[6]. Then, we design a undirected connected graph where each node represents a feature, and the edge weights reflect the excess over independence between feature pairs. From each node, we apply the Bellman-Ford algorithm[7] to find the shortest closed path. Among all paths, we select the one that best represents the original data based on the Bayesian Information Criterion (BIC). The features included in this optimal path constitute the final selected feature set; to design the soft-classifier, we rely on the assumption that a higher occurrence of a specific feature combination in majority class records (uninfected) implies that each new instance with those values is less likely to be infected. The learning phase consists of estimating the probability for a majority-class record occurring, given the distribution of uninfected patients. The prediction phase, instead, consists of estimated the majority-class probability for a new record (based on its i‑th attribute combination) using a weighted inverse Hamming distance [8]. The weight increases with the record\u27s frequency among uninfected patients; accordingly, the method ranks features based on a tailored definition of importance, stating that a feature - or a features set - is more important if it consistently exhibits the same value in majority-class data. To achieve this, we build a tree where nodes represent subsets of features, and each step measures the contribution of each new feature in reducing the majority-class data entropy. Last, once exploring all feature combinations and identifying the path with minimal entropy, the algorithm reports the features ranking as the order in which features appear along the path: from the root (most important) to the leaf (least important). To evaluate the framework performance in terms of one-class classification, we compared OC-Cat probability distribution with that obtained from Isolation Forest (iForest) and One-Class Support Vector Machine (OCSVM). For the analysis, dataset was split into training and test set (August 2023 as threshold: ~75% vs 25%). Results Data from 2836 hospitalized patients with VADs were retrieved. After keeping only the first VAD placement for each patients, we considered 2275 subjects (1222 women and 1053 men between 18 to 101 years) Among them, 148 become infected: 62 patients developed a CRBSI, 80 a CABSI and 3 both. In the first step, our approach retained 16 out of 29 variables, which were then inserted in the novel model in the second step. Figure 1 displays the risk factor index distributions for the training and test sets of our model, iForest, and OCSVM, along with their respective ROC curves. Lastly, catheter insertion site (upper vs lower limb vs neck), biological sex, hypertension, Charlson Comorbidity index, neurological disease and diabetes resulted the first most characterizing feature. Conclusion Our model introduces a novel, integrated approach for both characterizing and forecasting outcomes under severe imbalance in the target variable. It outperformed the iForest and OCSVM models applied to categorical and Boolean variables in a specific clinical contest. We are currently conducting further analysis and refinements to optimize performance on both our internal and external datasets, enhancing the model\u27s generalization.   &nbsp

    Beyond the Nutrition5k Project: Data Curation and Deep Learning Algorithms to Predict the Nutritional Composition of Dishes from Food Images

    No full text
    Introduction In recent years, artificial intelligence (AI) has emerged as a powerful tool to overcome limitations of traditional dietary assessment methods such as 24-hour recalls, food frequency questionnaires, and dietary records [1, 2]. Nevertheless, the success of AI models heavily depends on high quality, well-curated data. Pre-processing—handling missing values, outliers, and inconsistencies—is essential to ensure reliable model performance [3, 4]. The Nutrition5k project [5] is the first to adopt Deep Convolutional Neural Networks for the 2D direct prediction of mass and nutritional composition of dishes.   Aims We used the US-based Nutrition5k project to evaluate the performance of various deep learning (DL) algorithms, and to compare them in predicting mass, energy, and the macronutrient content from food images. We explored different ground truth configurations (by combining data curation with two country-specific food composition databases—FCDBs) and checked if there were specific dishes consistently mispredicted by most algorithms, and what common features they shared.   Methods Within the Nutrition5k project, mass (grams), energy (kcal), protein, fat, and carbohydrates (grams) contents were provided for each of the 5006 dishes as sum of nutritional values of single ingredients derived from the US-FCDB. In a previous publication [6], we have matched the US dishes with their Italian nutritional composition. This gave birth to four versions of the Nutrition5k dataset, specifically obtained as ground truths by crossing country-specific FCDBs with ingredient-mass correction of outlier dishes. We chose Inception_V3_IMAGENET1K_V1 (IncV3, the updated version of the IncV2 proposed in [5]), Res-Net101_IMAGENET1K_V2, ResNet50_IMAGENET1K_V2, ViT_B_16_IMAGENET1K_SWAG_E2E_V1 (ViT-B-16), built in two variants (2+1 and 2+2), and pretrained via the open-source ImageNet. IncV3_2+2 was our benchmark algorithm as in [5]. To ensure reproducibility, we adopted the same pipeline as in the Nutrition5k project for train/test split of dishes, loss function, frame preprocessing, and performance metrics (root mean squared error, mean absolute error – MAE – and its percentage – MAPE). Dish-specific (raw, absolute) differences between predicted and observed values of the target variables on the test set (n=676) were evaluated across datasets and algorithms (160 predictions per dish), by considering: (1) percentages of perfect, adjacent, and opposite agreement among quartile-based categories, and unweighted Cohen’s kappa statistics, and 2) Bland-Altman plots. We defined “incorrectly predicted dishes” dishes as those that for 7 or 8 DL algorithms (1) exceeded the 95% limits of agreement in the Bland-Altman plots and (2) had the highest 5% of absolute differences across target variables and datasets. Their dish frames were manually inspected and further removed when needed. The “incorrectly predicted dishes” were then grouped based on similarity in content. A sensitivity analysis was carried out to study whether energy content should be directly predicted by DL algorithms or deterministically calculated by summing up predicted macronutrients multiplied by the corresponding conversion factor. This led to three scenarios: the 5-task predicted energy content (main analysis), the 5-task computed energy content (energy calculated based on macronutrients predicted together with energy), and the 4-task computed energy content (no energy prediction potentially imporving macronutrient prediction).   Results The median dish to be predicted on the test set had a mass of 142 g, energy content of 164.5 kcal, 8.3 g of protein, 6.9 g of fat, and 11.3 g of carbohydrates. When dishes showed ingredients with extreme weight or composition, algorithms tended to pull their predictions toward the center of the distribution. For the same dataset, IncV3s consistently showed the worst percentages of perfect agreement across all target variables. For a given algorithm, perfect agreement was generally higher in the corrected datasets, with the exception of protein. Similarly, Cohen’s kappa values were lower for the IncV3s and higher for the corrected datasets. Globally, mass and energy content had more similar and lower error metrics, followed by protein, carbohydrates, and fat (Figure 1). By dataset, IncV3s generally exhibited the worst performances. Ingredient-mass correction strongly improved performance metrics. The incorrectly predicted dishes were 80, of which 12 were discarded (7 for discrepancies between ingredient names and images and 5 for image-related issues for all images). Beyond the corrected-portion-size group (5%), Salad-based (44%), Chicken-based (25%), Eggs-based (13%), and the Western-inspired breakfast foods (13%) groups were identified. From this list we removed a median of 60% of the original frames, which led to a slight reduction in MAPE values. While comparing our three scenarios, we observed a gradient: performance was the highest in the 5-task predicted, then the 5-task computed and finally the 4-task computed energy content scenario, advancing that energy prediction may partially compensate for macronutrient prediction errors, particularly those arising from image grounding issues. The ViT-B-16’s showed minimal differences (~<7%) across scenarios.   Conclusions We investigated the use of the Nutrition5k dataset for directly predicting the nutritional composition of dishes (including mass) using 2D images. All six selected algorithms outperformed the benchmark IncV3_2+2, as well as the lighter IncV3_2+1. Data curation, especially ingredient-mass correction, is critical in influencing algorithm performance

    A Modular Pipeline for the Construction and Validation of Polygenic Risk Scores in Oncology

    No full text
    INTRODUCTION Polygenic Risk Scores (PRS) are statistical tools designed to estimate individual predisposition to complex diseases by aggregating the effects of numerous genetic variants (Single Nucleotide Polymorphisms SNPs). In oncology, PRS hold promise for enhancing cancer risk stratification and personalizing screening strategies. However, their effectiveness depends on a well-defined computational framework that guarantees high-quality data processing and consistent predictive performance.   OBJECTIVE This study aims to describe a modular and reproducible pipeline for the construction and validation of PRS in cancer research, detailing each analytical step from genotype preprocessing to risk score validation. Given that cancer is characterized by a highly polygenic architecture, involving thousands of loci with small effect sizes, such efforts require analytical workflows capable of handling complex and large-scale genomic data in a reliable and scalable manner.   METHODS The pipeline begins with raw Variant Call Format (VCF) files obtained through genotyping. To ensure statistical power for detecting associations and constructing robust and accurate PRSs, large sample sizes, typically comprising several thousand cases and controls, are essential. Maintaining a balanced case-control ratio of 1:1 is crucial to minimize bias and maximize model stability. When relevant covariates are available, propensity score matching [1] can be applied to further balance cases and controls on clinical or demographic characteristics. Quality control (QC) is implemented using PLINK, a tool for handling SNP data, to remove variants and individuals based on call rate (<98%), minor allele frequency (MAF<1%), Hardy-Weinberg equilibrium deviations (p<1×10⁻⁴), excess heterozygosity or relatedness (PI_HAT>0.2), and sex discrepancies. Population stratification is assessed using Principal Component Analysis (PCA) or Multidimensional Scaling (MDS) to control for confounding due to population structure, and outliers are optionally detected via unsupervised clustering methods. The resulting components are included as covariates in downstream models. Imputation is performed via the Michigan [2] or Helmholtz imputation [3] servers using ancestry-matched reference panels to enhance the density of genotype data. Post-imputation filtering excludes SNPs with low imputation quality (R² < 0.3) and extreme allele frequencies to preserve dataset integrity. To identify genetic variants associated with cancer susceptibility, genome-wide association studies (GWAS) are conducted using logistic regression models, adjusting for age, sex, and leading principal components to mitigate confounding due to population substructure. When multiple cohorts are available, GWAS are initially conducted independently within each dataset. Subsequently, meta-analysis is performed to combine effect size estimates across studies, using either a fixed-effects or random-effects model. The choice of model depends on the extent of between-cohort heterogeneity, which may arise from differences in environmental exposures or other context-specific factors influencing cancer risk. In cases where such heterogeneity is minimal, fixed-effects meta-analysis via inverse-variance weighting is applied; otherwise, a random-effects model is employed to account for variability in genetic effect estimates across cohorts. For PRS construction, we employ a Bayesian regression framework with continuous shrinkage (PRS-CS) as proposed by Ge et al. [4], which integrates GWAS summary statistics with an external linkage disequilibrium (LD) reference panel to infer posterior SNP effect sizes. This approach eliminates the need to specify p-value thresholds or perform LD clumping and produces a single, optimized polygenic model. The PRS is finally calculated by summing allele dosages weighted by GWAS-derived effect sizes. Score performance is internally validated on a held-out portion of the original dataset and externally tested on independent cohorts. Evaluation metrics include the area under the receiver operating characteristic curve (AUC), R², and calibration plots.   RESULTS We are currently applying this pipeline to the development and validation of a PRS for gastric cancer (GC) risk in individuals of European ancestry. Despite the growing use of PRS in various malignancies, only few of them have focused on GC, mostly on Asian individuals, and no validated PRS currently exists for GC in European populations. To address this gap, we are leveraging individual-level genotype data from over 8,000 GC cases and more than 350,000 controls across multiple European cohorts, including the Helsinki Biobank, the Rotterdam Study, dataset from Hess et al. [5], and the Spanish sample from the Stomach cancer pooling (StoP) Consortium. These cohorts form the discovery dataset used to conduct GWAS and meta-analysis, followed by PRS construction using a Bayesian framework (PRS-CS). The resulting scores are being externally validated in independent datasets from the UK Biobank and three cohorts from StoP consortium (Rome, Latvia, and Lithuania).   CONCLUSIONS This pipeline provides a comprehensive and adaptable framework for constructing PRS in oncology, supporting methodological transparency and interoperability. Its modular design ensures flexibility across various datasets and facilitates implementation in clinical research. Future directions include increasing cross-ancestry portability and integrating PRS within clinical decision-making tools. &nbsp

    Unsupervised Clustering of Optical Coherence Tomography Data in Patients with Leber Hereditary Optic Neuropathy using Non-Negative Matrix Factorization and K-Means: A Comparison

    No full text
    INTRODUCTION Leber Hereditary Optic Neuropathy (LHON) is a rare genetic neurodegenerative disorder of the optic nerve, caused by mitochondrial DNA (mtDNA) pathogenic variants. It leads to sudden and severe central vision loss, mostly bilateral, typically in young adult males (onset age 18–35), though cases from 2 to 87 year of disease onset have been reported [1]. LHON has incomplete penetrance: all family members may carry the causative mtDNA pathogenic variant, but only some develop the disease phenotype. No definite predictors of disease conversion exist. However, subclinical signs can be detected through Optical Coherence Tomography (OCT), which vary between LHON asymptomatic carriers and symptomatic patients [2,3,4]. OCT is a non-invasive imaging technique that measures the thickness of retinal layers and optic nerve fibers. We used the DRI OCT Triton (Topcon), a swept-source multimodal imaging OCT device. LHON asymptomatic carriers may show early retinal alterations, while symptomatic individuals in the acute phase (within 6 months from onset) present distinct OCT phenotypes. Identifying putative OCT parameters predicting clinical conversion is an urgent unmet clinical need. OBJECTIVES To apply unsupervised clustering techniques to OCT data to identify latent subgroups of eyes with similar structural patterns, and assess their coherence with known clinical classes. METHODS We analyzed 173 eyes from symptomatic LHON patients (acute phase), asymptomatic LHON carriers, and healthy controls, based on 41 OCT parameters related to Ganglion Cell Layer (GCL), Retinal Nerve Fiber Layer (RNFL), and choroidal thickness. Data were normalized and clustered using: (1) Non-negative Matrix Factorization (NMF) via Brunet and Lee methods, running 50 iterations with cluster number (k) optimization based on internal quality indices [5]; (2) K-means clustering with optimal k selected using Elbow and Gap statistics. We also constructed a complete ExpressionSet object including phenotypic and clinical metadata to facilitate integration and visualization [5]. RESULTS All methods identified an optimal partition into 3 clusters, broadly consistent with the clinical classification. Brunet-based NMF outperformed Lee-NMF in capturing the clinical structure (purity 0.601 vs 0.572; entropy 0.744 vs 0.784), likely due to its ability to model sparse data, such as OCT matrices where a few variables dominate the individual profiles. Then, the extracted metagenes (partitions) showed localized structural patterns in RNFL and GCL sectors. K-means also separated groups meaningfully, although with more overlap, especially among symptomatic eyes. CONCLUSIONS Among the clustering methods tested, Brunet-based NMF emerged as the most suitable for unsupervised stratification of LHON patients, carriers, and controls based on OCT data. Its advantage lies in the ability to highlight sparse but informative features — i.e., those OCT parameters that best discriminate between clinical groups — allowing for more distinct phenotypic clustering. These findings support the use of data-driven approaches for structural profiling and future development of predictive tools for LHON conversion

    Improving Calibration Assessment near Clinical Thresholds: The Bayesian Calibration Error

    No full text
    INTRODUCTIONCalibration of predictive models is essential to ensure the clinical reliability of risk estimates, particularly when decisions are based on well-defined probability thresholds. However, especially in machine learning (ML) applications, calibration is often overlooked, and model performance is typically evaluated using discrimination metrics alone [1,2]. Several calibration metrics have been proposed, including the Brier Score, Expected Calibration Error (ECE), Maximum Calibration Error (MCE), and Integrated Calibration Index (ICI). Each of these has limitations: for example, the Brier Score reflects a global average and may mask local errors; ECE and MCE are highly sensitive to binning strategies and become unstable with limited data; the ICI, while more robust, does not focus specifically on clinically relevant thresholds [3–5]. As a result, these metrics may fail to detect or emphasize calibration errors in the areas most critical for clinical decision-making. OBJECTIVESTo introduce the Bayesian Calibration Error (BCE), a metric that quantifies both the magnitude and concentration of miscalibration around a clinically relevant threshold, and to evaluate its use alongside the Absolute Calibration Error (ACE). METHODSBCE integrates three components: (i) quantile-based adaptive binning, (ii) a Bayesian formulation to estimate local calibration error (LCE), which accounts for the number of events in each bin rather than relying solely on observed proportions, and (iii) a Gaussian weighting function centered around the decision threshold t. For each bin i, the mean predicted probability  is compared with the expected value of the observed frequency, modeled using a non-informative Beta(1,1) prior. The posterior distribution becomes Beta(, ), where k is the number of events and n is the number of observations in the bin. The local calibration error (LCE) is then defined as: After defining a decision threshold t (i.e., a predicted probability associated with a clinical "action"), derived through decision curve analysis and/or clinician input, a Gaussian weight is assigned to each bin:  , where σ (e.g., 0.1) controls the concentration around the threshold Weights are normalized to have unit mean. BCE is then computed as the weighted average of the LCEs. A high BCE indicates that miscalibration is particularly concentrated around the threshold.We applied this approach to a dataset of 3,672 pregnant women carrying small-for-gestational-age (SGA) fetuses, enrolled in the TRUFFLE 2 multicenter study. Three predictive models were developed—Logistic Regression (LR), Random Forest (RF), and XGBoost—using 11 routine clinical variables to predict adverse perinatal outcomes. The decision threshold was set at t = 0.3 based on prior decision analyses. RESULTSThe incidence of adverse outcomes was 13%. ACE confirmed the same performance ranking across models (LR: 0.0198, RF: 0.1126, XGBoost: 0.2290). However, BCE imposed a stricter penalty on RF (BCE = 0.1916) and an even higher one on XGBoost (BCE = 0.2633), indicating that miscalibration was concentrated around the decision threshold. Although the RF model showed a more pronounced local peak of miscalibration, XGBoost had a broader spread of error in bins adjacent to the threshold, resulting in a higher overall BCE. Conversely, the LR model maintained a low BCE (0.0216), suggesting good local calibration. CONCLUSIONSBCE complements global calibration metrics by quantifying whether miscalibration is concentrated around the clinical decision threshold. While ACE reflects the average accuracy of risk estimates across the entire prediction range, BCE captures local consistency near the threshold, offering a more nuanced evaluation. This distinction is particularly important in clinical contexts, where decisions hinge on specific risk cut-offs. When a clinical "action" threshold is defined, we recommend reporting both ACE and BCE to support informed model assessment. Moreover, BCE enables identification of models that, despite satisfactory global calibration, underperform near the decision threshold—and conversely, models with less favorable global performance that maintain adequate reliability in clinically critical regions. &nbsp

    A Non-Invasive Diagnostic Tool to Rule Out Left Main Stem Stenosis: the MASTER Study

    No full text
    INTRODUCTION In patients with stable coronary artery disease (CAD), medical therapy alone does not increase the risk of ischemic cardiovascular events or deaths, as compared to an initial invasive strategy by percutaneous coronary intervention [1]. However, patients with left main coronary artery disease (LMCAD) have poorer prognosis, and current guidelines recommend revascularization [2]. Therefore, a non-invasive diagnostic method, less expensive than coronary angiography (CAG), which could reliably identify LMCAD, would allow a safe and more sustainable treatment of the vast majority of stable CAD patients.   OBJECTIVES The MAin stem Stenosis prediction Through Exercise Response (MASTER) multicenter case-control study was designed to develop a diagnostic model for excluding LMCAD among subjects referred to coronary angiogram (CAG) for documented or suspected myocardial ischemia.   METHODS Eligible subjects were suspected CAD patients with an interpretable exercise stress test (EST) performed before CAG. The training set included patients with a CAG performed between 2010 and 2021 in 5 Italian hospitals; the validation set included patients with a CAG performed between 2022 and 2024 in 3 of the centers used for model training and in two additional hospitals (one in Italy and one in the USA). Cases were patients with either ≥50% left main (LM) stenosis or ≥70% stenoses of both proximal left anterior descending and proximal circumflex arteries identified through CAG. In all patients, we collected demographic, clinical, laboratory and EST variables. To deal with missing values, we performed a single imputation using predictive mean matching for numerical variables, logistic regression for binary variables and polytomous regression for categorical variables with more than two levels [3]. The diagnostic model was identified by applying logistic regression with Akaike Information Criterion (AIC)-based backward stepwise selection. The performance of the selected model in terms of discrimination was quantified by the Area Under the Curve (AUC) with 95% confidence intervals (95% CI). The optimal threshold for the linear predictor corresponded to the point on the ROC curve closest to the top-left corner, assuming a ratio of the cost of misclassifying a case versus a control equal to 100 and a 5% prevalence of LMCAD among patients undergoing CAG for suspected CAD [4, 5]. Based on the optimal threshold, we estimated sensitivity, specificity, negative and positive predictive values (NPV and PPV). We performed an internal validation estimating the optimism-adjusted AUC based on 500 samples [6]. We performed external validation in the complete validation set, and, as a sensitivity analysis, in the subset of patients from the centers not included in the training set.   Results The training set included 219 cases and 554 controls. The selected model showed an AUC of 0.80 (95% CI, 0.76-0.83), which after adjusting for optimism became 0.77 (see Figure 1). The model had a sensitivity of 86.3% and a specificity of 56.2%, with a NPV of 98.7% and a PPV of 9.4%. The validation set included 137 cases and 274 controls, of whom 53 and 91 in the two additional centers, respectively. The accuracy of the model on the complete validation set decreased, with an AUC of 0.70 (95% CI, 0.66-0.74). At the best threshold identified from the training set, sensitivity was 81.0% and specificity 45.3%. The NPV and PPV were 97.8% and 7.2%, respectively. When we limited the external validation to the two centers not included in the training set, we obtained similar results (AUC 0.72, 95% CI 0.63-0.81, sensitivity 79.2%, specificity 47.2%, NPV 97.7% and PPV 7.3%). CONCLUSIONS This large and multicentric study showed that, based on demographic, clinical and EST variables, it is possible to rule out the presence of LMCAD in patients able to perform a maximal EST, with a negative predictive value of about 98%, with a small difference between internal and external validation. Such results might influence the clinical management of stable CAD patients, by sparing many CAGs to non LMCAD patients

    Preoperative CT Radiomics for Prognosis Prediction in Resected Early-Stage Non-Small Cell Lung Cancer

    No full text
    BACKGROUND  Approximately 20% of non-small cell lung cancer (NSCLC) cases are diagnosed at an early stage (ES), allowing for potentially curative surgical resection. However, a significant proportion of these patients still experience disease recurrence. Although the TNM staging system remains the cornerstone for prognostic assessment and clinical decision-making, it does not fully account for outcome variability among patients within the same stage [1]. This highlights the need for novel biomarkers to complement TNM staging and support more personalized treatment strategies. Despite extensive efforts to identify such biomarkers, stage remains the sole factor currently guiding treatment and follow-up in ES-NSCLC. In this context, radiomics has recently gained attention as a promising, non-invasive tool to enhance prognostic evaluation [2].  OBJECTIVE  This study aims to develop and preliminarily validate models that use preoperative CT radiomic features—alone and in combination with clinically relevant factors—to predict post-surgical outcomes for ES-NSCLC.  METHODS  Imaging and clinical data were obtained from the MIRACLE study—a multicenter, retrospective and prospective investigation aimed at developing a prognostic algorithm by integrating biological, radiological, and clinical information. This project was supported by Italian Ministry of Health, under the frame of ERA PerMed (project code: ERP-2021-23680708). The current analysis focuses exclusively on retrospective data and preoperative CT images from patients enrolled at IRST-IRCCS between 2018 and 2021. The primary endpoint was disease-free survival (DFS), defined as the time from surgery to disease recurrence or death from any cause, whichever occurred first. The last follow-up update was in January 2024.   Tumors were manually segmented by two independent expert radiologists. Radiomic features were extracted from preoperative CT scans, acquired with or without contrast medium, using the open-source package PyRadiomics [3]. In some cases, both contrast-enhanced and non-contrast scans were available for the same patient.  Two analytical approaches were employed: one based on an extension of the Cox model, and the other using random survival forests (RSF). For the Cox-based models, radiomic feature selection involved bootstrap resampling, feature inclusion frequency analysis, and consensus clustering. In each bootstrap replicate, an elastic net Cox model was fitted, accounting for within-patient scan correlations. Features most frequently selected were then clustered via consensus clustering using Kendall’s tau distance and complete linkage, and one representative feature per cluster was chosen. For RSF, two modeling strategies were considered: one using all radiomic features, and one incorporating feature selection via hierarchical clustering on Kendall’s tau distances, with one representative feature retained per cluster.   All models underwent hyperparameter tuning using stratified 3-fold cross-validation (CV), and final models were trained with the optimal parameter set. Evaluation was performed using the same 5 repeats of 5-fold CV across models, with concordance index, integrated Brier score, and 3-year time-dependent AUC as performance metrics. Results are reported as mean ± standard deviation across the repeated CV runs.    RESULTS   A total of 78 patients were included, accounting for 115 CT scans. The majority were male (60.3 %), with a median age at surgery of 71 years [IQR: 65–75]. Adenocarcinoma was the most common histotype, observed in 83 % of cases. Most patients (87.2 %) underwent lobectomy, and 68.0 % presented with a stage I tumor. The median follow-up time was 42.5 months (95 % CI: 37.9-45.43) and the median DFS was not reached. Overall, 25 failures were observed.  From the Cox-based pipeline, two radiomic features—GLCM Cluster Shade and Shape Maximum 2D Diameter Column—were ultimately selected and included in a standard Cox model. This model achieved a C-index of 0.767 ± 0.103, IBS of 0.153 ± 0.032, and 3-year AUC of 0.804 ± 0.136. Adding pathological stage improved performance to a C-index of 0.777 ± 0.098, IBS of 0.152 ± 0.034, and AUC of 0.815 ± 0.134. The stage-only model performed worse across all metrics (C-index: 0.729 ± 0.119; IBS: 0.155 ± 0.042; AUC: 0.739 ± 0.155).  Similar patterns were observed with RSF models. The stage-only RSF model yielded a C-index of 0.720 ± 0.120, IBS of 0.163 ± 0.041, and AUC of 0.739 ± 0.160. Incorporating all radiomic features improved performance (C-index: 0.776 ± 0.095; IBS: 0.145 ± 0.041; AUC: 0.828 ± 0.135), but the best results were obtained using selected radiomic features (C-index: 0.788 ± 0.096; IBS: 0.147 ± 0.032; AUC: 0.837 ± 0.115). These included morphology-based (Shape Elongation and Shape Least Axis Length), intensity-based (Firstorder 10th Percentile, Firstorder Entropy, and Firstorder Interquartile Range), and texture-based (GLCM Difference Variance and GLCM ID) features. Adding stage to the selected radiomics model did not yield further improvement.  Additional analyses incorporating patient characteristics (e.g., age and sex) did not improve predictive performance and are not reported.  CONCLUSIONS  Our study shows that CT-derived radiomic features improve prognostic performance compared to stage alone. Although these results are promising, external validation on an independent dataset is essential to confirm their generalizability. Future work will also focus on investigating the explainability of the models to better understand the biological relevance of selected radiomic features.&nbsp

    The Importance of Hierarchical Regression in Public Health Data Modeling

    No full text

    13,272

    full texts

    21,278

    metadata records
    Updated in last 30 days.
    Riviste UNIMI
    Access Repository Dashboard
    Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇