22505 research outputs found
Sort by
An Interpretable Machine Learning Framework for Detecting Phishing URLs Based on Lexical Features
Phishing attacks represent one of the most significant and persistent threats in the cybersecu- rity landscape, with attackers increasingly using sophisticated URL manipulation techniques to deceive users and steal sensitive information. Traditional detection methods, which rely primarily on blacklists and heuristic rules, struggle to identify zero-day phishing URLs that have not yet been catalogued in security databases. This research addresses this critical gap by developing an interpretable machine learning framework for detecting phishing URLs usingclexical, structural, content-based, and domain metadata features. The study employs a comprehensive dataset of 11,430 labeled URLs (5,715 legitimate and 5,715 phishing) with 87 extracted features, categorized into structural (49 features), content- based (15 features), and metadata (23 features) attributes. Through extensive exploratory data analysis, the research identifies key discriminative patterns between legitimate and phishing URLs, including URL length, domain age, page rank, and Google indexing status. Four machine learning algorithms—Decision Tree, Random Forest, Support Vector Machine (SVM), and LightGBM—were trained, optimized, and evaluated. The best-performing model, LightGBM with hyperparameter tuning, achieved an accuracy of 96.94%, F1-score of 0.9694, and ROC-AUC of 0.9942. Principal Component Analysis (PCA) was applied to reduce dimensionality from 87 to 29 features while retaining 95% of variance, addressing multicollinearity and improving computational efficiency. To enhance model transparency and user trust, SHAP (SHapley Additive Explanations) values were computed to provide interpretable explanations for model predictions. The analysis revealed that domain reputation metrics, particularly google index and page rank, are the most influential features in distinguishing phishing from legitimate URLs. A web-based interface using Gradio was developed to enable real-time URL classification with explainable predictions. The research demonstrates that ensemble methods, particularly gradient boosting algorithms like LightGBM, outperform traditional classifiers for phishing detection. The integration of explainable AI techniques provides actionable insights into model decision-making, address-ing the “black box” problem common in machine learning security applications. This work contributes to the development of practical, scalable, and interpretable phishing detection systems suitable for real-time deployment in resource-constrained environments
Predicting inmate overcrowding to improve facility management
Overcrowding is a major challenge to correction systems because the conventional forecasting techniques are inaccurate and inadequate in most cases. This paper will solve this by constructing and testing a machine learning-based model to predict facility-level overcrowding. With the use of the XGBoost Regressor model on a dataset comprising of U.S. correctional facilities, the research identified key structural drivers but showed that the static facility attributes alone have limited predictive power (r-square approx 0.18) The discussion shows that overcrowding is non-linear, and a complicated problem not confined to the facility characteristics but to the larger, non-measurable regional influences, of which geographical characteristics are a strong proxy. The results present a demonstration of scalable interpretable forecasting system, which allows transition to proactive, data-driven strategic planning and ensure safety and efficiency in the facilities
Predictive Modelling of Long-term Outcomes in Myalgic Encephalomyelitis/Chronic Fatigue Syndrome using Machine Learning
Myalgic Encephalomyelitis/Chronic Fatigue Syndrome is a complex chronic illness characterized by debilitating, heterogeneous symptoms. The condition exhibits highly variable long-term outcomes, posing significant challenges for patient management, research, and clinical practice. Currently, there is a lack of tools for predicting individual patient trajectories, creating a substantial prognostic gap that is compounded by the prevailing research focus on diagnostics over prognosis. This study addresses that gap by investigating whether multidimensional baseline patient-reported outcome measures (PROMs), analysed using advanced and explainable machine learning methods, can meaningfully predict heterogeneous 12-month outcomes in ME/CFS, and by examining what patterns of predictability themselves reveal about the structure of outcome. For the field of applied health data science and machine learning, this represents a fundamental theoretical and practical challenge, as predictive models must operate under conditions of multidimensional symptom burden, outcome discordance, class imbalance, missingness, and non-linear disease behaviour. A secondary analysis was conducted on a longitudinal UK specialist-service cohort, restricted to adults with complete baseline and 12-month follow-up data (n = 438). Baseline measures comprised multidimensional PROMs capturing fatigue, functional impairment, mood, pain, sleep, activity limitation, and cognition. Follow-up PROMs and Clinical Global Impression scales were used to derive continuous fatigue change scores, binary improved versus worsened trajectories, and multi-pattern outcome classifications. Supervised machine learning models, including linear and logistic regression, regularised regression, random forest, gradient boosting, and support vector machines, were developed within a reproducible train test and cross-validation framework. Model performance was evaluated using error metrics, discrimination, calibration, and decision curve analysis. Unsupervised methods and explainability techniques, including principal component analysis, clustering, partial dependence, and SHAP values, were applied to characterize predictor structure and model behaviour. The models explained a meaningful proportion of variance in continuous fatigue change and achieved modest but clinically relevant discrimination for improved versus worsened trajectories. Across all modelling approaches, baseline functional disability, daytime sleepiness, pain intensity, and fatigue severity consistently emerged as the strongest predictors of subsequent deterioration. In contrast, improvement remained intrinsically difficult to forecast, with weak and unstable predictor profiles across models. Clustering analyses identified interpretable baseline severity subgroups but failed to reliably distinguish long-term prognostic classes. Among all approaches, random forest models demonstrated the most favourable balance of discrimination, probability calibration, and net clinical benefit, particularly for early identification of patients at elevated risk of deterioration. Using explainable machine learning applied to longitudinal PROMs, the findings show that deterioration in ME/CFS can be identified with clinically meaningful reliability, whereas recovery consistently resists prediction. This asymmetry suggests that outcome fragmentation is a structural property of the illness rather than a modelling limitation. Collectively, the findings support a conceptual reframing of ME/CFS outcome as a fragmented, multisystem construct in which deterioration behaves as a coherent, machine detectable state, whereas recovery is plural, weakly structured, and poorly predictable from PROMs alone. This asymmetry demonstrates that recovery is not the inverse of worsening and that outcome domains decouple across functional systems over time. Practically, the results indicate that PROM-based machine learning holds promise as an early-warning tool for deterioration but should not be interpreted as a confirmatory prognostic instrument for recovery. Future research should prioritize external validation, multimodal data integration, and the development of prognostic frameworks explicitly designed to accommodate outcome fragmentation and heterogeneous illness trajectories
Labor Transportation Simulation and Optimization - A Case Study
Transportation systems play a critical role in supporting economic activity, workforce mobility, and service delivery in urban and industrial environments. In many organizations, fixed-time transportation systems are essential for ensuring that employees are transferred from accommodation facilities to work locations within strict operational time windows. The case study company in this research operates a large-scale labour transport service between multiple accommodation camps and client sites in Dubai. Historically, buses were dispatched in an ad hoc manner, relying on supervisors’ experience rather than a systematic planning approach. This often led to low seat utilisation, unnecessarily long routes, higher fuel costs, and avoidable CO₂ emissions. The objective of this study is to develop and evaluate an optimization and simulation framework that assigns and routes a heterogeneous bus fleet more efficiently, while satisfying all worker demand within operational time windows. The methodology combines three main components. First, operational data were cleaned and aggregated into morning demand groups by camp and time window in Excel. An assigning framework was done through Excel Solver, discrete-event model in Arena, OptQuest on top of the Arena model to assign buses of different capacities to each demand group and to understand which approach allows for the best utilization of the available buses. Finally, a routing and distance-calculation tool was built in Excel using the haversine formula together with a custom VBA macro implementing a nearest-neighbour heuristic. The results show that several bus combinations can satisfy the required demand, but the Solver-based configuration, combined with the macro-driven routing model, delivers the best overall performance. Compared with TransCo’s informal dispatching practice, the final solution significantly reduces total operating cost (25%) and achieves a substantial decrease in CO₂ emissions (81%), while still serving all workers within the available shift time
Towards a Foundational Framework for Real-World Active Learning: Theory, Algorithms, and Applications
While supervised learning has seen great success in the modern machine learning era, the challenge of obtaining high-quality labeled training data still exists. In many knowledge-rich domains, we still face the problem of letting machine learning models learn well using a limited number of labels. Active learning (AL) has been a prominent learning paradigm that deals with such problems. This thesis reviews the classical challenges of AL, summarizes how our prior work has advanced the field, and charts a course for adapting AL to realistic and challenging scenarios. We begin by discussing past work on standard scenarios for AL, including multi-class and multi-label problems. In these traditional settings, we focus on uncertainty quantification, small data modeling, and balancing exploration and exploitation. We then explore emerging topics that address challenges related to data scalability, label noise, and limited evaluation budgets. Building on this foundation, we propose thesis projects aimed at enhancing current research and expanding its scope. Within the expanded scope of real-world AL challenges, we have proposed and studied the following important projects towards building a foundational framework: evidential AL for multi-label models, adaptive AL principles for noisy label learning, and principled active testing-while-learning frameworks. Later, we further apply AL methods to physics-informed machine learning models. Through these efforts, we aim to advance the methodology and application of AL, laying the groundwork for data-efficient, trustworthy, and scalable machine-learning systems
On the Adaptation of Latent Dynamics Models
Predicting future states of high-dimensional, partially observed dynamical systems - such as cardiac electrical propagation - is crucial for advancing fields like healthcare and physics. While univariate time-series forecasting is well-explored, high-dimensional time-series forecasting presents unresolved challenges. These challenges include the computational burden of processing high-dimensional data and the difficulty of accessing the system’s underlying dynamics directly. Classical optimization and analytical approaches become impractical as the dimensionality increases, leading to a growing interest in data-driven deep learning models, particularly those based on latent dynamics functions. Latent dynamics models provide an efficient way to map high-dimensional observations into lower-dimensional latent spaces, where a dynamics function learns to predict future states. The latent space acts as a coordinate system that simplifies the system’s underlying dynamics, optimizing for representations that expose simple, interpretable patterns. This transformation offers computational advantages and aligns with the theoretical assumption that many complex systems exhibit simple underlying dynamics. However, existing latent dynamics approaches typically focus on modeling individual systems under fixed training distributions, limiting their ability to adapt to related-but-different conditions at test-time. This dissertation proposes novel methods to extend latent dynamics functions to more complex forecasting scenarios, addressing the need for adaptability across heterogeneous environments. Five key research questions guide this work: (1) How can unsupervised latent dynamics functions be learned while integrating known domain knowledge? (2) How can a core latent dynamics function be adapted to downstream systems using control parameters? (3) How can models learn to generalize across environments with limited training samples? (4) How can a latent dynamics function continually adapt to new systems without forgetting previously learned dynamics? (5) How can adaptive latent dynamics functions be effectively deployed in complex clinical settings? The core contributions include: developing a unifying framework for latent dynamics, creating an unsupervised learning model with physics-informed supervision, introducing controllable latent dynamics for parameterized adaptation, extending models with meta-learning to generalize across tasks, and designing continual learning strategies to prevent catastrophic forgetting in dynamic environments. Finally, we propose a continual meta-learning framework for learning personalized neural surrogates of cardiac simulations in non-stationary environments. Together, these contributions advance the state of latent dynamics learning, equipping models to perform efficiently under complex, real-world forecasting conditions
Eccentricity in Binary Black Hole Mergers
Recent studies have shown that orbital eccentricity may serve as an important indicator of dynamical assembly as a formation channel for binary black holes. In contrast to binaries that evolve in isolation and circularize through long-term gravitational radiation, dynamically assembled binaries, such as those formed through gravitational encounters in dense stellar environments, may retain measurable eccentricity up to the point of merger. Because eccentricity leaves a distinct imprint on gravitational-wave signals, it provides a powerful observational handle on the astrophysical origins of compact binaries. Detecting this signature requires sensitive and accurate modeling of gravitational-wave signals, particularly as binaries enter the LIGO frequency band where most circular systems dominate. Although no confident detection of eccentricity has yet been made, the increasing sensitivity of the current detector network (LIGO, Virgo, and KAGRA) and the growing number of observed events make the prospect of identifying eccentric mergers increasingly promising. In this work, I investigate multiple approaches to quantifying and characterizing orbital eccentricity in binary black hole mergers. I begin by evaluating the effectiveness of the RIFT (Rapid Inference via Iterative FiTting) parameter estimation pipeline in recovering eccentric signals. I describe several improvements made to the RIFT framework to better handle the complex waveform morphologies introduced by eccentricity, including modifications to the likelihood evaluation and sampling strategies. In parallel, I develop and explore direct, waveform-based methods for estimating eccentricity from the asymptotic gravitational-wave signal itself, without requiring a full parameter estimation analysis. These methods provide a complementary and computationally efficient path toward identifying eccentric systems in large datasets. Beyond these core investigations, I also describe contributions to several collaborative and community-driven efforts within the field. These include integrating eccentric numerical relativity (NR) waveforms directly into RIFT analyses to validate parameter recovery, performing eccentric parameter estimation for both individual event studies and population-level analyses, and contributing to the development of an effective-one-body (EOB) waveform model that incorporates both eccentricity and spin precession. Collectively, these efforts aim to advance our ability to detect, interpret, and model eccentric binary black hole mergers, paving the way for a more complete understanding of the diverse dynamical processes that shape compact binary populations in the Universe
Military Academic-Scientific Production in Times Of Generative AI: Ethical Boundaries and Challenges for Strategic Stability
The adoption of emerging and disruptive technologies, which includes Artificial Intelligence (AI), has driven profound transformations across various sectors and in human behavior in recent years, permeating nearly all fields of knowledge. In this context, the use of AI as a tool to support academic-scientific production cannot be considered an exception. Additionally, there is a growing need for AI literacy for its users, for regulation and for the establishment of criteria to ensure these technologies are employed ethically and responsibly. In light of this global trend, the Brazilian Army has been seeking ways to regulate, manage, and optimize the use of AI in accordance with the ethical principles and values of the Institution. The objective of this paper is to analyze the use of Generative AI technologies in military academic-scientific production with regard to ethical boundaries and identifying potential implications for strategic stability. We carried out applied, exploratory research with a qualitative approach, based on a narrative literature review encompassing scientific studies indexed in the SCOPUS, WoS, Science Direct, SCiElo, and DOAJ databases. The main findings reveal the irreversible use of generative AI in academia, which has driven scientific advancements and significantly contributed to the spread of open science. However, the study also found potential biases and distortions in AI-generated content, which may pose risks to the academic development and education of future military leaders as well as to the country\u27s strategic stability
Hiding in Plain Sight
This thesis considers how perceptual, emotional, and behavioral patterns—often inherited, unnoticed, or obscured—can be revealed through the optical phenomena of glass. It connects visual experience with themes of grief, memory, and identity, using glass to explore the boundary between what is seen and what is felt. Through material investigations in refraction, reflection, and spatial arrangement, the studio research examines how glass can both obscure and reveal, echoing the instability of perception itself. These investigations, paired with a focus on viewer interaction, suggest that patterns hiding in plain sight become visible through intentional acts of perception and reflection