Search CORE

1,721,037 research outputs found

Statistical methods in financial market dynamics and portfolio strategies

Author: Jin Qi
Publication venue
Publication date: 28/07/2025
Field of study

This thesis uses statistical methods to explore topics in financial economics. In particular, we focus on topics related to financial market dynamics and portfolio strategies. This thesis makes four contributions to the literature. First, we introduce a method to detect linear and nonlinear lead-lag relationships in stock returns that uses pairwise Lévy-area and cross-correlation to rank assets from leaders to followers. We construct portfolios by trading followers based on leaders’ prior returns, hedged with an SPY ETF. With data from 1963 to 2022 for over 500 stocks, our portfolios achieve annualized returns over 20% and Sharpe ratios over 2. The relationships we discover are only partially explained by traditional factors like size or sector. Our results support the slow information diffusion hypothesis as daily rebalanced portfolios outperform less frequently rebalanced ones. Second, we study the effect of intraday volume shocks on stock returns during overnight and intraday periods. We discover a significant positive relationship between volume shocks and subsequent overnight returns, while no such effect exists during the next intra-day session. Well-known asset pricing risk factors and common explanations that associate abnormal trading volume with investor attention and cost of capital cannot account for the distinct intraday and overnight patterns we observe. We employ linear and machine learning models to forecast volume shocks and to construct portfolios that monetize the positive correlation between volume shocks and overnight stock returns. Our approach addresses the issue that volume shock is only known after the close auction; we show that this issue of non-tradability does not explain the observed relationship between volume shock and overnight stock returns. Third, we propose a framework to construct statistical arbitrage portfolios with graph clustering algorithms. First, we use five clustering methods to partition the correlation matrix of market residual returns of stocks into clusters. Next, we construct and evaluate the performance of mean-reverting statistical arbitrage portfolios within each cluster. We show that our proposed framework generates profitable trading strategies with over 10% annualized returns and statistically significant Sharpe ratios above one. The performance of our statistical arbitrage portfolios is neutral to the market and cannot be fully explained by intra-industry mean-reversion effects. In the last part, we examine the investment value in sell-side analyst price targets. We treat each analyst as a portfolio manager and use their price targets to construct 12-month implied return forecasts and self-financing long-short portfolios for each analyst. Our empirical analysis shows that while the average analyst does not generate statistically significant alpha relative to the returns of a long-only portfolio benchmark, a subset of analysts exhibits persistent alpha. Motivated by this heterogeneity, we introduce a “fund-of-analysts” framework that first predicts analyst performance and then dynamically allocates weights across analysts based on predicted analyst performances. Our results show that this meta-portfolio strategy can yield significant alpha over long-only benchmarks

Oxford University Research Archive

Network analysis and data science for finance: from traditional markets to decentralised exchanges

Author: Miori Deborah
Publication venue
Publication date: 13/01/2025
Field of study

Research on financial markets often confines itself to in-depth analyses of time series of asset prices, despite we are now in an era of unprecedented wealth of data that offers boundless opportunities for wider investigations. This thesis aims at broadening our understanding of traditional and decentralised market ecosystems, by taking indeed advantage of “unconventional data”. The latter are labelled as such either for their origin (i.e. being alternative data), or for their extensiveness (e.g. spanning multiple asset classes). Given the inherent higher complexity of our data, we leverage data science advancements to analyse them thoroughly. Recurrent techniques employed in this work include network science for capturing relationships among entities of interest, and clustering methods for dimensionality reduction and aggregation of information. Within traditional finance ecosystems, we investigate three sources of possible novel market insights, which indeed lead to alternative risk-monitoring tools. The first source lies in institutional investors’ holdings, which are found to signal crowding in trades, after aggregating the bipartite network of funds and their assets. Then, we consider a corpus of economic news with available timestamps. By modelling and clustering the interlinkage of concepts discussed within such news, we discover the major narratives of interest over time and map entropy in their state to market dislocations. Otherwise, we study returns of an heterogeneous set of indices belonging to multiple asset classes, and characterise their network of evolving correlations to identify market regimes that are found to have distinguishable macroeconomic features. Within decentralised finance ecosystems, we instead take direct advantage of the extensive and meticulous data-recording of blockchains. The trading activity of agents on multiple tokens is used to construct a network of transactions for each one of them, and clustering the set of such graphs allows us to identify interpretable “species” of traders. Lastly, we analyse data on liquidity provision, consumption, and price formation on competing decentralised exchange venues, to find a model for the prediction of incoming trading volume at block-level

Oxford University Research Archive

Multi-asset financial markets: mathematical modelling and data-driven approaches

Author: Vuletic Milena
Publication venue
Publication date: 18/11/2025
Field of study

This thesis develops statistical models and data-driven algorithms for modelling, simulation, and forecasting asset price dynamics in financial markets with many instruments and risk factors, focusing on equity and option markets. In such multi-asset settings, modelling of co-movements is crucial in order to implement correct hedging strategies, and generate realistic portfolio dynamics and loss distributions. Models also need to respect arbitrage relations linking prices of various instruments, which often imposes nonlinear constraints on state variables. Chapter 1 introduces notation, outlines the thesis structure, and compares various types of generative models, establishing shared themes and contributions. Chapter 2 presents a computationally tractable method for simulating arbitrage-free implied volatility surfaces. Our approach conciliates static arbitrage constraints with a realistic representation of statistical properties of implied volatility co-movements. Chapter 3 introduces VolGAN, a generative model for arbitrage-free implied volatility surfaces. The model is trained on time series of implied volatility surfaces and underlying prices, and is capable of generating realistic scenarios for the joint dynamics of the implied volatility surface and the underlying asset. Chapter 4 proposes a non-parametric data-driven methodology for hedging using generative models. In contrast with conventional model-based hedging approaches relying on sensitivity analysis of model pricing functions, our approach uses (conditional) generative models to simulate realistic market scenarios given current market conditions, and computes hedging strategies which minimise risk across these scenarios. The approach incorporates trading costs, leads to an optimal selection of hedging instruments, and adapts to market conditions. We illustrate the effectiveness of this methodology for hedging option portfolios using VolGAN, and compare its performance with delta and delta-vega hedging. Chapter 5 investigates the use of Generative Adversarial Networks (GANs) for probabilistic forecasting of financial time series. To this end, we introduce a novel economics-driven loss function for the generator, rendering GANs more suitable for a classification task. Our approach, named Fin-GAN, moves beyond pointwise forecasts and allows for uncertainty estimates. Numerical experiments on equity data showcase the effectiveness of our proposed methodology, which achieves higher Sharpe Ratios compared to commonly used supervised learning models, such as LSTM and ARIMA. Chapter 6 explores the construction of conditional generative models in a multi-asset setting by leveraging cross-asset relationships through a graph-based probabilistic ensemble framework. Rather than combining point forecasts, our method ensembles full conditional return distributions. The graph captures the transferability of predictive information across assets, with edge weights learned via a profit-maximisation objective reformulated as a LASSO regression. This computationally efficient approach induces sparse and interpretable weights. We apply our method to Fin-GAN, and demonstrate that the LASSO-induced graph outperforms benchmarks, including asset-specific models, return correlation-based graphs, and graph structures based on historical PnL or Sharpe Ratio attained by single-asset generators

Oxford University Research Archive

Statistical modeling and simulation of limit order markets

Author: Prenzel Felix
Publication venue
Publication date: 04/07/2024
Field of study

This thesis focuses on the statistical modeling of order flow in limit order markets and the development data-driven approaches for the simulation of limit order book dynamics. In the first part, after introducing various mathematical representations of limit order books (LOB) reflecting different degrees of granularity and information, we investigate the heterogeneity of order flow submitted through brokers using proprietary execution data and unsupervised learning techniques. This results in a statistical description of client order flow as a superposition of four components representing four heterogeneous types of agents – Quantitative, Day VWAP, Signal and Res – which differ through their trade frequency, intraday activity patterns and order sizes. The second part of the thesis develops data-driven simulation methods for limit order book dynamics. We first present a generative model for transitions of limit order book snapshots using generative adversarial networks (GANs). The model allows efficient simulation of snapshot time series reproducing desired properties and furthermore automatically reflects market impact when interacting with the order book state. Lastly, we propose a hierarchical approach to improve existing LOB simulation methods. In particular, we present a probabilistic model to generate calibrations of order flow models. This preserves theoretical properties of the underlying base model and allows to incorporate realistic features of intraday dynamics such as U-shaped intraday seasonality and trends, volatility dependency and market disruptions into LOB simulations

Oxford University Research Archive

Company relationship modeling and graph neural networks for financial market forecasting

Author: Luo Chang
Publication venue
Publication date: 26/03/2025
Field of study

This thesis presents a series of studies for financial market forecasting, leveraging various graph neural networks based on a novel company relationship modeling scheme. Departing from the conventional view of treating companies as standalone entities, this thesis models them as interconnected nodes within a Semantic Company Relationship Graph (SCRG). To achieve this, statistics on the co-occurrence of company names are compiled from a comprehensive financial news corpus, reflecting patterns of frequent business interactions collectively. These statistics are then used to create vector embeddings for each company, thus positioning all companies within the same semantic relationship space. The cosine similarities between these vectors are employed to define the numerical interrelationships among companies, thereby constructing the SCRG. This innovative relationship modeling scheme is grounded in the principles of statistical semantics and the distributional hypothesis in linguistics, which posit that patterns of word co-occurrence in a large corpus can effectively delineate semantic interconnections. Building on the SCRG’s relationship foundation, this thesis explores the adaptation of spatial-temporal graph neural networks for predicting stock movements. A key innovation is the introduction of the Non-Independent and Identically Distributed Spatial-Temporal Graph Neural Network (NIST-GNN). This model is uniquely designed to integrate features from neighboring companies and domestic historical timeseries data. It effectively addresses the temporal non-IID characteristics of stock data, enabling a more nuanced analysis of each stock’s temporal dynamics. Empirical results demonstrate that this methodology significantly outperforms existing benchmarks in profitability with better risk management. The experimental findings reveal insights into the dynamics of information spread within the US market, uncovering a typical one-day delay in the diffusion of public information among interrelated companies, thus challenging traditional views on market efficiency. Secondly, this thesis investigates the inference of absent news sentiment during periods with no media coverage, extending the use of the SCRG. News sentiment is a crucial proxy for investor sentiment and is widely used in asset pricing. However, consistent media coverage is not guaranteed for all companies, many of which experience ”media silent” periods, especially as media attention shifts towards more sensational business news. An analysis of 14 years of news data reveals that even well-known companies lack daily news coverage on almost half of the trading days. Traditional missing value imputation (MVI) methods are abundant but generally insufficient for the finance context, characterized by complex spatial and temporal interconnections among companies. To address these challenges, this thesis proposes a Non-IID Spatial-Temporal Chebyshev Network (NIST-Cheb) to leverage these relationships for inferring nonexistent news sentiment. A masked semi-supervised training approach is introduced to enhance the utility of the available sentiment data. The efficacy of this method is systematically validated through error-based metrics and empirical trading results. Experimental findings indicate that asset pricing models incorporating NIST-Cheb’s estimated sentiment scores significantly outperform traditional baselines. Theoretical contributions also discuss the spillover effects of news sentiment, emphasizing the importance and feasibility of using spatial and temporal sentiment information to infer absent news sentiment. The concluding part of this thesis focuses on the prediction of intraday market index movements, utilizing the SCRG as a foundational relationship prior for industry hierarchical analysis. It is known that previous studies on market index predictions have leaned heavily on machine learning strategies that predominantly targeted the temporal dynamics of market indices, often overlooking the valuable insights from the market microstructures of the underlying industrial clusters. The emergence of hierarchical graph pooling techniques marks a new direction in this field. This thesis pioneers the use of these techniques by framing market index prediction as a graph classification task and introduces a FinPool graph pooling operator, designed for the hierarchical feature representation of industrial clusters in the financial market. To optimally apply FinPool operators for index prediction, two innovative prediction frameworks, Stacked FinPool and Multi-tier Attention FinPool, are proposed, based on the insights of the Global Industry Classification Standard (GICS). Empirical trading evaluations indicate a notable improvement in profits and risk-adjusted returns, significantly outperforming conventional benchmarks. These findings not only challenge the Efficient Market Hypothesis but also demonstrate the untapped predictive power inherent in the microstructural details of market constituents

Edinburgh Research Archive

Data driven quantitative methods for financial forecasting problems

Author: Michael Nikolas
Publication venue
Publication date: 20/06/2025
Field of study

With increasing computational resources, data science and machine learning methods are becoming increasingly relevant in quantitative analysis. This thesis focuses on forecasting financial time series using data-driven methods. The contributions are divided into four parts, each based on different data sets, forecasting tasks, and methodologies. In the first part, we develop a feature based on option volumes decomposed by distinct market participant classes to predict directional price movements in the spot market. We conduct a detailed analysis demonstrating the robustness of our methodology and providing insights into the option contracts and orders with the highest predictability. In the second part, we introduce OFTER, a time series forecasting pipeline tailored for financial problems centered on mid-sized multivariate time series. OFTER is designed for online tasks, utilizing non-parametric models and avoiding the curse of dimensionality. In the third part, we augment the Heterogeneous Autoregressive Regression model for forecasting realized volatility. We employ various estimators for daily, weekly, and monthly volatilities, focusing on the utilization of option price data and implied volatilities based on the Black-Scholes and Heston models. In the final part, we formulate a network framework for forecasting problems related to E-mini S&P 500 and CBOE Volatility Index futures. By combining a multi-channel Graph Convolutional Network with a Long Short-Term Memory network, we enhance predictive performance across different forecasting problems

Oxford University Research Archive

Data-driven methods for simulation and forecasting of financial time series

Author: Zhang Chao
Publication venue
Publication date: 26/07/2023
Field of study

This thesis develops data-driven methods for the simulation and forecasting of financial time series. The contributions are structured into four main components. In the first part, we propose Tail-GAN, a novel nonparametric approach that combines a Generative Adversarial Network (GAN) with the joint elicitability property of Value-at-Risk (VaR) and Expected Shortfall (ES) for learning to simulate price scenarios that preserve tail risk features for a set of benchmark trading strategies. In the second part, we investigate the impact of order flow imbalance (OFI) on price movements in equity markets in a multi-asset setting. Our results show that, once the information from multiple levels is integrated into the OFI, multi-asset models with cross-impact do not provide additional explanatory power for contemporaneous impact compared to a sparse model without the cross-impact terms. We show however that cross-asset OFIs do improve the forecasting of future returns. In the third part, we apply machine learning models to forecast intraday realized volatility (RV), by exploiting commonality in intraday volatility by pooling stocks together, and by incorporating a proxy for market volatility. Neural networks dominate linear regression and tree-based models in terms of performance, and remain robust and competitive on unseen stocks not included in the training set, thus providing new empirical evidence for a universal volatility mechanism among stocks. We also propose a new approach to forecasting one-day-ahead RVs using past intraday RVs as predictors, and expose interesting time-of-day effects that aid the forecasting mechanism. In the last part, we develop a method for forecasting the realized covariance matrix of asset returns in the U.S. equity market by exploiting the predictive information of graphs in volatility and correlation. Specifically, we augment the Heterogeneous Autoregressive (HAR) model via neighborhood aggregation on these graphs. The results generally suggest that the augmented model incorporating graph information yields both statistically and economically significant improvements for out-of-sample performance over the traditional models

Oxford University Research Archive

Flexible estimation of temporal point processes and graphs

Author: Sulem Déborah
Publication venue
Publication date: 19/06/2023
Field of study

Handling complex data types with spatial structures, temporal dependencies, or discrete values, is generally a challenge in statistics and machine learning. In the recent years, there has been an increasing need of methodological and theoretical work to analyse non-standard data types, for instance, data collected on protein structures, genes interactions, social networks or physical sensors. In this thesis, I will propose a methodology and provide theoretical guarantees for analysing two general types of discrete data emerging from interactive phenomena, namely temporal point processes and graphs. On the one hand, temporal point processes are stochastic processes used to model event data, i.e., data that comes as discrete points in time or space where some phenomenon occurs. Some of the most successful applications of these discrete processes include online messages, financial transactions, earthquake strikes, and neuronal spikes. The popularity of these processes notably comes from their ability to model unobserved interactions and dependencies between temporally and spatially distant events. However, statistical methods for point processes generally rely on estimating a latent, unobserved, stochastic intensity process. In this context, designing flexible models and consistent estimation methods is often a challenging task. On the other hand, graphs are structures made of nodes (or agents) and edges (or links), where an edge represents an interaction or relationship between two nodes. Graphs are ubiquitous to model real-world social, transport, and mobility networks, where edges can correspond to virtual exchanges, physical connections between places, or migrations across geographical areas. Besides, graphs are used to represent correlations and lead-lag relationships between time series, and local dependence between random objects. Graphs are typical examples of non-Euclidean data, where adequate distance measures, similarity functions, and generative models need to be formalised. In the deep learning community, graphs have become particularly popular within the field of geometric deep learning. Structure and dependence can both be modelled by temporal point processes and graphs, although predominantly, the former act on the temporal domain while the latter conceptualise spatial interactions. Nonetheless, some statistical models combine graphs and point processes in order to account for both spatial and temporal dependencies. For instance, temporal point processes have been used to model the birth times of edges and nodes in temporal graphs. Moreover, some multivariate point processes models have a latent graph parameter governing the pairwise causal relationships between the components of the process. In this thesis, I will notably study such a model, called the Hawkes model, as well as graphs evolving in time. This thesis aims at designing inference methods that provide flexibility in the contexts of temporal point processes and graphs. This manuscript is presented in an integrated format, with four main chapters and two appendices. Chapters 2 and 3 are dedicated to the study of Bayesian nonparametric inference methods in the generalised Hawkes point process model. While Chapter 2 provides theoretical guarantees for existing methods, Chapter 3 also proposes, analyses, and evaluates a novel variational Bayes methodology. The other main chapters introduce and study model-free inference approaches for two estimation problems on graphs, namely spectral methods for the signed graph clustering problem in Chapter 4, and a deep learning algorithm for the network change point detection task on temporal graphs in Chapter 5. Additionally, Chapter 1 provides an introduction and background preliminaries on point processes and graphs. Chapter 6 concludes this thesis with a summary and critical thinking on the works in this manuscript, and proposals for future research. Finally, the appendices contain two supplementary papers. The first one, in Appendix A, initiated after the COVID-19 outbreak in March 2020, is an application of a discrete-time Hawkes model to COVID-related deaths counts during the first wave of the pandemic. The second work, in Appendix B, was conducted during an internship at Amazon Research in 2021, and proposes an explainability method for anomaly detection models acting on multivariate time series

Oxford University Research Archive

Hermitian matrices for clustering directed graphs: insights and applications

Author: Cucuringu Mihai
Publication venue
Publication date: 30/10/2019
Field of study

Graph clustering is a basic technique in data mining, and has widespread applications in different domains. While spectral techniques have been successfully applied for clustering undirected graphs, the performance of spectral clustering algorithms for directed graphs (digraphs) is not in general satisfactory, as these algorithms usually require symmetrising the matrix representing the digraph, and typical objective functions for undirected graph clustering (e.g., graph conductance and the normalised cut value) do not capture cluster-structures in which the information given by the direction of the edges is crucial. To overcome these downsides faced by most existing spectral algorithms, we study complex-valued Hermitian matrix representations of digraphs and present a clustering algorithm based on such Hermitian matrix representations. Through extensive experimental results and a theoretical analysis on a directed version of the stochastic block model, we show that our algorithm is able to uncover cluster-structures which are not simply based on edge-density, but on imbalances in the direction of the edges between the clusters. We highlight the significance of our work on a data set pertaining to internal migration in the United States: while previous spectral clustering algorithms for digraphs can only reveal that people are more likely to move between counties that are geographically close (a property independent of the direction of the edges in the underlying graph), our approach is able to cluster together counties with a similar socio-economical profile even when they are geogra

Oxford University Research Archive

Towards trustworthy machine learning with kernels

Author: Chau Siu Lun
Publication venue
Publication date: 29/08/2023
Field of study

Machine Learning has become an indispensable aspect of various safety-critical industries like healthcare, law, and automotive. Hence, it is crucial to ensure that our machine learning models function appropriately and instil trust among their users. This thesis focuses on improving the safety and transparency of Machine Learning by advocating for more principled uncertainty quantification and more effective explainability tools. Specifically, the use of Kernel Mean Embeddings (KME) and Gaussian Processes (GP) is prevalent in this work since they can represent probability distribution with minimal distributional assumptions and capture uncertainty well, respectively. I dedicate Chapter 2 to introduce these two methodologies. Chapter 3 demonstrates an effective use of these methods in conjunction with each other to tackle a statistical downscaling problem, in which a Deconditional Gaussian process is proposed. Chapter 4 considers a causal data fusion problem, where multiple causal graphs are combined for inference. I introduce BayesIMP, an algorithm built using KME and GPs, to draw causal conclusion while accounting for the uncertainty in the data and model. In Chapter 5, I present RKHS-SHAP to model explainability for kernel methods that utilizes Shapley values. Specifically, I propose to estimate the value function in the cooperative game using KMEs, circumventing the need for any parametric density estimations. A Shapley regulariser is also proposed to regulate the amount of contributions certain features can have to the model. Chapter 6 presents a generalised preferential Gaussian processes for modelling preference with non-rankable structure, which sets the scene for Chapter 7, where I built upon my research and propose Pref-SHAP to explain preference models

Oxford University Research Archive