1,721,026 research outputs found
Data Analysis and Modelling of Users’ Behaviour on the Web
As novas tecnologias e as suas aplicações modificaram as nossas interações com o mundo que nos circunda. O advento da Internet, com a sua capilaridade e seu uso generalizo, foi a transformação mais importante e repentina dos últimos 30 anos. Minha pesquisa nasce da necessidade de entender como as pessoas interagem com a web, de compreender como a web está evoluindo, e de modelar os hábitos e comportamentos dos usuários da Internet. Logs que registram o comportamentos dos usuários interagindo com a web, coletados através de medições passivas, oferecem uma oportunidade inigualável para estudar esses fenômenos. Baseado nesse tipo de logs, o meu trabalho foca em dois aspectos complementares: (i) na análise da navegação dos usuários e (ii) na modelagem do comportamento dos usuários.
Muitos desafios devem de ser enfrentados para viabilizar essa análise: medições passivas são em geral volumosas, ou seja \textit{big data}, e por isso requerem metodologias e infra-estrutura escaláveis para seu processamento. A análise dos dados necessita de métricas significativas e a introdução de metodologias inovadoras para a obtenção de informações confiáveis, filtradas, limpas e, sobretudo, úteis. A análise requer métodos estatísticos, de aprendizagem de máquina e de mineração de dados robustos. Além disso, a análise deve servir de base para a criação de modelos analíticos que sejam aderentes à realidade. Em soma, entender a aplicabilidade dos modelos é um passo fundamental para analisar possíveis cenários de uso e otimizar a performance dos serviços web.
Durante o doutorado eu analisei três anos de dados de cerca de 30\,000 consumidores de Internet de alta velocidade, reconstruindo a atividade dos usuários na web. Reconstruí as suas atividades de navegação, destacando a evolução no uso de diferentes dispositivos, a estrutura da navegação e a interação dos usuários com as redes sociais e os motores de busca. Introduzi uma nova metodologia de aprendizado de máquina para identificar páginas web e sites intencionalmente solicitados pelos usuários nos logs de medidas passivas. A partir dessas informações, demonstrei ser possível criar uma assinatura baseado nos sites visitados por cada usuário, que pode ser utilizadas para re-identificar usuários, com claras implicações para a privacidade on-line.
Modelei a sequência de serviços visitados pelos usuários na web, representando-os de forma sucinta e interpretável. Mostrei como extrair automaticamente grupos de sites similares ou conectados, agrupando os interesses de usuários e de comunidades. Também modelei a interação dos usuários com sistemas de recomendação on-line, apresentando um modelo de comportamento que captura o impacto da dinâmica temporal dos anúncios exibidos nas páginas. Finalmente, mostrei como melhorar os ganhos de uma plataforma de propaganda digital, otimizando os horários nos quais os anúncios deveriam ser exibidos aos usuários.
Os resultados dessa tese têm várias implicações para diferentes personagens na Internet e para a comunidade acadêmica. Na atual transformação digital, todas as pessoas e todos os objetos estão produzindo dados que podem ser explorados para criar novas aplicações revolucionarias. A análise dos dados de navegação nos permite realizar transformações incríveis não só na web, mas também em nossas cidades, na industria e na produção de energia. Aproveitar o conhecimento do comportamento do usuário obtido a partir de medições na rede e depois modelar e otimizar os sistemas, como feito neste trabalho, será um fator chave para a concepção de futuras cidades inteligentes.Le nuove tecnologie e le loro applicazioni modificano il nostro approccio con ciò che ci circonda. L'avvento di Internet, con la sua capillarità e pervasività, è stata la trasformazione più importante e repentina degli ultimi 30 anni. La mia ricerca è stata guidata dalla necessità di capire come le persone interagiscano con il web, di catturare come il web stesso cambi, e di modellare le abitudini e i comportamenti degli utenti.
Tracce e registri dell'attività online, altrimenti dette misure passive, offrono informazioni inestimabili per raggiungere questi obiettivi. Grazie a queste tracce, il mio lavoro si concentra nello studiare il comportamento delle persone quando navigano su Internet, da due punti di vista complementari: (i) l'analisi dei dati di navigazione e (ii) i modelli analitici di comportamento.
Tuttavia, vi sono molteplici sfide da affrontare: questo tipo di dati, detti \textit{big data}, necessitano di hardware e software scalabili, e dell'introduzione di metodologie e metriche innovative per ottenere informazioni che siano pulite, affidabili e soprattutto utili.
L'analisi dati viene eseguita grazie a metodi statistici, di machine learning e di data mining. Inoltre, l'analisi è un prerequisito per costruire dei modelli analitici dei fenomeni studiati, che siano il più possibile aderenti alla realtà. Infine, capire l'applicabilità dei modelli costruiti è un passaggio fondamentale per ottimizzare le prestazioni e capire i possibili scenari.
Più in dettaglio, durante il mio dottorato, ho analizzato 3 anni di dati di circa 30\,000 abitazioni, e ne ho ricostruito le attività online. Grazie a ciò, ho potuto mostrare l'evoluzione nell'utilizzo di diversi dispositivi, la struttura intrinseca delle navigazioni e l'interazione con le reti sociali e i motori di ricerca. Ho introdotto dei sistemi automatici per identificare le pagine e i servizi web intenzionalmente richiesti. Ho anche analizzato la costruzione di profili degli utenti, tracciando i loro domini visitati, per poi mostrare come poterli re-identificare nel futuro.
Ho modellato le sequenze di siti visti, rappresentandole succintamente in una maniera facilmente interpretabile. Ho mostrato come estrarre automaticamente gruppi di siti web simili in contenuto o strettamente relazionati, e come riunire interessi e trend di utenti singoli o intere comunità.
Ho anche modellato l'interazione con i sistemi di raccomandazione, introducendo un modello di comportamento umano che cattura l'impatto della dinamica temporale delle pubblicità mostrate. Infine, ho migliorato sperimentalmente i ricavi di una piattaforma di pubblicità, ottimizzandone i tempi di visualizzazione delle inserzioni.
I miei risultati hanno diverse implicazioni per i molteplici attori nel panorama web e per il mondo della ricerca. Seguendo un corretto approccio scientifico, I dataset usati in questa tesi sono resi disponibili in modo anonimizzato per la comunità, in modo da garantire la riproducibilità dei miei risultati.
Inoltre, il tema della privacy online in un mondo in forte cambiamento è stato affrontato e analizzato, con l'obiettivo di trovare un compromesso tra il bisogno di ottenere la conoscenza per lo sviluppo delle tecnologie e la necessità di non violare la riservatezza degli individui.
Infine, l'attuale trasformazione digitale comporta che tutte le persone e oggetti producono dati che possano essere sfruttati per creare sconvolgenti possibilità.
L'analisi dati ci permette di realizzare incredibili trasformazioni non solo di Internet, ma anche nelle nostre città, nella produzione di energia o nell'industria.
Sfruttare i comportamenti delle persone che si ottengono attraverso questi dati, modellare e ottimizzare le prestazioni dei sistemi così come ho fatto in questo lavoro, sarà un fattore chiave per progettare le città intelligenti di un futuro molto vicino.New technologies and services strongly transform our approach with the world. The Internet and its pervasive use was certainly the most dramatic leap in the last 30 years. My research was driven by the need to understand how people interact with the web, capturing its characteristics and changes, and modelling people's inner habits and interactions.
Traces and logs of users' behaviours collected in the Internet (i.e., passive measurements) offer invaluable information to obtain this goal.
Thanks to these passive traces, my work focuses on studying the behaviour of the users on the Internet, with focus on two complementary aspects: (i) data analytics, and (ii) user modelling.
There are many key challenges to face: (big) data requires the use of scalable software and hardware. It demands also the introduction of innovative methodologies and meaningful metric to obtain trustable, filtered, clean and useful information.
Data analytics is performed by means of a variety of statistical, machine learning and data mining approaches. Moreover, it is also a pre-requisite for creating analytical models of the studied phenomena, that should be as much as possible adherent to the reality.
Lastly, understanding the applicability of derived models is a fundamental step for optimizing performances and understanding possible scenarios.
More in details, during my PhD I analyzed 3 years of data of about 30\,000 households. I reconstruct users' online activity. Thanks to this, I was able to highlight device usage evolution, the intrinsic structure of the navigation and the interactions with social networks and search engines.
I introduced a new machine learning approach to identify the intentionally visited web-pages and web-sites. Then, I built specific users' profiles, fingerprinting their visited domains, and then I showed how to re-identify users in a future time.
I modelled the sequence of the visited web services, representing them in a succinct and interpretable manner. I showed that I can automatically extract groups of similar or likely connected web-sites, and monitor the interests and browsing patterns of single users or communities.
I also modelled the user interaction with online recommendation systems, introducing a user behavioural model that captures the impact of the temporal dynamics of shown advertisement. Lastly, I demonstrate how to improve the revenue of an advertisement platform, optimizing the timings when ads are shown to users.
My findings have several direct implications to the different Internet actors and to the research community.
Following the scientific approach, I made available the anonymized datasets for the community, in order to guarantee the reproducibility of my results.
Moreover, I addressed the problem of privacy online in today changing world, with the objective of finding a trade-off between the desire to obtain knowledge for shaping new technologies and the need to not violate the privacy of individuals.
Finally, the current digital transformation implicates that everyone and everything produce data that can be exploited to create new disruptive capabilities.
Data analytics allows us to realize incredible transformations not only in the web, but also in our cities, in the energy production, and in manufacturing. Exploiting the knowledge of the users' behaviour from these data, modelling and optimizing system performances as I did in my work, will be a key factor for designing near future smart-cities
A hybrid swarm-based algorithm for single-objective optimization problems involving high-cost analyses
Human Behaviour on the Web: Evolution, Interactions and Exploitation
The Web has a fundamental impact on our life, and its usage is quite dynamic and heterogeneous.
Moreover, the Web, and in particular Online Social Networks allow people to communicate directly with the public, bypassing filters of traditional medias. Among the others, politicians and companies are exploiting this technologies to widen their influence.
In the talk I will show techniques to capture such usage evolution and analyze people interaction on the Internet. This information allows us to understand how users and web services change over time, and how someone can take advantage of these behaviours
A multi-faceted characterization of free-floating car sharing service usage
During the last decade, car sharing systems appeared in many cities and gained popularity. The research community has analyzed their current utilization trends in different contexts, their growth perspectives, and their gradual shift towards more sustainable technologies. Through the large and heterogeneous amount of car sharing usage data that is now available, researchers have been able to gain new insights into these services. In this paper, we provide an extensive char-acterization of the Free-Floating Car Sharing (FFCS) service usage in 23 cities in Europe and North America over a 14-month period. From our data about FFCS services, we detail fleet size, oper-ating area, and characteristics of the car bookings and rentals. We also identify temporal patterns that are peculiar to specific cities and countries. We further highlight urban zones with high attractiveness or with a high rental generation rate. Finally, we compare the systems relying on internal combustion engine cars with those based on electric vehicles in terms of various in-dicators, including the influence on car refueling. The results show that car utilization patterns are rather variable across cities with the highest per-car utilization rate in Madrid. The majority of the cities show negative or stable usage trends due to either the reduced appeal of the service or the presence of inefficiencies in the service provision. These data-driven insights may help system managers assess the provided services’ profitability and sustainability from multiple perspectives
A hybrid ABC for expensive optimizations: CEC 2016 competition benchmark
An evolution of the Artificial Bee Colony (ABC) optimization algorithm, called the Artificial super-Bee enhanced Colony (AsBeC), is presented for leading to the best improvement with a low number of analyses. AsBeC is designed to provide fast convergence speed, high solution accuracy and robust performance over a wide range of problems. It implements enhancements of ABC structure and original hybridizations with interpolation strategies. The aforementioned techniques are tested on the expensive benchmark of the Special Session on RealParameter Single Objective Optimization at CEC 2016. In this specific case, the hybridization with a quadratic trust region approach assumes a major importance. Moreover, the AsBeC results are compared to the algorithms tested on the same benchmark at CEC 2015, showing remarkable competitiveness and robustnes
On Cost-Effectiveness of Language Models for Time Series Anomaly Detection
Detecting anomalies in time series data is crucial across several domains, including healthcare, finance, and automotive. Large Language Models (LLMs) have recently shown promising results by leveraging robust model pretraining. However, fine-tuning LLMs with several billion parameters requires a large number of training samples and significant training costs. Conversely, LLMs under a zero-shot learning setting require lower overall computational costs, but can fall short in handling complex anomalies. In this paper, we explore the use of lightweight language models for Time Series Anomaly Detection, either zero-shot or via fine-tuning them. Specifically, we leverage lightweight models that were originally designed for time series forecasting, benchmarking them for anomaly detection against both open-source and proprietary LLMs across different datasets. Our experiments demonstrate that lightweight models (70 Billions)
Debate on online social networks at the time of COVID-19: An Italian case study
The COVID-19 pandemic is not only having a heavy impact on healthcare but also changing people’s habits and the society we live in. Countries such as Italy have enforced a total lockdown lasting several months, with most of the population forced to remain at home. During this time, online social networks, more than ever, have represented an alternative solution for social life, allowing users to interact and debate with each other. Hence, it is of paramount importance to understand the changing use of social networks brought about by the pandemic. In this paper, we analyze how the interaction patterns around popular influencers in Italy changed during the first six months of 2020, within Instagram and Facebook social networks. We collected a large dataset for this group of public figures, including more than 54 million comments on over 140 thousand posts for these months. We analyze and compare engagement on the posts of these influencers and provide quantitative figures for aggregated user activity. We further show the changes in the patterns of usage before and during the lockdown, which demonstrated a growth of activity and sizable daily and weekly variations. We also analyze the user sentiment through the psycholinguistic properties of comments, and the results testified the rapid boom and disappearance of topics related to the pandemic. To support further analyses, we release the anonymized dataset
Disentangling the Information Flood on OSNs: Finding Notable Posts and Topics
Online Social Networks (OSNs) are an integral part of modern life for sharing thoughts, stories, and news. An ecosystem of influencers generates a flood of content in the form of posts, some of which have an unusually high level of engagement with the influencer’s fan base. These posts relate to blossoming topics of discussion that generate particular interest among users: The COVID-19 pandemic is a prominent example. Studying these phenomena provides an understanding of the OSN landscape and requires appropriate methods. This paper presents a methodology to discover notable posts and group them according to their related topic. By combining anomaly detection, graph modelling and community detection techniques, we pinpoint salient events automatically, with the ability to tune the amount of them. We showcase our approach using a large Instagram dataset and extract some notable weekly topics that gained momentum from 1.4 million posts. We then illustrate some use cases ranging from the COVID-19 outbreak to sporting events
The Sweet Danger of Sugar: Debunking Representation Learning for Encrypted Traffic Classification
Recently we have witnessed the explosion of proposals that, inspired by Language Models like BERT, exploit Representation Learning models to create traffic representations. All of them promise astonishing performance in encrypted traffic classification (up to 98% accuracy). In this paper, with a networking expert mindset, we critically reassess their performance. Through extensive analysis, we demonstrate that the reported successes are heavily influenced by data preparation problems, which allow these models to find easy shortcuts - spurious correlation between features and labels - during fine-tuning that unrealistically boost their performance. When such shortcuts are not present - as in real scenarios - these models perform poorly. We also introduce Pcap-Encoder, an LM-based representation learning model that we specifically design to extract features from protocol headers. Pcap-Encoder appears to be the only model that provides an instrumental representation for traffic classification. Yet, its complexity questions its applicability in practical settings. Our findings reveal flaws in dataset preparation and model training, calling for a better and more conscious test design. We propose a correct evaluation methodology and stress the need for rigorous benchmarking
Modeling communication asymmetry and content personalization in online social networks
The increasing popularity of online social networks (OSNs) attracted growing interest in modeling social interactions. On online social platforms, a few individuals, commonly referred to as influencers, produce the majority of content consumed by users and hegemonize the landscape of the social debate. However, classical opinion models do not capture this communication asymmetry. We develop an opinion model inspired by observations on social media platforms with two main objectives: first, to describe this inherent communication asymmetry in OSNs, and second, to model the effects of content personalization.
We derive a Fokker-Planck equation for the temporal evolution of users' opinion distribution and analytically characterize the stationary system behavior. Analytical results, confirmed by Monte-Carlo simulations, show how {strict forms of} content personalization tend to radicalize user opinion, leading to the emergence of echo chambers, and favor structurally advantaged influencers.
As an example application, we apply our model to Facebook data during the Italian government crisis in the summer of 2019. Our work provides a flexible framework to evaluate the impact of {content personalization on the opinion formation process, focusing on the interaction betweeni nfluential individuals and regular users. This framework is interesting in the context of marketing and advertising, misinformation spreading, politics and activism
- …
