1,720,972 research outputs found
Generative Approaches to Sound-Squatting: AI Tools and Validation
L'abstract è presente nell'allegato / the abstract is in the attachmen
Tracking Knowledge Propagation Across Wikipedia Languages
In this paper, we present a dataset of inter-language knowledge propagation in Wikipedia. Covering the entire 309 language editions and 33M articles, the dataset aims to track the full propagation history of Wikipedia concepts, and allow follow-up research on building predictive models of them. For this purpose, we align all the Wikipedia articles in a language-agnostic manner according to the concept they cover, which results in 13M propagation instances. To the best of our knowledge, this dataset is the first to explore the full inter-language propagation at a large scale. Together with the dataset, a holistic overview of the propagation and key insights about the underlying structural factors are provided to aid future research. For example, we find that although long cascades are unusual, the propagation tends to continue further once it reaches more than four language editions. We also find that the size of language editions is associated with the speed of propagation. We believe the dataset not only contributes to the prior literature on Wikipedia growth but also enables new use cases such as edit recommendation for addressing knowledge gaps, detection of disinformation, and cultural relationship analysis
Can Public IP Blocklists Explain Internet Radiation?
Network telescopes (IP addresses hosting no services) are valuable for observing unsolicited Internet traffic from scanners, crawlers, botnets, and misconfigured hosts. This traffic is known as Internet radiation, and its monitoring with telescopes helps in identifying malicious activities. Yet, the deployment of telescopes is expensive. Meanwhile, numerous public blocklists aggregate data from various sources to track IP addresses involved in malicious activity. This raises the question of whether public blocklists already provide sufficient coverage of these actors, thus rendering new network telescopes unnecessary. We address this question by analyzing traffic from four geographically distributed telescopes and dozens of public blocklists over a two-month period. Our findings show that public blocklists include approximately 71% of IP addresses observed in the telescopes. Moreover, telescopes typically observe scanning activities days before they appear in blocklists. We also find that only 4 out of 50 lists contribute the majority of the coverage, while the addresses evading blocklists present more sporadic activity. Our results demonstrate that distributed telescopes remain valuable assets for network security, providing early detection of threats and complementary coverage to public blocklists. These results call for more coordination among telescope operators and blocklist providers to enhance the defense against emerging threats
Can Blocklists Explain Darknet Traffic?
Darknets are IP addresses that function as passive probes, recording all received packets without hosting services. The traffic they capture, being unsolicited, makes darknets akin to “network telescopes”. Traces collected on darknets aggregate multiple events useful for cybersecurity, like network scans and exploit attempts. Yet, the mix of heterogeneous events observed from darknets poses significant challenges to those who must understand darknet traffic. Here we face the question of whether new darknet deployments provide novel and useful information when compared to public blocklists. Multiple Cyber Threat Intelligence (CTI) sources publish lists of IP addresses that perform malicious activities, from simple automated scans to SPAM and phishing campaigns. They represent a valuable resource for network administrators, helping to block cyberattacks. Built with a combination of multiple sensors — including darknets and honeypots — these lists could explain the traffic seen on other darknets, thus simplifying the search for relevant events in independent darknet deployments. We thus investigate to what extent open blocklists explain darknet traffic. By crawling hundreds of CTI sources providing blocklists, we first notice how these lists are often incomplete or slowly updated. Traffic seen in our darknet deployment is hardly explained by the blocklists, even when considering only the most prominent scan attempts, and ignoring events such as backscattering. Our preliminary results suggest that blocklists can be of great use for seeding the explanation of darknet traffic, by giving context for the activity of a few IP addresses. Yet, more addresses with similar behaviour are observed in the darknet and could be used to enrich and complement the blocklists
URLGEN – Towards Automatic URL Generation Using GANs
URLs play an essential role on the Internet, allowing access to Web resources. Automatically generating URLs is helpful in various tasks, such as application debugging, API testing, and blocklist creation for security applications. Current testing suites deeply embed experts’ domain knowledge to generate suitable URLs, resulting in an ad-hoc solution for each given application. These tools thus require heavy manual intervention, with the expensive coding of rules that are hard to maintain. We here introduce URLGEN, a system that uses Generative Adversarial Networks (GANs) to tackle the automatic URL generation problem. URLGEN is designed for web API testing and generates URL samples for an application without any system expertise, complementing the existing tools. It leverages Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) architectures, augmented by an embedding layer that simplifies the URL learning and generation process. We show that URLGEN learns to generate new valid URLs from samples of real URLs without requiring any domain knowledge and following a purely data-driven approach. We compare the GAN architecture of URLGEN against other design options and show that the LSTM architecture can better capture the correlation among URL characters, outperforming previously proposed solutions. Finally, we show that the URLGEN approach can be extended to other scenarios, which we illustrate with two use cases, i.e., cybersquatting domain prediction and URL classification
LogPr\'ecis: Unleashing Language Models for Automated Malicious Log Analysis
The collection of security-related logs holds the key to understanding attack
behaviors and diagnosing vulnerabilities. Still, their analysis remains a
daunting challenge. Recently, Language Models (LMs) have demonstrated unmatched
potential in understanding natural and programming languages. The question
arises whether and how LMs could be also useful for security experts since
their logs contain intrinsically confused and obfuscated information. In this
paper, we systematically study how to benefit from the state-of-the-art in LM
to automatically analyze text-like Unix shell attack logs. We present a
thorough design methodology that leads to LogPr\'ecis. It receives as input raw
shell sessions and automatically identifies and assigns the attacker tactic to
each portion of the session, i.e., unveiling the sequence of the attacker's
goals. We demonstrate LogPr\'ecis capability to support the analysis of two
large datasets containing about 400,000 unique Unix shell attacks. LogPr\'ecis
reduces them into about 3,000 fingerprints, each grouping sessions with the
same sequence of tactics. The abstraction it provides lets the analyst better
understand attacks, identify fingerprints, detect novelty, link similar
attacks, and track families and mutations. Overall, LogPr\'ecis, released as
open source, paves the way for better and more responsive defense against
cyberattacks.Comment: 18 pages, Computer&Security
(https://www.sciencedirect.com/science/article/pii/S0167404824001068), code
available at https://github.com/SmartData-Polito/logprecis, models available
at https://huggingface.co/SmartDataPolit
Augmenting phishing squatting detection with GANs
Current solutions to tackle phishing employ blocklists that are built from user reports or automatic approaches. They, however, fall short in detecting zero-day phishing attacks. We propose the use of Generative Adversarial Networks (GANs) to automate the generation of new squatting candidates starting from a list of benign URLs. The candidates can be either manually verified or become part of a training set for existing machine learning models. Our results show that GANs can produce squatting candidates, some of which are previously unknown existing phishing domains
CFA-Bench: Cybersecurity Forensic Llm Agent Benchmark and Testing
This paper investigates the capabilities and limitations of Large Language Model (LLM) agents in performing cybersecurity forensic tasks, including incident response, digital evidence correlation, and threat attribution. To enable a fair comparison of agents and LLMs, we introduce CFAbench, a novel benchmark designed to evaluate their forensic reasoning abilities. We leverage a controlled testbed where vulnerable services are instantiated, attacked, and monitored, generating forensic evidence in the form of packet captures and log traces. Using this setup, we generate 20 curated incidents targeting 13 distinct services, focusing on recent vulnerabilities. Each incident presents progressively complex checkpoints, culminating in the identification of the specific Common Vulnerabilities and Exposure (CVE). We evaluate different LLM-powered agent architectures, equipping them with essential forensic tools such as a PCAP Reader and an Information Retriever. Each agent is asked to analyse the incidents to systematically track their performance across different forensic checkpoints. While preliminary, our findings demonstrate the potential of LLM agents in cybersecurity forensics, revealing their strengths and critical areas for improvement. This study underscores the need for standardized benchmarks to assess LLM agents in cyber threat analysis rigorously. For this, we make CFA-bench open to the research community. Our results provide a foundation for future research aimed at refining agent architectures and enhancing their forensic reasoning capabilities
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
- …
