1,720,972 research outputs found

    Generative Approaches to Sound-Squatting: AI Tools and Validation

    Full text link
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Tracking Knowledge Propagation Across Wikipedia Languages

    Full text link
    In this paper, we present a dataset of inter-language knowledge propagation in Wikipedia. Covering the entire 309 language editions and 33M articles, the dataset aims to track the full propagation history of Wikipedia concepts, and allow follow-up research on building predictive models of them. For this purpose, we align all the Wikipedia articles in a language-agnostic manner according to the concept they cover, which results in 13M propagation instances. To the best of our knowledge, this dataset is the first to explore the full inter-language propagation at a large scale. Together with the dataset, a holistic overview of the propagation and key insights about the underlying structural factors are provided to aid future research. For example, we find that although long cascades are unusual, the propagation tends to continue further once it reaches more than four language editions. We also find that the size of language editions is associated with the speed of propagation. We believe the dataset not only contributes to the prior literature on Wikipedia growth but also enables new use cases such as edit recommendation for addressing knowledge gaps, detection of disinformation, and cultural relationship analysis

    Can Public IP Blocklists Explain Internet Radiation?

    No full text
    Network telescopes (IP addresses hosting no services) are valuable for observing unsolicited Internet traffic from scanners, crawlers, botnets, and misconfigured hosts. This traffic is known as Internet radiation, and its monitoring with telescopes helps in identifying malicious activities. Yet, the deployment of telescopes is expensive. Meanwhile, numerous public blocklists aggregate data from various sources to track IP addresses involved in malicious activity. This raises the question of whether public blocklists already provide sufficient coverage of these actors, thus rendering new network telescopes unnecessary. We address this question by analyzing traffic from four geographically distributed telescopes and dozens of public blocklists over a two-month period. Our findings show that public blocklists include approximately 71% of IP addresses observed in the telescopes. Moreover, telescopes typically observe scanning activities days before they appear in blocklists. We also find that only 4 out of 50 lists contribute the majority of the coverage, while the addresses evading blocklists present more sporadic activity. Our results demonstrate that distributed telescopes remain valuable assets for network security, providing early detection of threats and complementary coverage to public blocklists. These results call for more coordination among telescope operators and blocklist providers to enhance the defense against emerging threats

    Can Blocklists Explain Darknet Traffic?

    No full text
    Darknets are IP addresses that function as passive probes, recording all received packets without hosting services. The traffic they capture, being unsolicited, makes darknets akin to “network telescopes”. Traces collected on darknets aggregate multiple events useful for cybersecurity, like network scans and exploit attempts. Yet, the mix of heterogeneous events observed from darknets poses significant challenges to those who must understand darknet traffic. Here we face the question of whether new darknet deployments provide novel and useful information when compared to public blocklists. Multiple Cyber Threat Intelligence (CTI) sources publish lists of IP addresses that perform malicious activities, from simple automated scans to SPAM and phishing campaigns. They represent a valuable resource for network administrators, helping to block cyberattacks. Built with a combination of multiple sensors — including darknets and honeypots — these lists could explain the traffic seen on other darknets, thus simplifying the search for relevant events in independent darknet deployments. We thus investigate to what extent open blocklists explain darknet traffic. By crawling hundreds of CTI sources providing blocklists, we first notice how these lists are often incomplete or slowly updated. Traffic seen in our darknet deployment is hardly explained by the blocklists, even when considering only the most prominent scan attempts, and ignoring events such as backscattering. Our preliminary results suggest that blocklists can be of great use for seeding the explanation of darknet traffic, by giving context for the activity of a few IP addresses. Yet, more addresses with similar behaviour are observed in the darknet and could be used to enrich and complement the blocklists

    URLGEN – Towards Automatic URL Generation Using GANs

    Full text link
    URLs play an essential role on the Internet, allowing access to Web resources. Automatically generating URLs is helpful in various tasks, such as application debugging, API testing, and blocklist creation for security applications. Current testing suites deeply embed experts’ domain knowledge to generate suitable URLs, resulting in an ad-hoc solution for each given application. These tools thus require heavy manual intervention, with the expensive coding of rules that are hard to maintain. We here introduce URLGEN, a system that uses Generative Adversarial Networks (GANs) to tackle the automatic URL generation problem. URLGEN is designed for web API testing and generates URL samples for an application without any system expertise, complementing the existing tools. It leverages Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) architectures, augmented by an embedding layer that simplifies the URL learning and generation process. We show that URLGEN learns to generate new valid URLs from samples of real URLs without requiring any domain knowledge and following a purely data-driven approach. We compare the GAN architecture of URLGEN against other design options and show that the LSTM architecture can better capture the correlation among URL characters, outperforming previously proposed solutions. Finally, we show that the URLGEN approach can be extended to other scenarios, which we illustrate with two use cases, i.e., cybersquatting domain prediction and URL classification

    LogPr\'ecis: Unleashing Language Models for Automated Malicious Log Analysis

    Full text link
    The collection of security-related logs holds the key to understanding attack behaviors and diagnosing vulnerabilities. Still, their analysis remains a daunting challenge. Recently, Language Models (LMs) have demonstrated unmatched potential in understanding natural and programming languages. The question arises whether and how LMs could be also useful for security experts since their logs contain intrinsically confused and obfuscated information. In this paper, we systematically study how to benefit from the state-of-the-art in LM to automatically analyze text-like Unix shell attack logs. We present a thorough design methodology that leads to LogPr\'ecis. It receives as input raw shell sessions and automatically identifies and assigns the attacker tactic to each portion of the session, i.e., unveiling the sequence of the attacker's goals. We demonstrate LogPr\'ecis capability to support the analysis of two large datasets containing about 400,000 unique Unix shell attacks. LogPr\'ecis reduces them into about 3,000 fingerprints, each grouping sessions with the same sequence of tactics. The abstraction it provides lets the analyst better understand attacks, identify fingerprints, detect novelty, link similar attacks, and track families and mutations. Overall, LogPr\'ecis, released as open source, paves the way for better and more responsive defense against cyberattacks.Comment: 18 pages, Computer&Security (https://www.sciencedirect.com/science/article/pii/S0167404824001068), code available at https://github.com/SmartData-Polito/logprecis, models available at https://huggingface.co/SmartDataPolit

    Augmenting phishing squatting detection with GANs

    Full text link
    Current solutions to tackle phishing employ blocklists that are built from user reports or automatic approaches. They, however, fall short in detecting zero-day phishing attacks. We propose the use of Generative Adversarial Networks (GANs) to automate the generation of new squatting candidates starting from a list of benign URLs. The candidates can be either manually verified or become part of a training set for existing machine learning models. Our results show that GANs can produce squatting candidates, some of which are previously unknown existing phishing domains

    CFA-Bench: Cybersecurity Forensic Llm Agent Benchmark and Testing

    Full text link
    This paper investigates the capabilities and limitations of Large Language Model (LLM) agents in performing cybersecurity forensic tasks, including incident response, digital evidence correlation, and threat attribution. To enable a fair comparison of agents and LLMs, we introduce CFAbench, a novel benchmark designed to evaluate their forensic reasoning abilities. We leverage a controlled testbed where vulnerable services are instantiated, attacked, and monitored, generating forensic evidence in the form of packet captures and log traces. Using this setup, we generate 20 curated incidents targeting 13 distinct services, focusing on recent vulnerabilities. Each incident presents progressively complex checkpoints, culminating in the identification of the specific Common Vulnerabilities and Exposure (CVE). We evaluate different LLM-powered agent architectures, equipping them with essential forensic tools such as a PCAP Reader and an Information Retriever. Each agent is asked to analyse the incidents to systematically track their performance across different forensic checkpoints. While preliminary, our findings demonstrate the potential of LLM agents in cybersecurity forensics, revealing their strengths and critical areas for improvement. This study underscores the need for standardized benchmarks to assess LLM agents in cyber threat analysis rigorously. For this, we make CFA-bench open to the research community. Our results provide a foundation for future research aimed at refining agent architectures and enhancing their forensic reasoning capabilities

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
    corecore