1,721,092 research outputs found

    Data set for anomaly detection on a HPC system

    No full text
    <p>This data set contains the data collected on the DAVIDE HPC system (CINECA & E4 & University of Bologna, Bologna, Italy) in the period March-May 2018.</p> <p>The data set has been used to train a autoencoder-based model to automatically detect anomalies in a semi-supervised fashion, on a real HPC system.</p> <p>This work is described in:</p> <p>1) "Anomaly Detection using Autoencoders in High Performance Computing Systems", <a href="https://arxiv.org/search/cs?searchtype=author&query=Borghesi%2C+A">Andrea Borghesi</a>, <a href="https://arxiv.org/search/cs?searchtype=author&query=Bartolini%2C+A">Andrea Bartolini</a>, <a href="https://arxiv.org/search/cs?searchtype=author&query=Lombardi%2C+M">Michele Lombardi</a>, <a href="https://arxiv.org/search/cs?searchtype=author&query=Milano%2C+M">Michela Milano</a>, <a href="https://arxiv.org/search/cs?searchtype=author&query=Benini%2C+L">Luca Benini,</a> IAAI19 (proceedings in process) -- https://arxiv.org/abs/1902.08447</p> <p>2) "Online Anomaly Detection in HPC Systems", <a href="https://arxiv.org/search/cs?searchtype=author&query=Borghesi%2C+A">Andrea Borghesi</a>, <a href="https://arxiv.org/search/cs?searchtype=author&query=Libri%2C+A">Antonio Libri</a>, <a href="https://arxiv.org/search/cs?searchtype=author&query=Benini%2C+L">Luca Benini</a>, <a href="https://arxiv.org/search/cs?searchtype=author&query=Bartolini%2C+A">Andrea Bartolini, </a>AICAS19 (proceedings in process) -- https://arxiv.org/abs/1811.05269</p> <p>See the git repository for usage examples & details --> https://github.com/AndreaBorghesi/anomaly_detection_HPC</p&gt

    Making the Most of Scarce Input Data in Deep Learning-based Source Code Classification for Heterogeneous Device Mapping

    Full text link
    Despite its relatively recent history, Deep Learning (DL) based source code analysis is already a cornerstone in machine learning for compiler optimization. When applied to the classification of pieces of code to identify the best computation unit in a heterogeneous Systems-on-Chip, it can be effective in supporting decisions that a programmer has otherwise to take manually. Several techniques have been proposed exploiting different networks and input information, prominently sequence-based and graph-based representations, complemented by auxiliary information typically related to payload and device configuration. While the accuracy of DL methods strongly depends on the training and test datasets, so far no exhaustive and statistically meaningful analysis has been done on its impact on the results and on how to effectively extract the available information. This is relevant also considering the scarce availability of source code datasets that can be labelled by profiling on heterogeneous compute units. In this paper, we first present such study, that leads us to devise the contribution of code sequences and auxiliary inputs separately. Starting from this analysis, we then demonstrate that by using normalization of auxiliary information it is possible to improve state-of-art results in terms of accuracy. Finally, we propose a novel approach exploiting Siamese networks that further improve mapping accuracy by increasing the cardinality of the dataset, thus compensating for its relatively small size

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    A Method for Accelerated Simulations of Reinforcement Learning Tasks of UAVs in AirSim

    No full text
    Reinforcement Learning (RL) is widely used for training Unmanned Aerial Vehicles (UAVs) involving complex perception information (e.g., camera, lidar). RL achievable performance is affected by the time needed to learn from the direct interaction of the agent with the environment. AirSim is a widely used simulator for autonomous UAV research, and its photorealism is suitable for algorithms using cameras for making or assisting flying control decisions. This work aims to reduce the RL time by reducing the simulation time step. This impairs simulation accuracy, so the impact on RL training must be quantitatively assessed. We characterise the AirSim acceleration impact on RL training time and accuracy while performing an obstacle avoidance task in a UAV application. We observed that using 5x as the Airsim acceleration factor, the RL task performance degrades by 95%. The observed performance increase is due to the latencies present in the AirSim command chain. We overcome this limitation by proposing a new command approach which allows accelerating without performance degradation until 10x. When pushing the acceleration factor to the extreme (100x), the RL task performance degrades by 38% with a measured speed-up of 15x

    ExaQuery: Proving Data Structure to Unstructured Telemetry Data in Large-Scale HPC

    Full text link
    High-performance computing (HPC) is the cornerstone of technological advancements in our digital age, but its management is becoming increasingly challenging, particularly as systems approach exascale. Operational data analytics (ODA) and holistic monitoring frameworks aim to alleviate this burden by collecting live telemetry from HPC systems. ODA frameworks rely on NoSQL databases for scalability, with implicit data structures embedded in metric names, necessitating domain knowledge for navigating telemetry data relations. To address the imperative need for explicit representation of relations in telemetry data, we propose a novel ontology for ODA, which we apply to a real HPC installation. The proposed ontology captures relationships between topological components and links hardware components(compute nodes, rack, systems) with job's execution and allocations collected telemetry. This ontology forms the basis for constructing a knowledge graph, enabling graph queries for ODA. Moreover, we propose a comparative analysis of the complexity (expressed in lines of code) and domain knowledge requirement (qualitatively assessed by informed end-users) of complex query implementation with the proposed method and NoSQL methods commonly employed in today's ODAs. We focused on six queries informed by facility managers' daily operations, aiming to benefit not only facility managers but also system administrators and user support. Our comparative analysis demonstrates that the proposed ontology facilitates the implementation of complex queries with significantly fewer lines of code and domain knowledge required as compared to NoSQL methods

    PM100: A Job Power Consumption Dataset of a Large-Scale HPC System

    No full text
    The dataset is a collection of jobs extracted from the job_table data of M100 (https://doi.org/10.5281/zenodo.7588815), a collection of data extracted from a Tier-0 supercomputer hosted at CINECA (Marconi100, https://www.hpc.cineca.it/hardware/marconi100). The original job data present in M100 are filtered out by considering only the jobs running exclusively on the resources. Each job entry included in PM100 contains the power consumption of the job recorded at Node level, CPU level and Memory level. The final dataset contains 231116 jobs, executed on Marconi100 between May and October 2020. The dataset is stored as a parquet file, where each entry contains the information on a job execution. The structure of the data, as well as the code to generate them, is contained in the official GitHub repository of the project: https://github.com/francescoantici/PM100-data/
    corecore