1,721,018 research outputs found
Predicting system-level power for a hybrid supercomputer
For current High Performance Computing systems to scale towards the holy grail of ExaFLOP performance, their power consumption has to be reduced by at least one order of magnitude. This goal can be achieved only through a combination of hardware and software advances. Being able to model and accurately predict the power consumption of large computational systems is necessary for software-level innovations such as proactive and power-aware scheduling, resource allocation and fault tolerance techniques. In this paper we present a 2-layer model of power consumption for a hybrid supercomputer (which held the top spot of the Green500 list on July 2013) that combines CPU, GPU and MIC technologies to achieve higher energy efficiency. Our model takes as input workload information - the number and location of resources that are used by each job at a certain time - and calculates the resulting system-level power consumption. When jobs are submitted to the system, the workload configuration can be foreseen based on the scheduler policies, and our model can then be applied to predict the ensuing system-level power consumption. Additionally, alternative workload configurations can be evaluated from a power perspective and more efficient ones can be selected. Applications of the model include not only power-aware scheduling but also prediction of anomalous behavior
Cognified distributed computing
Cognification - the act of transforming ordinary objects or processes into their intelligent counterparts through Data Science and Artificial Intelligence - is a disruptive technology that has been revolutionizing disparate fields ranging from corporate law to medical diagnosis. Easy access to massive data sets, data analytics tools and High-Performance Computing (HPC) have been fueling this revolution. In many ways, cognification is similar to the electrification revolution that took place more than a century ago when electricity became a ubiquitous commodity that could be accessed with ease from anywhere in order to transform mechanical processes into their electrical counterparts. In this paper, we consider two particular forms of distributed computing - Data Centers and HPC systems - and argue that they are ripe for cognification of their entire ecosystem, ranging from top-level applications down to low-level resource and power management services. We present our vision for what 'Cognified Distributed Computing' might look like and outline some of the challenges that need to be addressed and new technologies that need to be developed in order to make it a reality. In particular, we examine the role cognification can play in tackling power consumption, resiliency and management problems in these systems. We describe intelligent software-based solutions to these problems powered by on-line predictive models built from streamed real-time data. While we cast the problem and our solutions in the context of large Data Centers and HPC systems, we believe our approach to be applicable to distributed computing in general. We believe that the traditional systems research agenda has much to gain by crossing discipline boundaries to include ideas and techniques from Data Science, Machine Learning and Artificial Intelligence
BiDAl (Big Data Analyzer)
Modern data centers that provide Internet-scale services are stadium-size structures housing tens of thousands of heterogeneous devices (server clusters, networking equipment, power and cooling infrastructures) that must operate continuously and reliably. As part of their operation, these devices produce large amounts of data in the form of event and error logs that are essential not only for identifying problems but also for improving data center efficiency and management. These activities employ data analytics and often exploit hidden statistical patterns and correlations among different factors present in the data. Uncovering these patterns and correlations is challenging due to the sheer volume of data to be analyzed. BiDAl is a prototype “log-data analysis framework” that incorporates various Big Data technologies to simplify the analysis of data traces from large clusters. BiDAl is written in Java with a modular and extensible architecture so that different storage backends (currently, HDFS and SQLite are supported), as well as different analysis languages (current implementation supports SQL, R and Hadoop MapReduce) can be easily se lected as appropriate
A machine learning approach to online fault classification in HPC systems
As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which is the deliberate triggering of faults in a system so as to observe their behavior in a controlled environment. In this paper, we propose a fault classification method for HPC systems based on machine learning. The novelty of our approach rests with the fact that it can be operated on streamed data in an online manner, thus opening the possibility to devise and enact control actions on the target system in real-time. We introduce a high-level, easy-to-use fault injection tool called FINJ, with a focus on the management of complex experiments. In order to train and evaluate our machine learning classifiers, we inject faults to an in-house experimental HPC system using FINJ, and generate a fault dataset which we describe extensively. Both FINJ and the dataset are publicly available to facilitate resiliency research in the HPC systems field. Experimental results demonstrate that our approach allows almost perfect classification accuracy to be reached for different fault types with low computational overhead and minimal delay
Antarex HPC Fault Dataset
The Antarex dataset contains trace data collected from the homonymous experimental HPC system located at ETH Zurich while it was subjected to fault injection, for the purpose of conducting machine learning-based fault detection studies for HPC systems. Acquiring our own dataset was made necessary by the fact that commercial HPC system operators are very reluctant to share trace data containing information about faults in their systems.
In order to acquire data, we executed benchmark applications and at the same time injected faults in the system at specific times via dedicated programs, so as to trigger anomalies in the behaviour of the applications. A wide range of faults is covered in our dataset, from hardware faults, to misconfiguration faults, and finally to performance anomalies cause by interference from other processes. This was achieved through the FINJ fault injection tool, developed by the authors.
The dataset contains two types of data: one type of data refers to a series of CSV files, each containing a set of system performance metrics sampled through the LDMS HPC monitoring framework. Another type refers to the log files detailing the status of the system (i.e., currently running benchmark applications or injected fault programs) at each time point in the dataset. Such a structure enables researchers to perform a wide range of studies on the dataset. Moreover, since we collected the dataset by streaming continuous data, any study based on it will easily be reproducible on a real HPC system, in an online way. The dataset is divided in two parts: the first includes only the CPU and memory-related benchmark applications and fault programs, while the second is strictly hard drive-related. We executed each part in both single-core and multi-core variants, resulting in a total of 4 dataset blocks for 32 days of data acquisition, and 20GB of uncompressed data.
For a detailed analysis on the structure and features of the Antarex dataset, please refer to the research paper "Online Fault Classification in HPC System through Machine Learning", by Netti et al. Additional details can be found in the research paper "FINJ: a Fault Injection Tool for HPC System" by Netti et al., whereas all source code can be found on the GitHub repository of the FINJ tool.</p
Self managing monitoring for highly elastic large scale Cloud deployments
Infrastructure as a Service computing exhibits a number of properties, which are not found in conventional server deployments. Elasticity is among the most significant of these properties which has wide reaching implications for applications deployed in cloud hosted VMs. Among the applications affected by elasticity is monitoring. In this paper we investigate the challenges of monitoring large cloud deployments and how these challenges differ from previous monitoring problems. In order to meet these unique challenges we propose Varanus, a highly scalable monitoring tool resistant to the effects of rapid elasticity. This tool breaks with many of the conventions of previous monitoring systems and leverages a multi-tier P2P architecture in order to achieve in situ monitoring without the need for dedicated monitoring infrastructure. We then evaluate Varanus against current monitoring architectures. We find that conventional monitoring tools perform acceptably for small, non changing cloud deployments. However in the case of large or highly elastic deployments current tools perform unacceptably incurring increased latencies, high load and slowed operation necessitating that a new, alternative tool be used. Further, we demonstrate that Varanus maintains low latency and low resource monitoring state propagation at scale and during during periods of high elasticity
Coordination models and languages: semantics and expressiveness
Dottorato di ricerca in informatica. 11. ciclo. Coordinatore Ozalp Babaoglu. Tutore Roberto GorrieriConsiglio Nazionale delle Ricerche - Biblioteca Centrale - P.le Aldo Moro, 7, Rome; Biblioteca Nazionale Centrale - Piazza Cavalleggeri, 1, Florence / CNR - Consiglio Nazionale delle RichercheSIGLEITItal
Performance evaluation of data locality exploitation
Dottorato di ricerca in informatica. 11. ciclo. Coordinatore Ozalp Babaoglu. Revisore Gilberto FileeConsiglio Nazionale delle Ricerche - Biblioteca Centrale - P.le Aldo Moro, 7, Rome; Biblioteca Nazionale Centrale - P.za Cavalleggeri, 1, Florence / CNR - Consiglio Nazionale delle RichercheSIGLEITItal
The people's cloud
Peer-to-peer cloud computing could free us from the tyranny of data centers. Not long ago, any start-up hoping to create the next big thing on the Internet had to invest sizable amounts of money in computing hardware, network connectivity, real estate to house the equipment, and technical personnel to keep everything working 24/7. The inevitable delays in getting all this funded, designed, and set up could easily erase any competitive edge the company might have had at the outset. Today, the same start-up could have its product up and running in the cloud in a matter of days, if not hours, with zero up-front investment in servers and similar gear. And the company wouldn't have to pay for any more computing oomph than it needs at any given time, because most cloud-service providers allot computing resources dynamically according to actual demand
- …
