1,721,057 research outputs found

    Towards operator-less data centers through data-driven, predictive, proactive autonomics

    Full text link
    Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using live data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating predictive models for node failures. Our results support the practicality of a data-driven approach by showing the effectiveness of predictive models based on data found in typical data center logs. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing node state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if nodes will fail in a future 24-h window. Our evaluation reveals that if we limit false positive rates to 5 %, we can achieve true positive rates between 27 and 88 % with precision varying between 50 and 72 %. This level of performance allows us to recover large fraction of jobs’ executions (by redirecting them to other nodes when a failure of the present node is predicted) that would otherwise have been wasted due to failures. We discuss the feasibility of including our predictive model as the central component of a data-driven autonomic manager and operating it on-line with live data streams (rather than off-line on data logs). All of the scripts used for BigQuery and classification analyses are publicly available on GitHub

    A data-driven approach to modeling power consumption for a hybrid supercomputer

    Full text link
    Power consumption of current High Performance Computing systems has to be reduced by at least one order of magnitude before they can be scaled up towards ExaFLOP performance. While we can expect novel hardware technologies and architectures to contribute towards this goal, significant advances have to come also from software technologies such as proactive and power-aware scheduling, resource allocation, and fault-tolerant computing. Development of these software technologies in turn relies heavily on our ability to model and accurately predict power consumption in large computing systems. In this paper, we present a data-driven model of power consumption for a hybrid supercomputer (which held the top spot in the Green500 ranking in June 2013) that combines CPU, GPU, and MIC technologies to achieve high levels of energy efficiency. Our model takes as input workload characteristics-the number and location of resources that are used by each job at a certain time-and calculates a predicted power consumption at the system level. The model is application-code-agnostic and is based solely on a data-driven predictive approach, where log data describing the past jobs in the system are employed to estimate future power consumption. For this, three different model components are developed and integrated. The first employs support vector regression to predict power usage for jobs before these are started. The second uses a simple heuristic to predict the length of jobs, again before they start. The two predictions are then combined to estimate power consumption due to the job at all computational elements in the system. The third component is a linear model that takes as input the power consumption at the computing units and predicts system-wide power consumption. Our method achieves highly-accurate predictions starting solely from workload information and user histories. The model can be applied to power-aware scheduling and power capping: alternative workload dispatching configurations can be evaluated from a power perspective and more efficient ones can be selected. The methodology outlined here can be easily adapted to other HPC systems where the same types of log data are available

    A Big Data analyzer for large trace logs

    Full text link
    Current generation of Internet-based services are typically hosted on large data centers that take the form of warehouse-size structures housing tens of thousands of servers. Continued availability of a modern data center is the result of a complex orchestration among many internal and external actors including computing hardware, multiple layers of intricate software, networking and storage devices, electrical power and cooling plants. During the course of their operation, many of these components produce large amounts of data in the form of event and error logs that are essential not only for identifying and resolving problems but also for improving data center efficiency and management. Most of these activities would benefit significantly from data analytics techniques to exploit hidden statistical patterns and correlations that may be present in the data. The sheer volume of data to be analyzed makes uncovering these correlations and patterns a challenging task. This paper presents Big Data analyzer (BiDAl), a prototype Java tool for log-data analysis that incorporates several Big Data technologies in order to simplify the task of extracting information from data traces produced by large clusters and server farms. BiDAl provides the user with several analysis languages (SQL, R and Hadoop MapReduce) and storage backends (HDFS and SQLite) that can be freely mixed and matched so that a custom tool for a specific task can be easily constructed. BiDAl has a modular architecture so that it can be extended with other backends and analysis languages in the future. In this paper we present the design of BiDAl and describe our experience using it to analyze publicly-available traces from Google data clusters, with the goal of building a realistic model of a complex data center

    Predicting system-level power for a hybrid supercomputer

    Full text link
    For current High Performance Computing systems to scale towards the holy grail of ExaFLOP performance, their power consumption has to be reduced by at least one order of magnitude. This goal can be achieved only through a combination of hardware and software advances. Being able to model and accurately predict the power consumption of large computational systems is necessary for software-level innovations such as proactive and power-aware scheduling, resource allocation and fault tolerance techniques. In this paper we present a 2-layer model of power consumption for a hybrid supercomputer (which held the top spot of the Green500 list on July 2013) that combines CPU, GPU and MIC technologies to achieve higher energy efficiency. Our model takes as input workload information - the number and location of resources that are used by each job at a certain time - and calculates the resulting system-level power consumption. When jobs are submitted to the system, the workload configuration can be foreseen based on the scheduler policies, and our model can then be applied to predict the ensuing system-level power consumption. Additionally, alternative workload configurations can be evaluated from a power perspective and more efficient ones can be selected. Applications of the model include not only power-aware scheduling but also prediction of anomalous behavior

    Power consumption modeling and prediction in a hybrid CPU-GPU-MIC supercomputer

    No full text
    Power consumption is a major obstacle for High Performance Computing (HPC) systems in their quest towards the holy grail of ExaFLOP performance. Significant advances in power efficiency have to be made before this goal can be attained and accurate modeling is an essential step towards power efficiency by optimizing system operating parameters to match dynamic energy needs. In this paper we present a study of power consumption by jobs in Eurora, a hybrid CPU-GPUMIC system installed at the largest Italian data center. Using data from a dedicated monitoring framework, we build a data-driven model of power consumption for each user in the system and use it to predict the power requirements of future jobs. We are able to achieve good prediction results for over 80% of the users in the system. For the remaining users, we identify possible reasons why prediction performance is not as good. Possible applications for our predictive modeling results include scheduling optimization, power-aware billing and system-scale power modeling. All the scripts used for the study have been made available on GitHub

    Cognified distributed computing

    Full text link
    Cognification - the act of transforming ordinary objects or processes into their intelligent counterparts through Data Science and Artificial Intelligence - is a disruptive technology that has been revolutionizing disparate fields ranging from corporate law to medical diagnosis. Easy access to massive data sets, data analytics tools and High-Performance Computing (HPC) have been fueling this revolution. In many ways, cognification is similar to the electrification revolution that took place more than a century ago when electricity became a ubiquitous commodity that could be accessed with ease from anywhere in order to transform mechanical processes into their electrical counterparts. In this paper, we consider two particular forms of distributed computing - Data Centers and HPC systems - and argue that they are ripe for cognification of their entire ecosystem, ranging from top-level applications down to low-level resource and power management services. We present our vision for what 'Cognified Distributed Computing' might look like and outline some of the challenges that need to be addressed and new technologies that need to be developed in order to make it a reality. In particular, we examine the role cognification can play in tackling power consumption, resiliency and management problems in these systems. We describe intelligent software-based solutions to these problems powered by on-line predictive models built from streamed real-time data. While we cast the problem and our solutions in the context of large Data Centers and HPC systems, we believe our approach to be applicable to distributed computing in general. We believe that the traditional systems research agenda has much to gain by crossing discipline boundaries to include ideas and techniques from Data Science, Machine Learning and Artificial Intelligence

    A Holistic Approach to Log Data Analysis in High-Performance Computing Systems: The Case of IBM Blue Gene/Q

    Full text link
    The complexity and cost of managing high-performance computing infrastructures are on the rise. Automating management and repair through predictive models to minimize human interventions is an attempt to increase system availability and contain these costs. Building predictive models that are accurate enough to be useful in automatic management cannot be based on restricted log data from subsystems but requires a holistic approach to data analysis from disparate sources. Here we provide a detailed multi-scale characterization study based on four datasets reporting power consumption, temperature, workload, and hardware/software events for an IBM Blue Gene/Q installation.We show that the system runs a rich parallel workload, with low correlation among its components in terms of temperature and power, but higher correlation in terms of events. As expected, power and temperature correlate strongly, while events display negative correlations with load and power. Power and workload show moderate correlations, and only at the scale of components. The aim of the study is a systematic, integrated characterization of the computing infrastructure and discovery of correlation sources and levels to serve as basis for future predictive modeling efforts

    A Big Data Analyzer for Large Trace Logs

    No full text
    Current generation of Internet-based services are typically hosted on large data centers that take the form of warehouse-size structures hous-ing tens of thousands of servers. Continued availability of a modern data center is the result of a complex orchestration among many internal and external actors including computing hardware, multiple layers of intricate software, networking and storage devices, electrical power and cooling plants. During the course of their operation, many of these components produce large amounts of data in the form of event and error logs that are essential not only for identifying and resolving problems but also for improving data center efficiency and management. Most of these activi-ties would benefit significantly from data analytics techniques to exploit hidden statistical patterns and correlations that may be present in the data. The sheer volume of data to be analyzed makes uncovering these correlations and patterns a challenging task. This paper presents BiDAl, a prototype Java tool for log-data analysis that incorporates several Big Data technologies in order to simplify the task of extracting information from data traces produced by large clusters and server farms. BiDAl provides the user with several analysis languages (SQL, R and Hadoop MapReduce) and storage backends (HDFS and SQLite) that can be freely mixed and matched so that a custom tool for a specific task can be easily constructed. BiDAl has a modular architecture so that it can be extended with other backends and analysis languages in the future. In this paper we present the design of BiDAl and describe our experience using it to analyze publicly-available traces from Google data clusters, with the goal of building a realistic model of a complex data center.

    Self-Organizing Mechanisms for Task Allocation in a Knowledge-Based Economy

    No full text
    A prevalent claim is that we are in knowledge economy. When we talk about knowledge economy, we generally mean the concept of “Knowledge-based economy” indicating the use of knowledge and technologies to produce economic benefits. Hence knowledge is both tool and raw material (people’s skill) for producing some kind of product or service. In this kind of environment economic organization is undergoing several changes. For example authority relations are less important, legal and ownership-based definitions of the boundaries of the firm are becoming irrelevant and there are only few constraints on the set of coordination mechanisms. Hence what characterises a knowledge economy is the growing importance of human capital in productive processes (Foss, 2005) and the increasing knowledge intensity of jobs (Hodgson, 1999). Economic processes are also highly intertwined with social processes: they are likely to be informal and reciprocal rather than formal and negotiated. Another important point is also the problem of the division of labor: as economic activity becomes mainly intellectual and requires the integration of specific and idiosyncratic skills, the task of dividing the job and assigning it to the most appropriate individuals becomes arduous, a “supervisory problem” (Hogdson, 1999) emerges and traditional hierarchical control may result increasingly ineffective. Not only specificity of know how makes it awkward to monitor the execution of tasks, more importantly, top-down integration of skills may be difficult because ‘the nominal supervisors will not know the best way of doing the job – or even the precise purpose of the specialist job itself – and the worker will know better’ (Hogdson,1999). We, therefore, expect that the organization of the economic activity of specialists should be, at least partially, self-organized. The aim of this thesis is to bridge studies from computer science and in particular from Peer-to-Peer Networks (P2P) to organization theories. We think that the P2P paradigm well fits with organization problems related to all those situation in which a central authority is not possible. We believe that P2P Networks show a number of characteristics similar to firms working in a knowledge-based economy and hence that the methodology used for studying P2P Networks can be applied to organization studies. Three are the main characteristics we think P2P have in common with firms involved in knowledge economy: - Decentralization: in a pure P2P system every peer is an equal participant, there is no central authority governing the actions of the single peers; - Cost of ownership: P2P computing implies shared ownership reducing the cost of owing the systems and the content, and the cost of maintaining them; - Self-Organization: it refers to the process in a system leading to the emergence of global order within the system without the presence of another system dictating this order. These characteristics are present also in the kind of firm that we try to address and that’ why we have shifted the techniques we adopted for studies in computer science (Marcozzi et al., 2005; Hales et al., 2007 [39]) to management science

    Client side exploitation: il metodo BeEF

    Full text link
    corecore