1,720,990 research outputs found

    Enriched View Synchrony: A Programming Paradigm for Partitionable Asynchronous Distributed Systems

    No full text
    Distributed systems constructed using off-the-shelf communication infrastructures are becoming common vehicles for doing business in many important application domains. Large geographic extent due to increased globalization, increased probability of failures, and highly dynamic loads all contribute toward a partitionable and asynchronous characterization for these systems. In this paper, we consider the problem of developing reliable applications to be deployed in partitionable asynchronous distributed systems. What makes this task difficult is guaranteeing the consistency of shared state despite asynchrony, failures, and recoveries, including the formation and merging of partitions. While view synchrony within process groups is a powerful paradigm that can significantly simplify reasoning about asynchrony and failures, it is insufficient for coping with recoveries and merging of partitions after repairs. We first give an abstract characterization for shared state management in partitionable asynchronous distributed systems and then show how views can be enriched to convey structural and historical information relevant to the group's activity. The resulting paradigm, called enriched view synchrony, can be implemented efficiently and leads to a simple programming methodology for solving shared state management in the presence of partitions

    Towards Data-Driven Autonomics in Data Centers

    Full text link
    Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using generated data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating a predictive model for node failures. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing machine state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if machines will fail in a future 24-hour window. Our evaluation reveals that if we limit false positive rates to 5%, we can achieve true positive rates between 27% and 88% with precision varying between 50% and 72%. We discuss the practicality of including our predictive model as the central component of a data-driven autonomic manager and operating it on-line with live data streams (rather than off-line on data logs). All of the scripts used for BigQuery and classification analyses are publicly available from the authors’ website

    Programming Partition-Aware Network Applications

    No full text
    We consider the problem of developing reliable applications to be deployed in partitionable asynchronous distributed systems. What makes this task difficult is guaranteeing the consistency of shared state despite asynchrony, failures and recoveries, including the formation and merging of partitions. While view synchrony within process groups is a powerful paradigm that can significantly simplify reasoning about asynchrony and failures, it is insufficient for coping with recoveries and merging of partitions after repairs. We first give an abstract characterization for shared state management in partitionable asynchronous distributed systems and then show how views can be enriched to convey structural and historical information relevant to the group's activity. The resulting paradigm, called enriched view synchrony, can be implemented efficiently and leads to a simple programming methodology for solving shared state management in the presence of partitions

    Constraint Programming-Based Job Dispatching for Modern HPC Applications

    No full text
    HPC systems are increasingly being used for big data analytics and predictive model building that employ many short jobs. In these application scenarios, HPC job dispatchers need to process large numbers of short jobs quickly and make decisions on-line while ensuring high Quality-of-Service (QoS) levels and meet demanding timing requirements. Constraint Programming (CP) is an effective approach for tackling job dispatching problems. Yet, the state-of-the-art CP-based job dispatchers are unable to satisfy the challenges of on-line dispatching and take advantage of job duration predictions. These limitations jeopardize achieving high QoS levels, and consequently impede the adoption of CP-based dispatchers in HPC systems. We propose a class of CP-based dispatchers that are more suitable for HPC systems running modern applications. The new dispatchers are able to reduce the time required for generating on-line dispatching decisions significantly, and are able to make effective use of job duration predictions to decrease waiting times and job slowdowns, especially for workloads dominated by short jobs

    IntegratingAgentCommunicationLanguagesin

    No full text
    2002-6 TowardsaSemanticWebforFormalMathematics(Ph.D.Thesis),Schena,I.,March2002. 2002-7 RevisitingInteractiveMarkovChains,Bravetti,M.,June2002. 2002-8 UserUntraceabilityintheNext-GenerationInternet:aProposal,Tortonesi,M.,Davoli,R.,August2002. 2002-9 Towards Adaptive, Resilientand Self-OrganizingPeer-to-Peer Systems, Montresor, A., Meling, H., Babaoglu,O.,September2002. 2002-10 TowardsSelf-Organizing,Self-RepairingandResilientDistributedSystems,Montresor,A.,Babaoglu,O., Meling,H.,September2002(RevisedNovember2002). 2002-11 Messor:Load-BalancingthroughaSwarmofAutonomousAgents,Montresor,A.,Meling,H.,Babaoglu, O.,September2002. 2002-12 Johanna: OpenCollaborativeTechnologiesforTeleorganizations,Gaspari,M.,Picci,L.,Petrucci,A., Faglioni,G.,December2002

    A machine learning approach to online fault classification in HPC systems

    Full text link
    As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which is the deliberate triggering of faults in a system so as to observe their behavior in a controlled environment. In this paper, we propose a fault classification method for HPC systems based on machine learning. The novelty of our approach rests with the fact that it can be operated on streamed data in an online manner, thus opening the possibility to devise and enact control actions on the target system in real-time. We introduce a high-level, easy-to-use fault injection tool called FINJ, with a focus on the management of complex experiments. In order to train and evaluate our machine learning classifiers, we inject faults to an in-house experimental HPC system using FINJ, and generate a fault dataset which we describe extensively. Both FINJ and the dataset are publicly available to facilitate resiliency research in the HPC systems field. Experimental results demonstrate that our approach allows almost perfect classification accuracy to be reached for different fault types with low computational overhead and minimal delay

    BiDAl: Big Data Analyzer for Cluster Traces

    Full text link
    Modern data centers that provide Internet-scale services are stadium-size structures housing tens of thousands of heterogeneous devices (server clusters, networking equipment, power and cooling infrastructures) that must operate continuously and reliably. As part of their operation, these devices produce large amounts of data in the form of event and error logs that are essential not only for identifying problems but also for improving data center efficiency and management. These activities employ data analytics and often exploit hidden statistical patterns and correlations among different factors present in the data. Uncovering these patterns and correlations is challenging due to the sheer volume of data to be analyzed. This paper presents BiDAl, a prototype “log-data analysis framework” that incorporates various Big Data technologies to simplify the analysis of data traces from large clusters. BiDAl is written in Java with a modular and extensible architecture so that different storage backends (currently, HDFS and SQLite are supported), as well as different analysis languages (current implementation supports SQL, R and Hadoop MapReduce) can be easily selected as appropriate. We present the design of BiDAl and describe our experience using it to analyze several public traces of Google data clusters for building a simulation model capable of reproducing observed behavior

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
    corecore