1,720,981 research outputs found
Cognified distributed computing
Cognification - the act of transforming ordinary objects or processes into their intelligent counterparts through Data Science and Artificial Intelligence - is a disruptive technology that has been revolutionizing disparate fields ranging from corporate law to medical diagnosis. Easy access to massive data sets, data analytics tools and High-Performance Computing (HPC) have been fueling this revolution. In many ways, cognification is similar to the electrification revolution that took place more than a century ago when electricity became a ubiquitous commodity that could be accessed with ease from anywhere in order to transform mechanical processes into their electrical counterparts. In this paper, we consider two particular forms of distributed computing - Data Centers and HPC systems - and argue that they are ripe for cognification of their entire ecosystem, ranging from top-level applications down to low-level resource and power management services. We present our vision for what 'Cognified Distributed Computing' might look like and outline some of the challenges that need to be addressed and new technologies that need to be developed in order to make it a reality. In particular, we examine the role cognification can play in tackling power consumption, resiliency and management problems in these systems. We describe intelligent software-based solutions to these problems powered by on-line predictive models built from streamed real-time data. While we cast the problem and our solutions in the context of large Data Centers and HPC systems, we believe our approach to be applicable to distributed computing in general. We believe that the traditional systems research agenda has much to gain by crossing discipline boundaries to include ideas and techniques from Data Science, Machine Learning and Artificial Intelligence
Comparison of Machine Learning Classifiers on Integrated Transcriptomic Data
Omics data are being generated for different conditions, and can be a valuable resource for building novel predictive models for medical diagnosis. Given the reduced number of samples in each dataset, the application of Machine Learning (ML) models requires data integration. At the same time, multiple ML models are available, and the best option for data integration is not known. These challenges have been addressed typically in restricted settings, i.e., for one single disease at a time. However, a thorough comparison of models on integrated data, for different conditions, is still missing. In this paper we confront 7 classifiers on integrated data for 6 diseases, over 14 datasets. We compared the models on single and integrated datasets, employing different pre-processing techniques. We also evaluated the effect of feature selection, analyzing the robustness and relevance of the features extracted. We observed that, even if integration slightly reduces predictive power, the models are still able to produce good classifications. When testing generalization abilities on new datasets, sometimes the performance decreases drastically, depending on the disease studied
Antarex HPC Fault Dataset
The Antarex dataset contains trace data collected from the homonymous experimental HPC system located at ETH Zurich while it was subjected to fault injection, for the purpose of conducting machine learning-based fault detection studies for HPC systems. Acquiring our own dataset was made necessary by the fact that commercial HPC system operators are very reluctant to share trace data containing information about faults in their systems.
In order to acquire data, we executed benchmark applications and at the same time injected faults in the system at specific times via dedicated programs, so as to trigger anomalies in the behaviour of the applications. A wide range of faults is covered in our dataset, from hardware faults, to misconfiguration faults, and finally to performance anomalies cause by interference from other processes. This was achieved through the FINJ fault injection tool, developed by the authors.
The dataset contains two types of data: one type of data refers to a series of CSV files, each containing a set of system performance metrics sampled through the LDMS HPC monitoring framework. Another type refers to the log files detailing the status of the system (i.e., currently running benchmark applications or injected fault programs) at each time point in the dataset. Such a structure enables researchers to perform a wide range of studies on the dataset. Moreover, since we collected the dataset by streaming continuous data, any study based on it will easily be reproducible on a real HPC system, in an online way. The dataset is divided in two parts: the first includes only the CPU and memory-related benchmark applications and fault programs, while the second is strictly hard drive-related. We executed each part in both single-core and multi-core variants, resulting in a total of 4 dataset blocks for 32 days of data acquisition, and 20GB of uncompressed data.
For a detailed analysis on the structure and features of the Antarex dataset, please refer to the research paper "Online Fault Classification in HPC System through Machine Learning", by Netti et al. Additional details can be found in the research paper "FINJ: a Fault Injection Tool for HPC System" by Netti et al., whereas all source code can be found on the GitHub repository of the FINJ tool.</p
Cohesion, consensus and extreme information in opinion dynamics
Opinion formation is an important element of social dynamics. It has been widely studied in the last years with tools from physics, mathematics and computer science. Here, a continuous model of opinion dynamics for multiple possible choices is analyzed. Its main features are the inclusion of disagreement and possibility of modulating external information/media effects, both from one and multiple sources. The interest is in identifying the effect of the initial cohesion of the population, the interplay between cohesion and media extremism, and the effect of using multiple external sources of information that can influence the system. Final consensus, especially with the external message, depends highly on these factors, as numerical simulations show. When no external input is present, consensus or segregation is determined by the initial cohesion of the population. Interestingly, when only one external source of information is present, consensus can be obtained, in general, only when this is extremely neutral, i.e., there is not a single opinion strongly promoted, or in the special case of a large initial cohesion and low exposure to the external message. On the contrary, when multiple external sources are allowed, consensus can emerge with one of them even when this is not extremely neutral, i.e., it carries a strong message, for a large range of initial conditions
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
- …
