1,721,002 research outputs found
Dependability evaluation of middleware technology for large-scale distributed caching
Distributed caching systems (e.g., Memcached) are widely used by service providers to satisfy accesses by millions of concurrent clients. Given their large-scale, modern distributed systems rely on a middleware layer to manage caching nodes, to make applications easier to develop, and to apply load balancing and replication strategies. In this work, we performed a dependability evaluation of three popular middleware platforms, namely Twemproxy by Twitter, Mcrouter by Facebook, and Dynomite by Netflix, to assess availability and performance under faults, including failures of Memcached nodes and congestion due to unbalanced workloads and network link bandwidth bottlenecks. We point out the different availability and performance trade-offs achieved by the three platforms, and scenarios in which few faulty components cause cascading failures of the whole distributed system
Towards Cognitive Security Defense from Data
IT organizations rely on a variety of independent security monitors and data sources to develop situational awareness for detecting and responding to security incidents. In spite of the advances in Security Information and Event Management (SIEM) for handling monitoring data in production environments, computer defense still depends on many cognitive human processes. In this context, having machines doing part of the cognitive work in lieu of humans is by now a real necessity. We present our framework towards the vision of cognitive SIEM, its building components and ongoing work on the topic
DRACO: Distributed Resource-aware Admission Control for large-scale, multi-tier systems
Modern distributed systems are designed to manage overload conditions, by throttling the traffic in excess that cannot be served through overload control techniques. However, the adoption of large-scale NoSQL datastores make systems vulnerable to unbalanced overloads, where specific datastore nodes are overloaded because of hot-spot resources and hogs. In this paper, we propose DRACO, a novel overload control solution that is aware of data dependencies between the application and the datastore tiers. DRACO performs selective admission control of application requests, by only dropping the ones that map to resources on overloaded datastore nodes, while achieving high resource utilization on non-overloaded datastore nodes. We evaluate DRACO on two case studies with high availability and performance requirements, a virtualized IP Multimedia Subsystem and a distributed fileserver. Results show that the solution can achieve high performance and resource utilization even under extreme overload conditions, up to 100x the engineered capacity
ThorFI: a Novel Approach for Network Fault Injection as a Service
In this work, we present a novel fault injection solution (ThorFI) for virtual networks in cloud computing infrastructures. ThorFI is designed to provide non-intrusive fault injection capabilities for a cloud tenant, and to isolate injections from interfering with other tenants on the infrastructure. We present the solution in the context of the OpenStack cloud management platform, and release this implementation as open-source software. Finally, we present two relevant case studies of ThorFI, respectively in an NFV IMS and of a high-availability cloud application. The case studies show that ThorFI can enhance functional tests with fault injection, as in 4%–34% of the test cases the IMS is unable to handle faults; and that despite redundancy in virtual networks, faults in one virtual network segment can propagate to other segments, and can affect the throughput and response time of the cloud application as a whole, by about 3 times in the worst case
Towards Lightweight Temporal and Fault Isolation in Mixed-Criticality Systems with Real-Time Containers
A Comparative Analysis of Software Aging in Image Classifiers on Cloud and Edge
Image classifiers for recognizing real-world objects are widely used in the Internet of Things (IoT) and Cyber-Physical Systems(CPSs). A classifier is trained offline by machine learning algorithms with training data sets, and then it is deployed on a cloud or an edge computing system for online label predictions. As the classifier's performance depends on the underlying software infrastructure, it may degrade over time due to software faults causing software aging. In this paper, we address this issue and experimentally investigate software aging observed in an image classification system that continuously runs on cloud and edge computing environments. We apply several statistical techniques to analyze degradation trends in the systems under stress tests. Our statistical trend analysis confirms the degradation trends in the throughput as well as the available memory resources both in the cloud and the edge environments. Contrary to our expectation, the edge computing environment under test had much less impact on the performance degradation than our cloud environment when the workload is high, although the latter one has four times larger allocated memory resources. We also show that the observed performance degradation trends are associated with the memory usage of specific processes by performing correlation analysis
Fault Injection Analytics: A Novel Approach to Discover Failure Modes in Cloud-Computing Systems
Cloud computing systems fail in complex and unexpected ways, due to unexpected combinations of events and interactions between hardware and software components. Fault injection is an effective means to bring out these failures in a controlled environment. However, fault injection experiments produce massive amounts of data, and manually analyzing these data is inefficient and error-prone, as the analyst can miss severe failure modes that are yet unknown. This paper introduces a new paradigm (fault injection analytics) that applies unsupervised machine learning on execution traces of the injected system, to ease the discovery and interpretation of failure modes. We evaluated the proposed approach in the context of fault injection experiments on the OpenStack cloud computing platform, where we show that the approach can accurately identify failure modes with a low computational cost
A comprehensive study on software aging across android versions and vendors
This paper analyzes the phenomenon of software aging – namely, the gradual performance degradation and resource exhaustion in the long run – in the Android OS. The study intends to highlight if, and to what extent, devices from different vendors, under various usage conditions and configurations, are affected by software aging and which parts of the system are the main contributors. The results demonstrate that software aging systematically determines a gradual loss of responsiveness perceived by the user, and an unjustified depletion of physical memory. The analysis reveals differences in the aging trends due to the workload factors and to the type of running applications, as well as differences due to vendors’ customization. Moreover, we analyze several system-level metrics to trace back the software aging effects to their main causes. We show that bloated Java containers are a significant contributor to software aging, and that it is feasible to mitigate aging through a micro-rejuvenation solution at the container level
Automating the correctness assessment of AI-generated code for security contexts
Evaluating the correctness of code generated by AI is a challenging open problem. In this paper, we propose a fully automated method, named ACCA, to evaluate the correctness of AI-generated code for security purposes. The method uses symbolic execution to assess whether the AI-generated code behaves as a reference implementation. We use ACCA to assess four state-of-the-art models trained to generate security-oriented assembly code and compare the results of the evaluation with different baseline solutions, including output similarity metrics, widely used in the field, and the well-known ChatGPT, the AI-powered language model developed by OpenAI. Our experiments show that our method outperforms the baseline solutions and assesses the correctness of the AI-generated code similar to the human-based evaluation, which is considered the ground truth for the assessment in the field. Moreover, ACCA has a very strong correlation with the human evaluation (Pearson's correlation coefficient r=0.84 on average). Finally, since it is a full y automated solution that does not require any human intervention, the proposed method performs the assessment of every code snippet in ∼0.17 s on average, which is definitely lower than the average time required by human analysts to manually inspect the code, based on our experience
- …
