1,721,123 research outputs found
Mélange: Multi-tenant scheduling with adaptive eviction for graph processing clusters
Multi-tenancy is an important approach to resource consolidation in cluster management. In this thesis we design and evaluate Mélange, an efficient multi-tenant scheduler targeted towards graph processing jobs. Mélange supports job priorities and eviction, while attempting to avoid starvation. We propose novel ways of exploiting domain-specific knowledge to achieve better scheduling decisions for graph processing jobs. We evaluate static eviction policies and design Mélange to adapt to the cluster and job state at run time to reduce overhead costs during eviction. We have developed Mélange as a cross-layer scheduler built over Apache Giraph and YARN, and show experimental results with synthetic as well as production workloads.Submission published under a 24 month embargo labeled 'U of I Access', the embargo will last until 2020-05-01The student, Jayasi Mehar, accepted the attached license on 2018-04-23 at 16:28.The student, Jayasi Mehar, submitted this Thesis for approval on 2018-04-23 at 17:20.This Thesis was approved for publication on 2018-04-24 at 15:17.DSpace SAF Submission Ingestion Package generated from Vireo submission #12434 on 2018-08-31 at 17:21:18Made available in DSpace on 2018-09-04T20:36:52Z (GMT). No. of bitstreams: 2
MEHAR-THESIS-2018.pdf: 2131888 bytes, checksum: 3d6639d87f3c3efbb1bee31e21a20dda (MD5)
LICENSE.txt: 4209 bytes, checksum: 53175c3bd8e036182ec8cdbe3a27034e (MD5)
Previous issue date: 2018-04-24Embargo set by: Seth Robbins for item 107296
Lift date: 2020-09-04T20:37:00Z
Reason: Author requested U of Illinois access only (OA after 2yrs) in Vireo ETD systemEmbargo set by: Seth Robbins for item 107296
Lift date: 2020-09-04T20:42:08Z
Reason: Author requested U of Illinois access only (OA after 2yrs) in Vireo ETD systemU of I Only Restriction Lifted for Item 107296 on 2020-09-05T09:15:09Z
Adaptive control for availability and consistency in distributed key-values stores
The CAP theorem says that distributed key-value stores can only provide bounded consistency (C) and availability (A) under the presence of partition (P). Recent work has proposed the ability for applications of such stores to specify either an availability SLA or a consistency SLA. In this paper, we propose an adaptive algorithm that automatically controls the underlying storage system in real-time to meet such an SLA while optimizing the other C/A metric. We also present an implementation of the algorithm based on the popular key-value store Riak. Our experiments with the modified system, under realistic workloads, show that the control technique is able to change the system’s configurations to quickly and stably satisfy the SLAs.Item withdrawn by Laura Spradlin ([email protected]) on 2014-12-09T16:37:00Z
Item was in collections:
University of Illinois Theses & Dissertations (ID: 1)
No. of bitstreams: 1
CanhSon_NguyenBa.pdf: 950628 bytes, checksum: 6fa85ced91c072b5a6e905a8e6ee3594 (MD5)Made available in DSpace on 2015-01-21T19:59:27Z (GMT). No. of bitstreams: 1
Canh Son_Nguyen Ba.pdf: 950424 bytes, checksum: 2f2180bc253219320a79cc8806fa6dfd (MD5)Embargo set by: Seth Robbins for item 73282
Lift date: 2017-01-21T19:59:39Z
Reason: Author requested closed access (OA after 2yrs) in Vireo ETD systemLimited Restriction Lifted for Item 73282 on 2017-01-22T10:15:43Z
Topology-aware distributed graph processing for tightly-coupled clusters
Cloud applications have burgeoned over the last few years, but they are typically written for loosely-coupled clusters such as datacenters. In this thesis we investigate how one can run cloud applications in tightly-coupled clusters and network topologies, namely super-computers. Specifically, we look at a class of distributed machine learning systems called distributed graph processing systems, and run them on NCSA Blue Waters. Partitioning the graph is key to achieving performance in distributed graph processing systems. We present new topology-aware partitioning techniques that better exploit the structure of the network topologies in supercomputers. Compared to existing work, our new Restricted Oblivious and Grid Centroid partitioning approaches produce 25-33% improvement in makespan, along with a sizable reduction in network traffic. We also discuss optimizations such as smart network buffers that further amplify the improvement. To help operators select the best graph partitioning technique, we culminate our experimental results into a decision tree.Submission published under a 24 month embargo labeled 'U of I Access', the embargo will last until 2020-05-01The student, Mayank Bhatt, accepted the attached license on 2018-04-23 at 17:13.The student, Mayank Bhatt, submitted this Thesis for approval on 2018-04-23 at 17:20.This Thesis was approved for publication on 2018-04-24 at 15:21.DSpace SAF Submission Ingestion Package generated from Vireo submission #12435 on 2018-08-31 at 17:21:19Made available in DSpace on 2018-09-04T20:36:52Z (GMT). No. of bitstreams: 2
BHATT-THESIS-2018.pdf: 1415794 bytes, checksum: e08311d8168967b2e47baf1ef67f7fdc (MD5)
LICENSE.txt: 4209 bytes, checksum: b810a770b0873fc45062dd7e9ce83fde (MD5)
Previous issue date: 2018-04-24Embargo set by: Seth Robbins for item 107297
Lift date: 2020-09-04T20:37:00Z
Reason: Author requested U of Illinois access only (OA after 2yrs) in Vireo ETD systemEmbargo set by: Seth Robbins for item 107297
Lift date: 2020-09-04T20:42:08Z
Reason: Author requested U of Illinois access only (OA after 2yrs) in Vireo ETD systemU of I Only Restriction Lifted for Item 107297 on 2020-09-05T09:15:32Z
An experimental comparison of partitioning strategies in distributed graph processing
In this thesis, we study the problem of choosing among partitioning strategies in distributed graph processing systems.
To this end, we evaluate and characterize both the performance and resource usage of different partitioning strategies under various popular distributed graph processing systems, applications, input graphs, and execution environments.
Through our experiments, we found that no single partitioning strategy is the best fit for all situations, and that the choice of partitioning strategy has a significant effect on resource usage and application run-time.
Our experiments demonstrate that the choice of partitioning strategy depends on (1) the degree distribution of input graph, (2) the type and duration of the application, and (3) the cluster size.
Based on our results, we present rules of thumb to help users pick the best partitioning strategy for their particular use cases. We present results from each system, as well as from all partitioning strategies implemented in two common systems (PowerLyra and GraphX).Submission published under a 24 month embargo labeled 'U of I Access', the embargo will last until 2019-05-01The student, Shiv Verma, accepted the attached license on 2017-04-17 at 19:28.The student, Shiv Verma, submitted this Thesis for approval on 2017-04-17 at 19:40.This Thesis was approved for publication on 2017-04-24 at 09:06.DSpace SAF Submission Ingestion Package generated from Vireo submission #10830 on 2017-08-10 at 15:05:59Made available in DSpace on 2017-08-10T20:33:03Z (GMT). No. of bitstreams: 2
VERMA-THESIS-2017.pdf: 1176883 bytes, checksum: e49f2de22c65fd67d96121626f710849 (MD5)
LICENSE.txt: 4207 bytes, checksum: eb422c7d45cb7c49bb2212e387d9fcaf (MD5)
Previous issue date: 2017-04-24Embargo set by: Colleen Fallaw for item 102777
Lift date: 2019-08-10T21:27:21Z
Reason: Author requested U of Illinois access only (OA after 2yrs) in Vireo ETD systemU of I Only Restriction Lifted for Item 102777 on 2019-08-11T09:15:39Z
Henge: An intent-driven scheduler for multi-tenant stream processing
This thesis presents Henge, a system that supports intent-based multi-tenancy in modern stream processing applications. Henge supports multi-tenancy as a first-class citizen: everyone inside an organization can now submit their stream processing jobs to a single, shared, consolidated cluster. Additionally, Henge allows each tenant (job) to specify its own intents (i.e., requirements) as a Service Level Objective (SLO) that captures latency and/or throughput. In a multi-tenant cluster, the Henge scheduler adapts continually to meet jobs’ SLOs in spite of limited cluster resources, and under dynamic input workloads. SLOs are soft and are based on utility functions. Henge continually tracks SLO satisfaction, and when jobs miss their SLOs, it wisely navigates the state space to perform resource allocations in real time, maximizing total system utility achieved by all jobs in the system. Henge is integrated in Apache Storm and the thesis presents experimental results, using both production topologies and real datasets.Submission published under a 24 month embargo labeled 'Closed Access', the embargo will last until 2019-12-01The student, Faria Kalim, accepted the attached license on 2017-12-04 at 13:41.The student, Faria Kalim, submitted this Thesis for approval on 2017-12-04 at 13:51.This Thesis was approved for publication on 2017-12-04 at 16:54.DSpace SAF Submission Ingestion Package generated from Vireo submission #11818 on 2018-03-13 at 10:37:19Made available in DSpace on 2018-03-13T17:35:43Z (GMT). No. of bitstreams: 2
KALIM-THESIS-2017.pdf: 3149824 bytes, checksum: 1a1d9956c624f4aa3d379806d5627319 (MD5)
LICENSE.txt: 4208 bytes, checksum: 12a2f1944760a6e0d6e4e380714f485b (MD5)
Previous issue date: 2017-12-04Embargo set by: Seth Robbins for item 105472
Lift date: 2020-03-13T17:36:05Z
Reason: Author requested closed access (OA after 2yrs) in Vireo ETD systemLimited Restriction Lifted for Item 105472 on 2020-03-14T09:15:12Z
Efficient on -Demand Operations in Large-Scale Infrastructures
This dissertation discusses several on-demand operations, challenges associated with them, and system designs that meet these challenges. Specifically, we design and implement techniques for (1) on-demand group monitoring that allows users and administrators of an infrastructure to query and aggregate the up-to-date state of the machines (e.g., CPU utilization) in one or multiple groups, (2) on-demand storage for intermediate data generated by dataflow programming paradigms running in clouds, (3) on-demand Grid scheduling that makes worker-centric scheduling decisions based on the current availability of compute nodes, and (4) on-demand key/value pair lookup that is overlay-independent and perturbation-resistant. We evaluate these on-demand operations using large-scale simulations with traces gathered from real systems, as well as via deployments over real testbeds such as Emulab and PlanetLab.Made available in DSpace on 2015-09-25T20:20:43Z (GMT). No. of bitstreams: 2
license.txt: 4848 bytes, checksum: 96035ab3f5e1c23cc7138a224ce498bd (MD5)
3392096.pdf: 2147686 bytes, checksum: d99febba2daedbf68bc67a735798a1b2 (MD5)
Previous issue date: 2009Embargo set by: Seth Robbins for item 83141
Lift date: Forever
Reason: Restricted to the U of I community idenfinitely during batch ingest of legacy ETDsRestricted to the U of I community idenfinitely during batch ingest of legacy ETDsU of I Only120 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2009
Exploration of fault tolerance in Apache Spark
This thesis provides an exploration of two techniques for solving fault tolerance for batch processing in Apache Spark. We evaluate the benefits and challenges of these approaches.
Apache Spark is a cluster computing system comprised of three main components: the driver program, the cluster manager, and the worker nodes. Spark already tolerates the loss of worker nodes, and other external tools already provide fault tolerance solutions for the cluster manager. For example, the cluster manager deployed using Apache Mesos provides fault tolerance to the cluster manager. Spark does not support driver fault tolerance for batch processing. The driver program stores critical state of the running job by maintaining oversight of the workers; failure of the driver program always results in loss of all oversight over the worker nodes and is equivalent to catastrophic failure of the entire Spark application.
In this thesis, we explore two approaches to achieve fault tolerance in Apache Spark for batch processing, enabling promised execution of long-running critical jobs and consistent performance while still supporting high uptime. The first approach serializes critical state of the driver program and relay that state to passive processors. Upon failure, this state is loaded by a secondary processor and computation is resumed. The second approach narrows the scope of the problem and synchronizes block information between primary and secondary drivers so that locations of cached aggregated data is not lost after primary driver failure. Loss of these locations leads to a state from which computation cannot be resumed. Both approaches propose considerable changes to the Apache Spark architecture in order to support high availability of batch processing jobs.Submission original under an indefinite embargo labeled 'Open Access'. The submission was exported from vireo on 2018-03-13 without embargo termsThe student, Akshun Gupta, accepted the attached license on 2017-12-05 at 17:53.The student, Akshun Gupta, submitted this Thesis for approval on 2017-12-05 at 18:00.This Thesis was approved for publication on 2017-12-06 at 15:19.DSpace SAF Submission Ingestion Package generated from Vireo submission #11876 on 2018-03-13 at 10:11:18Made available in DSpace on 2018-03-13T15:49:02Z (GMT). No. of bitstreams: 2
GUPTA-THESIS-2017.pdf: 1301887 bytes, checksum: a78b5bda4d2d5007bb36ac04a8dbceed (MD5)
LICENSE.txt: 4209 bytes, checksum: 16e4ae78760f2aca14fc3dd48af24c86 (MD5)
Previous issue date: 2017-12-0
GNuggies: A proposal for hosting resilient stateless services using untrusted nodes
This thesis outlines a proposal for a serverless cloud compute system hosted on untrusted nodes. We call this proposed system “GNuggies”. It is designed to feel instantly familiar to existing serverless offerings such as AWS Lambda or Azure Cloud Functions. The key difference between GNuggies and existing offerings is that GNuggies proposes leveraging spare compute resources by allowing anyone to contribute nodes into the system. These contributed nodes must be treated as untrusted and this is where the bulk this thesis’s contributions arise:
1. A proposed architecture that adapts well understood Distributed Systems concepts to situations involving untrusted nodes and to run in the absence of central authorities.
2. A proposed system wherein actors choose to contribute spare compute and are effectively incentivized to do so.
3. An incentive structure that makes actors less willing to behave in a malicious manner.
This thesis discusses the methods to be used and evaluates their strengths and weaknesses. It also argues that decentralized serverless is a direction that the internet will potentially benefit from moving towards.Submission original under an indefinite embargo labeled 'Open Access'. The submission was exported from vireo on 2018-08-31 without embargo termsThe student, Harshit Agarwal, accepted the attached license on 2018-04-24 at 16:38.The student, Harshit Agarwal, submitted this Thesis for approval on 2018-04-24 at 16:49.This Thesis was approved for publication on 2018-04-26 at 15:27.DSpace SAF Submission Ingestion Package generated from Vireo submission #12459 on 2018-08-31 at 17:14:34Made available in DSpace on 2018-09-04T20:27:27Z (GMT). No. of bitstreams: 2
AGARWAL-THESIS-2018.pdf: 678848 bytes, checksum: f8b796b8fe43dc2e4b3ea0ab0496db11 (MD5)
LICENSE.txt: 4212 bytes, checksum: dd4f8382e7920d59a530536e67052e70 (MD5)
Previous issue date: 2018-04-2
Stela: on-demand elasticity in distributed data stream processing systems
"Big data is characterized by volume and velocity [24], and recently several real- time stream processing systems have emerged to combat this challenge. These systems process streams of data in real time and computational results. However, current popular data stream processing systems lack the ability to scale out and scale in (i.e., increase or decrease the number of machines or VMs allocated to the application) efficiently and unintrusively when requested by the user on demand. In order to scale out/in, a critical problem that needs to be solved is to determine which operator(s) of the stream processing application need to be given more resources or taken resources away from, in order to maximize the application throughput. We do so by presenting a novel metric called ""Expected Throughput Percentage"" (ETP). ETP takes into account not only congested elements of the stream processing application but also their effect on downstream elements and on the overall application throughput.
Next, we show how our new system, called Stela (STream processing ELAsticity), incorporates ETP in its scheduling strategy. Stela enables scale out and scale in operations on demand, and achieves the twin goals of optimizing post-scaling throughput and minimizing interference to throughput during the scaling out/in. We have integrated the implementation of Stela into Apache Storm [27], a popular data stream processing system.
We conducted experiments on Stela using a set of micro benchmark topologies as well as two topologies from Yahoo! Inc. Our experiment results shows Stela achieves 5% to 120% higher post scale throughput comparing to default Storm scheduler performing scale out operations, and 40% to 500% of throughput improvement comparing to the default scheduler during scale in stage. This work is a joint project with Master student Boyang Peng [1]."Submission original under an indefinite embargo labeled 'Open Access'. The submission was exported from vireo on 2015-09-29 without embargo termsThe student, Le Xu, accepted the attached license on 2015-07-20 at 02:40.The student, Le Xu, submitted this Thesis for approval on 2015-07-20 at 02:46.This Thesis was approved for publication on 2015-07-22 at 11:57.DSpace SAF Submission Ingestion Package generated from Vireo submission #8552 on 2015-09-29 at 13:23:11Made available in DSpace on 2015-09-29T20:38:39Z (GMT). No. of bitstreams: 2
XU-THESIS-2015.pdf: 3302651 bytes, checksum: d2dcb41ad657c24f9038f3d1d57a37d3 (MD5)
LICENSE.txt: 4202 bytes, checksum: 10d3a9b69c0a6092bbb17115d5e2ff09 (MD5)
Previous issue date: 2015-07-2
- …
