1,721,123 research outputs found
Systems for analyzing routing policies and localizing faults in the Internet
Thesis (Ph.D.)--University of Washington, 2021Our reliance on the Internet continues to grow; however, Internet communication has seen little progress over the years because it typically spans multiple Autonomous Systems (ASes) that are operated by individual Internet Service Providers (ISPs) and organizations. This inherent autonomy of the Internet limits the visibility into other networks and the velocity of change. As a result, public Internet communication has become the weak link for Internet-based services. In this thesis, I design, build, and evaluate practical algorithms and systems that ISPs and cloud providers can use to analyze Internet routing policies and localize faults in the Internet. Knowledge of the business relationships between ASes is essential to understanding the behavior of the Internet routing system. I develop ProbLink, a probabilistic algorithm to infer business relationships between ASes in the Internet. By integrating noisy but useful features, it overcomes the challenges in inferring hard links such as routing violating the valley-free assumption, limited visibility, and non-conventional peering practices. I build three real-world applications on top of ProbLink and show that ProbLink has a significant impact when applied to practical applications compared to the state-of-the-art inference algorithms based on empirical rules. For Internet-based services such as video calls and online games, providing low latency is important. I design and build a system, BlameIt, that automatically localizes the faulty AS when there is latency degradation between clients and clouds. BlameIt employs a hybrid two-phased blame assignment, combining the best parts of passive analysis (low measurement overhead) and active probing (fine-grained fault localization). BlameIt has been in production deployment for 3 years at Microsoft Azure and produces results with high accuracy at low overheads
Building Efficient Network Protocols for Data Centers using Programmable Switches
Thesis (Ph.D.)--University of Washington, 2019Historically, computer networks have been designed to have most of the complexity at the end-hosts, while the switches connecting them are simple forwarding pipes that understand a fixed, well-specified set of protocols. This simplifies switching chip design, enabling them to operate at high speeds, albeit at the cost of little to no flexibility. On the other hand, recent advances in hardware switch architectures have made it feasible to perform limited flexible packet processing without sacrificing performance. Network operators can configure switches to process custom packet headers to exercise greater control over how packets are processed and routed. However, these switches have limited state, limited per-packet computation, restricted class of operations, and support a fixed set of scheduling primitives to be able to operate at line rate. This thesis explores various mechanisms and techniques to overcome these switch limitations and implement efficient network protocols that rely on both flexible computation and packet scheduling inside the network. First, we use approximation techniques to mask limitations on computation and network state, letting us implement rich protocols that perform complex computations inside the network. Next, we propose an approximate scheduling mechanism based on Calendar Queues that lets us implement a wide range of scheduling algorithms to achieve various end-to-end performance objectives. Finally, we implement some of these protocols on real hardware and within a packet-level simulator to demonstrate significant performance improvement over state-of-the-art techniques
Building Distributed Systems Using Programmable Networks
Thesis (Ph.D.)--University of Washington, 2020The continuing increase of data center network bandwidth, coupled with a slower improvement in CPU performance, has challenged our conventional wisdom regarding data center networks: how to build distributed systems that can keep up with the network speeds and are high-performant and energy-efficient? The recent emergence of a programmable network fabric (PNF) suggests a potential solution. By offloading suitable computations to a PNF device (i.e., SmartNIC, reconfigurable switch, or network accelerator), one can reduce request serving latency, save end-host CPU cores, and enable efficient traffic control. In this dissertation, we present three frameworks for building PNF-enabled distributed systems: (1) IncBricks, an in-network caching fabric built with network accelerators and programmable switches; (2) iPipe, an actor-based framework for offloading distributed applications on SmartNICs; (3) E3, an energy-efficient microservice execution platform for SmartNIC-accelerated servers. This dissertation presents how to make efficient use of in-network heterogeneous computing re- sources by employing new programming abstractions, applying approximation techniques, co- designing with end-host software layers, and designing efficient control-/data-planes. Our prototyped systems using commodity PNF hardware not only show the feasibility of such an approach but also demonstrate that it is an indispensable technique for efficient data center computing
Multi-tenant Machine Learning Model Serving Systems on GPU Clusters
Thesis (Ph.D.)--University of Washington, 2024In an era where GPUs are both costly and scarce, efficiently serving machine learning models has become a critical challenge. Assuming that serving one model requires GPUs, serving n models would seemingly require GPUs. In the multi-tenant setting, we can pool the whole cluster's GPUs to serve the models collectively, thus requiring far fewer GPUs. This talk addresses how to optimize cluster-wide GPU utilization in a multi-tenant setting. Key challenges addressed include:(1) batching efficiency under latency constraints,
(2) bursty requests and GPU consolidation,
(3) GPU cluster auto-scaling. This dissertation discusses two projects that address the above research problems.The first project, Symphony, focuses on serving DNN models. With a novel Deferred Batch Scheduling algorithm and a system design supporting it, Symphony makes high-quality batching decisions and enables robust auto-scaling. Symphony achieves 6x goodput given the same number of GPUs, saves 60\% GPUs when serving the same request rate, and is capable to handle 15 million requests per second.
The second project, Punica, creates a new paradigm of serving multiple LoRA fine-tuned large language models at the cost of one. Punica improves throughput by 12x without latency sacrifice
Enhancing Multi-Tenant Disaggregated Storage Systems with H/W Innovations
Thesis (Ph.D.)--University of Washington, 2025Disaggregated storage architectures have become a foundational element in modern datacenter design, enabling independent scaling of compute and storage resources. However, supporting multi-tenant workloads on shared flash-based storage devices remains challenging due to interference, limited hardware isolation, and host-centric software bottlenecks. This dissertation explores system and interface designs that leverage hardware innovations—specifically, SmartNICs and NVMe SSDs—to improve performance, fairness, and adaptability in disaggregated storage systems. The first part introduces Gimbal, a software storage switch that enables multi-tenant-aware scheduling and congestion control on SmartNIC-based storage nodes. Gimbal employs write cost estimation, credit-based flow control, and hierarchical I/O scheduling to isolate tenants and maintain throughput fairness under constrained compute budgets. We then present eZNS, an elastic Zoned Namespace abstraction that enables dynamic resource allocation for ZNS SSDs. eZNS decouples rigid namespace boundaries and allows flexible sharing of zones through global and local overdrive policies. Coordinated I/O planning and proactive space management improve utilization and write efficiency while preserving predictable performance. Lastly, we propose the Interposable Transport Protocol (ITP), a transport-layer abstraction for SmartNIC-based disaggregated storage. ITP enables in-network request redirection, replication, and remote memory access by treating SmartNICs as protocol-aware dataplane processors. The prototype demonstrates that forwarding and RMA operations can complete within single-digit microseconds using ARM cores, achieving near-RDMA performance without dedicated hardware support. Collectively, these contributions show that co-designing software abstractions with emerging hardware enables disaggregated storage systems to achieve high performance, adaptability, and strong multi-tenant isolation. The proposed designs lay a foundation for scalable, composable storage infrastructure in modern cloud environments
Structural Insights for LLM Serving Efficiency
Thesis (Ph.D.)--University of Washington, 2025The widespread adoption of Large Language Models (LLMs) has reshaped the datacenter computing landscape. As these models continue to grow in size and complexity, they require increasingly expensive and power-intensive infrastructure. Hence, serving LLMs efficiently has become critical for managing costs and resource constraints in modern datacenters. In this dissertation, I argue that serving efficiency can be significantly improved by designing systems that are aware of the distinct phases of generative LLM inference: a compute-intensive prefill phase and a memory-intensive decode phase. Exploiting the unique properties of these phases unlocks significant performance gains at scale. My research validates this thesis through three studies. First, I address power constraints, a key bottleneck to datacenter growth. By analyzing how the distinct power demands of prefill and decode phases aggregate, I show that inference cluster power is underutilized. Based on this observation, I develop a power oversubscription framework that safely adds more servers under existing power budgets, increasing inference cluster capacity with minimal performance impact. Second, I show that running the compute-bound prefill and memory-bound decode phases on the same hardware leads to poor performance and resource stranding. To address these overheads, I introduce a new inference cluster architecture that disaggregates the phases onto hardware fleets specialized to better manage resources for each phase. This phase-separated cluster design yields substantial efficiency improvements over traditional approaches. Third, I extensively analyze the unique inefficiencies caused by conditional computation in Mixture-of-Experts (MoE) models, which I formalize as the MoE tax. This tax manifests differently across the two phases, for instance, creating load imbalance in prefill and increased memory transfers in decode. Based on this analysis, I propose phase-specific optimizations to address these bottlenecks and improve the efficiency of serving MoE models at scale. Collectively, these studies demonstrate that phase awareness is a key principle for designing efficient generative LLM serving systems
Characterizing and Improving Web Page Load Times
Thesis (Ph.D.)--University of Washington, 2015Web page load time (PLT) is a key performance metric that many techniques aim to improve. PLT is much slower than lower-level latencies, but the reason was not well understood. This dissertation first characterizes theWeb page load time by abstracting a dependency model between network and computation activities. We have built a tool WProf based on this model, that identifies the bottlenecks of PLTs of hundreds ofWeb pages, and that provides basis for evaluating PLT-reducing techniques. Next, we evaluate SPDY’s contributions to PLTs and find that SPDY’s impact on PLTs is largely limited by the dependencies and browser computation. This suggests that the page load process should be restructured to remove the dependencies so as to improve PLTs. Thus, we propose SplitBrowser that preprocesses Web pages on a proxy server and migrate carefully crafted state to the client so as to simplify the client-side page load process. We have shown that SplitBrowser reduces PLTs by more than half under a variety of mobile settings that span less compute power and slower networks
Optimizing Distributed Systems using Machine Learning
Thesis (Ph.D.)--University of Washington, 2019Distributed systems consist of many components that interact with each other to perform certain task(s). Traditionally, many of these systems base their decisions on sets of rules or configurations defined by operators as well as handcrafted analytical models. However, creating those rules or engineering such models is a challenging task. First, the same system should be able to work under a combinatorial number of conditions on top of heterogeneous hardware. Second, they should support different type of workloads and run in potentially widely different settings. Third, they should be able to handle time-varying resource needs. These factors render reasoning about distributed systems' performance in general far from trivial. In this thesis, we propose optimizing distributed systems using machine learning (ML). Our main contribution is the design, implementation, augmentation, and evaluation of three distributed systems that illustrate the impact of these ML-based optimizations: 1) Curator, a framework that safeguards distributed storage systems' health and performance by scheduling and executing background maintenance tasks, 2) AdaRes, an adaptive system that dynamically adjusts virtual machine resources in virtual execution environments, and 3) Pulpo, a federated system that efficiently trains machine learning models across different data centers. Each system instantiates appropriate ML models for the task at hand, alleviating systems designers from manually tuning rules and handcrafting complex analytical models. Our evaluations on real clusters show how our ML formulations result in improved distributed systems' efficiency and performance
Practical Improvements to User Privacy in Cloud Applications
Thesis (Ph.D.)--University of Washington, 2017-08As the cloud handles more user data, users need better techniques to protect their privacy from adversaries looking to gain unauthorized access to sensitive data. Today’s cloud services offer weak assurances with respect to user privacy, as most data is processed unencrypted in a centralized location by systems with a large trusted computing base. While current architectures enable application development speed, this comes at the cost of susceptibility to large-scale data breaches. In this thesis, I argue that we can make significant improvements to user privacy from both external attackers and insider threats. In the first part of the thesis, I develop the Radiatus architecture for securing fully-featured cloud applications from external attacks. Radiatus secures private data stored by web applications by isolating server-side code execution into per-user sandboxes, limiting the scope of successful attacks. In the second part of the thesis, I focus on a simpler messaging application, Talek, securing it from both external and insider threats. Talek is a group private messaging system that hides both message contents as well as communication patterns from an adversary in partial control of the cloud. Both of these systems are designed to provide better security and privacy guarantees for users under realistic threat models, while offering practical performance and development costs. This thesis presents an implementation and evaluation of both systems, showing that improved user privacy can come at acceptable costs
- …
