Organizational Unit:

School of Computer Science

Permanent Link

https://hdl.handle.net/1853/70781

Parent Organization

Organizational Unit

College of Computing

ArchiveSpace Name Record

https://finding-aids.library.gatech.edu/agents/corporate_entities/945

Full item page

Publication Search Results

Now showing 1 - 10 of 12

Frame, rods and beads of the edge computing abacus

(Georgia Institute of Technology, 2016-11-15) Bhardwaj, Ketan

Emerging applications enabled by powerful end-user devices and 5G technologies, pose demands for reduced access latencies to web services and dramatic increase in the back-haul network capacity. In response, edge computing---the use of computational resources closer to end devices, at the edge of network---is becoming an attractive approach to addressing these demands. Going beyond point solutions, the vision of edge computing is to enable web services to deploy their edge functions (EF) in a multi-tenant infrastructure present at the edge of the mobile networks. However, there are critical technical challenges that need to be addressed to make that vision possible. This dissertation addresses three such critical challenges: 1. Demonstration of benefits of edge functions for real world, highly dynamic and large scale Android app ecosystem: (i) AppFlux and AppSachets, to relieve bandwidth pressure due to existing app delivery mechanisms, (ii) Ephemeral apps and app slices that rethink the app delivery for emerging app usage models, to highlight that the edge computing can enable transformational changes in the computing landscape beyond just latency and bandwidth optimizations. 2. Design and implementation of AirBox – a secure, lightweight and flexible edge function platform needed by web services to deploy and manage their EFs on edge computing nodes on-demand. AirBox is based on a detailed experimental design space exploration for system level mechanisms that are suitable for an edge function platform to address the technical challenges associated with provisioning, management and EF security. AirBox leverages state-of-the-art hardware-assisted and OS-agnostic security features, such as Intel SGX, to prescribe a reference design of a secure EF. 3. Finally, a solution to the most critical issue of enabling edge functions while preserving end to end security guarantees. Today, when most web services are delivered over encrypted traffic, it is impossible for edge functions to provide meaningful functionalities without compromising security or obviating performance benefits of EFs. Secure protocol extensions (SPX) can efficiently maintain the proposed End-to-Edge-to-End (E3) security semantics. Using SPX, we accomplish the seemingly impossible task of allowing edge functions to operate on encrypted traffic transmitted over secure protocols with modest overheads, while ensuring their security semantics, and continuing to provide the benefits of edge computing.
Elf: Efficient lightweight fast stream processing at scale

(Georgia Institute of Technology, 2016-05-10) Hu, Liting

Large Internet companies like Facebook, Amazon, and Twitter are increasingly recognizing the value of stream data processing, using tools like Flume, Muppet, or Storm to continuously collect and process incoming data in real time to help govern company activities. Applications include monitoring marketing streams for business-critical decisions, identifying spam campaigns from social network streams, datacenter's intrusion detection and troubleshooting, and others. Technical challenges for stream processing include the following: how to scale to numerous, concurrently running streaming jobs, to coordinate across those jobs to share insights, to make online changes to job functions to adapt to new requirements or data characteristics, and for each job, to efficiently operate over different time windows. This dissertation presents a new stream processing model, termed ELF, which addresses these new challenges. ELF proposes a novel decentralized "many masters many workers'' architecture implemented over a set of agents enriching the web tier of datacenter systems. ELF uses a DHT protocol to assign the jobs respective sets of master/workers mapping to the agents of the web tier, where for each job, the live data streams generated by webservers are first divided into mini-batches, then inserted and aggregated as space-efficient compressed buffer trees (CBTs) in local agents' memories. Second, per-batch results are `flushed' from CBTs, to be rolled up and aggregated via shared reducer trees (SRTs), in ways that naturally balance SRT-induced load, reduce processing latencies, and allow online job changes along with cross-job coordination. An ELF prototype implemented and evaluated for a larger scale configuration demonstrates scalability, high per-node throughput, sub-second job latency, and sub-second ability to adjust the actions of jobs being run.
Runtime specialization for heterogeneous CPU-GPU platforms

(Georgia Institute of Technology, 2015-12-03) Farooqui, Naila

Heterogeneous parallel architectures like those comprised of CPUs and GPUs are a tantalizing compute fabric for performance-hungry developers. While these platforms enable order-of-magnitude performance increases for many data-parallel application domains, there remain several open challenges: (i) the distinct execution models inherent in the heterogeneous devices present on such platforms drives the need to dynamically match workload characteristics to the underlying resources, (ii) the complex architecture and programming models of such systems require substantial application knowledge and effort-intensive program tuning to achieve high performance, and (iii) as such platforms become prevalent, there is a need to extend their utility from running known regular data-parallel applications to the broader set of input-dependent, irregular applications common in enterprise settings. The key contribution of our research is to enable runtime specialization on such hybrid CPU-GPU platforms by matching application characteristics to the underlying heterogeneous resources for both regular and irregular workloads. Our approach enables profile-driven resource management and optimizations for such platforms, providing high application performance and system throughput. Towards this end, this research: (a) enables dynamic instrumentation for GPU-based parallel architectures, specifically targeting the complex Single-Instruction Multiple-Data (SIMD) execution model, to gain real-time introspection into application behavior; (b) leverages such dynamic performance data to support novel online resource management methods that improve application performance and system throughput, particularly for irregular, input-dependent applications; (c) automates some of the programmer effort required to exercise specialized architectural features of such platforms via instrumentation-driven dynamic code optimizations; and (d) proposes a specialized, affinity-aware work-stealing scheduling runtime for integrated CPU-GPU processors that efficiently distributes work across all CPU and GPU cores for improved load balance, taking into account both application characteristics and architectural differences of the underlying devices.
Scalable and robust compute capacity multiplexing in virtualized datacenters

(Georgia Institute of Technology, 2014-05-16) Kesavan, Mukil

Multi-tenant cloud computing datacenters run diverse workloads, inside virtual machines (VMs), with time varying resource demands. Compute capacity multiplexing systems dynamically manage the placement of VMs on physical machines to ensure that their resource demands are always met while simultaneously optimizing on the total datacenter compute capacity being used. In essence, they give the cloud its fundamental property of being able to dynamically expand and contract resources required on-demand. At large scale datacenters though there are two practical realities that designers of compute capacity multiplexing systems need to deal with: (a) maintaining low operational overhead given variable cost of performing management operations necessary to allocate and multiplex resources, and (b) the prevalence of a large number and wide variety of faults in hardware, software and due to human error, that impair multiplexing efficiency. In this thesis we propound the notion that explicitly designing the methods and abstractions used in capacity multiplexing systems for this reality is critical to better achieve administrator and customer goals at large scales. To this end the thesis makes the following contributions: (i) CCM - a hierarchically organized compute capacity multiplexer that demonstrates that simple designs can be highly effective at multiplexing capacity with low overheads at large scales compared to complex alternatives, (ii) Xerxes - a distributed load generation framework for flexibly and reliably benchmarking compute capacity allocation and multiplexing systems, (iii) A speculative virtualized infrastructure management stack that dynamically replicates management operations on virtualized entities, and a compute capacity multiplexer for this environment, that together provide fault-scalable management performance for a broad class of commonly occurring faults in large scale datacenters. Our systems have been implemented in an industry-strength cloud infrastructure built on top of the VMware vSphere virtualization platform and the popular open source OpenStack cloud computing platform running ESXi and Xen hypervisors, respectively. Our experiments have been conducted in a 700 server datacenter using the Xerxes benchmark replaying trace data from production clusters, simulating parameterized scenarios like flash crowds, and also using a suite of representative cloud applications. Results from these scenarios demonstrate the effectiveness of our design techniques in real-life large scale environments.
Middleware for online scientific data analytics at extreme scale

(Georgia Institute of Technology, 2014-03-25) Zheng, Fang

Scientific simulations running on High End Computing machines in domains like Fusion, Astrophysics, and Combustion now routinely generate terabytes of data in a single run, and these data volumes are only expected to increase. Since such massive simulation outputs are key to scientific discovery, the ability to rapidly store, move, analyze, and visualize data is critical to scientists' productivity. Yet there are already serious I/O bottlenecks on current supercomputers, and movement toward the Exascale is further accelerating this trend. This dissertation is concerned with the design, implementation, and evaluation of middleware-level solutions to enable high performance and resource efficient online data analytics to process massive simulation output data at large scales. Online data analytics can effectively overcome the I/O bottleneck for scientific applications at large scales by processing data as it moves through the I/O path. Online analytics can extract valuable insights from live simulation output in a timely manner, better prepare data for subsequent deep analysis and visualization, and gain improved performance and reduced data movement cost (both in time and in power) compared to the conventional post-processing paradigm. The thesis identifies the key challenges for online data analytics based on the needs of a variety of large-scale scientific applications, and proposes a set of novel and effective approaches to efficiently program, distribute, and schedule online data analytics along the critical I/O path. In particular, its solution approach i) provides a high performance data movement substrate to support parallel and complex data exchanges between simulation and online data analytics, ii) enables placement flexibility of analytics to exploit distributed resources, iii) for co-placement of analytics with simulation codes on the same nodes, it uses fined-grained scheduling to harvest idle resources for running online analytics with minimal interference to the simulation, and finally, iv) it supports scalable efficient online spatial indices to accelerate data analytics and visualization on the deep memory hierarchies of high end machines. Our middleware approach is evaluated with leadership scientific applications in domains like Fusion, Combustion, and Molecular Dynamics, and on different High End Computing platforms. Substantial improvements are demonstrated in end-to-end application performance and in resource efficiency at scales of up to 16384 cores, for a broad range of analytics and visualization codes. The outcome is a useful and effective software platform for online scientific data analytics facilitating large-scale scientific data exploration.
Virtual platforms: achieving performance and isolation properties on shared multicore servers

(Georgia Institute of Technology, 2013-11-19) Tembey, Priyanka

Multicore servers in datacenter systems are routinely used to run multiple disparate application workload mixes. Analysis performed in Google's datacenters show, for instance, components (i.e., processes) of up to 19 distinct applications to be co-deployed on a single multicore node. Virtualization technology further encourages this trend, increasing platform utilization via higher levels of workload consolidation. Systems software on these shared server nodes must meet challenges that include (a) providing end-to-end performance guarantees for possibly multiple applications and delivering global platform-level properties such as platform-level power or utilization caps., (b) mediating use of shared resources efficiently while offering isolation guarantees for multiple applications running on consolidated platforms to maintain their performance properties predictably, and (c) meeting multiple dynamic competing application performance levels and platform-level properties efficiently, especially in oversubscribed systems. The goals of this thesis addresses (a)-(c) as follows: (1) by developing system-level mechanisms for addressing challenges (a)-(c), (2) by demonstrating their ability to deliver improved application performance with less variability and improved platform efficiency, and (3) by creating principles and representative methods for realizing the isolation properties sought by applications and the efficiency sought for platforms. The concrete realization of these goals is a Virtual Platforms (VP) enabled hypervisor - where per application or platform-level policy objectives are expressed at the system-level via elastic resource abstractions, which may also change dynamically during system runtime. For multiple consolidated applications (and their virtual platforms), there are methods that monitor and mediate their use of shared platform resources to deliver improved isolation for predictable performance, while Merlin: a resource allocator for shared multicore servers makes it easier to implement higher-level arbitration policies while meeting multiple performance and platform properties. As single-node multicore platforms evolve further from small numbers of homogeneous cores toward multiple sets or islands of potentially heterogeneous cores residing on a single chip, such platforms will have multiple resource managers managing their respective `islands' of resources. Though geared toward improved scalability and functionality, for applications spanning across multiple diverse resource islands to realize such opportunities, systems software must make it easier for them to interact with the island managers; and also help islands based systems achieve end-to-end performance properties via joint coordination amongst their island managers. In order to meet the challenges in maintaining performance objectives on future `scale-out' platforms, this thesis contributes inTune: a framework for inter-island operation, offering APIs and mechanisms that permit applications (and their virtual platforms) to interface with resource islands and their resource managers to jointly achieve application performance guarantees and global platform-level properties. This thesis focuses on the management of compute, physical memory and memory bandwidth resources of single node server platforms, however the methods presented in this work can be extended to other resource types including network and storage resources. InTune and Virtual-Platforms are implemented in the Xen hypervisor for x86 multi-core platforms with multiple NUMA memory nodes. Evaluation with representative parallel, web-based, and real-time applications and application mixes demonstrate the benefits of using our methods to achieve application performance and platform policy objectives.
System abstractions for resource scaling on heterogeneous platforms

(Georgia Institute of Technology, 2013-11-18) Gupta, Vishal

The increasingly diverse nature of modern applications makes it critical for future systems to have dynamic resource scaling capabilities which enable them to adapt their resource usage to meet user requirements. Such mechanisms should be both fine-grained in nature for resource-efficient operation and also provide a high scaling range to support a variety of applications with diverse needs. To this end, heterogeneous platforms, consisting of components with varying characteristics, have been proposed to provide improved performance/efficiency than homogeneous configurations, by making it possible to execute applications on the most suitable component. However, introduction of such heterogeneous architectural components requires system software to embrace complexity associated with heterogeneity for managing them efficiently. Diversity across vendors and rapidly changing hardware make it difficult to incorporate heterogeneity-aware resource management mechanisms into mainstream systems, affecting the widespread adoption of these platforms. Addressing these issues, this dissertation presents novel abstractions and mechanisms for heterogeneous platforms which decouple heterogeneity from management operations by masking the differences due to heterogeneity from applications. By exporting a homogeneous interface over heterogeneous components, it proposes the scalable 'resource state' abstraction, allowing applications to express their resource requirements which then are dynamically and transparently mapped to heterogeneous resources underneath. The proposed approach is explored for both modern mobile devices where power is a key resource and for cloud computing environments where platform resource usage has monetary implications, resulting in HeteroMates and HeteroVisor solutions. In addition, it also highlights the need for hardware and system software to consider multiple resources together to obtain desirable gains from such scaling mechanisms. The solutions presented in this dissertation open ways for utilizing future heterogeneous platforms to provide on-demand performance, as well as resource-efficient operation, without disrupting the existing software stack.
Memory region: a system abstraction for managing the complex memory structures of multicore platforms

(Georgia Institute of Technology, 2013-11-18) Lee, Min

The performance of modern many-core systems depends on the effective use of their complex cache and memory structures, and this will likely become more pronounced with the impending arrival of on-chip 3D stacked and non-volatile off-chip byte-addressable memory. Yet to date, operating systems have not treated memory as a first class schedulable resource, embracing memory heterogeneity. This dissertation presents a new software abstraction, called ‘memory region’, which denotes the current set of physical memory pages actively used by workloads. Using this abstraction, memory resources can be scheduled for applications to fully exploit a platform's underlying cache and memory system, thereby gaining improved performance and predictability in execution, particularly for the consolidated workloads seen in virtualized and cloud computing infrastructures. The abstraction's implementation in the Xen hypervisor involves the run-time detection of memory regions, the scheduled mapping of these regions to caches to match performance goals, and maintaining region-to-cache mappings using per-cache page tables. This dissertation makes the following specific contributions. First, its region scheduling method proposes that the location of memory blocks rather than CPU utilization is the principal determinant where workloads are run. It proposes a new scheduling method, the region scheduling that the location of memory blocks determines where the workloads are run. Second, treating memory blocks as first-class resources, new methods for efficient cache management are shown to improve application performance as well as the performance of certain operating system functions. Third, explicit memory scheduling makes it possible to disaggregate operating systems, without the need to change OS sources and with only small markups of target guest OS functionality. With this method, OS functions can be mapped to specific desired platform components, such as file system confined to running on specific cores and using only certain memory resources designated for its use. This can improve performance for applications heavily dependent on certain OS functions, by dynamically providing those functions with the resources needed for their current use, and it can prevent performance-critical application functionality from being needlessly perturbed by OS functions used for other purposes or by other jobs. Fourth, extensions of region scheduling can also help applications deal with the heterogeneous memory resources present in future systems, including on-chip stacked DRAM and NUMA or even NVRAM memory modules. More generally, regions scheduling is shown to apply to memory structures with well-defined differences in memory access latencies.
Storage and aggregation for fast analytics systems

(Georgia Institute of Technology, 2013-11-18) Amur, Hrishikesh

Computing in the last decade has been characterized by the rise of data- intensive scalable computing (DISC) systems. In particular, recent years have wit- nessed a rapid growth in the popularity of fast analytics systems. These systems exemplify a trend where queries that previously involved batch-processing (e.g., run- ning a MapReduce job) on a massive amount of data, are increasingly expected to be answered in near real-time with low latency. This dissertation addresses the problem that existing designs for various components used in the software stack for DISC sys- tems do not meet the requirements demanded by fast analytics applications. In this work, we focus specifically on two components: 1. Key-value storage: Recent work has focused primarily on supporting reads with high throughput and low latency. However, fast analytics applications require that new data entering the system (e.g., new web-pages crawled, currently trend- ing topics) be quickly made available to queries and analysis codes. This means that along with supporting reads efficiently, these systems must also support writes with high throughput, which current systems fail to do. In the first part of this work, we solve this problem by proposing a new key-value storage system – called the WriteBuffer (WB) Tree – that provides up to 30× higher write per- formance and similar read performance compared to current high-performance systems. 2. GroupBy-Aggregate: Fast analytics systems require support for fast, incre- mental aggregation of data for with low-latency access to results. Existing techniques are memory-inefficient and do not support incremental aggregation efficiently when aggregate data overflows to disk. In the second part of this dis- sertation, we propose a new data structure called the Compressed Buffer Tree (CBT) to implement memory-efficient in-memory aggregation. We also show how the WB Tree can be modified to support efficient disk-based aggregation.
Monitoring and analysis system for performance troubleshooting in data centers

(Georgia Institute of Technology, 2013-11-18) Wang, Chengwei

It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which was not realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performance troubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming. To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers. VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novel software architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. By running anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to the performance issue. VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be found via solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocus with real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.