School of Computer Science Technical Report Series

Series Type
Publication Series
Associated Organization(s)
Associated Organization(s)
Organizational Unit
Organizational Unit

Publication Search Results

Now showing 1 - 7 of 7
  • Item
    AxBench: A Benchmark Suite for Approximate Computing Across the System Stack
    (Georgia Institute of Technology, 2016) Yazdanbakhsh, Amir ; Mahajan, Divya ; Lotfi-Kamran, Pejman ; Esmaeilzadeh, Hadi
    As the end of Dennard scaling looms, both the semiconductor industry and the research community are exploring for innovative solutions that allow energy efficiency and performance to continue to scale. Approximation computing has become one of the viable techniques to perpetuate the historical improvements in the computing landscape. As approximate computing attracts more attention in the community, having a general, diverse, and representative set of benchmarks to evaluate different approximation techniques becomes necessary. In this paper, we develop and introduce AxBench, a general, diverse and representative multi-framework set of benchmarks for CPUs, GPUs, and hardware design with the total number of 29 benchmarks. We judiciously select and develop each benchmark to cover a diverse set of domains such as machine learning, scientific computation, signal processing, image processing, robotics, and compression. AxBench comes with the necessary annotations to mark the approximable region of code and the application-specific quality metric to assess the output quality of each application. AxBenchwith these set of annotations facilitate the evaluation of different approximation techniques. To demonstrate its effectiveness, we evaluate three previously proposed approximation techniques using AxBench benchmarks: loop perforation [1] and neural processing units (NPUs) [2–4] on CPUs and GPUs, and Axilog [5] on dedicated hardware. We find that (1) NPUs offer higher performance and energy efficiency as compared to loop perforation on both CPUs and GPUs, (2) while NPUs provide considerable efficiency gains on CPUs, there still remains significant opportunity to be explored by other approximation techniques, (3) Unlike on CPUs, NPUs offer full benefits of approximate computations on GPUs, and (4) considerable opportunity remains to be explored by innovative approximate computation techniques at the hardware level after applying Axilog.
  • Item
    Neural Acceleration for GPU Throughput Processors
    (Georgia Institute of Technology, 2015) Yazdanbakhsh, Amir ; Park, Jongse ; Sharma, Hardik ; Lotfi-Kamran, Pejman ; Esmaeilzadeh, Hadi
    General-purpose computing on graphics processing units (GPGPU) accelerates the execution of diverse classes of applications, such as recognition, gaming, data analytics, weather prediction, and multimedia. Many of these applications are amenable to approximate execution. This application characteristic provides an opportunity to improve the performance and efficiency of GPGPU. Recent work has shown significant gains with neural approximate acceleration for CPU workloads. This work studies the effectiveness of neural approximate acceleration for GPU workloads. As applying CPU neural accelerators to GPUs leads to high area overhead, we define a low overhead neurally accelerated architecture for GPGPUs that enables scalable integration of neural acceleration on the large number of GPU cores. We also devise a mechanism that controls the tradeoff between the quality of results and the benefits from neural acceleration. We evaluate this design on a modern GPU architecture using a diverse set of benchmarks. Compared to the baseline GPGPU architecture, the cycle- accurate simulation results show 2.4 average speedup and 2.8 average energy reduction with 10% quality loss across all benchmarks. The quality control mechanism retains 1.9 average speedup and 2.1 energy reduction while reducing the quality degradation to 2.5%. These benefits are achieved by approximately 1.2% area overhead.
  • Item
    RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads
    (Georgia Institute of Technology, 2015) Yazdanbakhsh, Amir ; Pekhimenko, Gennady ; Thwaites, Bradley ; Esmaeilzadeh, Hadi ; Kim, Taesoo ; Mutlu, Onur ; Mowry, Todd C.
    This paper aims to tackle two fundamental memory bottle-necks: limited off-chip bandwidth (bandwidth wall) and long access latency (memory wall). To achieve this goal, our approach exploits the inherent error resilience of a wide range of applications. We introduce an approximation technique, called Rollback-Free Value Prediction (RFVP). When certain safe-to-approximate load operations miss in the cache, RFVP predicts the requested values. However, RFVP never checks for or recovers from load value mispredictions, hence avoiding the high cost of pipeline flushes and re-executions. RFVP mitigates the memory wall by enabling the execution to continue without stalling for long-latency memory accesses. To mitigate the bandwidth wall, RFVP drops some fraction of load requests which miss in the cache after predicting their values. Dropping requests reduces memory bandwidth contention by removing them from the system. The drop rate then becomes a knob to control the tradeoff between performance/energy efficiency and output quality. For a diverse set of applications from Rodinia, Mars, and NVIDIA SDK, employing RFVP with a 14KB predictor per streaming multiprocessor (SM) in a modern GPU delivers, on average, 40% speedup and 31% energy reduction, with average 8.8% quality loss. With 10% loss in quality, the benefits reach a maximum of 2.4x speedup and 2.0x energy reduction. As an extension, we also evaluate RFVP’s latency benefits for a single core CPU. For a subset of the SPEC CFP 2000/2006 benchmarks that are amenable to safe approximation, RFVP achieves, on average, 8% speedup and 6% energy reduction, with 0.9% average quality loss.
  • Item
    TABLA: A Unified Template-based Framework for Accelerating Statistical Machine Learning
    (Georgia Institute of Technology, 2015) Mahajan, Divya ; Park, Jongse ; Amaro, Emmanuel ; Sharma, Hardik ; Yazdanbakhsh, Amir ; Kim, Joon ; Esmaeilzadeh, Hadi
    A growing number of commercial and enterprise systems increasingly rely on compute-intensive machine learning algorithms. While the demand for these compute-intensive applications is growing, the performance benefits from general-purpose platforms are diminishing. To accommodate the needs of machine learning algorithms, Field Programmable Gate Arrays (FPGAs) provide a promising path forward and represent an intermediate point between the efficiency of ASICs and the programmability of general-purpose processors. However, acceleration with FPGAs still requires long design cycles and extensive expertise in hardware design. To tackle this challenge, instead of designing an accelerator for machine learning algorithms, we develop TABLA, a framework that generates accelerators for a class of machine learning algorithms. The key is to identify the commonalities across a wide range of machine learning algorithms and utilize this commonality to provide a high-level abstraction for programmers. TABLA leverages the insight that many learning algorithms can be expressed as stochastic optimization problems. Therefore, a learning task becomes solving an optimization problem using stochastic gradient descent that minimizes an objective function. The gradient solver is fixed while the objective function changes for different learning algorithms. TABLA provides a template-based framework for accelerating this class of learning algorithms. With TABLA, the developer uses a high-level language to only specify the learning model as the gradient of the objective function. TABLA then automatically generates the synthesizable implementation of the accelerator for FPGA realization. We use TABLA to generate accelerators for ten different learning task that are implemented on a Xilinx Zynq FPGA platform. We rigorously compare the benefits of the FPGA acceleration to both multicore CPUs (ARMCortex A15 and Xeon E3) and to many-core GPUs (Tegra K1, GTX 650 Ti, and Tesla K40) using real hardware measurements. TABLA-generated accelerators provide 15.0x and 2.9x average speedup over the ARM and the Xeon processors, respectively. These accelerator provide 22.7x, 53.7x, and 30.6x higher performance-per-Watt compare to Tegra, GTX 650, and Tesla, respectively. These benefits are achieved while the programmers write less than 50 lines of code.
  • Item
    ExpAX: A Framework for Automating Approximate Programming
    (Georgia Institute of Technology, 2014) Park, Jongse ; Zhang, Xin ; Ni, Kangqi ; Esmaeilzadeh, Hadi ; Naik, Mayur
    We present ExpAX, a framework for automating approximate programming. ExpAX consists of these three components: (1) a programming model based on a new kind of program specification, which we refer to as error expectations. Our programming model enables programmers to implicitly relax the accuracy constraints without explicitly marking operations as approximate; (2) an approximation safety analysis that automatically infers a safe-to-approximate set of program operations; and (3) an optimization that automatically marks a subset of the safe-to-approximate operations as approximate while statistically adhering to the error expectations. We evaluate ExpAX on a diverse set of Java applications. The results show that ExpAX provides significant energy savings (up to 35%) with large reduction in programmer effort (between 3x to 113x ) while providing formal safety and statistical quality-of-result guarantees.
  • Item
    Methodical Approximate Hardware Design and Reuse
    (Georgia Institute of Technology, 2014) Yazdanbakhsh, Amir ; Thwaites, Bradley ; Park, Jongse ; Esmaeilzadeh, Hadi
    Design and reuse of approximate hardware components—digital circuits that may produce inaccurate results—can potentially lead to significant performance and energy improvements. Many emerging error-resilient applications can exploit such designs provided approximation is applied in a controlled manner. This paper provides the design abstractions and semantics for methodical, modular, and controlled approximate hardware design and reuse. With these abstractions, critical parts of the circuit still carry the strict semantics of traditional hardware design, while flexibility is provided. We discuss these abstractions in the context of synthesizable register transfer level (RTL) design with Verilog. Our framework governs the application of approximation during the synthesis process without involving the designers in the details of approximate synthesis and optimization. Through high-level annotations, our design paradigm provides high-level control over where and to what degree approximation is applied. We believe that our work forms a foundation for practical approximate hardware design and reuse.
  • Item
    Expectation-Oriented Framework for Automating Approximate Programming
    (Georgia Institute of Technology, 2013) Esmaeilzadeh, Hadi ; Ni, Kangqi ; Naik, Mayur
    This paper describes ExpAX, a framework for automating approximate programming based on programmer-specified error expectations. Three components constitute ExpAX: (1) a programming model based on a new kind of program specification, which we refer to as expectations. Our programming model enables programmers to implicitly relax the accuracy constraints without explicitly marking operations approximate; (2) a novel approximation safety analysis that automatically identifies a safe-to-approximate subset of the program operations; and (3) an optimization that automatically marks a subset of the safe-to-approximate operations as approximate while considering the error expectation. Further, we formulate the process of automatically marking operations as approximate as an optimization problem and provide a genetic algorithm to solve it. We evaluate ExpAX using a diverse set of applications and show that it can provide significant energy savings while improving the quality-of-result degradation. ExpAX automatically excludes the safe-to-approximate operations that if approximated lead to significant quality degradation.