Series
School of Computer Science Technical Report Series

Series Type
Publication Series
Description
Associated Organization(s)
Associated Organization(s)
Organizational Unit
Organizational Unit

Publication Search Results

Now showing 1 - 4 of 4
  • Item
    Neural Acceleration for GPU Throughput Processors
    (Georgia Institute of Technology, 2015) Yazdanbakhsh, Amir ; Park, Jongse ; Sharma, Hardik ; Lotfi-Kamran, Pejman ; Esmaeilzadeh, Hadi
    General-purpose computing on graphics processing units (GPGPU) accelerates the execution of diverse classes of applications, such as recognition, gaming, data analytics, weather prediction, and multimedia. Many of these applications are amenable to approximate execution. This application characteristic provides an opportunity to improve the performance and efficiency of GPGPU. Recent work has shown significant gains with neural approximate acceleration for CPU workloads. This work studies the effectiveness of neural approximate acceleration for GPU workloads. As applying CPU neural accelerators to GPUs leads to high area overhead, we define a low overhead neurally accelerated architecture for GPGPUs that enables scalable integration of neural acceleration on the large number of GPU cores. We also devise a mechanism that controls the tradeoff between the quality of results and the benefits from neural acceleration. We evaluate this design on a modern GPU architecture using a diverse set of benchmarks. Compared to the baseline GPGPU architecture, the cycle- accurate simulation results show 2.4 average speedup and 2.8 average energy reduction with 10% quality loss across all benchmarks. The quality control mechanism retains 1.9 average speedup and 2.1 energy reduction while reducing the quality degradation to 2.5%. These benefits are achieved by approximately 1.2% area overhead.
  • Item
    TABLA: A Unified Template-based Framework for Accelerating Statistical Machine Learning
    (Georgia Institute of Technology, 2015) Mahajan, Divya ; Park, Jongse ; Amaro, Emmanuel ; Sharma, Hardik ; Yazdanbakhsh, Amir ; Kim, Joon ; Esmaeilzadeh, Hadi
    A growing number of commercial and enterprise systems increasingly rely on compute-intensive machine learning algorithms. While the demand for these compute-intensive applications is growing, the performance benefits from general-purpose platforms are diminishing. To accommodate the needs of machine learning algorithms, Field Programmable Gate Arrays (FPGAs) provide a promising path forward and represent an intermediate point between the efficiency of ASICs and the programmability of general-purpose processors. However, acceleration with FPGAs still requires long design cycles and extensive expertise in hardware design. To tackle this challenge, instead of designing an accelerator for machine learning algorithms, we develop TABLA, a framework that generates accelerators for a class of machine learning algorithms. The key is to identify the commonalities across a wide range of machine learning algorithms and utilize this commonality to provide a high-level abstraction for programmers. TABLA leverages the insight that many learning algorithms can be expressed as stochastic optimization problems. Therefore, a learning task becomes solving an optimization problem using stochastic gradient descent that minimizes an objective function. The gradient solver is fixed while the objective function changes for different learning algorithms. TABLA provides a template-based framework for accelerating this class of learning algorithms. With TABLA, the developer uses a high-level language to only specify the learning model as the gradient of the objective function. TABLA then automatically generates the synthesizable implementation of the accelerator for FPGA realization. We use TABLA to generate accelerators for ten different learning task that are implemented on a Xilinx Zynq FPGA platform. We rigorously compare the benefits of the FPGA acceleration to both multicore CPUs (ARMCortex A15 and Xeon E3) and to many-core GPUs (Tegra K1, GTX 650 Ti, and Tesla K40) using real hardware measurements. TABLA-generated accelerators provide 15.0x and 2.9x average speedup over the ARM and the Xeon processors, respectively. These accelerator provide 22.7x, 53.7x, and 30.6x higher performance-per-Watt compare to Tegra, GTX 650, and Tesla, respectively. These benefits are achieved while the programmers write less than 50 lines of code.
  • Item
    ExpAX: A Framework for Automating Approximate Programming
    (Georgia Institute of Technology, 2014) Park, Jongse ; Zhang, Xin ; Ni, Kangqi ; Esmaeilzadeh, Hadi ; Naik, Mayur
    We present ExpAX, a framework for automating approximate programming. ExpAX consists of these three components: (1) a programming model based on a new kind of program specification, which we refer to as error expectations. Our programming model enables programmers to implicitly relax the accuracy constraints without explicitly marking operations as approximate; (2) an approximation safety analysis that automatically infers a safe-to-approximate set of program operations; and (3) an optimization that automatically marks a subset of the safe-to-approximate operations as approximate while statistically adhering to the error expectations. We evaluate ExpAX on a diverse set of Java applications. The results show that ExpAX provides significant energy savings (up to 35%) with large reduction in programmer effort (between 3x to 113x ) while providing formal safety and statistical quality-of-result guarantees.
  • Item
    Methodical Approximate Hardware Design and Reuse
    (Georgia Institute of Technology, 2014) Yazdanbakhsh, Amir ; Thwaites, Bradley ; Park, Jongse ; Esmaeilzadeh, Hadi
    Design and reuse of approximate hardware components—digital circuits that may produce inaccurate results—can potentially lead to significant performance and energy improvements. Many emerging error-resilient applications can exploit such designs provided approximation is applied in a controlled manner. This paper provides the design abstractions and semantics for methodical, modular, and controlled approximate hardware design and reuse. With these abstractions, critical parts of the circuit still carry the strict semantics of traditional hardware design, while flexibility is provided. We discuss these abstractions in the context of synthesizable register transfer level (RTL) design with Verilog. Our framework governs the application of approximation during the synthesis process without involving the designers in the details of approximate synthesis and optimization. Through high-level annotations, our design paradigm provides high-level control over where and to what degree approximation is applied. We believe that our work forms a foundation for practical approximate hardware design and reuse.