Person:
Prvulovic, Milos

Associated Organization(s)
Organizational Unit
ORCID
ArchiveSpace Name Record

Publication Search Results

Now showing 1 - 2 of 2
  • Item
    Performance debugging support for many-core processors project
    (Georgia Institute of Technology, 2012-09) Prvulovic, Milos ; Oh, Jungju ; Park, Sunjae
    In recent years, the number of cores available on a processor has increased rapidly, while the performance of an individual core has increased much more slowly. As a result, achieving a large performance improvement for applications now requires programmers to leverage the increased core count. This is often a very challenging problem, and many parallel applications end up suffering from performance bugs caused by scalability limiters. These prevent performance from improving as much as it should with more cores. Since we expect core counts to continue increasing for the foreseeable future, addressing scalability limiters is important for developing software that will obtain better performance on future hardware. This project, jointly funded by SRC and NSF, investigated software and hardware mechanisms that automate significant parts of this performance/scalability debugging effort in order to give programmers accurate and actionable feedback about the scaling limiters present in their code. Scalability limiters are mostly caused by resource-related bottlenecks and by insufficient exposed parallelism in the application. The main resource-related bottlenecks are related to excessive cache misses, while insufficient parallelism is mostly manifested as threads waiting to complete a synchronization operation such as a lock (lock contention) or a barrier (load imbalance).
  • Item
    KIMA: Hybrid Checkpointing for Recovery from a Wide Range of Errors and Detection Latencies
    (Georgia Institute of Technology, 2010) Doudalis, Ioannis ; Prvulovic, Milos
    Full system reliability is a problem that spans multiple levels of the software/hardware stack. The normal execution of a program in a system can be disrupted by multiple factors, ranging from transient errors in a processor and software bugs, to permanent hardware failures and human mistakes. A common method for recovering from such errors is the creation of checkpoints during the execution of the program, allowing the system to restore the program to a previous error-free state and resume execution. Different causes of errors, though, have different occurrence frequencies and detection latencies, requiring the creation of multiple checkpoints at different frequencies in order to maximize the availability of the system. In this paper we present KIMA, a novel checkpointing creation and management technique that combines efficiently the existing undo-log and redo-log checkpointing approaches, reducing the overall bandwidth requirements to both the memory and the hard disk. KIMA establishes DRAM-based undo-log checkpoints every 10ms, then leverages the undo-log metadata and checkpointed data to establish redo-log checkpoints every 1 second in non-volatile memory (such as PCM). Our results show that KIMA incurs average overheads of less than 1% while enabling efficient recovery from both transient and hard errors that have a variety of detection latencies.