CAPS Home

Tools

The design of specialized architectures for accelerating the inference of Deep Neural Networks (DNNs) is a booming area of research nowadays. While first-generation accelerator proposals used simple fixed dataflows tailored for dense DNNs, more recent architectures have argued for flexibility to efficiently support a wide variety of layer types, dimensions, and sparsity. As the complexity of these accelerators grows, it becomes more and more appealing for researchers to have cycle-level simulation tools at their disposal to allow for fast and accurate design-space exploration, and rapid quantification of the efficacy of architectural enhancements during the early stages of a design. To this end, we present STONNE (Simulation TOol of Neural Network Engines), a cycle-level, highly-modular and highly-extensible simulation framework that can plug into any high-level DNN framework as an accelerator device and perform end-to-end evaluation of flexible accelerator microarchitectures with sparsity support, running complete DNN models.

F. Muñoz-Martínez, J. L. Abellán, M. E. Acacio and T. Krishna, "STONNE: Enabling Cycle-Level Microarchitectural Simulation for DNN Inference Accelerators," 2021 IEEE International Symposium on Workload Characterization (IISWC), Storrs, CT, USA, 2021.[PDF]

Link to STONNE

Over the past three decades, the parallel applications of the Splash-2 benchmark suite have been instrumental in advancing multiprocessor research. Recently, the Splash-3 benchmarks eliminated performance bugs, data races, and improper synchronization that plagued Splash-2 benchmarks after the definition of the C memory model. In Splash-4 ,we revisit the Splash-3 benchmarks and adapt them for contemporary architectures with atomic operations and lock-free constructs. With our changes, we improve the scalability of most benchmarks for up to 32 and 64 cores, showing an improvement of up to 9x in actual machines, and up to 5x in simulation, over the unmodified Splash-3 benchmarks. To denote the substantive nature of the improvements in the Splash-3 benchmarks.

Eduardo José Gómez-Hernández, Ruixiang Shao, Christos Sakalis, Stefanos Kaxiras, Alberto Ros, "Splash-4: Improving Scalability with Lock-Free Constructs". International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 235--236, Worldwide event, March 2021.[PDF]

Link to Splash-4

A well-known benchmark suite of parallel applications is the Splash-2 suite. Since its creation in the context of the DASH project, Splash-2 benchmarks have been widely used in research. However, Splash-2 was released over two decades ago and does not adhere to the recent C memory consistency model. This leads to unexpected and often incorrect behavior when some Splash-2 benchmarks are used in conjunction with contemporary compilers and hardware (simulated or real). Most importantly, we discovered critical performance bugs. In the Splash-3 benchmark suite we rectify the problematic benchmarks and contribute to the community a new sanitized version of the Splash-2 benchmarks.

Reference: Christos Sakalis, Carl Leonardsson, Stefanos Kaxiras, Alberto Ros, "Splash-3: A Properly Synchronized Benchmark Suite for Contemporary Research". International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 101--111, April 2016.[PDF]

Link to Splash-3

Existing multi-threaded applications perform synchronization either in an explicit way, e.g., making use of the functionality provided by synchronization libraries or in an implicit way, e.g., using shared variables. Unfortunately, the implicit synchronization constructs are prone to errors and difficult to detect. We developed a tool that is able to detect implicit synchronization in multi-threaded applications. The detection is performed by ensuring that during the execution of an application under a memory model that provides sequential consistency for data-race-free applications (SC for DRF), every read returns the same value as if running under sequential consistency. If the previous condition is not fulfilled by the execution, the application has data races, which may be intended to perform implicit synchronization.

Reference: Alberto Ros, Stefanos Kaxiras, "Fast&Furious: A Tool for Detecting Covert Racing". 6th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures (PARMA) and 4th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (DITAM), pages 1--6, January 2015.[PDF]

Link to Fast&Furious