CAPS Home

Publications

To view the group's complete research catalog including books, patents, thesis, etc kindly follow this link.

Eduardo José Gómez-Hernández, Juan M. Cebrian, Stefanos Kaxiras, Alberto Ros (2022), "Splash-4: A Modern Benchmark Suite with Lock-Free Constructs", International Symposium on Workload Characterization (IISWC).

[Abstract] [BibTeX]

Abstract:The cornerstone for the performance evaluation of computer systems is the benchmark suite. Among the many benchmark suites used in high-performance computing and multicore research, Splash-2 has been instrumental in advancing knowledge for both academia and industry. Published in 1995 and with over 5276 citations and counting, this benchmark suite is still in use to evaluate novel architectural proposals. Recently, the Splash-3 suite eliminates important performance bugs, data races, and improper synchronization that plagued Splash-2 benchmarks after the formal definition of the C memory model. However, keeping up with architectural changes while maintaining the same workloads and algorithms (for comparative purposes) is a real challenge. Benchmark suites can misrepresent the performance characteristics of a computer system if they do not reflect the available features of the hardware and architects may end up overestimating the impact of proposed techniques or underestimating others. In this work we introduce a revised version of Splash-3, designated Splash-4, that introduces modern programming techniques to improve scalability on contemporary hardware. We then characterize Splash-3 and Splash-4 in a state-ofthe-art simulated architecture, Intel's Ice Lake with gem5-20 simulator, as well as a real contemporary hardware processor (AMD's EPYC 7002 series). Our evaluation shows that for a 64-thread execution Splash-4 reduces the normalized execution time by an average of 52% and 34% for AMD's EPYC and Intel's Ice Lake, respectively.

BibTeX:

                    @InProceedings{ejgomez-iiswc22,
                      author =       {Eduardo Jos{\'e} G{\'o}mez-Hern{\'a}ndez and Juan Manuel Cebrian and Stefanos Kaxiras and Alberto Ros},
                      title =        {Splash-4: A Modern Benchmark Suite with Lock-Free Constructs},
                      booktitle =    {2022 IEEE International Symposium on Workload Characterization (IISWC)},
                      doi =          {},
                      pages =        {},
                      year =         {2022},
                      editor =       {},
                      address =      {Austin, TX (USA)},
                      month =        nov,
                      publisher =    {IEEE Computer Society},
                      ratio-acep =   {47.92% (23/48)},
                      isbn =         {},
                      url =          {http://webs.um.es/aros/papers/pdfs/ejgomez-iiswc22.pdf}
                    }

Agustín Navarro-Torres, Biswabandan Panda, J. Alastruey-Benedé, Pablo Ibáñez, Víctor Viñals-Yúfera, and Alberto Ros (2022), "Berti: an Accurate Local-Delta Data Prefetcher", International Symposium on Microarchitecture (MICRO).

[Abstract] [BibTeX]

Abstract:Data prefetching is a technique that plays a crucial role in modern high-performance processors by hiding long latency memory accesses. Several state-of-the-art hardware prefetchers exploit the concept of deltas, defined as the difference between the cache line addresses of two demand accesses. Existing delta prefetchers, such as best offset prefetching (BOP) and multi-lookahead prefetching (MLOP), train and predict future accesses based on global deltas. We observed that the use of global deltas results in missed opportunities to anticipate memory accesses. In this paper, we propose Berti, a first-level data cache prefetcher that selects the best local deltas, i.e., those that consider only demand accesses issued by the same instruction. Thanks to a high-confidence mechanism that precisely detects the timely local deltas with high coverage, Berti generates accurate prefetch requests. Then, it orchestrates the prefetch requests to the memory hierarchy, using the selected deltas. Our empirical results using ChampSim and SPEC CPU2017 and GAP workloads show that, with a storage overhead of just 2.55 KB, Berti improves performance by 8.5% compared to a baseline IP-stride and 3.5% compared to IPCP, a state-of-the-art prefetcher. Our evaluation also shows that Berti reduces dynamic energy at the memory hierarchy by 33.6% compared to IPCP, thanks to its high prefetch accuracy.

BibTeX:

                    @INPROCEEDINGS{9923806, 
                       author={Navarro-Torres, Agustín and Panda, Biswabandan and Alastruey-Benedé, Jesús and Ibáñez, Pablo and Viñals-Yúfera, Víctor and Ros, Alberto},  
                       booktitle={2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)},   
                       title={Berti: an Accurate Local-Delta Data Prefetcher},   
                       year={2022},  
                       volume={},  
                       number={},  
                       pages={975-991},  
                       doi={10.1109/MICRO56248.2022.00072}}

Sawan Singh, Arthur Perais, Alexandra Jimborean, and Alberto Ros (2022), "Exploring Instruction Fusion Opportunities in General Purpose Processors", International Symposium on Microarchitecture (MICRO).

[Abstract] [BibTeX]

Abstract: The Complex Instruction Set Computer (CISC) paradigm has led to the introduction of instruction cracking in which an architectural instruction is divided into multiple microarchitectural instructions (u-ops). However, the dual concept, instruction fusion is also prevalent in modern microarchitectures to maximize resource utilization. In essence, some architectural instructions are too complex to be executed as a unit, so they should be cracked, while others are too simple to waste resources on executing them as a unit, so they should be fused with others. In this paper, we focus on instruction fusion and explore opportunities for fusing additional instructions in a high-performance general purpose pipeline. We show that enabling fusion for common RISC-V idioms improves performance by 7%. Then, we determine experimentally that enabling fusion only for memory instructions achieves 86% of the potential of fusion in this particular case. Finally, we propose the Helios microarchitecture, able to fuse non-consecutive and noncontiguous memory instructions, and discuss microarchitectural changes required to do so efficiently while preserving correctness. Helios allows to fuse an additional 5.5% of dynamic instructions, yielding a 14.2% performance uplift over no fusion (8.2% over baseline fusion).

BibTeX:

                    @INPROCEEDINGS{9923815,
                      author={Singh, Sawan and Perais, Arthur and Jimborean, Alexandra and Ros, Alberto},
                      booktitle={2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)}, 
                      title={Exploring Instruction Fusion Opportunities in General Purpose Processors}, 
                      year={2022},
                      volume={},
                      number={},
                      pages={199-212},
                      doi={10.1109/MICRO56248.2022.00026}}

Ashkan Asgharzadeh, Juan M. Cebrian, Arthur Perais, Stefanos Kaxiras, Alberto Ros (2022), "Free Atomics: Hardware Atomic Operations without Fences", International Symposium on Computer Architecture (ISCA).

[Abstract] [BibTeX]

Abstract: Atomic Read-Modify-Write (RMW) instructions are primitive synchronization operations implemented in hardware that provide the building blocks for higher-abstraction synchronization mechanisms to programmers. According to publicly available documentation, current x86 implementations serialize atomic RMW operations, i.e., the store buffer is drained before issuing atomic RMWs and subsequent memory operations are stalled until the atomic RMW commits. This serialization, carried out by memory fences, incurs a performance cost which is expected to increase with deeper pipelines. This work proposes Free atomics, a lightweight, speculative, deadlock-free implementation of atomic operations that removes the need for memory fences, thus improving performance, while preserving atomicity and consistency. Free atomics is, to the best of our knowledge, the first proposal to enable store-to-load forwarding for atomic RMWs. Free atomics only requires simple modifications and incurs a small area overhead (15 bytes). Our evaluation using gem5-20 shows that, for a 32-core configuration, Free atomics improves performance by 12.5%, on average, for a large range of parallel workloads and 25.2%, on average, for atomic-intensive parallel workloads over a fenced atomic RMW implementation.

BibTeX:

                    @InProceedings{aasgharzadeh-isca22,
                      author = 	 {Ashkan Asgharzadeh and Juan M. Cebrian and Arthur Perais and Stefanos Kaxiras and Alberto Ros},
                      title = 	 {Free Atomics: Hardware Atomic Operations without Fences},
                      booktitle =    {49th International Symposium on Computer Architecture (ISCA)},
                      doi =          {10.1145/3470496.3527385},
                      pages = 	 {},
                      year = 	 {2022},
                      editor = 	 {ACM},
                      address = 	 {New York, NY (USA)},
                      month = 	 jun,
                      publisher =    {Association for Computing Machinery (ACM)},
                      ratio-acep =   {16.75% (67/400)},
                      isbn =         {978-1-4503-8610-4},
                      issn =         {1063-6897},
                      url =          {http://webs.um.es/aros/papers/pdfs/aasgharzadeh-isca22.pdf}
                    }

Juan Manuel Cebrian, Thibaud Balem, Adrian Barredo, Marc Casas, Miquel Moreto, Alberto Ros, Alexandra Jimborean (2022), "Compiler-Assisted Compaction/Restoration of SIMD Instructions", IEEE Transactions on Parallel and Distributed Systems (TPDS).

[Abstract] [BibTeX]

Abstract: Vector processors (e.g., SIMD or GPUs) are ubiquitous in high performance systems. All the supercomputers in the world exploit data-level parallelism (DLP), for example by using single instructions to operate over several data elements. Improving vector processing is therefore key for exascale computing. However, despite its potential, vector code generation and execution have significant challenges. Among these challenges, control flow divergence is one of the main performance limiting factors. Most modern vector instruction sets, including SIMD, rely on predication to support divergence control. Nevertheless, the performance and energy consumption in predicated codes is usually insensitive to the number of active elements in a predicated mask. Since the trend is that vector register size increases, the energy efficiency of exascale computing systems will become sub-optimal. This paper proposes a novel approach to improve execution efficiency in predicated vector codes, the Compiler-Assisted Compaction/Restoration (CACR) technique. Baseline CR delays predicated SIMD instructions with inactive elements, compacting active elements from instances of the same instruction of consecutive loop iterations. Compacted elements form an equivalent dense vector instruction. After executing the dense instructions, their results are restored to the original instructions. However, CR has a significant performance and energy penalty when it fails to find active elements, either due to lack of resources when unrolling or because of inter-loop dependencies. In CACR, the compiler analyzes the code looking for key information required to configure CR. Then, it passes this information to the processor via new instructions inserted in the code. This prevents CR from waiting for active elements on scenarios when it would fail to form dense instructions. Simulated results (gem5) show that CACR improves performance by up to 29% and reduces dynamic energy by up to 24.2% on average, for a a set of applications with predicated execution. The baseline CR only achieves 18.6% performance and 14% energy improvements for the same configuration and applications.

BibTeX:

                    @Article{jcebrian-tpds22,
                      author = 	 {Juan Manuel Cebrian and Thibaud Balem and Adrian Barredo and Marc Casas and Miquel Moreto and Alberto Ros and Alexandra Jimborean},
                      title = 	 {Compiler-Assisted Compaction/Restoration of SIMD Instructions},
                      journal = 	 {IEEE Transactions on Parallel and Distributed Systems (TPDS)},
                      doi =          {10.1109/TPDS.2021.3091015},
                      publisher =    {IEEE Computer Society},
                      address =      {},
                      year = 	 {2022},
                      volume = 	 {33},
                      number = 	 {4},
                      issn =         {1045-9219},
                      pages = 	 {779--791},
                      month = 	 apr,
                      impactfactor = {2.600, 29/108 (Q1) - COMPUTER SCIENCE, THEORY & METHODS (2019)},
                      url =          {http://webs.um.es/aros/papers/pdfs/jmcebrian-tpds22.pdf}
                    }

Víctor Nicolás-Conesa, Rubén Titos-Gil, Ricardo Fernández-Pascual, Alberto Ros, Manuel E. Acacio (2022), "Analysis of the Interactions Between ILP and TLP With Hardware Transactional Memory", Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP).

[Abstract] [BibTeX]

Abstract: Hardware Transactional Memory (HTM) allows the use of transactions by programmers, making parallel programming easier and theoretically obtaining the performance of finegrained locks. However, transactions can abort for a variety of reasons, resulting in the squash of speculatively executed instructions and the consequent loss in both performance and energy efficiency. Among the different sources of abort, conflicting concurrent accesses to the same shared memory locations from different transactions are often the prevalent cause. In this work, we characterize, for the first time to the best of our knowledge, how the aggressiveness of the cores in terms of exploiting instruction-level parallelism can interact with threadlevel speculation support brought by HTM systems. We observe that altering the size of the structures that support out-of-order and speculative execution changes the number of aborts produced in the execution of transactional workloads on a best-effort HTM implementation. Our results show that a small number of powerful cores is more suitable for high-contention scenarios, whereas under low contention it is preferable to use a larger number of less aggressive cores. In addition, an aggressive core can lead to performance loss in medium-contention scenarios due to an increase in the number of aborts. We conclude that depending on contention, a careful choice over processor aggressiveness can reduce abort ratios.

BibTeX:

                    @InProceedings{vnicolas-pdp22,
                      author = 	 {V{\'{\i}}ctor Nicol{\'a}s-Conesa and Rub{\'e}n Titos-Gil and Ricardo Fern{\'a}ndez-Pascual and Alberto Ros and Manuel E. Acacio},
                      title = 	 {Analysis of the Interactions Between ILP and TLP With Hardware Transactional Memory},
                      booktitle =    {23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)},
                      doi =          {10.1109/PDP55904.2022},
                      editor = 	 {IEEE Computer Society},
                      pages = 	 {157--164},
                      year = 	 {2022},
                      publisher =    {IEEE Computer Society},
                      address = 	 {Worldwide event},
                      month = 	 mar,
                      ratio-acep =   {37.84% (28/74)},
                      isbn =         {978-1-6654-6958-6},
                      url =          {http://webs.um.es/aros/papers/pdfs/vnicolas-pdp22.pdf}
                    }

Marina Shimchenko, Rubén Titos-Gil, Ricardo Fernández-Pascual, Manuel E. Acacio, Stefanos Kaxiras, Alberto Ros, Alexandra Jimborean (2022), "Analysing Software Prefetching Opportunities in Hardware Transactional Memory", Journal of Supercomputing (SUPE).

[Abstract] [BibTeX]

Abstract: Hardware Transactional Memory emerged to make parallel programming more accessible. However, the performance pitfall of this technique is squashing speculatively executed instructions and re-executing them in case of aborts, ultimately resorting to serialization in case of repeated conflicts. A significant fraction of aborts occur due to conflicts (concurrent reads and writes to the same memory location performed by different threads). Our proposal aims to reduce conflict aborts by reducing the window of time during which transactional regions can suffer conflicts. We achieve this by using software prefetching instructions inserted automatically at compile-time. Through these prefetch instructions, we intend to bring the necessary data for each transaction from the main memory to the cache before the transaction itself starts to execute, thus converting the otherwise long latency cache misses into hits during the execution of the transaction. The obtained results show that our approach decreases the number of aborts by 30% on average and improves performance by up to 19% and 10% for two out of the eight evaluated benchmarks. We provide insights into when our technique is beneficial given certain characteristics of the transactional regions, the advantages and disadvantages of our approach, and finally, discuss potential solutions to overcome some of its limitations.

BibTeX:

                    @Article{mshimchenko-supe22,
                      author = 	 {Marina Shimchenko and Rub{\'e}n Titos-Gil and Ricardo Fern{\'a}ndez-Pascual and Manuel E. Acacio and Stefanos Kaxiras and Alberto Ros and Alexandra Jimborean},
                      title = 	 {Analysing Software Prefetching Opportunities in Hardware Transactional Memory},
                      journal = 	 {Journal of Supercomputing (SUPE)},
                      doi =          {10.1007/s11227-021-03897-z},
                      publisher =    {Springer US},
                      address =      {},
                      year = 	 {2022},
                      volume = 	 {78},
                      number = 	 {1},
                      issn =         {0920-8542},
                      pages = 	 {919--944},
                      month = 	 jan,
                      impactfactor = {1.532, 44/103 (Q2) - COMPUTER SCIENCE, THEORY & METHODS (2017)},
                      url =          {http://webs.um.es/aros/papers/pdfs/mshimchenko-supe21.pdf}
                    }

Rubén Titos-Gil, Ricardo Fernández-Pascual, Manuel E. Acacio, Alberto Ros (2022), "DeTraS: Delaying Stores for Friendly-Fire Mitigation in Hardware Transactional Memory", EEE Transactions on Parallel and Distributed Systems (TPDS),.

[Abstract] [BibTeX]

Abstract: Commercial Hardware Transactional Memory (HTM) systems are best-effort designs that leverage the coherence substrate to detect conflicts eagerly. Resolving conflicts in favor of the requesting core is the simplest option for ensuring deadlock freedom, yet it is prone to livelocks. In this work, we propose and evaluate DeTraS (Delayed Transactional Stores), an HTM-aware store buffer design aimed at mitigating such livelocks. DeTraS takes advantage of the fact that modern commercial processors implement a large store buffer, and uses it to prevent transactional stores predicted to conflict from performing early in the transaction. By leveraging existing processor structures, we propose a simple design that improves the ability of requester-wins HTM systems to achieve forward progress in spite of high contention while side-stepping the performance penalty of falling back to mutual exclusion. With just over 50 extra bytes, DeTraS captures the advantages of lazy conflict management without the complexity brought into the coherence fabric by commit arbitration schemes nor the relaxation of the single-writer invariant of prior works. Through detailed simulations of a 16-core tiled CMP using gem5, we demonstrate that DeTraS brings reductions in average execution time of 25% when compared to an Intel RTM-like design.

BibTeX:

                    @Article{rtitos-tpds22,
                      author = 	 {Rub{\'e}n Titos-Gil and Ricardo Fern{\'a}ndez-Pascual and Manuel E. Acacio and Alberto Ros},
                      title = 	 {DeTraS: Delaying Stores for Friendly-Fire Mitigation in Hardware Transactional Memory},
                      journal = 	 {IEEE Transactions on Parallel and Distributed Systems (TPDS)},
                      doi =          {10.1109/TPDS.2021.3085210},
                      publisher =    {IEEE Computer Society},
                      address =      {},
                      year = 	 {2022},
                      volume = 	 {33},
                      number = 	 {1},
                      issn =         {1045-9219},
                      pages = 	 {1--13},
                      month = 	 jan,
                      impactfactor = {2.600, 29/108 (Q1) - COMPUTER SCIENCE, THEORY & METHODS (2019)},
                      url =          {http://webs.um.es/aros/papers/pdfs/rtitos-tpds22.pdf}
                    }

Bhargavi R. Upadhyay, Alberto Ros, Jalpa Shah (2021), "Efficient Classification of Private Memory Blocks", Journal of Parallel Distributed Computing (JPDC).

[Abstract] [BibTeX]

Abstract: Shared memory architectures are pervasive in the multicore technology era. Still, sequential and parallel applications use most of the data as private in a multicore system. Recent proposals using this observation and driven by a classification of private/shared memory data can reduce the coherence directory area or the memory access latency. The effectiveness of these proposals depends on the accuracy of the classification. The existing proposals perform the private/shared classification at page granularity, leading to a miss-classification and reducing the number of detected private memory blocks. We propose a mechanism able to accurately classify memory blocks using the existing translation lookaside buffers (TLB), which increases the effectiveness of proposals relying on a private/shared classification. Our experimental results show that the proposed scheme reduces L1 cache misses by 25% compared to a page-grain classification approach, which translates into an improvement in system performance by 8.0% with respect to a page-grain approach. Keywords: chip multiprocessor, cache coherence, private-shared data classification

BibTeX:

                    @Article{bupadhyay-jpdc21,
                      author = 	 {Bhargavi R. Upadhyay and Alberto Ros and Jalpa Shah},
                      title = 	 {Efficient Classification of Private Memory Blocks},
                      journal = 	 {Journal of Parallel Distributed Computing (JPDC)},
                      doi =          {10.1016/j.jpdc.2021.07.005},
                      publisher =    {Academic Press, Inc.},
                      address =      {Orlando, FL (USA)},
                      year = 	 {2021},
                      volume = 	 {157},
                      number = 	 {},
                      issn =         {0743-7315},
                      pages = 	 {256--268},
                      month = 	 nov,
                      impactfactor = {2.296, 35/108 (Q2) - COMPUTER SCIENCE, THEORY & METHODS (2019)},
                      url =          {http://webs.um.es/aros/papers/pdfs/bupadhyay-jpdc21.pdf}
                    }

Eduardo José Gómez-Hernández, Rubén Titos-Gil, Juan Manuel Cebrian, Stefanos Kaxiras, Alberto Ros (2021), "Efficient, Distributed, and Non-Speculative Multi-Address Atomic Operations", International Symposium on Microarchitecture (MICRO).

[Abstract] [BibTeX]

Abstract: Critical sections that read, modify, and write (RMW) a small set of addresses are common in parallel applications and concurrent data structures. However, to escape from the intricacies of finegrained locks, which require reasoning about all possible thread interleavings, programmers often resort to coarse-grained locks to ensure atomicity. This results in atomic protection of a much larger set of potentially conflicting addresses, and, consequently, increased lock contention and unneeded serialization. As many before us have observed, these problems would be solved if only general RMW multi-address atomic operations were available, but current proposals are impractical because of deadlock scenarios that appear due to resource limitations. Alternatively, transactional memory can detect conflicts at run-time aiming to maximize concurrency, but it has significant overheads in highly-contended critical sections. In this work, we propose multi-address atomic operations (MAD atomics). MAD atomics achieve complexity-effective, non-speculative, non-deadlocking, fine-grained locking for multiple addresses, relying solely on the coherence protocol and a predetermined locking order. Unlike prior works, MAD atomics address the challenge of enabling atomic modification over a set of cachelines with arbitrary addresses, simultaneously locking all of them while sidestepping deadlock. MAD atomics only require a small storage per core (around 68 bytes), while significantly outperforming typical lock implementations. Indeed, our evaluation using gem5-20 shows that MAD atomics can improve performance by up to 18x (3.4x, on average, for the applications and concurrent data structures evaluated in this work) over a baseline implemented with locks running on 16 cores. More importantly, the improvement still reaches 2.7×, on average, compared to an Intel hardware transactional memory implementation running on 16 cores.

BibTeX:

                    @InProceedings{ejgomez-micro21,
                      author =       {Eduardo Jos{\'e} G{\'o}mez-Hern{\'a}ndez and Rub{\'e}n Titos-Gil and Juan Manuel Cebrian and Stefanos Kaxiras and Alberto Ros},
                      title =        {Efficient, Distributed, and Non-Speculative Multi-Address Atomic Operations},
                      booktitle =    {54th International Symposium on Microarchitecture (MICRO)},
                      doi =          {10.1145/3466752.3480073},
                      pages =        {337--349},
                      year =         {2021},
                      editor =       {},
                      address =      {Worldwide event},
                      month =        oct,
                      publisher =    {IEEE Computer Society},
                      ratio-acep =   {21.86% (94/430)},
                      isbn =         {978-1-4503-8557-2},
                      url =          {http://webs.um.es/aros/papers/pdfs/ejgomez-micro21.pdf}
                    }

Josué Feliu, Alberto Ros, Manuel E. Acacio, Stefanos Kaxiras (2021), "ITSLF: Inter-Thread Store-to-Load Forwarding in Simultaneous Multithreading", International Symposium on Microarchitecture (MICRO).

[Abstract] [BibTeX]

Abstract: In this paper, we argue that, for a class of fine-grain, synchronization-intensive, parallel workloads, it is advantageous to consolidate synchronization and communication as much as possible among the threads of simultaneous multithreading (SMT) cores. While, today, the shared L1 is the closest coherent level where synchronization and communication between SMT threads can take place, we observe that there is an even closer shared level, entirely inside a single core. This level comprises the load queues (LQ) and store queues (SQ) / store buffers (SB) of the SMT threads and to the best of our knowledge it has never been used as such. The reason is that if we allow communication of different SMT threads via their LQs and SQs/SBs, i.e., inter-thread storeto-load forwarding (ITSLF), we violate write atomicity with respect to the outside world, beyond the acceptable model of read-own-write-early multiple-copy atomicity (rMCA). The key insight of our work is that we can accelerate synchronization and communication among SMT threads with inter-thread store-to-load forwarding, without affecting the memory model—in particular without violating rMCA. We demonstrate how we can achieve this entirely through speculative interactions between LQs and SQs/SBs of different threads, while ensuring deadlock-free execution. Without changing the architectural model, the ISA, or the software, and without adding extra hardware in the form of a specialized accelerator, our insight enables a new design point for a standard architecture. We demonstrate that with ITSLF, workloads scale better on a single 8-way SMT core (with the resources of a single-threaded core) than on a baseline SMT (with or without optimizations), or on 8 single-threaded cores.

BibTeX:

                    @InProceedings{jfeliu-micro21,
                      author =       {Josu{\'e} Feliu and Alberto Ros and Manuel E. Acacio and Stefanos Kaxiras},
                      title =        {ITSLF: Inter-Thread Store-to-Load Forwarding in Simultaneous Multithreading},
                      booktitle =    {54th International Symposium on Microarchitecture (MICRO)},
                      doi =          {10.1145/3466752.3480086},
                      pages =        {1296--1308},
                      year =         {2021},
                      editor =       {},
                      address =      {Worldwide event},
                      month =        oct,
                      publisher =    {IEEE Computer Society},
                      ratio-acep =   {21.86% (94/430)},
                      isbn =         {978-1-4503-8557-2},
                      url =          {http://webs.um.es/aros/papers/pdfs/jfeliu-micro21.pdf}
                    }

Christos Sakalis, Zamshed Chowdhury, Shayne Wadle, Ismail Akturk, Alberto Ros, Magnus Själander, Stefanos Kaxiras, Ulya R. Karpuzcu (2021), "Do Not Predict - Recompute! How Value Recomputation Can Truly Boost the Performance of Invisible Speculation", IEEE International Symposium on Secure and Private Execution Environment Design (SEED).

[Abstract] [BibTeX]

Abstract: Recent architectural approaches that address speculative side-channel attacks aim to prevent software from exposing the microarchitectural state changes of transient execution. The Delay-on-Miss technique is one such approach, which simply delays loads that miss in the L1 cache until they become non-speculative, resulting in no transient changes in the memory hierarchy. However, this costs performance, prompting the use of value prediction (VP) to regain some of the delay. However, the problem cannot be solved by simply introducing a new kind of speculation (value prediction). Value-predicted loads have to be validated, which cannot be commenced until the load becomes nonspeculative. Thus, value-predicted loads occupy the same amount of precious core resources (e.g., reorder buffer entries) as Delay-on-Miss. The end result is that VP only yields marginal benefits over Delay-on-Miss. In this paper, our insight is that we can achieve the same goal as VP (increasing performance by providing the value of loads that miss) without incurring its negative side-effect (delaying the release of precious resources), if we can safely, non-speculatively, recompute a value in isolation (without being seen from the outside), so that we do not expose any information by transferring such a value via the memory hierarchy. Value Recomputation, which trades computation for data transfer was previously proposed in an entirely different context: to reduce energyexpensive data transfers in the memory hierarchy. In this paper, we demonstrate the potential of value recomputation in relation to the Delayon-Miss approach of hiding speculation, discuss the trade-offs, and show that we can achieve the same level of security, reaching 93% of the unsecured baseline performance (5% higher than Delay-on-miss), and exceeding (by 3%) what even an oracular (100% accuracy and coverage) value predictor could do. I

BibTeX:

                    @InProceedings{csakalis-seed21,
                      author =       {Christos Sakalis and Zamshed Chowdhury and Shayne Wadle and Ismail Akturk and Alberto Ros and Magnus Sj{\"a}lander and Stefanos Kaxiras and Ulya R. Karpuzcu},
                      title =        {Do Not Predict - Recompute! How Value Recomputation Can Truly Boost the Performance of Invisible Speculation},
                      booktitle =    {1st IEEE International Symposium on Secure and Private Execution Environment Design (SEED)},
                      doi =          {10.1109/SEED51797.2021.00021},
                      pages =        {89--100},
                      year =         {2021},
                      editor =       {},
                      address =      {Worldwide event},
                      month =        sep,
                      publisher =    {IEEE Computer Society},
                      ratio-acep =   {30.23% (13/43)},
                      isbn =         {978-1-6654-2026-6},
                      url =          {http://webs.um.es/aros/papers/pdfs/csakalis-seed21.pdf}
                    }

Alberto Ros (2021), "BL∪E: A Timely, IP-based Data Prefetcher", The 1st ML-Based Data Prefetching Competition. ML for Computer Architecture and Systems.

[Abstract] [BibTeX]

Abstract: High-performance prefetchers require not only predicting the future cache lines that will be requested but also when they will be requested. Timeliness is therefore an essential property for getting the maximum performance from a prefetcher. Bringing the cache line too early to cache can decrease the coverage of the prefetcher when such cache line is evicted before is requested. On the other hand, prefetching the data too late can lead to late prefetchers, and thus, sub-optimal performance. This paper presents BL∪E, a data prefetcher that predicts the prefetched cache lines based on timeliness. The prefetcher accounts for the time required to fetch a cache line and issues the prefetch request early enough, such that when it is accessed it will already be stored in cache. For each instruction pointer (or group of them) BL∪E i) correlates in a timely way the cache lines that have been requested and ii) infers their timely delta when the cache lines have not been accessed yet

BibTeX:

                    @InProceedings{aros-mldpc21,
                      author = 	 {Alberto Ros},
                      title = 	 {BL$\cup$E: A Timely, IP-based Data Prefetcher},
                      booktitle =    {The 1st ML-Based Data Prefetching Competition. ML for Computer Architecture and Systems},
                      pages = 	 {},
                      year = 	 {2021},
                      editor = 	 {},
                      address = 	 {Worldwide event},
                      month = 	 jun,
                      publisher =    {},
                      ratio-acep =   {75.00% (3/4)},
                      isbn =         {},
                      url =          {http://webs.um.es/aros/papers/pdfs/aros-mldpc21.pdf}
                    }

Alberto Ros, Alexandra Jimborean (2021), "A Cost-Effective Entangling Prefetcher for Instructions", International Symposium on Computer Architecture (ISCA).

[Abstract] [BibTeX]

Abstract: Prefetching instructions in the instruction cache is a fundamental technique for designing high-performance computers. There are three key properties to consider when designing an efficient and effective prefetcher: timeliness, coverage, and accuracy. Timeliness is essential, as bringing instructions too early increases the risk of the instructions being evicted from the cache before their use and requesting them too late can lead to the instructions arriving after they are demanded. Coverage is important to reduce the number of instruction cache misses and accuracy to ensure that the prefetcher does not pollute the cache or interacts negatively with the other hardware mechanisms. This paper presents the Entangling Prefetcher for Instructions that entangles instructions to maximize timeliness. The prefetcher works by finding which instruction should trigger the prefetch for a subsequent instruction, accounting for the latency of each cache miss. The prefetcher is carefully adjusted to account for both coverage and accuracy. Our evaluation shows that with 40KB of storage, Entangling can increase performance up to 23%, outperforming state-of-the-art prefetchers.

BibTeX:

                    @InProceedings{aros-isca21,
                      author = 	 {Alberto Ros and Alexandra Jimborean},
                      title = 	 {A Cost-Effective Entangling Prefetcher for Instructions},
                      booktitle =    {48th International Symposium on Computer Architecture (ISCA)},
                      doi =          {10.1109/ISCA52012.2021.00017},
                      pages = 	 {99--111},
                      year = 	 {2021},
                      editor = 	 {},
                      address = 	 {Worldwide event},
                      month = 	 jun,
                      publisher =    {Association for Computing Machinery (ACM)},
                      ratio-acep =   {18.72% (76/406)},
                      isbn =         {978-1-6654-3333-4},
                      issn =         {1063-6897},
                      url =          {http://webs.um.es/aros/papers/pdfs/aros-isca21.pdf}
                    }

Eduardo José Gómez-Hernández, Ruixiang Shao, Christos Sakalis, Stefanos Kaxiras, Alberto Ros (2021), "Splash-4: Improving Scalability with Lock-Free Constructs", International Symposium on Performance Analysis of Systems and Software (ISPASS).

[Abstract] [BibTeX]

Abstract: Over the past three decades, the parallel applications of the Splash-2 benchmark suite have been instrumental in advancing multiprocessor research. Recently, the Splash-3 benchmarks eliminated performance bugs, data races, and improper synchronization that plagued Splash-2 benchmarks after the definition of the C memory model. In this work, we revisit the Splash-3 benchmarks and adapt them for contemporary architectures with atomic operations and lock-free constructs. With our changes, we improve the scalability of most benchmarks for up to 32 and 64 cores, showing an improvement of up to 9x in actual machines, and up to 5x in simulation, over the unmodified Splash-3 benchmarks. To denote the substantive nature of the improvements in the Splash-3 benchmarks and to re-introduce them in contemporary research, we refer to the new collection as Splash-4.

BibTeX:

                    @InProceedings{ejgomez-ispass21,
                      author = 	 {Eduardo Jos{\'e} G{\'o}mez-Hern{\'a}ndez and Ruixiang Shao and Christos Sakalis and Stefanos Kaxiras and Alberto Ros},
                      title = 	 {Splash-4: Improving Scalability with Lock-Free Constructs},
                      booktitle =    {International Symposium on Performance Analysis of Systems and Software (ISPASS)},
                      doi =          {10.1109/ISPASS51385.2021.00044},
                      editor =       {IEEE Computer Society},
                      pages = 	 {235--236},
                      year = 	 {2021},
                      address = 	 {Worldwide event},
                      month = 	 mar,
                      publisher =    {IEEE Computer Society},
                      ratio-acep =   {36.92% (24/65)},
                      isbn =         {978-1-7281-8643-6},
                      url =          {http://webs.um.es/aros/papers/pdfs/ejgomez-ispass21.pdf}
                    }

Per Ekemark, Yuan Yao, Alberto Ros, Konstantinos Sagonas, Stefanos Kaxiras (2021), "TSOPER: Efficient Coherence-Based Strict Persistency", High Performance Computer Architecture (HPCA).

[Abstract] [BibTeX]

Abstract: We propose a novel approach for hardware-based strict TSO persistency, called TSOPER. We allow a TSO persistency model to freely coalesce values in the caches, by forming atomic groups of cachelines to be persisted. A group persist is initiated for an atomic group if any of its newly written values are exposed to the outside world. A key difference with prior work is that our architecture is based on the concept of a TSO persist buffer, that sits in parallel to the shared LLC, and persists atomic groups directly from private caches to NVM, bypassing the coherence serialization of the LLC. To impose dependencies among atomic groups that are persisted from the private caches to the TSO persist buffer, we introduce a sharing-list coherence protocol that naturally captures the order of coherence operations in its sharing lists, and thus can reconstruct the dependencies among different atomic groups entirely at the private cache level without involving the shared LLC. The combination of the sharing-list coherence and the TSO persist buffer allows persist operations and writes to non-volatile memory to happen in the background and trail the coherence operations. Coherence runs ahead at full speed; persistency follows belatedly. Our evaluation shows that TSOPER provides the same level of reordering as a program-driven relaxed model, hence, approximately the same level of performance, albeit without needing the programmer or compiler to be concerned about false sharing, data-race-free semantics, etc., and guaranteeing all software that can run on top of TSO, automatically persists in TSO.

BibTeX:

                                    @InProceedings{pekemark-hpca21,
                                      author =       {Per Ekemark and Yuan Yao and Alberto Ros and Konstantinos Sagonas and Stefanos Kaxiras},
                                      title =        {TSOPER: Efficient Coherence-Based Strict Persistency},
                                      booktitle =    {27th Symposium on High Performance Computer Architecture (HPCA)},
                                      doi =          {10.1109/HPCA51647.2021.00021},
                                      pages =        {125--138},
                                      year =         {2021},
                                      editor =       {},
                                      address =      {Worldwide event},
                                      month =        feb,
                                      publisher =    {IEEE Computer Society},
                                      ratio-acep =   {24.42% (63/258)},
                                      isbn =         {978-1-6654-4670-9},
                                      issn =         {1530-0897},
                                      url =          {http://webs.um.es/aros/papers/pdfs/pekemark-hpca21.pdf}
                                    }

Christos Sakalis, Stefanos Kaxiras, Alberto Ros, Alexandra Jimborean, Magnus Själander (2020), "Understanding Selective Delay as a Method for Efficient Secure Speculative Execution", IEEE Transactions on Computers (TC).

[Abstract] [BibTeX]

Abstract:Since the introduction of Meltdown and Spectre, the research community has been tirelessly working on speculative side-channel attacks and on how to shield computer systems from them. To ensure that a system is protected not only from all the currently known attacks but also from future, yet to be discovered, attacks, the solutions developed need to be general in nature, covering a wide array of system components, while at the same time keeping the performance, energy, area, and implementation complexity costs at a minimum. One such solution is our own delay-on-miss, which efficiently protects the memory hierarchy by i) selectively delaying speculative load instructions and ii) utilizing value prediction as an invisible form of speculation. In this work we dive deeper into delay-on-miss, offering insights into why and how it affects the performance of the system. We also reevaluate value prediction as an invisible form of speculation. Specifically, we focus on the implications that delaying memory loads has in the memory level parallelism of the system and how this affects the value predictor and the overall performance of the system. We present new, updated results but more importantly, we also offer deeper insight into why delay-on-miss works so well and what this means for the future of secure speculative execution.

BibTeX:

                                    @Article{csakalis-tc20,
                                      author = 	 {Christos Sakalis and Stefanos Kaxiras and Alberto Ros and Alexandra Jimborean and Magnus Sj{\"a}lander},
                                      title = 	 {Understanding Selective Delay as a Method for Efficient Secure Speculative Execution},
                                      journal = 	 {IEEE Transactions on Computers (TC)},
                                      doi =          {10.1109/TC.2020.3014456},
                                      publisher =    {IEEE Computer Society},
                                      address =      {},
                                      year = 	 {2020},
                                      volume = 	 {69},
                                      number = 	 {11},
                                      issn =         {0018-9340},
                                      pages = 	 {1584--1595},
                                      month = 	 nov,
                                      impactfactor = {2.711, 19/53 (Q2) - COMPUTER SCIENCE, HARDWARE & ARCHITECTURE (2019)},
                                      url =          {http://webs.um.es/aros/papers/pdfs/csakalis-tc20.pdf}
                                    }

Alberto Ros and Stefanos Kaxiras (2020), "Speculative Enforcement of Store Atomicity", In International Symposium on Microarchitecture (MICRO).

[Abstract] [BibTeX]

Abstract: Various memory consistency model implementations (e.g., x86, SPARC) willfully allow a core to see its own stores while they are in limbo, i.e., executed (and perhaps retired) but not yet inserted in memory order. This is known as store-to-load forwarding and it is a necessity to safeguard the local thread's sequential program semantics while achieving high performance. However, this can lead to counter-intuitive behaviours, requiring fences to prevent such behaviours when needed.Other vendors (e.g., IBM 370 and the z/Architecture series) opt for enforcing what we call in this work store atomicity, that is, disallowing a core to see its own stores before they are written to memory, trading off performance for a more intuitive memory model. Ideally, we want a stricter model to ease programability at the same time that architects can provide high-performance solutions. We make a simple observation. What holds for any other rule in a consistency model, also holds for store atomicity: it is not a crime to break the rule, unless we get caught.In this work, we detail the different ways of detecting a store atomicity violation. This leads us to a new insight: a load performed by a forwarding from an in-limbo store is not speculative; younger loads performed after that forwarding are. Based on this insight we propose an effective and cheap speculative approach to dynamically enforce store atomicity only when the detection of its violation actually occurs. In practice, these cases are rare during the execution of a program. In all other cases (the bulk of the execution of a program) store-to-load forwarding can be done without violating store atomicity. The end result is that we provide the best of both worlds: a more intuitive store-atomic memory model, i.e., the 370 model, with the performance and cost approaching (at an average of just 2.5% and 2.7% overhead for parallel and sequential applications, respectively) that of a non-store-atomic model, i.e., the x86 model.

BibTeX:

                            @inproceedings{aros-micro20,
                              author = {Alberto Ros, Stefanos Kaxiras},
                              title = {Speculative Enforcement of Store Atomicity},
                              booktitle = {International Symposium on Microarchitecture (MICRO)},
                              year = {2020}
                            }

Cebrian JM, Kaxiras S and Ros A (2020), "Boosting Store Buffer Efficiency with Store-Prefetch Bursts", In International Symposium on Microarchitecture (MICRO).

[Abstract] [BibTeX]

Abstract: Virtually all processors today employ a store buffer (SB) to hide store latency. However, when the store buffer is full, store latency is exposed to the processor causing pipeline stalls. The default strategies to mitigate these stalls are to issue prefetch for ownership requests when store instructions commit and to continuously increase the store buffer size. While these strategies considerably increase memory-level parallelism for stores, there are still applications that suffer deeply from stalls caused by the store buffer. Even worse, store-buffer induced stalls increase considerably when simultaneous multi-threading is enabled, as the store buffer is statically partitioned among the threads.
In this paper, we propose a highly selective and very aggressive prefetching strategy to minimize store-buffer induced stalls. Our proposal, Store-Prefetch Burst (SPB), is based on the following insights: i) the majority of store-buffer induced stalls are caused by a few stores; ii) the access pattern of such stores are easily predictable; and iii) the latency of the stores is not commonly hidden by standard cache prefetchers, as hiding their latency would require tremendous prefetch aggressiveness. SPB accurately detects contiguous store-access patterns (requiring just 67 bits of storage) and prefetches the remaining memory blocks of the accessed page in a single burst request to the L1 controller. SPB matches the performance of a 1024-entry SB implementation on a 56-entry SB (i.e., Skylake architecture). For a 14-entry SB (e.g., running four logical cores), it achieves 95.0% of that ideal performance, on average, for SPEC CPU 2017. Additionally, a
20-entry store buffer that incorporates SPB achieves the average performance of a standard 56-entry store buffer.

BibTeX:

                            @inproceedings{jcebrian-micro20,
                              author = {Cebrian, Juan M and Kaxiras, Stefanos and Ros, Alberto},
                              title = {Boosting Store Buffer Efficiency with Store-Prefetch Bursts},
                              booktitle = {International Symposium on Microarchitecture (MICRO)},
                              year = {2020}
                            }

Ros A and Jimborean A (2020), "The Entangling Instruction Prefetcher", IEEE Computer Architecture Letters. Vol. 19(2), pp. 84-87.

[Abstract] [BibTeX] [DOI]

Abstract: Prefetching instructions is a fundamental technique for designing high-performance computers. There are three key properties to consider when designing an efficient and effective prefetcher: timeliness, coverage, and accuracy. Timeliness is an essential property, as bringing instructions too early increases the risk of the instructions being evicted from the cache before their use while requesting them too late can lead to the instructions arriving past their designated execution time. Coverage is important to reduce the number of instruction cache misses (there is enough prefetching), and accuracy to ensure that the prefetcher does not pollute the cache or interacts negatively with the other hardware mechanisms (there is not too much prefetching). This letter presents the Entangling instruction prefetcher that entangles instructions to provide timeliness. The prefetcher works by finding which instruction should trigger the prefetch for a subsequent instruction, accounting for the latency of each cache miss. The prefetcher is carefully adjusted to account for both coverage and accuracy. Our evaluation shows that the Entangling I-prefetcher increases performance by 29.3 percent on average, with a coverage of 94.9 percent and accuracy of 77.4 percent.

BibTeX:

                            @article{aros-cal20,
                              author = {Ros, Alberto and Jimborean, Alexandra},
                              title = {The Entangling Instruction Prefetcher},
                              journal = {IEEE Computer Architecture Letters},
                              year = {2020},
                              volume = {19},
                              number = {2},
                              pages = {84--87},
                              doi = {10.1109/LCA.2020.3002947}
                            }

Ros A and Jimborean A (2020), "The Entangling Instruction Prefetcher", The 1st Instruction Prefetching Championship. Worldwide event Vol. 19

[Abstract] [BibTeX] [DOI]

BibTeX:

                            @article{aros-ipc20,
                              author = {Ros, Alberto and Jimborean, Alexandra},
                              title = {The Entangling Instruction Prefetcher},
                              journal = {The 1st Instruction Prefetching Championship},
                              year = {2020},
                              volume = {19},
                              doi = {10.1109/LCA.2020.3002947}
                            }

Christos Sakalis, Stefanos Kaxiras, Alberto Ros, Alexandra Jimborean MS (2020), "Understanding Selective and Delay as a Method and for and Efficient Secure and Speculative Execution", In IEEE TRANSACTIONS ON COMPUTERS. IEEE.

[Abstract] [BibTeX]

Abstract: Since the introduction of Meltdown and Spectre, the research community has been tirelessly working on speculative side-channel attacks and on how to shield computer systems from them. To ensure that a system is protected not only from all the currently known attacks but also from future, yet to be discovered, attacks, the solutions developed need to be general in nature, covering a wide array of system components, while at the same time keeping the performance, energy, area, and implementation complexity costs at a minimum. One such solution is our own delay-on-miss, which efficiently protects the memory hierarchy by i) selectively delaying speculative load instructions and ii) utilizing value prediction as an invisible form of speculation. In this work we dive deeper into delay-on-miss, offering insights into why and how it affects the performance of the system. We also reevaluate value prediction as an invisible form of speculation. Specifically, we focus on the implications that delaying memory loads has in the memory level parallelism of the system and how this affects the value predictor and the overall performance of the system. We present new, updated results but more importantly, we also offer deeper insight into why delay-on-miss works so well and what this means for the future of secure speculative execution.

BibTeX:

                            @inproceedings{csakalis-tc20,
                              author = {Christos Sakalis, Stefanos Kaxiras, Alberto Ros, Alexandra Jimborean, Magnus Sjalander},
                              title = {Understanding Selective and Delay as a Method and for and Efficient Secure and Speculative Execution},
                              booktitle = {IEEE TRANSACTIONS ON COMPUTERS},
                              publisher = {IEEE},
                              year = {2020}
                            }

Singh S, Jimborean A and Ros A (2020), "Regional out-of-order writes in total store order", Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT. (Section 4), pp. 205-216.

[Abstract] [BibTeX] [DOI]

Abstract: The store buffer, an essential component in today's processors, isdesigned to hide memory latency by moving stores off the processor's critical path. Furthermore, under the Total Store Order (TSO)memory model, the store buffer ensures the in-order retirement ofstores. Problems arise when the store buffer is full or, under TSO,when the leading store encounters a cache miss, which blocks allsubsequent stores and incurs severe performance bottlenecks.This work presents a software-hardware co-designed approachto cope with this bottleneck for processors with strong consistencyguarantees. Our proposal is driven by the insight that store operations can be reordered if their reordering does not change theobservable program behavior. The compiler delineates safe regionswithin which stores can be shuffled while still delivering the sameobservable behavior as if they performed in program order andunsafe regions within which stores must be kept in program order.This is leveraged by a novel dual-mode store buffer that switchesbetween the out-of-order and in-order execution of stores withinthe safe and respectively unsafe regions. Correctness is preservedthrough well-placed fences inserted by the compiler, which impedethe execution of stores from the following regions until all storesof the current region complete. Our dual-mode store buffer onlyrequires one extra bit per entry, significantly decreases processorstall cycles, and brings 8.13% performance improvements comparedto a mainstream store buffer.

BibTeX:

                            @article{ssingh-pact20,
                              author = {Singh, Sawan and Jimborean, Alexandra and Ros, Alberto},
                              title = {Regional out-of-order writes in total store order},
                              journal = {Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT},
                              year = {2020},
                              number = {Section 4},
                              pages = {205--216},
                              doi = {10.1145/3410463.3414645}
                            }

Titos-Gil R, Fernandez-Pascual R, Ros A and Acacio ME (2020), "Concurrent Irrevocability in Best-Effort Hardware Transactional Memory", IEEE Transactions on Parallel and Distributed Systems. Vol. 31(6), pp. 1301-1315.

[Abstract] [BibTeX] [DOI]

Abstract: Existing best-effort requester-wins implementations of transactional memory must resort to non-speculative execution to provide forward progress in the presence of transactions that exceed hardware capacity, experience page faults or suffer high-contention leading to livelocks. Current approaches to irrevocability employ lock-based synchronization to achieve mutual exclusion when executing a transaction non-speculatively, conservatively precluding concurrency with any other transactions in order to guarantee atomicity at the cost of degrading performance. In this article, we propose a new form of concurrent irrevocability whose goal is to minimize the loss of concurrency paid when transactions resort to irrevocability to complete. By enabling optimistic concurrency control also during non-speculative execution of a transaction, our proposal allows for higher parallelism than existing schemes. We describe the extensions to the instruction set to provide concurrent irrevocable transactions as well as the architectural extensions required to realize them on a best-effort HTM system without requiring any modification to the cache coherence protocol. Our evaluation shows that our proposal achieves an average reduction of 12.5 percent in execution time across the STAMP benchmarks, with 15.8 percent on average for highly contended workloads.

BibTeX:

                            @article{rtitos-tpds20,
                              author = {Titos-Gil, Ruben and Fernandez-Pascual, Ricardo and Ros, Alberto and Acacio, Manuel E.},
                              title = {Concurrent Irrevocability in Best-Effort Hardware Transactional Memory},
                              journal = {IEEE Transactions on Parallel and Distributed Systems},
                              year = {2020},
                              volume = {31},
                              number = {6},
                              pages = {1301--1315},
                              doi = {10.1109/TPDS.2019.2963030}
                            }

Titos-Gil R, Fernández-Pascual R, Ros A and Acacio ME (2020), "PfTouch: Concurrent page-fault handling for Intel restricted transactional memory", Journal of Parallel and Distributed Computing. Vol. 145, pp. 111-123.

[Abstract] [BibTeX] [DOI]

Abstract: Page faults occurring within transactions jeopardize concurrency in Intel Restricted Transactional Memory (RTM). To make progress in spite of page-fault-induced aborts, the program must resort to the non-speculative fallback path and re-execute the affected transaction. Since the atomicity of a non-speculative transaction is guaranteed by impeding the execution of any other speculative transactions until the former completes, taking the fallback path is particularly harmful for performance. Therefore, such page-fault-induced aborts currently lead to thread serialization during the potentially long period of time taken to resolve them. In this work we propose PfTouch, a simple extension to RTM that allows page-fault handling to be moved out of non-speculative transactional execution in mutual exclusion. Our proposal sidesteps taking the fallback path in these cases and thus avoids its associated performance loss, by triggering page faults in the abort handler while other speculative transactions can run concurrently. PfTouch requires minimal modifications in the Intel RTM specification and keeps the OS unaltered. Through full-system simulation, we show that PfTouch achieves average reductions in execution time of 7.7% (up to 24.4%) for the STAMP benchmarks, closely matching the performance of the more complex suspended transactional mode in the IBM Power ISA.

BibTeX:

                            @article{rtitos-jpdc20,
                              author = {Titos-Gil, Rubén and Fernández-Pascual, Ricardo and Ros, Alberto and Acacio, Manuel E.},
                              title = {PfTouch: Concurrent page-fault handling for Intel restricted transactional memory},
                              journal = {Journal of Parallel and Distributed Computing},
                              year = {2020},
                              volume = {145},
                              pages = {111--123},
                              doi = {10.1016/j.jpdc.2020.06.009}
                            }

Tran KA, Sakalis C, Själander M, Ros A, Kaxiras S and Jimborean A (2020), "Clearing the shadows: Recovering lost performance for invisible speculative execution through HW/SW Co-design", Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT. (iii), pp. 241-254.

[Abstract] [BibTeX] [DOI]

Abstract: Out-of-order processors heavily rely on speculation to achieve highperformance, allowing instructions to bypass other slower instructions in order to fully utilize the processor's resources. Speculativelyexecuted instructions do not affect the correctness of the application, as they never change the architectural state, but they do affectthe micro-architectural behavior of the system. Until recently, thesechanges were considered to be safe but with the discovery of newsecurity attacks that misuse speculative execution to leak secreteinformation through observable micro-architectural changes (socalled side-channels), this is no longer the case. To solve this issue,a wave of software and hardware mitigations have been proposed,the majority of which delay and/or hide speculative execution untilit is deemed to be safe, trading performance for security. Thesenewly enforced restrictions change how speculation is applied andwhere the performance bottlenecks appear, forcing us to rethinkhow we design and optimize both the hardware and the software.We observe that many of the state-of-the-art hardware solutionstargeting memory systems operate on a common scheme: the visible execution of loads or their dependents is blocked until theybecome safe to execute. In this work we propose a generally applicable hardware-software extension that focuses on removing thecauses of loads' unsafety, generally caused by control and memorydependence speculation. As a result, we manage to make more loadssafe to execute at an early stage, which enables us to schedule moreloads at a time to overlap their delays and improve performance.We apply our techniques on the state-of-the-art Delay-on-Misshardware defense and show that we reduce the performance gapto the unsafe baseline by 53% (on average).

BibTeX:

                            @article{ktran-pact20,
                              author = {Tran, Kim Anh and Sakalis, Christos and Själander, Magnus and Ros, Alberto and Kaxiras, Stefanos and Jimborean, Alexandra},
                              title = {Clearing the shadows: Recovering lost performance for invisible speculative execution through HW/SW Co-design},
                              journal = {Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT},
                              year = {2020},
                              number = {iii},
                              pages = {241--254},
                              doi = {10.1145/3410463.3414640}
                            }

Upadhyay BR, Ros A and Ns M (2020), "TLB-based block-grain classification of private data", Proceedings - 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2020. , pp. 122-130.

[Abstract] [BibTeX] [DOI]

Abstract: Sequential and parallel applications use most of the data as private in a multi-core system. Recent proposals made use of this observation to reduce the area of the coherence directories or the memory access latency. The driving force of these proposals is the classification of private/shared memory data. The effectiveness of these proposals depends on the number of detected private data. The existing proposals perform the private/shared classification at page granularity, leading to a noticeable amount of miss-classified memory blocks.We propose a mechanism that works on block granularity using the translation lookaside buffer (TLB) to make accurate detection of private data, which increases the effectiveness of proposals relying on a private/shared classification. Simulation results show that the block-grain approach obtains 17.0% more accessed private miss data than the page-grain approach, which translates to an improvement in system performance by 6.02% compared to a page-grain approach.

BibTeX:

                            @article{bupadhyay-pdp20,
                              author = {Upadhyay, Bhargavi R. and Ros, Alberto and Ns, Murty},
                              title = {TLB-based block-grain classification of private data},
                              journal = {Proceedings - 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2020},
                              year = {2020},
                              pages = {122--130},
                              doi = {10.1109/PDP50117.2020.00025}
                            }

Alves R, Ros A, Black-Schaffer D and Kaxiras S (2019), "Filter caching for free: The untapped potential of the store-buffer", Proceedings - International Symposium on Computer Architecture. , pp. 436-448.

[Abstract] [BibTeX] [DOI]

Abstract: Modern processors contain store-buffers to allow stores to retire under a miss, thus hiding store-miss latency. The store-buffer needs to be large (for performance) and searched on every load (for correctness), thereby making it a costly structure in both area and energy. Yet on every load, the store-buffer is probed in parallel with the L1 and TLB, with no concern for the store-buffer's intrinsic hit rate or whether a store-buffer hit can be predicted to save energy by disabling the L1 and TLB probes. In this work we cache data that have been written back to memory in a unified store-queue/buffer/cache, and predict hits to avoid L1/TLB probes and save energy. By dynamically adjusting the allocation of entries between the store-queue/buffer/cache, we can achieve nearly optimal reuse, without causing stalls. We are able to do this efficiently and cheaply by recognizing key properties of stores: free caching (since they must be written into the store-buffer for correctness we need no additional data movement), cheap coherence (since we only need to track state changes of the local, dirty data in the store-buffer), and free and accurate hit prediction (since the memory dependence predictor already does this for scheduling). As a result, we are able to increase the store-buffer hit rate and reduce store-buffer/TLB/L1 dynamic energy by 11.8% (up to 26.4%) on SPEC2006 without hurting performance (average IPC improvements of 1.5%, up to 4.7%). The cost for these improvements is a 0.2% increase in L1 cache capacity (1 bit per line) and one additional tail pointer in the store-buffer.

BibTeX:

                            @article{ralves-isca19,
                              author = {Alves, Ricardo and Ros, Alberto and Black-Schaffer, David and Kaxiras, Stefanos},
                              title = {Filter caching for free: The untapped potential of the store-buffer},
                              journal = {Proceedings - International Symposium on Computer Architecture},
                              year = {2019},
                              pages = {436--448},
                              doi = {10.1145/3307650.3322269}
                            }

Ros A (2019), "Berti: A Per-Page Best-Request-Time Delta Prefetcher", The 3rd Data Prefetching Championship. (1)

[Abstract] [BibTeX]

Abstract: Prefetching data blocks into the caches comprising the memory hierarchy is a fundamental technique for designing high-performance computers. In fact, current systems implement prefetchers at every cache level. Timeliness is an essential property for getting the maximum performance from the prefetcher, as bringing the data early to cache can increase its miss ratio and requesting the data too late can lead to sub-optimal performance. This paper presents Berti, a prefetcher that finds the delta that provides the best timeliness for memory blocks in each page. The prefetcher works in two modes: (i) on the first access to a block, in a certain period of time, the prefetcher issues a request for the next block according to the best delta found; (ii) for cold pages, a burst mechanism fetches blocks that cannot be reached adding the delta to the current accessed block.

BibTeX:

                            @article{aros-dpc19,
                              author = {Ros, Alberto},
                              title = {Berti: A Per-Page Best-Request-Time Delta Prefetcher},
                              journal = {The 3rd Data Prefetching Championship},
                              year = {2019},
                              number = {1}
                            }

Sakalis C, Alipour M, Ros A, Jimborean A, Kaxiras S and Själander M (2019), "Ghost Loads: What is the Cost of Invisible Speculation?", ACM International Conference on Computing Frontiers 2019, CF 2019 - Proceedings. , pp. 153-163.

[Abstract] [BibTeX] [DOI]

Abstract: Speculative execution is necessary for achieving high performance on modern general-purpose CPUs but, starting with Spectre and Meltdown, it has also been proven to cause severe security flaws. In case of a misspeculation, the architectural state is restored to assure functional correctness but a multitude of microarchitectural changes (e.g., cache updates), caused by the speculatively executed instructions, are commonly left in the system. These changes can be used to leak sensitive information, which has led to a frantic search for solutions that can eliminate such security flaws. The contribution of this work is an evaluation of the cost of hiding speculative side-effects in the cache hierarchy, making them visible only after the speculation has been resolved. For this, we compare (for the first time) two broad approaches: I) waiting for loads to become non-speculative before issuing them to the memory system, and ii) eliminating the side-effects of speculation, a solution consisting of invisible loads (Ghost loads) and performance optimizations (Ghost Buffer and Materialization). While previous work, InvisiSpec, has proposed a similar solution to our latter approach, it has done so with only a minimal evaluation and at a significant performance cost. The detailed evaluation of our solutions shows that: I) waiting for loads to become non-speculative is no more costly than the previously proposed InvisiSpec solution, albeit much simpler, non-invasive in the memory system, and stronger security-wise; ii) hiding speculation with Ghost loads (in the context of a relaxed memory model) can be achieved at the cost of 12% performance degradation and 9% energy increase, which is significantly better that the previous state-of-the-art solution.

BibTeX:

                            @article{csakalis-cf19,
                              author = {Sakalis, Christos and Alipour, Mehdi and Ros, Alberto and Jimborean, Alexandra and Kaxiras, Stefanos and Själander, Magnus},
                              title = {Ghost Loads: What is the Cost of Invisible Speculation?},
                              journal = {ACM International Conference on Computing Frontiers 2019, CF 2019 - Proceedings},
                              year = {2019},
                              pages = {153--163},
                              doi = {10.1145/3310273.3321558}
                            }

Sakalis C, Kaxiras S, Ros A, Jimborean A and Själander M (2019), "Efficient invisible speculative execution through selective delay and value prediction", Proceedings - International Symposium on Computer Architecture. , pp. 723-735.

[Abstract] [BibTeX] [DOI]

Abstract: Speculative execution, the base on which modern high-performance general-purpose CPUs are built on, has recently been shown to enable a slew of security attacks. All these attacks are centered around a common set of behaviors: During speculative execution, the architectural state of the system is kept unmodified, until the speculation can be verified. In the event that a misspeculation occurs, then anything that can affect the architectural state is reverted (squashed) and re-executed correctly. However, the same is not true for the microarchitectural state. Normally invisible to the user, changes to the microarchitectural state can be observed through various side-channels, with timing differences caused by the memory hierarchy being one of the most common and easy to exploit. The speculative side-channels can then be exploited to perform attacks that can bypass software and hardware checks in order to leak information. These attacks, out of which the most infamous are perhaps Spectre and Meltdown, have led to a frantic search for solutions. In this work, we present our own solution for reducing the microarchitectural state-changes caused by speculative execution in the memory hierarchy. It is based on the observation that if we only allow accesses that hit in the L1 data cache to proceed, then we can easily hide any microarchitectural changes until after the speculation has been verified. At the same time, we propose to prevent stalls by value predicting the loads that miss in the L1. Value prediction, though speculative, constitutes an invisible form of speculation, not seen outside the core. We evaluate our solution and show that we can prevent observable microarchitectural changes in the memory hierarchy while keeping the performance and energy costs at 11% and 7%, respectively. In comparison, the current state of the art solution, InvisiSpec, incurs a 46% performance loss and a 51% energy increase.

BibTeX:

                            @article{csakalis-isca19,
                              author = {Sakalis, Christos and Kaxiras, Stefanos and Ros, Alberto and Jimborean, Alexandra and Själander, Magnus},
                              title = {Efficient invisible speculative execution through selective delay and value prediction},
                              journal = {Proceedings - International Symposium on Computer Architecture},
                              year = {2019},
                              pages = {723--735},
                              doi = {10.1145/3307650.3322216}
                            }

Titos-Gil R, Flores A, Fernandez-Pascual R, Ros A, Petit S, Sahuquillo J and Acacio ME (2019), "Way Combination for an Adaptive and Scalable Coherence Directory", IEEE Transactions on Parallel and Distributed Systems. Vol. 30(11), pp. 2608-2623.

[Abstract] [BibTeX] [DOI]

Abstract: This manuscript opens the way to a new class of coherence directory structures that are based on the brand-new concept of way combining. A Way-Combining Directory (WC-dir) builds on a typical sparse directory but allows to take advantage of several ways in the same set to codify the sharing information of each memory block. The result is a sparse directory with variable effective associativity per set and variable length entries, thus being able to dynamically adapt the directory structure to the particular requirements of each application. In particular, our proposal uses just enough bits per entry to store a single pointer, which is optimal for the common case of having just one sharer. For those addresses that have more than one sharer, we have observed that in the majority of cases extra bits could be taken from other empty ways in the same set. All in all, our proposal minimizes the storage overheads without losing the flexibility to adapt to several sharing degrees and without the complexities of other previously proposed techniques. Detailed simulations of a 128-core multicore architecture running benchmarks from PARSEC-3.0 and SPLASH-3 demonstrate that WC-dir can closely approach the performance of a non-scalable bit vector sparse directory, beating the state-of-the-art Scalable Coherence Directory (SCD) and Pool directory proposals.

BibTeX:

                            @article{rtitos-tpds19,
                              author = {Titos-Gil, Ruben and Flores, Antonio and Fernandez-Pascual, Ricardo and Ros, Alberto and Petit, Salvador and Sahuquillo, Julio and Acacio, Manuel E.},
                              title = {Way Combination for an Adaptive and Scalable Coherence Directory},
                              journal = {IEEE Transactions on Parallel and Distributed Systems},
                              year = {2019},
                              volume = {30},
                              number = {11},
                              pages = {2608--2623},
                              doi = {10.1109/TPDS.2019.2917185}
                            }

Abdulla PA, Atig MF, Kaxiras S, Leonardsson C, Ros A and Zhu Y (2018), "Mending fences with self-invalidation and self-downgrade", Logical Methods in Computer Science. Vol. 14(1), pp. 1-33.

[Abstract] [BibTeX] [DOI]

Abstract: Cache coherence protocols based on self-invalidation and self-downgrade have recently seen increased popularity due to their simplicity, potential performance efficiency, and low energy consumption. However, such protocols result in memory instruction reordering, thus causing extra program behaviors that are often not intended by the programmers. We propose a novel formal model that captures the semantics of programs running under such protocols, and features a set of fences that interact with the coherence layer. Using the model, we design an algorithm to analyze the reachability and check whether a program satisfies a given safety property with the current set of fences. We describe a method for insertion of optimal sets of fences that ensure correctness of the program under such protocols. The method relies on a counter-example guided fence insertion procedure. One feature of our method is that it can handle a variety of fences (with different costs). This diversity makes optimization more difficult since one has to optimize the total cost of the inserted fences, rather than just their number. To demonstrate the strength of our approach, we have implemented a prototype and run it on a wide range of examples and benchmarks. We have also, using simulation, evaluated the performance of the resulting fenced programs.

BibTeX:

                            @article{paabdulla-lmcs18,
                              author = {Abdulla, Parosh Aziz and Atig, Mohamed Faouzi and Kaxiras, Stefanos and Leonardsson, Carl and Ros, Alberto and Zhu, Yunyun},
                              title = {Mending fences with self-invalidation and self-downgrade},
                              journal = {Logical Methods in Computer Science},
                              year = {2018},
                              volume = {14},
                              number = {1},
                              pages = {1--33},
                              doi = {10.23638/LMCS-14(1:6)2018}
                            }

Abellán JL, Padierna E, Ros A and Acacio ME (2018), "Photonic-based express coherence notifications for many-core CMPs", Journal of Parallel and Distributed Computing. Vol. 113, pp. 179-194.

[Abstract] [BibTeX] [DOI]

Abstract: Directory-based coherence protocols (Directory) are considered the design of choice to provide maximum performance in coherence maintenance for shared-memory many-core CMPs, despite their large memory overhead. New solutions are emerging to achieve acceptable levels of on-chip area overhead and energy consumption such as optimized encoding of block sharers in Directory (e.g., SCD) or broadcast-based coherence (e.g., Hammer). In this work, we propose a novel and efficient solution for the cache coherence problem in many-core systems based on the co-design of the coherence protocol and the interconnection network. Particularly, we propose ECONO, a cache coherence protocol tailored to future many-core systems that resorts on PhotoBNoC, a special lightweight dedicated silicon-photonic subnetwork for efficient delivery of the atomic broadcast coherence messages used by the protocol. Considering a simulated 256-core system, as compared with Hammer, we demonstrate that ECONO+PhotoBNoC reduces performance and energy consumption by an average of 34% and 32%, respectively. Additionally, our proposal lowers the area overhead entailed by SCD by 2×.

BibTeX:

                            @article{jlabellan-jpdc18,
                              author = {Abellán, José L. and Padierna, Eduardo and Ros, Alberto and Acacio, Manuel E.},
                              title = {Photonic-based express coherence notifications for many-core CMPs},
                              journal = {Journal of Parallel and Distributed Computing},
                              year = {2018},
                              volume = {113},
                              pages = {179--194},
                              doi = {10.1016/j.jpdc.2017.11.015}
                            }

Esteve A, Ros A, Robles A and E G (2018), "TokenTLB + CUP : A Token-Based Page", IEEE Transactions on Parallel and Distributed Systems (TPDS). , pp. 1-14.