Publications



To view the group's complete research catalog including books, patents, thesis, etc kindly follow this link.

Matching entries: 0
settings...
Eduardo José Gómez-Hernández, Juan M. Cebrian, Stefanos Kaxiras, Alberto Ros (2022), "Splash-4: A Modern Benchmark Suite with Lock-Free Constructs", International Symposium on Workload Characterization (IISWC).
Abstract:The cornerstone for the performance evaluation of computer systems is the benchmark suite. Among the many benchmark suites used in high-performance computing and multicore research, Splash-2 has been instrumental in advancing knowledge for both academia and industry. Published in 1995 and with over 5276 citations and counting, this benchmark suite is still in use to evaluate novel architectural proposals. Recently, the Splash-3 suite eliminates important performance bugs, data races, and improper synchronization that plagued Splash-2 benchmarks after the formal definition of the C memory model. However, keeping up with architectural changes while maintaining the same workloads and algorithms (for comparative purposes) is a real challenge. Benchmark suites can misrepresent the performance characteristics of a computer system if they do not reflect the available features of the hardware and architects may end up overestimating the impact of proposed techniques or underestimating others. In this work we introduce a revised version of Splash-3, designated Splash-4, that introduces modern programming techniques to improve scalability on contemporary hardware. We then characterize Splash-3 and Splash-4 in a state-ofthe-art simulated architecture, Intel's Ice Lake with gem5-20 simulator, as well as a real contemporary hardware processor (AMD's EPYC 7002 series). Our evaluation shows that for a 64-thread execution Splash-4 reduces the normalized execution time by an average of 52% and 34% for AMD's EPYC and Intel's Ice Lake, respectively.
BibTeX:
                    @InProceedings{ejgomez-iiswc22,
                      author =       {Eduardo Jos{\'e} G{\'o}mez-Hern{\'a}ndez and Juan Manuel Cebrian and Stefanos Kaxiras and Alberto Ros},
                      title =        {Splash-4: A Modern Benchmark Suite with Lock-Free Constructs},
                      booktitle =    {2022 IEEE International Symposium on Workload Characterization (IISWC)},
                      doi =          {},
                      pages =        {},
                      year =         {2022},
                      editor =       {},
                      address =      {Austin, TX (USA)},
                      month =        nov,
                      publisher =    {IEEE Computer Society},
                      ratio-acep =   {47.92% (23/48)},
                      isbn =         {},
                      url =          {http://webs.um.es/aros/papers/pdfs/ejgomez-iiswc22.pdf}
                    }
                    
                          
Agustín Navarro-Torres, Biswabandan Panda, J. Alastruey-Benedé, Pablo Ibáñez, Víctor Viñals-Yúfera, and Alberto Ros (2022), "Berti: an Accurate Local-Delta Data Prefetcher", International Symposium on Microarchitecture (MICRO).
Abstract:Data prefetching is a technique that plays a crucial role in modern high-performance processors by hiding long latency memory accesses. Several state-of-the-art hardware prefetchers exploit the concept of deltas, defined as the difference between the cache line addresses of two demand accesses. Existing delta prefetchers, such as best offset prefetching (BOP) and multi-lookahead prefetching (MLOP), train and predict future accesses based on global deltas. We observed that the use of global deltas results in missed opportunities to anticipate memory accesses. In this paper, we propose Berti, a first-level data cache prefetcher that selects the best local deltas, i.e., those that consider only demand accesses issued by the same instruction. Thanks to a high-confidence mechanism that precisely detects the timely local deltas with high coverage, Berti generates accurate prefetch requests. Then, it orchestrates the prefetch requests to the memory hierarchy, using the selected deltas. Our empirical results using ChampSim and SPEC CPU2017 and GAP workloads show that, with a storage overhead of just 2.55 KB, Berti improves performance by 8.5% compared to a baseline IP-stride and 3.5% compared to IPCP, a state-of-the-art prefetcher. Our evaluation also shows that Berti reduces dynamic energy at the memory hierarchy by 33.6% compared to IPCP, thanks to its high prefetch accuracy.
BibTeX:
                    @INPROCEEDINGS{9923806, 
                       author={Navarro-Torres, Agustín and Panda, Biswabandan and Alastruey-Benedé, Jesús and Ibáñez, Pablo and Viñals-Yúfera, Víctor and Ros, Alberto},  
                       booktitle={2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)},   
                       title={Berti: an Accurate Local-Delta Data Prefetcher},   
                       year={2022},  
                       volume={},  
                       number={},  
                       pages={975-991},  
                       doi={10.1109/MICRO56248.2022.00072}}
                    
                          
Sawan Singh, Arthur Perais, Alexandra Jimborean, and Alberto Ros (2022), "Exploring Instruction Fusion Opportunities in General Purpose Processors", International Symposium on Microarchitecture (MICRO).
Abstract: The Complex Instruction Set Computer (CISC) paradigm has led to the introduction of instruction cracking in which an architectural instruction is divided into multiple microarchitectural instructions (u-ops). However, the dual concept, instruction fusion is also prevalent in modern microarchitectures to maximize resource utilization. In essence, some architectural instructions are too complex to be executed as a unit, so they should be cracked, while others are too simple to waste resources on executing them as a unit, so they should be fused with others. In this paper, we focus on instruction fusion and explore opportunities for fusing additional instructions in a high-performance general purpose pipeline. We show that enabling fusion for common RISC-V idioms improves performance by 7%. Then, we determine experimentally that enabling fusion only for memory instructions achieves 86% of the potential of fusion in this particular case. Finally, we propose the Helios microarchitecture, able to fuse non-consecutive and noncontiguous memory instructions, and discuss microarchitectural changes required to do so efficiently while preserving correctness. Helios allows to fuse an additional 5.5% of dynamic instructions, yielding a 14.2% performance uplift over no fusion (8.2% over baseline fusion).
BibTeX:
                    @INPROCEEDINGS{9923815,
                      author={Singh, Sawan and Perais, Arthur and Jimborean, Alexandra and Ros, Alberto},
                      booktitle={2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)}, 
                      title={Exploring Instruction Fusion Opportunities in General Purpose Processors}, 
                      year={2022},
                      volume={},
                      number={},
                      pages={199-212},
                      doi={10.1109/MICRO56248.2022.00026}}
                    
                          
Ashkan Asgharzadeh, Juan M. Cebrian, Arthur Perais, Stefanos Kaxiras, Alberto Ros (2022), "Free Atomics: Hardware Atomic Operations without Fences", International Symposium on Computer Architecture (ISCA).
Abstract: Atomic Read-Modify-Write (RMW) instructions are primitive synchronization operations implemented in hardware that provide the building blocks for higher-abstraction synchronization mechanisms to programmers. According to publicly available documentation, current x86 implementations serialize atomic RMW operations, i.e., the store buffer is drained before issuing atomic RMWs and subsequent memory operations are stalled until the atomic RMW commits. This serialization, carried out by memory fences, incurs a performance cost which is expected to increase with deeper pipelines. This work proposes Free atomics, a lightweight, speculative, deadlock-free implementation of atomic operations that removes the need for memory fences, thus improving performance, while preserving atomicity and consistency. Free atomics is, to the best of our knowledge, the first proposal to enable store-to-load forwarding for atomic RMWs. Free atomics only requires simple modifications and incurs a small area overhead (15 bytes). Our evaluation using gem5-20 shows that, for a 32-core configuration, Free atomics improves performance by 12.5%, on average, for a large range of parallel workloads and 25.2%, on average, for atomic-intensive parallel workloads over a fenced atomic RMW implementation.
BibTeX:
                    @InProceedings{aasgharzadeh-isca22,
                      author = 	 {Ashkan Asgharzadeh and Juan M. Cebrian and Arthur Perais and Stefanos Kaxiras and Alberto Ros},
                      title = 	 {Free Atomics: Hardware Atomic Operations without Fences},
                      booktitle =    {49th International Symposium on Computer Architecture (ISCA)},
                      doi =          {10.1145/3470496.3527385},
                      pages = 	 {},
                      year = 	 {2022},
                      editor = 	 {ACM},
                      address = 	 {New York, NY (USA)},
                      month = 	 jun,
                      publisher =    {Association for Computing Machinery (ACM)},
                      ratio-acep =   {16.75% (67/400)},
                      isbn =         {978-1-4503-8610-4},
                      issn =         {1063-6897},
                      url =          {http://webs.um.es/aros/papers/pdfs/aasgharzadeh-isca22.pdf}
                    }
                    
                          
Juan Manuel Cebrian, Thibaud Balem, Adrian Barredo, Marc Casas, Miquel Moreto, Alberto Ros, Alexandra Jimborean (2022), "Compiler-Assisted Compaction/Restoration of SIMD Instructions", IEEE Transactions on Parallel and Distributed Systems (TPDS).
Abstract: Vector processors (e.g., SIMD or GPUs) are ubiquitous in high performance systems. All the supercomputers in the world exploit data-level parallelism (DLP), for example by using single instructions to operate over several data elements. Improving vector processing is therefore key for exascale computing. However, despite its potential, vector code generation and execution have significant challenges. Among these challenges, control flow divergence is one of the main performance limiting factors. Most modern vector instruction sets, including SIMD, rely on predication to support divergence control. Nevertheless, the performance and energy consumption in predicated codes is usually insensitive to the number of active elements in a predicated mask. Since the trend is that vector register size increases, the energy efficiency of exascale computing systems will become sub-optimal. This paper proposes a novel approach to improve execution efficiency in predicated vector codes, the Compiler-Assisted Compaction/Restoration (CACR) technique. Baseline CR delays predicated SIMD instructions with inactive elements, compacting active elements from instances of the same instruction of consecutive loop iterations. Compacted elements form an equivalent dense vector instruction. After executing the dense instructions, their results are restored to the original instructions. However, CR has a significant performance and energy penalty when it fails to find active elements, either due to lack of resources when unrolling or because of inter-loop dependencies. In CACR, the compiler analyzes the code looking for key information required to configure CR. Then, it passes this information to the processor via new instructions inserted in the code. This prevents CR from waiting for active elements on scenarios when it would fail to form dense instructions. Simulated results (gem5) show that CACR improves performance by up to 29% and reduces dynamic energy by up to 24.2% on average, for a a set of applications with predicated execution. The baseline CR only achieves 18.6% performance and 14% energy improvements for the same configuration and applications.
BibTeX:
                    @Article{jcebrian-tpds22,
                      author = 	 {Juan Manuel Cebrian and Thibaud Balem and Adrian Barredo and Marc Casas and Miquel Moreto and Alberto Ros and Alexandra Jimborean},
                      title = 	 {Compiler-Assisted Compaction/Restoration of SIMD Instructions},
                      journal = 	 {IEEE Transactions on Parallel and Distributed Systems (TPDS)},
                      doi =          {10.1109/TPDS.2021.3091015},
                      publisher =    {IEEE Computer Society},
                      address =      {},
                      year = 	 {2022},
                      volume = 	 {33},
                      number = 	 {4},
                      issn =         {1045-9219},
                      pages = 	 {779--791},
                      month = 	 apr,
                      impactfactor = {2.600, 29/108 (Q1) - COMPUTER SCIENCE, THEORY & METHODS (2019)},
                      url =          {http://webs.um.es/aros/papers/pdfs/jmcebrian-tpds22.pdf}
                    }
                    
                          
Víctor Nicolás-Conesa, Rubén Titos-Gil, Ricardo Fernández-Pascual, Alberto Ros, Manuel E. Acacio (2022), "Analysis of the Interactions Between ILP and TLP With Hardware Transactional Memory", Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP).
Abstract: Hardware Transactional Memory (HTM) allows the use of transactions by programmers, making parallel programming easier and theoretically obtaining the performance of finegrained locks. However, transactions can abort for a variety of reasons, resulting in the squash of speculatively executed instructions and the consequent loss in both performance and energy efficiency. Among the different sources of abort, conflicting concurrent accesses to the same shared memory locations from different transactions are often the prevalent cause. In this work, we characterize, for the first time to the best of our knowledge, how the aggressiveness of the cores in terms of exploiting instruction-level parallelism can interact with threadlevel speculation support brought by HTM systems. We observe that altering the size of the structures that support out-of-order and speculative execution changes the number of aborts produced in the execution of transactional workloads on a best-effort HTM implementation. Our results show that a small number of powerful cores is more suitable for high-contention scenarios, whereas under low contention it is preferable to use a larger number of less aggressive cores. In addition, an aggressive core can lead to performance loss in medium-contention scenarios due to an increase in the number of aborts. We conclude that depending on contention, a careful choice over processor aggressiveness can reduce abort ratios.
BibTeX:
                    @InProceedings{vnicolas-pdp22,
                      author = 	 {V{\'{\i}}ctor Nicol{\'a}s-Conesa and Rub{\'e}n Titos-Gil and Ricardo Fern{\'a}ndez-Pascual and Alberto Ros and Manuel E. Acacio},
                      title = 	 {Analysis of the Interactions Between ILP and TLP With Hardware Transactional Memory},
                      booktitle =    {23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)},
                      doi =          {10.1109/PDP55904.2022},
                      editor = 	 {IEEE Computer Society},
                      pages = 	 {157--164},
                      year = 	 {2022},
                      publisher =    {IEEE Computer Society},
                      address = 	 {Worldwide event},
                      month = 	 mar,
                      ratio-acep =   {37.84% (28/74)},
                      isbn =         {978-1-6654-6958-6},
                      url =          {http://webs.um.es/aros/papers/pdfs/vnicolas-pdp22.pdf}
                    }
                    
                          
Marina Shimchenko, Rubén Titos-Gil, Ricardo Fernández-Pascual, Manuel E. Acacio, Stefanos Kaxiras, Alberto Ros, Alexandra Jimborean (2022), "Analysing Software Prefetching Opportunities in Hardware Transactional Memory", Journal of Supercomputing (SUPE).
Abstract: Hardware Transactional Memory emerged to make parallel programming more accessible. However, the performance pitfall of this technique is squashing speculatively executed instructions and re-executing them in case of aborts, ultimately resorting to serialization in case of repeated conflicts. A significant fraction of aborts occur due to conflicts (concurrent reads and writes to the same memory location performed by different threads). Our proposal aims to reduce conflict aborts by reducing the window of time during which transactional regions can suffer conflicts. We achieve this by using software prefetching instructions inserted automatically at compile-time. Through these prefetch instructions, we intend to bring the necessary data for each transaction from the main memory to the cache before the transaction itself starts to execute, thus converting the otherwise long latency cache misses into hits during the execution of the transaction. The obtained results show that our approach decreases the number of aborts by 30% on average and improves performance by up to 19% and 10% for two out of the eight evaluated benchmarks. We provide insights into when our technique is beneficial given certain characteristics of the transactional regions, the advantages and disadvantages of our approach, and finally, discuss potential solutions to overcome some of its limitations.
BibTeX:
                    @Article{mshimchenko-supe22,
                      author = 	 {Marina Shimchenko and Rub{\'e}n Titos-Gil and Ricardo Fern{\'a}ndez-Pascual and Manuel E. Acacio and Stefanos Kaxiras and Alberto Ros and Alexandra Jimborean},
                      title = 	 {Analysing Software Prefetching Opportunities in Hardware Transactional Memory},
                      journal = 	 {Journal of Supercomputing (SUPE)},
                      doi =          {10.1007/s11227-021-03897-z},
                      publisher =    {Springer US},
                      address =      {},
                      year = 	 {2022},
                      volume = 	 {78},
                      number = 	 {1},
                      issn =         {0920-8542},
                      pages = 	 {919--944},
                      month = 	 jan,
                      impactfactor = {1.532, 44/103 (Q2) - COMPUTER SCIENCE, THEORY & METHODS (2017)},
                      url =          {http://webs.um.es/aros/papers/pdfs/mshimchenko-supe21.pdf}
                    }
                    
                          
Rubén Titos-Gil, Ricardo Fernández-Pascual, Manuel E. Acacio, Alberto Ros (2022), "DeTraS: Delaying Stores for Friendly-Fire Mitigation in Hardware Transactional Memory", EEE Transactions on Parallel and Distributed Systems (TPDS),.
Abstract: Commercial Hardware Transactional Memory (HTM) systems are best-effort designs that leverage the coherence substrate to detect conflicts eagerly. Resolving conflicts in favor of the requesting core is the simplest option for ensuring deadlock freedom, yet it is prone to livelocks. In this work, we propose and evaluate DeTraS (Delayed Transactional Stores), an HTM-aware store buffer design aimed at mitigating such livelocks. DeTraS takes advantage of the fact that modern commercial processors implement a large store buffer, and uses it to prevent transactional stores predicted to conflict from performing early in the transaction. By leveraging existing processor structures, we propose a simple design that improves the ability of requester-wins HTM systems to achieve forward progress in spite of high contention while side-stepping the performance penalty of falling back to mutual exclusion. With just over 50 extra bytes, DeTraS captures the advantages of lazy conflict management without the complexity brought into the coherence fabric by commit arbitration schemes nor the relaxation of the single-writer invariant of prior works. Through detailed simulations of a 16-core tiled CMP using gem5, we demonstrate that DeTraS brings reductions in average execution time of 25% when compared to an Intel RTM-like design.
BibTeX:
                    @Article{rtitos-tpds22,
                      author = 	 {Rub{\'e}n Titos-Gil and Ricardo Fern{\'a}ndez-Pascual and Manuel E. Acacio and Alberto Ros},
                      title = 	 {DeTraS: Delaying Stores for Friendly-Fire Mitigation in Hardware Transactional Memory},
                      journal = 	 {IEEE Transactions on Parallel and Distributed Systems (TPDS)},
                      doi =          {10.1109/TPDS.2021.3085210},
                      publisher =    {IEEE Computer Society},
                      address =      {},
                      year = 	 {2022},
                      volume = 	 {33},
                      number = 	 {1},
                      issn =         {1045-9219},
                      pages = 	 {1--13},
                      month = 	 jan,
                      impactfactor = {2.600, 29/108 (Q1) - COMPUTER SCIENCE, THEORY & METHODS (2019)},
                      url =          {http://webs.um.es/aros/papers/pdfs/rtitos-tpds22.pdf}
                    }
                    
                          
Bhargavi R. Upadhyay, Alberto Ros, Jalpa Shah (2021), "Efficient Classification of Private Memory Blocks", Journal of Parallel Distributed Computing (JPDC).
Abstract: Shared memory architectures are pervasive in the multicore technology era. Still, sequential and parallel applications use most of the data as private in a multicore system. Recent proposals using this observation and driven by a classification of private/shared memory data can reduce the coherence directory area or the memory access latency. The effectiveness of these proposals depends on the accuracy of the classification. The existing proposals perform the private/shared classification at page granularity, leading to a miss-classification and reducing the number of detected private memory blocks. We propose a mechanism able to accurately classify memory blocks using the existing translation lookaside buffers (TLB), which increases the effectiveness of proposals relying on a private/shared classification. Our experimental results show that the proposed scheme reduces L1 cache misses by 25% compared to a page-grain classification approach, which translates into an improvement in system performance by 8.0% with respect to a page-grain approach. Keywords: chip multiprocessor, cache coherence, private-shared data classification
BibTeX:
                    @Article{bupadhyay-jpdc21,
                      author = 	 {Bhargavi R. Upadhyay and Alberto Ros and Jalpa Shah},
                      title = 	 {Efficient Classification of Private Memory Blocks},
                      journal = 	 {Journal of Parallel Distributed Computing (JPDC)},
                      doi =          {10.1016/j.jpdc.2021.07.005},
                      publisher =    {Academic Press, Inc.},
                      address =      {Orlando, FL (USA)},
                      year = 	 {2021},
                      volume = 	 {157},
                      number = 	 {},
                      issn =         {0743-7315},
                      pages = 	 {256--268},
                      month = 	 nov,
                      impactfactor = {2.296, 35/108 (Q2) - COMPUTER SCIENCE, THEORY & METHODS (2019)},
                      url =          {http://webs.um.es/aros/papers/pdfs/bupadhyay-jpdc21.pdf}
                    }
                    
                          
Eduardo José Gómez-Hernández, Rubén Titos-Gil, Juan Manuel Cebrian, Stefanos Kaxiras, Alberto Ros (2021), "Efficient, Distributed, and Non-Speculative Multi-Address Atomic Operations", International Symposium on Microarchitecture (MICRO).
Abstract: Critical sections that read, modify, and write (RMW) a small set of addresses are common in parallel applications and concurrent data structures. However, to escape from the intricacies of finegrained locks, which require reasoning about all possible thread interleavings, programmers often resort to coarse-grained locks to ensure atomicity. This results in atomic protection of a much larger set of potentially conflicting addresses, and, consequently, increased lock contention and unneeded serialization. As many before us have observed, these problems would be solved if only general RMW multi-address atomic operations were available, but current proposals are impractical because of deadlock scenarios that appear due to resource limitations. Alternatively, transactional memory can detect conflicts at run-time aiming to maximize concurrency, but it has significant overheads in highly-contended critical sections. In this work, we propose multi-address atomic operations (MAD atomics). MAD atomics achieve complexity-effective, non-speculative, non-deadlocking, fine-grained locking for multiple addresses, relying solely on the coherence protocol and a predetermined locking order. Unlike prior works, MAD atomics address the challenge of enabling atomic modification over a set of cachelines with arbitrary addresses, simultaneously locking all of them while sidestepping deadlock. MAD atomics only require a small storage per core (around 68 bytes), while significantly outperforming typical lock implementations. Indeed, our evaluation using gem5-20 shows that MAD atomics can improve performance by up to 18x (3.4x, on average, for the applications and concurrent data structures evaluated in this work) over a baseline implemented with locks running on 16 cores. More importantly, the improvement still reaches 2.7×, on average, compared to an Intel hardware transactional memory implementation running on 16 cores.
BibTeX:
                    @InProceedings{ejgomez-micro21,
                      author =       {Eduardo Jos{\'e} G{\'o}mez-Hern{\'a}ndez and Rub{\'e}n Titos-Gil and Juan Manuel Cebrian and Stefanos Kaxiras and Alberto Ros},
                      title =        {Efficient, Distributed, and Non-Speculative Multi-Address Atomic Operations},
                      booktitle =    {54th International Symposium on Microarchitecture (MICRO)},
                      doi =          {10.1145/3466752.3480073},
                      pages =        {337--349},
                      year =         {2021},
                      editor =       {},
                      address =      {Worldwide event},
                      month =        oct,
                      publisher =    {IEEE Computer Society},
                      ratio-acep =   {21.86% (94/430)},
                      isbn =         {978-1-4503-8557-2},
                      url =          {http://webs.um.es/aros/papers/pdfs/ejgomez-micro21.pdf}
                    }
                    
                          
Josué Feliu, Alberto Ros, Manuel E. Acacio, Stefanos Kaxiras (2021), "ITSLF: Inter-Thread Store-to-Load Forwarding in Simultaneous Multithreading", International Symposium on Microarchitecture (MICRO).
Abstract: In this paper, we argue that, for a class of fine-grain, synchronization-intensive, parallel workloads, it is advantageous to consolidate synchronization and communication as much as possible among the threads of simultaneous multithreading (SMT) cores. While, today, the shared L1 is the closest coherent level where synchronization and communication between SMT threads can take place, we observe that there is an even closer shared level, entirely inside a single core. This level comprises the load queues (LQ) and store queues (SQ) / store buffers (SB) of the SMT threads and to the best of our knowledge it has never been used as such. The reason is that if we allow communication of different SMT threads via their LQs and SQs/SBs, i.e., inter-thread storeto-load forwarding (ITSLF), we violate write atomicity with respect to the outside world, beyond the acceptable model of read-own-write-early multiple-copy atomicity (rMCA). The key insight of our work is that we can accelerate synchronization and communication among SMT threads with inter-thread store-to-load forwarding, without affecting the memory model—in particular without violating rMCA. We demonstrate how we can achieve this entirely through speculative interactions between LQs and SQs/SBs of different threads, while ensuring deadlock-free execution. Without changing the architectural model, the ISA, or the software, and without adding extra hardware in the form of a specialized accelerator, our insight enables a new design point for a standard architecture. We demonstrate that with ITSLF, workloads scale better on a single 8-way SMT core (with the resources of a single-threaded core) than on a baseline SMT (with or without optimizations), or on 8 single-threaded cores.
BibTeX:
                    @InProceedings{jfeliu-micro21,
                      author =       {Josu{\'e} Feliu and Alberto Ros and Manuel E. Acacio and Stefanos Kaxiras},
                      title =        {ITSLF: Inter-Thread Store-to-Load Forwarding in Simultaneous Multithreading},
                      booktitle =    {54th International Symposium on Microarchitecture (MICRO)},
                      doi =          {10.1145/3466752.3480086},
                      pages =        {1296--1308},
                      year =         {2021},
                      editor =       {},
                      address =      {Worldwide event},
                      month =        oct,
                      publisher =    {IEEE Computer Society},
                      ratio-acep =   {21.86% (94/430)},
                      isbn =         {978-1-4503-8557-2},
                      url =          {http://webs.um.es/aros/papers/pdfs/jfeliu-micro21.pdf}
                    }
                    
                          
Christos Sakalis, Zamshed Chowdhury, Shayne Wadle, Ismail Akturk, Alberto Ros, Magnus Själander, Stefanos Kaxiras, Ulya R. Karpuzcu (2021), "Do Not Predict - Recompute! How Value Recomputation Can Truly Boost the Performance of Invisible Speculation", IEEE International Symposium on Secure and Private Execution Environment Design (SEED).
Abstract: Recent architectural approaches that address speculative side-channel attacks aim to prevent software from exposing the microarchitectural state changes of transient execution. The Delay-on-Miss technique is one such approach, which simply delays loads that miss in the L1 cache until they become non-speculative, resulting in no transient changes in the memory hierarchy. However, this costs performance, prompting the use of value prediction (VP) to regain some of the delay. However, the problem cannot be solved by simply introducing a new kind of speculation (value prediction). Value-predicted loads have to be validated, which cannot be commenced until the load becomes nonspeculative. Thus, value-predicted loads occupy the same amount of precious core resources (e.g., reorder buffer entries) as Delay-on-Miss. The end result is that VP only yields marginal benefits over Delay-on-Miss. In this paper, our insight is that we can achieve the same goal as VP (increasing performance by providing the value of loads that miss) without incurring its negative side-effect (delaying the release of precious resources), if we can safely, non-speculatively, recompute a value in isolation (without being seen from the outside), so that we do not expose any information by transferring such a value via the memory hierarchy. Value Recomputation, which trades computation for data transfer was previously proposed in an entirely different context: to reduce energyexpensive data transfers in the memory hierarchy. In this paper, we demonstrate the potential of value recomputation in relation to the Delayon-Miss approach of hiding speculation, discuss the trade-offs, and show that we can achieve the same level of security, reaching 93% of the unsecured baseline performance (5% higher than Delay-on-miss), and exceeding (by 3%) what even an oracular (100% accuracy and coverage) value predictor could do. I
BibTeX:
                    @InProceedings{csakalis-seed21,
                      author =       {Christos Sakalis and Zamshed Chowdhury and Shayne Wadle and Ismail Akturk and Alberto Ros and Magnus Sj{\"a}lander and Stefanos Kaxiras and Ulya R. Karpuzcu},
                      title =        {Do Not Predict - Recompute! How Value Recomputation Can Truly Boost the Performance of Invisible Speculation},
                      booktitle =    {1st IEEE International Symposium on Secure and Private Execution Environment Design (SEED)},
                      doi =          {10.1109/SEED51797.2021.00021},
                      pages =        {89--100},
                      year =         {2021},
                      editor =       {},
                      address =      {Worldwide event},
                      month =        sep,
                      publisher =    {IEEE Computer Society},
                      ratio-acep =   {30.23% (13/43)},
                      isbn =         {978-1-6654-2026-6},
                      url =          {http://webs.um.es/aros/papers/pdfs/csakalis-seed21.pdf}
                    }
                    
                          
Alberto Ros (2021), "BL∪E: A Timely, IP-based Data Prefetcher", The 1st ML-Based Data Prefetching Competition. ML for Computer Architecture and Systems.
Abstract: High-performance prefetchers require not only predicting the future cache lines that will be requested but also when they will be requested. Timeliness is therefore an essential property for getting the maximum performance from a prefetcher. Bringing the cache line too early to cache can decrease the coverage of the prefetcher when such cache line is evicted before is requested. On the other hand, prefetching the data too late can lead to late prefetchers, and thus, sub-optimal performance. This paper presents BL∪E, a data prefetcher that predicts the prefetched cache lines based on timeliness. The prefetcher accounts for the time required to fetch a cache line and issues the prefetch request early enough, such that when it is accessed it will already be stored in cache. For each instruction pointer (or group of them) BL∪E i) correlates in a timely way the cache lines that have been requested and ii) infers their timely delta when the cache lines have not been accessed yet
BibTeX:
                    @InProceedings{aros-mldpc21,
                      author = 	 {Alberto Ros},
                      title = 	 {BL$\cup$E: A Timely, IP-based Data Prefetcher},
                      booktitle =    {The 1st ML-Based Data Prefetching Competition. ML for Computer Architecture and Systems},
                      pages = 	 {},
                      year = 	 {2021},
                      editor = 	 {},
                      address = 	 {Worldwide event},
                      month = 	 jun,
                      publisher =    {},
                      ratio-acep =   {75.00% (3/4)},
                      isbn =         {},
                      url =          {http://webs.um.es/aros/papers/pdfs/aros-mldpc21.pdf}
                    }
                    
                          
Alberto Ros, Alexandra Jimborean (2021), "A Cost-Effective Entangling Prefetcher for Instructions", International Symposium on Computer Architecture (ISCA).
Abstract: Prefetching instructions in the instruction cache is a fundamental technique for designing high-performance computers. There are three key properties to consider when designing an efficient and effective prefetcher: timeliness, coverage, and accuracy. Timeliness is essential, as bringing instructions too early increases the risk of the instructions being evicted from the cache before their use and requesting them too late can lead to the instructions arriving after they are demanded. Coverage is important to reduce the number of instruction cache misses and accuracy to ensure that the prefetcher does not pollute the cache or interacts negatively with the other hardware mechanisms. This paper presents the Entangling Prefetcher for Instructions that entangles instructions to maximize timeliness. The prefetcher works by finding which instruction should trigger the prefetch for a subsequent instruction, accounting for the latency of each cache miss. The prefetcher is carefully adjusted to account for both coverage and accuracy. Our evaluation shows that with 40KB of storage, Entangling can increase performance up to 23%, outperforming state-of-the-art prefetchers.
BibTeX:
                    @InProceedings{aros-isca21,
                      author = 	 {Alberto Ros and Alexandra Jimborean},
                      title = 	 {A Cost-Effective Entangling Prefetcher for Instructions},
                      booktitle =    {48th International Symposium on Computer Architecture (ISCA)},
                      doi =          {10.1109/ISCA52012.2021.00017},
                      pages = 	 {99--111},
                      year = 	 {2021},
                      editor = 	 {},
                      address = 	 {Worldwide event},
                      month = 	 jun,
                      publisher =    {Association for Computing Machinery (ACM)},
                      ratio-acep =   {18.72% (76/406)},
                      isbn =         {978-1-6654-3333-4},
                      issn =         {1063-6897},
                      url =          {http://webs.um.es/aros/papers/pdfs/aros-isca21.pdf}
                    }
                          
Eduardo José Gómez-Hernández, Ruixiang Shao, Christos Sakalis, Stefanos Kaxiras, Alberto Ros (2021), "Splash-4: Improving Scalability with Lock-Free Constructs", International Symposium on Performance Analysis of Systems and Software (ISPASS).
Abstract: Over the past three decades, the parallel applications of the Splash-2 benchmark suite have been instrumental in advancing multiprocessor research. Recently, the Splash-3 benchmarks eliminated performance bugs, data races, and improper synchronization that plagued Splash-2 benchmarks after the definition of the C memory model. In this work, we revisit the Splash-3 benchmarks and adapt them for contemporary architectures with atomic operations and lock-free constructs. With our changes, we improve the scalability of most benchmarks for up to 32 and 64 cores, showing an improvement of up to 9x in actual machines, and up to 5x in simulation, over the unmodified Splash-3 benchmarks. To denote the substantive nature of the improvements in the Splash-3 benchmarks and to re-introduce them in contemporary research, we refer to the new collection as Splash-4.
BibTeX:
                    @InProceedings{ejgomez-ispass21,
                      author = 	 {Eduardo Jos{\'e} G{\'o}mez-Hern{\'a}ndez and Ruixiang Shao and Christos Sakalis and Stefanos Kaxiras and Alberto Ros},
                      title = 	 {Splash-4: Improving Scalability with Lock-Free Constructs},
                      booktitle =    {International Symposium on Performance Analysis of Systems and Software (ISPASS)},
                      doi =          {10.1109/ISPASS51385.2021.00044},
                      editor =       {IEEE Computer Society},
                      pages = 	 {235--236},
                      year = 	 {2021},
                      address = 	 {Worldwide event},
                      month = 	 mar,
                      publisher =    {IEEE Computer Society},
                      ratio-acep =   {36.92% (24/65)},
                      isbn =         {978-1-7281-8643-6},
                      url =          {http://webs.um.es/aros/papers/pdfs/ejgomez-ispass21.pdf}
                    }
                          
Per Ekemark, Yuan Yao, Alberto Ros, Konstantinos Sagonas, Stefanos Kaxiras (2021), "TSOPER: Efficient Coherence-Based Strict Persistency", High Performance Computer Architecture (HPCA).
Abstract: We propose a novel approach for hardware-based strict TSO persistency, called TSOPER. We allow a TSO persistency model to freely coalesce values in the caches, by forming atomic groups of cachelines to be persisted. A group persist is initiated for an atomic group if any of its newly written values are exposed to the outside world. A key difference with prior work is that our architecture is based on the concept of a TSO persist buffer, that sits in parallel to the shared LLC, and persists atomic groups directly from private caches to NVM, bypassing the coherence serialization of the LLC. To impose dependencies among atomic groups that are persisted from the private caches to the TSO persist buffer, we introduce a sharing-list coherence protocol that naturally captures the order of coherence operations in its sharing lists, and thus can reconstruct the dependencies among different atomic groups entirely at the private cache level without involving the shared LLC. The combination of the sharing-list coherence and the TSO persist buffer allows persist operations and writes to non-volatile memory to happen in the background and trail the coherence operations. Coherence runs ahead at full speed; persistency follows belatedly. Our evaluation shows that TSOPER provides the same level of reordering as a program-driven relaxed model, hence, approximately the same level of performance, albeit without needing the programmer or compiler to be concerned about false sharing, data-race-free semantics, etc., and guaranteeing all software that can run on top of TSO, automatically persists in TSO.
BibTeX:
                                    @InProceedings{pekemark-hpca21,
                                      author =       {Per Ekemark and Yuan Yao and Alberto Ros and Konstantinos Sagonas and Stefanos Kaxiras},
                                      title =        {TSOPER: Efficient Coherence-Based Strict Persistency},
                                      booktitle =    {27th Symposium on High Performance Computer Architecture (HPCA)},
                                      doi =          {10.1109/HPCA51647.2021.00021},
                                      pages =        {125--138},
                                      year =         {2021},
                                      editor =       {},
                                      address =      {Worldwide event},
                                      month =        feb,
                                      publisher =    {IEEE Computer Society},
                                      ratio-acep =   {24.42% (63/258)},
                                      isbn =         {978-1-6654-4670-9},
                                      issn =         {1530-0897},
                                      url =          {http://webs.um.es/aros/papers/pdfs/pekemark-hpca21.pdf}
                                    }
                                    
                          
Christos Sakalis, Stefanos Kaxiras, Alberto Ros, Alexandra Jimborean, Magnus Själander (2020), "Understanding Selective Delay as a Method for Efficient Secure Speculative Execution", IEEE Transactions on Computers (TC).
Abstract:Since the introduction of Meltdown and Spectre, the research community has been tirelessly working on speculative side-channel attacks and on how to shield computer systems from them. To ensure that a system is protected not only from all the currently known attacks but also from future, yet to be discovered, attacks, the solutions developed need to be general in nature, covering a wide array of system components, while at the same time keeping the performance, energy, area, and implementation complexity costs at a minimum. One such solution is our own delay-on-miss, which efficiently protects the memory hierarchy by i) selectively delaying speculative load instructions and ii) utilizing value prediction as an invisible form of speculation. In this work we dive deeper into delay-on-miss, offering insights into why and how it affects the performance of the system. We also reevaluate value prediction as an invisible form of speculation. Specifically, we focus on the implications that delaying memory loads has in the memory level parallelism of the system and how this affects the value predictor and the overall performance of the system. We present new, updated results but more importantly, we also offer deeper insight into why delay-on-miss works so well and what this means for the future of secure speculative execution.
BibTeX:
                                    @Article{csakalis-tc20,
                                      author = 	 {Christos Sakalis and Stefanos Kaxiras and Alberto Ros and Alexandra Jimborean and Magnus Sj{\"a}lander},
                                      title = 	 {Understanding Selective Delay as a Method for Efficient Secure Speculative Execution},
                                      journal = 	 {IEEE Transactions on Computers (TC)},
                                      doi =          {10.1109/TC.2020.3014456},
                                      publisher =    {IEEE Computer Society},
                                      address =      {},
                                      year = 	 {2020},
                                      volume = 	 {69},
                                      number = 	 {11},
                                      issn =         {0018-9340},
                                      pages = 	 {1584--1595},
                                      month = 	 nov,
                                      impactfactor = {2.711, 19/53 (Q2) - COMPUTER SCIENCE, HARDWARE & ARCHITECTURE (2019)},
                                      url =          {http://webs.um.es/aros/papers/pdfs/csakalis-tc20.pdf}
                                    }
                          
Alberto Ros and Stefanos Kaxiras (2020), "Speculative Enforcement of Store Atomicity", In International Symposium on Microarchitecture (MICRO).
Abstract: Various memory consistency model implementations (e.g., x86, SPARC) willfully allow a core to see its own stores while they are in limbo, i.e., executed (and perhaps retired) but not yet inserted in memory order. This is known as store-to-load forwarding and it is a necessity to safeguard the local thread's sequential program semantics while achieving high performance. However, this can lead to counter-intuitive behaviours, requiring fences to prevent such behaviours when needed.Other vendors (e.g., IBM 370 and the z/Architecture series) opt for enforcing what we call in this work store atomicity, that is, disallowing a core to see its own stores before they are written to memory, trading off performance for a more intuitive memory model. Ideally, we want a stricter model to ease programability at the same time that architects can provide high-performance solutions. We make a simple observation. What holds for any other rule in a consistency model, also holds for store atomicity: it is not a crime to break the rule, unless we get caught.In this work, we detail the different ways of detecting a store atomicity violation. This leads us to a new insight: a load performed by a forwarding from an in-limbo store is not speculative; younger loads performed after that forwarding are. Based on this insight we propose an effective and cheap speculative approach to dynamically enforce store atomicity only when the detection of its violation actually occurs. In practice, these cases are rare during the execution of a program. In all other cases (the bulk of the execution of a program) store-to-load forwarding can be done without violating store atomicity. The end result is that we provide the best of both worlds: a more intuitive store-atomic memory model, i.e., the 370 model, with the performance and cost approaching (at an average of just 2.5% and 2.7% overhead for parallel and sequential applications, respectively) that of a non-store-atomic model, i.e., the x86 model.
BibTeX:
                            @inproceedings{aros-micro20,
                              author = {Alberto Ros, Stefanos Kaxiras},
                              title = {Speculative Enforcement of Store Atomicity},
                              booktitle = {International Symposium on Microarchitecture (MICRO)},
                              year = {2020}
                            }
                            
Cebrian JM, Kaxiras S and Ros A (2020), "Boosting Store Buffer Efficiency with Store-Prefetch Bursts", In International Symposium on Microarchitecture (MICRO).
Abstract: Virtually all processors today employ a store buffer (SB) to hide store latency. However, when the store buffer is full, store latency is exposed to the processor causing pipeline stalls. The default strategies to mitigate these stalls are to issue prefetch for ownership requests when store instructions commit and to continuously increase the store buffer size. While these strategies considerably increase memory-level parallelism for stores, there are still applications that suffer deeply from stalls caused by the store buffer. Even worse, store-buffer induced stalls increase considerably when simultaneous multi-threading is enabled, as the store buffer is statically partitioned among the threads.
In this paper, we propose a highly selective and very aggressive prefetching strategy to minimize store-buffer induced stalls. Our proposal, Store-Prefetch Burst (SPB), is based on the following insights: i) the majority of store-buffer induced stalls are caused by a few stores; ii) the access pattern of such stores are easily predictable; and iii) the latency of the stores is not commonly hidden by standard cache prefetchers, as hiding their latency would require tremendous prefetch aggressiveness. SPB accurately detects contiguous store-access patterns (requiring just 67 bits of storage) and prefetches the remaining memory blocks of the accessed page in a single burst request to the L1 controller. SPB matches the performance of a 1024-entry SB implementation on a 56-entry SB (i.e., Skylake architecture). For a 14-entry SB (e.g., running four logical cores), it achieves 95.0% of that ideal performance, on average, for SPEC CPU 2017. Additionally, a
20-entry store buffer that incorporates SPB achieves the average performance of a standard 56-entry store buffer.
BibTeX:
                            @inproceedings{jcebrian-micro20,
                              author = {Cebrian, Juan M and Kaxiras, Stefanos and Ros, Alberto},
                              title = {Boosting Store Buffer Efficiency with Store-Prefetch Bursts},
                              booktitle = {International Symposium on Microarchitecture (MICRO)},
                              year = {2020}
                            }
                            
Ros A and Jimborean A (2020), "The Entangling Instruction Prefetcher", IEEE Computer Architecture Letters. Vol. 19(2), pp. 84-87.
Abstract: Prefetching instructions is a fundamental technique for designing high-performance computers. There are three key properties to consider when designing an efficient and effective prefetcher: timeliness, coverage, and accuracy. Timeliness is an essential property, as bringing instructions too early increases the risk of the instructions being evicted from the cache before their use while requesting them too late can lead to the instructions arriving past their designated execution time. Coverage is important to reduce the number of instruction cache misses (there is enough prefetching), and accuracy to ensure that the prefetcher does not pollute the cache or interacts negatively with the other hardware mechanisms (there is not too much prefetching). This letter presents the Entangling instruction prefetcher that entangles instructions to provide timeliness. The prefetcher works by finding which instruction should trigger the prefetch for a subsequent instruction, accounting for the latency of each cache miss. The prefetcher is carefully adjusted to account for both coverage and accuracy. Our evaluation shows that the Entangling I-prefetcher increases performance by 29.3 percent on average, with a coverage of 94.9 percent and accuracy of 77.4 percent.
BibTeX:
                            @article{aros-cal20,
                              author = {Ros, Alberto and Jimborean, Alexandra},
                              title = {The Entangling Instruction Prefetcher},
                              journal = {IEEE Computer Architecture Letters},
                              year = {2020},
                              volume = {19},
                              number = {2},
                              pages = {84--87},
                              doi = {10.1109/LCA.2020.3002947}
                            }
                            
Ros A and Jimborean A (2020), "The Entangling Instruction Prefetcher", The 1st Instruction Prefetching Championship. Worldwide event Vol. 19
Abstract: Prefetching instructions is a fundamental technique for designing high-performance computers. There are three key properties to consider when designing an efficient and effective prefetcher: timeliness, coverage, and accuracy. Timeliness is an essential property, as bringing instructions too early increases the risk of the instructions being evicted from the cache before their use while requesting them too late can lead to the instructions arriving past their designated execution time. Coverage is important to reduce the number of instruction cache misses (there is enough prefetching), and accuracy to ensure that the prefetcher does not pollute the cache or interacts negatively with the other hardware mechanisms (there is not too much prefetching). This letter presents the Entangling instruction prefetcher that entangles instructions to provide timeliness. The prefetcher works by finding which instruction should trigger the prefetch for a subsequent instruction, accounting for the latency of each cache miss. The prefetcher is carefully adjusted to account for both coverage and accuracy. Our evaluation shows that the Entangling I-prefetcher increases performance by 29.3 percent on average, with a coverage of 94.9 percent and accuracy of 77.4 percent.
BibTeX:
                            @article{aros-ipc20,
                              author = {Ros, Alberto and Jimborean, Alexandra},
                              title = {The Entangling Instruction Prefetcher},
                              journal = {The 1st Instruction Prefetching Championship},
                              year = {2020},
                              volume = {19},
                              doi = {10.1109/LCA.2020.3002947}
                            }
                            
Christos Sakalis, Stefanos Kaxiras, Alberto Ros, Alexandra Jimborean MS (2020), "Understanding Selective and Delay as a Method and for and Efficient Secure and Speculative Execution", In IEEE TRANSACTIONS ON COMPUTERS. IEEE.
Abstract: Since the introduction of Meltdown and Spectre, the research community has been tirelessly working on speculative side-channel attacks and on how to shield computer systems from them. To ensure that a system is protected not only from all the currently known attacks but also from future, yet to be discovered, attacks, the solutions developed need to be general in nature, covering a wide array of system components, while at the same time keeping the performance, energy, area, and implementation complexity costs at a minimum. One such solution is our own delay-on-miss, which efficiently protects the memory hierarchy by i) selectively delaying speculative load instructions and ii) utilizing value prediction as an invisible form of speculation. In this work we dive deeper into delay-on-miss, offering insights into why and how it affects the performance of the system. We also reevaluate value prediction as an invisible form of speculation. Specifically, we focus on the implications that delaying memory loads has in the memory level parallelism of the system and how this affects the value predictor and the overall performance of the system. We present new, updated results but more importantly, we also offer deeper insight into why delay-on-miss works so well and what this means for the future of secure speculative execution.
BibTeX:
                            @inproceedings{csakalis-tc20,
                              author = {Christos Sakalis, Stefanos Kaxiras, Alberto Ros, Alexandra Jimborean, Magnus Sjalander},
                              title = {Understanding Selective and Delay as a Method and for and Efficient Secure and Speculative Execution},
                              booktitle = {IEEE TRANSACTIONS ON COMPUTERS},
                              publisher = {IEEE},
                              year = {2020}
                            }
                            
Singh S, Jimborean A and Ros A (2020), "Regional out-of-order writes in total store order", Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT. (Section 4), pp. 205-216.
Abstract: The store buffer, an essential component in today's processors, isdesigned to hide memory latency by moving stores off the processor's critical path. Furthermore, under the Total Store Order (TSO)memory model, the store buffer ensures the in-order retirement ofstores. Problems arise when the store buffer is full or, under TSO,when the leading store encounters a cache miss, which blocks allsubsequent stores and incurs severe performance bottlenecks.This work presents a software-hardware co-designed approachto cope with this bottleneck for processors with strong consistencyguarantees. Our proposal is driven by the insight that store operations can be reordered if their reordering does not change theobservable program behavior. The compiler delineates safe regionswithin which stores can be shuffled while still delivering the sameobservable behavior as if they performed in program order andunsafe regions within which stores must be kept in program order.This is leveraged by a novel dual-mode store buffer that switchesbetween the out-of-order and in-order execution of stores withinthe safe and respectively unsafe regions. Correctness is preservedthrough well-placed fences inserted by the compiler, which impedethe execution of stores from the following regions until all storesof the current region complete. Our dual-mode store buffer onlyrequires one extra bit per entry, significantly decreases processorstall cycles, and brings 8.13% performance improvements comparedto a mainstream store buffer.
BibTeX:
                            @article{ssingh-pact20,
                              author = {Singh, Sawan and Jimborean, Alexandra and Ros, Alberto},
                              title = {Regional out-of-order writes in total store order},
                              journal = {Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT},
                              year = {2020},
                              number = {Section 4},
                              pages = {205--216},
                              doi = {10.1145/3410463.3414645}
                            }
                            
Titos-Gil R, Fernandez-Pascual R, Ros A and Acacio ME (2020), "Concurrent Irrevocability in Best-Effort Hardware Transactional Memory", IEEE Transactions on Parallel and Distributed Systems. Vol. 31(6), pp. 1301-1315.
Abstract: Existing best-effort requester-wins implementations of transactional memory must resort to non-speculative execution to provide forward progress in the presence of transactions that exceed hardware capacity, experience page faults or suffer high-contention leading to livelocks. Current approaches to irrevocability employ lock-based synchronization to achieve mutual exclusion when executing a transaction non-speculatively, conservatively precluding concurrency with any other transactions in order to guarantee atomicity at the cost of degrading performance. In this article, we propose a new form of concurrent irrevocability whose goal is to minimize the loss of concurrency paid when transactions resort to irrevocability to complete. By enabling optimistic concurrency control also during non-speculative execution of a transaction, our proposal allows for higher parallelism than existing schemes. We describe the extensions to the instruction set to provide concurrent irrevocable transactions as well as the architectural extensions required to realize them on a best-effort HTM system without requiring any modification to the cache coherence protocol. Our evaluation shows that our proposal achieves an average reduction of 12.5 percent in execution time across the STAMP benchmarks, with 15.8 percent on average for highly contended workloads.
BibTeX:
                            @article{rtitos-tpds20,
                              author = {Titos-Gil, Ruben and Fernandez-Pascual, Ricardo and Ros, Alberto and Acacio, Manuel E.},
                              title = {Concurrent Irrevocability in Best-Effort Hardware Transactional Memory},
                              journal = {IEEE Transactions on Parallel and Distributed Systems},
                              year = {2020},
                              volume = {31},
                              number = {6},
                              pages = {1301--1315},
                              doi = {10.1109/TPDS.2019.2963030}
                            }
                            
Titos-Gil R, Fernández-Pascual R, Ros A and Acacio ME (2020), "PfTouch: Concurrent page-fault handling for Intel restricted transactional memory", Journal of Parallel and Distributed Computing. Vol. 145, pp. 111-123.
Abstract: Page faults occurring within transactions jeopardize concurrency in Intel Restricted Transactional Memory (RTM). To make progress in spite of page-fault-induced aborts, the program must resort to the non-speculative fallback path and re-execute the affected transaction. Since the atomicity of a non-speculative transaction is guaranteed by impeding the execution of any other speculative transactions until the former completes, taking the fallback path is particularly harmful for performance. Therefore, such page-fault-induced aborts currently lead to thread serialization during the potentially long period of time taken to resolve them. In this work we propose PfTouch, a simple extension to RTM that allows page-fault handling to be moved out of non-speculative transactional execution in mutual exclusion. Our proposal sidesteps taking the fallback path in these cases and thus avoids its associated performance loss, by triggering page faults in the abort handler while other speculative transactions can run concurrently. PfTouch requires minimal modifications in the Intel RTM specification and keeps the OS unaltered. Through full-system simulation, we show that PfTouch achieves average reductions in execution time of 7.7% (up to 24.4%) for the STAMP benchmarks, closely matching the performance of the more complex suspended transactional mode in the IBM Power ISA.
BibTeX:
                            @article{rtitos-jpdc20,
                              author = {Titos-Gil, Rubén and Fernández-Pascual, Ricardo and Ros, Alberto and Acacio, Manuel E.},
                              title = {PfTouch: Concurrent page-fault handling for Intel restricted transactional memory},
                              journal = {Journal of Parallel and Distributed Computing},
                              year = {2020},
                              volume = {145},
                              pages = {111--123},
                              doi = {10.1016/j.jpdc.2020.06.009}
                            }
                            
Tran KA, Sakalis C, Själander M, Ros A, Kaxiras S and Jimborean A (2020), "Clearing the shadows: Recovering lost performance for invisible speculative execution through HW/SW Co-design", Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT. (iii), pp. 241-254.
Abstract: Out-of-order processors heavily rely on speculation to achieve highperformance, allowing instructions to bypass other slower instructions in order to fully utilize the processor's resources. Speculativelyexecuted instructions do not affect the correctness of the application, as they never change the architectural state, but they do affectthe micro-architectural behavior of the system. Until recently, thesechanges were considered to be safe but with the discovery of newsecurity attacks that misuse speculative execution to leak secreteinformation through observable micro-architectural changes (socalled side-channels), this is no longer the case. To solve this issue,a wave of software and hardware mitigations have been proposed,the majority of which delay and/or hide speculative execution untilit is deemed to be safe, trading performance for security. Thesenewly enforced restrictions change how speculation is applied andwhere the performance bottlenecks appear, forcing us to rethinkhow we design and optimize both the hardware and the software.We observe that many of the state-of-the-art hardware solutionstargeting memory systems operate on a common scheme: the visible execution of loads or their dependents is blocked until theybecome safe to execute. In this work we propose a generally applicable hardware-software extension that focuses on removing thecauses of loads' unsafety, generally caused by control and memorydependence speculation. As a result, we manage to make more loadssafe to execute at an early stage, which enables us to schedule moreloads at a time to overlap their delays and improve performance.We apply our techniques on the state-of-the-art Delay-on-Misshardware defense and show that we reduce the performance gapto the unsafe baseline by 53% (on average).
BibTeX:
                            @article{ktran-pact20,
                              author = {Tran, Kim Anh and Sakalis, Christos and Själander, Magnus and Ros, Alberto and Kaxiras, Stefanos and Jimborean, Alexandra},
                              title = {Clearing the shadows: Recovering lost performance for invisible speculative execution through HW/SW Co-design},
                              journal = {Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT},
                              year = {2020},
                              number = {iii},
                              pages = {241--254},
                              doi = {10.1145/3410463.3414640}
                            }
                            
Upadhyay BR, Ros A and Ns M (2020), "TLB-based block-grain classification of private data", Proceedings - 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2020. , pp. 122-130.
Abstract: Sequential and parallel applications use most of the data as private in a multi-core system. Recent proposals made use of this observation to reduce the area of the coherence directories or the memory access latency. The driving force of these proposals is the classification of private/shared memory data. The effectiveness of these proposals depends on the number of detected private data. The existing proposals perform the private/shared classification at page granularity, leading to a noticeable amount of miss-classified memory blocks.We propose a mechanism that works on block granularity using the translation lookaside buffer (TLB) to make accurate detection of private data, which increases the effectiveness of proposals relying on a private/shared classification. Simulation results show that the block-grain approach obtains 17.0% more accessed private miss data than the page-grain approach, which translates to an improvement in system performance by 6.02% compared to a page-grain approach.
BibTeX:
                            @article{bupadhyay-pdp20,
                              author = {Upadhyay, Bhargavi R. and Ros, Alberto and Ns, Murty},
                              title = {TLB-based block-grain classification of private data},
                              journal = {Proceedings - 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2020},
                              year = {2020},
                              pages = {122--130},
                              doi = {10.1109/PDP50117.2020.00025}
                            }
                            
Alves R, Ros A, Black-Schaffer D and Kaxiras S (2019), "Filter caching for free: The untapped potential of the store-buffer", Proceedings - International Symposium on Computer Architecture. , pp. 436-448.
Abstract: Modern processors contain store-buffers to allow stores to retire under a miss, thus hiding store-miss latency. The store-buffer needs to be large (for performance) and searched on every load (for correctness), thereby making it a costly structure in both area and energy. Yet on every load, the store-buffer is probed in parallel with the L1 and TLB, with no concern for the store-buffer's intrinsic hit rate or whether a store-buffer hit can be predicted to save energy by disabling the L1 and TLB probes. In this work we cache data that have been written back to memory in a unified store-queue/buffer/cache, and predict hits to avoid L1/TLB probes and save energy. By dynamically adjusting the allocation of entries between the store-queue/buffer/cache, we can achieve nearly optimal reuse, without causing stalls. We are able to do this efficiently and cheaply by recognizing key properties of stores: free caching (since they must be written into the store-buffer for correctness we need no additional data movement), cheap coherence (since we only need to track state changes of the local, dirty data in the store-buffer), and free and accurate hit prediction (since the memory dependence predictor already does this for scheduling). As a result, we are able to increase the store-buffer hit rate and reduce store-buffer/TLB/L1 dynamic energy by 11.8% (up to 26.4%) on SPEC2006 without hurting performance (average IPC improvements of 1.5%, up to 4.7%). The cost for these improvements is a 0.2% increase in L1 cache capacity (1 bit per line) and one additional tail pointer in the store-buffer.
BibTeX:
                            @article{ralves-isca19,
                              author = {Alves, Ricardo and Ros, Alberto and Black-Schaffer, David and Kaxiras, Stefanos},
                              title = {Filter caching for free: The untapped potential of the store-buffer},
                              journal = {Proceedings - International Symposium on Computer Architecture},
                              year = {2019},
                              pages = {436--448},
                              doi = {10.1145/3307650.3322269}
                            }
                            
Ros A (2019), "Berti: A Per-Page Best-Request-Time Delta Prefetcher", The 3rd Data Prefetching Championship. (1)
Abstract: Prefetching data blocks into the caches comprising the memory hierarchy is a fundamental technique for designing high-performance computers. In fact, current systems implement prefetchers at every cache level. Timeliness is an essential property for getting the maximum performance from the prefetcher, as bringing the data early to cache can increase its miss ratio and requesting the data too late can lead to sub-optimal performance. This paper presents Berti, a prefetcher that finds the delta that provides the best timeliness for memory blocks in each page. The prefetcher works in two modes: (i) on the first access to a block, in a certain period of time, the prefetcher issues a request for the next block according to the best delta found; (ii) for cold pages, a burst mechanism fetches blocks that cannot be reached adding the delta to the current accessed block.
BibTeX:
                            @article{aros-dpc19,
                              author = {Ros, Alberto},
                              title = {Berti: A Per-Page Best-Request-Time Delta Prefetcher},
                              journal = {The 3rd Data Prefetching Championship},
                              year = {2019},
                              number = {1}
                            }
                            
Sakalis C, Alipour M, Ros A, Jimborean A, Kaxiras S and Själander M (2019), "Ghost Loads: What is the Cost of Invisible Speculation?", ACM International Conference on Computing Frontiers 2019, CF 2019 - Proceedings. , pp. 153-163.
Abstract: Speculative execution is necessary for achieving high performance on modern general-purpose CPUs but, starting with Spectre and Meltdown, it has also been proven to cause severe security flaws. In case of a misspeculation, the architectural state is restored to assure functional correctness but a multitude of microarchitectural changes (e.g., cache updates), caused by the speculatively executed instructions, are commonly left in the system. These changes can be used to leak sensitive information, which has led to a frantic search for solutions that can eliminate such security flaws. The contribution of this work is an evaluation of the cost of hiding speculative side-effects in the cache hierarchy, making them visible only after the speculation has been resolved. For this, we compare (for the first time) two broad approaches: I) waiting for loads to become non-speculative before issuing them to the memory system, and ii) eliminating the side-effects of speculation, a solution consisting of invisible loads (Ghost loads) and performance optimizations (Ghost Buffer and Materialization). While previous work, InvisiSpec, has proposed a similar solution to our latter approach, it has done so with only a minimal evaluation and at a significant performance cost. The detailed evaluation of our solutions shows that: I) waiting for loads to become non-speculative is no more costly than the previously proposed InvisiSpec solution, albeit much simpler, non-invasive in the memory system, and stronger security-wise; ii) hiding speculation with Ghost loads (in the context of a relaxed memory model) can be achieved at the cost of 12% performance degradation and 9% energy increase, which is significantly better that the previous state-of-the-art solution.
BibTeX:
                            @article{csakalis-cf19,
                              author = {Sakalis, Christos and Alipour, Mehdi and Ros, Alberto and Jimborean, Alexandra and Kaxiras, Stefanos and Själander, Magnus},
                              title = {Ghost Loads: What is the Cost of Invisible Speculation?},
                              journal = {ACM International Conference on Computing Frontiers 2019, CF 2019 - Proceedings},
                              year = {2019},
                              pages = {153--163},
                              doi = {10.1145/3310273.3321558}
                            }
                            
Sakalis C, Kaxiras S, Ros A, Jimborean A and Själander M (2019), "Efficient invisible speculative execution through selective delay and value prediction", Proceedings - International Symposium on Computer Architecture. , pp. 723-735.
Abstract: Speculative execution, the base on which modern high-performance general-purpose CPUs are built on, has recently been shown to enable a slew of security attacks. All these attacks are centered around a common set of behaviors: During speculative execution, the architectural state of the system is kept unmodified, until the speculation can be verified. In the event that a misspeculation occurs, then anything that can affect the architectural state is reverted (squashed) and re-executed correctly. However, the same is not true for the microarchitectural state. Normally invisible to the user, changes to the microarchitectural state can be observed through various side-channels, with timing differences caused by the memory hierarchy being one of the most common and easy to exploit. The speculative side-channels can then be exploited to perform attacks that can bypass software and hardware checks in order to leak information. These attacks, out of which the most infamous are perhaps Spectre and Meltdown, have led to a frantic search for solutions. In this work, we present our own solution for reducing the microarchitectural state-changes caused by speculative execution in the memory hierarchy. It is based on the observation that if we only allow accesses that hit in the L1 data cache to proceed, then we can easily hide any microarchitectural changes until after the speculation has been verified. At the same time, we propose to prevent stalls by value predicting the loads that miss in the L1. Value prediction, though speculative, constitutes an invisible form of speculation, not seen outside the core. We evaluate our solution and show that we can prevent observable microarchitectural changes in the memory hierarchy while keeping the performance and energy costs at 11% and 7%, respectively. In comparison, the current state of the art solution, InvisiSpec, incurs a 46% performance loss and a 51% energy increase.
BibTeX:
                            @article{csakalis-isca19,
                              author = {Sakalis, Christos and Kaxiras, Stefanos and Ros, Alberto and Jimborean, Alexandra and Själander, Magnus},
                              title = {Efficient invisible speculative execution through selective delay and value prediction},
                              journal = {Proceedings - International Symposium on Computer Architecture},
                              year = {2019},
                              pages = {723--735},
                              doi = {10.1145/3307650.3322216}
                            }
                            
Titos-Gil R, Flores A, Fernandez-Pascual R, Ros A, Petit S, Sahuquillo J and Acacio ME (2019), "Way Combination for an Adaptive and Scalable Coherence Directory", IEEE Transactions on Parallel and Distributed Systems. Vol. 30(11), pp. 2608-2623.
Abstract: This manuscript opens the way to a new class of coherence directory structures that are based on the brand-new concept of way combining. A Way-Combining Directory (WC-dir) builds on a typical sparse directory but allows to take advantage of several ways in the same set to codify the sharing information of each memory block. The result is a sparse directory with variable effective associativity per set and variable length entries, thus being able to dynamically adapt the directory structure to the particular requirements of each application. In particular, our proposal uses just enough bits per entry to store a single pointer, which is optimal for the common case of having just one sharer. For those addresses that have more than one sharer, we have observed that in the majority of cases extra bits could be taken from other empty ways in the same set. All in all, our proposal minimizes the storage overheads without losing the flexibility to adapt to several sharing degrees and without the complexities of other previously proposed techniques. Detailed simulations of a 128-core multicore architecture running benchmarks from PARSEC-3.0 and SPLASH-3 demonstrate that WC-dir can closely approach the performance of a non-scalable bit vector sparse directory, beating the state-of-the-art Scalable Coherence Directory (SCD) and Pool directory proposals.
BibTeX:
                            @article{rtitos-tpds19,
                              author = {Titos-Gil, Ruben and Flores, Antonio and Fernandez-Pascual, Ricardo and Ros, Alberto and Petit, Salvador and Sahuquillo, Julio and Acacio, Manuel E.},
                              title = {Way Combination for an Adaptive and Scalable Coherence Directory},
                              journal = {IEEE Transactions on Parallel and Distributed Systems},
                              year = {2019},
                              volume = {30},
                              number = {11},
                              pages = {2608--2623},
                              doi = {10.1109/TPDS.2019.2917185}
                            }
                            
Abdulla PA, Atig MF, Kaxiras S, Leonardsson C, Ros A and Zhu Y (2018), "Mending fences with self-invalidation and self-downgrade", Logical Methods in Computer Science. Vol. 14(1), pp. 1-33.
Abstract: Cache coherence protocols based on self-invalidation and self-downgrade have recently seen increased popularity due to their simplicity, potential performance efficiency, and low energy consumption. However, such protocols result in memory instruction reordering, thus causing extra program behaviors that are often not intended by the programmers. We propose a novel formal model that captures the semantics of programs running under such protocols, and features a set of fences that interact with the coherence layer. Using the model, we design an algorithm to analyze the reachability and check whether a program satisfies a given safety property with the current set of fences. We describe a method for insertion of optimal sets of fences that ensure correctness of the program under such protocols. The method relies on a counter-example guided fence insertion procedure. One feature of our method is that it can handle a variety of fences (with different costs). This diversity makes optimization more difficult since one has to optimize the total cost of the inserted fences, rather than just their number. To demonstrate the strength of our approach, we have implemented a prototype and run it on a wide range of examples and benchmarks. We have also, using simulation, evaluated the performance of the resulting fenced programs.
BibTeX:
                            @article{paabdulla-lmcs18,
                              author = {Abdulla, Parosh Aziz and Atig, Mohamed Faouzi and Kaxiras, Stefanos and Leonardsson, Carl and Ros, Alberto and Zhu, Yunyun},
                              title = {Mending fences with self-invalidation and self-downgrade},
                              journal = {Logical Methods in Computer Science},
                              year = {2018},
                              volume = {14},
                              number = {1},
                              pages = {1--33},
                              doi = {10.23638/LMCS-14(1:6)2018}
                            }
                            
Abellán JL, Padierna E, Ros A and Acacio ME (2018), "Photonic-based express coherence notifications for many-core CMPs", Journal of Parallel and Distributed Computing. Vol. 113, pp. 179-194.
Abstract: Directory-based coherence protocols (Directory) are considered the design of choice to provide maximum performance in coherence maintenance for shared-memory many-core CMPs, despite their large memory overhead. New solutions are emerging to achieve acceptable levels of on-chip area overhead and energy consumption such as optimized encoding of block sharers in Directory (e.g., SCD) or broadcast-based coherence (e.g., Hammer). In this work, we propose a novel and efficient solution for the cache coherence problem in many-core systems based on the co-design of the coherence protocol and the interconnection network. Particularly, we propose ECONO, a cache coherence protocol tailored to future many-core systems that resorts on PhotoBNoC, a special lightweight dedicated silicon-photonic subnetwork for efficient delivery of the atomic broadcast coherence messages used by the protocol. Considering a simulated 256-core system, as compared with Hammer, we demonstrate that ECONO+PhotoBNoC reduces performance and energy consumption by an average of 34% and 32%, respectively. Additionally, our proposal lowers the area overhead entailed by SCD by 2×.
BibTeX:
                            @article{jlabellan-jpdc18,
                              author = {Abellán, José L. and Padierna, Eduardo and Ros, Alberto and Acacio, Manuel E.},
                              title = {Photonic-based express coherence notifications for many-core CMPs},
                              journal = {Journal of Parallel and Distributed Computing},
                              year = {2018},
                              volume = {113},
                              pages = {179--194},
                              doi = {10.1016/j.jpdc.2017.11.015}
                            }
                            
Esteve A, Ros A, Robles A and E G (2018), "TokenTLB + CUP : A Token-Based Page", IEEE Transactions on Parallel and Distributed Systems (TPDS). , pp. 1-14.
BibTeX:
                            @article{aesteve-tpds18,
                              author = {Esteve, Albert and Ros, Alberto and Robles, Antonio and G, E},
                              title = {TokenTLB + CUP : A Token-Based Page},
                              journal = {IEEE Transactions on Parallel and Distributed Systems (TPDS)},
                              year = {2018},
                              pages = {1--14}
                            }
                            
Jimborean A, Ekemark P, Waern J, Kaxiras S and Ros A (2018), "Automatic Detection of Large Extended Data-Race-Free Regions with Conflict Isolation", IEEE Transactions on Parallel and Distributed Systems. Vol. 29(3), pp. 527-541.
Abstract: Data-race-free (DRF) parallel programming becomes a standard as newly adopted memory models of mainstream programming languages such as C++ or Java impose data-race-freedom as a requirement. We propose compiler techniques that automatically delineate extended data-race-free (xDRF) regions, namely regions of code that provide the same guarantees as the synchronization-free regions (in the context of DRF codes). xDRF regions stretch across synchronization boundaries, function calls and loop back-edges and preserve the data-race-free semantics, thus increasing the optimization opportunities exposed to the compiler and to the underlying architecture. We further enlarge xDRF regions with a conflict isolation (CI) technique, delineating what we call xDRF-CI regions while preserving the same properties as xDRF regions. Our compiler (1) precisely analyzes the threads' memory accessing behavior and data sharing in shared-memory, general-purpose parallel applications, (2) isolates data-sharing and (3) marks the limits of xDRF-CI code regions. The contribution of this work consists in a simple but effective method to alleviate the drawbacks of the compiler's conservative nature in order to be competitive with (and even surpass) an expert in delineating xDRF regions manually. We evaluate the potential of our technique by employing xDRF and xDRF-CI region classification in a state-of-the-art, dual-mode cache coherence protocol. We show that xDRF regions reduce the coherence bookkeeping and enable optimizations for performance (6.4 percent) and energy efficiency (12.2 percent) compared to a standard directory-based coherence protocol. Enhancing the xDRF analysis with the conflict isolation technique improves performance by 7.1 percent and energy efficiency by 15.9 percent.
BibTeX:
                            @article{ajimborean-tpds18,
                              author = {Jimborean, Alexandra and Ekemark, Per and Waern, Jonatan and Kaxiras, Stefanos and Ros, Alberto},
                              title = {Automatic Detection of Large Extended Data-Race-Free Regions with Conflict Isolation},
                              journal = {IEEE Transactions on Parallel and Distributed Systems},
                              year = {2018},
                              volume = {29},
                              number = {3},
                              pages = {527--541},
                              doi = {10.1109/TPDS.2017.2771509}
                            }
                            
Kaxiras S, Carlson TE and Ros A (2018), "Non-Speculative Load Reordering in TSO", IEEE Micro (TopPicks).
BibTeX:
                            @article{skaxiras-toppicks18,
                              author = {Kaxiras, Stefanos and Carlson, Trevor E and Ros, Alberto},
                              title = {Non-Speculative Load Reordering in TSO},
                              journal = {IEEE Micro (TopPicks)},
                              year = {2018}
                            }
                            
Ros A and Kaxiras S (2018), "Non-Speculative store coalescing in total store order", Proceedings - International Symposium on Computer Architecture. , pp. 221-234.
Abstract: We present a non-speculative solution for a coalescing store buffer in total store order (TSO) consistency. Coalescing violates TSO with respect to both conflicting loads and conflicting stores, if partial state is exposed to the memory system. Proposed solutions for coalescing in TSO resort to speculation-and-rollback or centralized arbitration to guarantee atomicity for the set of stores whose order is affected by coalescing. These solutions can suffer from scalability, complexity, resource-conflict deadlock, and livelock problems. A non-speculative solution that writes out coalesced cachelines, one at a time, over a typical directory-based MESI coherence layer, has the potential to transcend these problems if it can guarantee absence of deadlock in a practical way. There are two major problems for a non-speculative coalescing store buffer: i) how to present to the memory system a group of coalesced writes as atomic, and ii) how to not deadlock while attempting to do so. For this, we introduce a new lexicographical order. Relying on this order, conflicting atomic groups of coalesced writes can be individually performed per cache block, without speculation, rollback, or replay, and without deadlock or livelock, yet appear atomic to conflicting parties and preserve TSO. One of our major contributions is to show that lexicographical orders based on a small part of the physical address (sub-address order) are deadlock-free throughout the system when taking into account resource-conflict deadlocks. Our approach exceeds the performance and energy benefits of two baseline TSO store buffers and matches the coalescing (and energy savings) of a release-consistency store buffer, at comparable cost.
BibTeX:
                            @article{aros-isca18,
                              author = {Ros, Alberto and Kaxiras, Stefanos},
                              title = {Non-Speculative store coalescing in total store order},
                              journal = {Proceedings - International Symposium on Computer Architecture},
                              year = {2018},
                              pages = {221--234},
                              doi = {10.1109/ISCA.2018.00028}
                            }
                            
Ros A and Kaxiras S (2018), "The superfluous load queue", Proceedings of the Annual International Symposium on Microarchitecture, MICRO. Vol. 2018-Octob, pp. 95-107.
Abstract: In an out-of-order core, the load queue (LQ), the store queue (SQ), and the store buffer (SB) are responsible for ensuring: i) correct forwarding of stores to loads and ii) correct ordering among loads (with respect to external stores). The first requirement safeguards the sequential semantics of program execution and applies to both serial and parallel code; the second requirement safeguards the semantics of coherence and consistency (e.g., TSO). In particular, loads search the SQ/SB for the latest value that may have been produced by a store, and stores and invalidations search the LQ to find speculative loads in case they violate uniprocessor or multiprocessor ordering. To meet timing constraints the LQ and SQ/SB system is composed of CAM structures that are frequently searched. This results in high complexity, cost, and significant difficulty to scale, but is the current state of the art. Prior research demonstrated the feasibility of a non-Associative LQ by replaying loads at commit. There is a steep cost however: A significant increase in L1 accesses and contention for L1 ports. This is because prior work assumes Sequential Consistency and completely ignores the existence of a SB in the system. In contrast, we intentionally delay stores in the SB to achieve a total management of stores and loads in a core, while still supporting TSO. Our main result is that we eliminate the LQ without burdening the L1 with extra accesses. Store forwarding is achieved by delaying our own stores until speculatively issued loads are validated on commit, entirely in-core; TSO load→load ordering is preserved by delaying remote external stores in their SB until our own speculative reordered loads commit. While the latter is inspired by recent work on non-speculative load reordering, our contribution here is to show that this can be accomplished without having a load queue. Eliminating the LQ results in both energy savings and performance improvement from the elimination of LQ-induced stalls.
BibTeX:
                            @article{aros-micro18,
                              author = {Ros, Alberto and Kaxiras, Stefanos},
                              title = {The superfluous load queue},
                              journal = {Proceedings of the Annual International Symposium on Microarchitecture, MICRO},
                              year = {2018},
                              volume = {2018-Octob},
                              pages = {95--107},
                              doi = {10.1109/MICRO.2018.00017}
                            }
                            
Esteve A, Ros A, Gomez ME, Robles A and Duato J (2017), "TLB-Based Temporality-Aware Classification in CMPs with Multilevel TLBs", IEEE Transactions on Parallel and Distributed Systems. Vol. 28(8), pp. 2401-2413.
Abstract: Recent proposals are based on classifying memory accesses into private or shared in order to process private accesses more efficiently and reduce coherence overhead. The classification mechanisms previously proposed are either not able to adapt to the dynamic sharing behavior of the applications or require frequent broadcast messages. Additionally, most of these classification approaches assume single-level translation lookaside buffers (TLBs). However, deeper and more efficient TLB hierarchies, such as the ones implemented in current commodity processors, have not been appropriately explored. This paper analyzes accurate classification mechanisms in multilevel TLB hierarchies. In particular, we propose an efficient data classification strategy for systems with distributed shared last-level TLBs. Our approach classifies data accounting for temporal private accesses and constrains TLB-related traffic by issuing unicast messages on first-level TLB misses. When our classification is employed to deactivate coherence for private data in directory-based protocols, it improves the directory efficiency and, consequently, reduces coherence traffic to merely 53.0 percent, on average. Additionally, it avoids some of the overheads of previous classification approaches for purely private TLBs, improving average execution time by nearly 9 percent for large-scale systems.
BibTeX:
                            @article{aesteve-tpds17,
                              author = {Esteve, Albert and Ros, Alberto and Gomez, Maria E. and Robles, Antonio and Duato, Jose},
                              title = {TLB-Based Temporality-Aware Classification in CMPs with Multilevel TLBs},
                              journal = {IEEE Transactions on Parallel and Distributed Systems},
                              year = {2017},
                              volume = {28},
                              number = {8},
                              pages = {2401--2413},
                              doi = {10.1109/TPDS.2017.2658576}
                            }
                            
Fernández-Pascual R, Ros A and Acacio ME (2017), "To be silent or not: on the impact of evictions of clean data in cache-coherent multicores", Journal of Supercomputing. Vol. 73(10), pp. 4428-4443.
Abstract: Maintaining coherence across hundreds or even thousands of cores is not an easy task. Among all of the proposed solutions until now, directory-based cache coherence has been advocated as the most feasible way of beating the scalability hurdles that arise at such large scale. Thanks to the knowledge accumulated during the last four decades, there is general consensus on the impact of most of the design aspects of directory coherence on performance, energy consumption and cost. However, there is one subtle design point for which we have observed some divergences in contemporary research works on cache-coherent multicores. Specifically, while some recent works assume a silent replacement policy for evictions of clean data in the last-level private caches, others implement just the opposite that we call a noisy replacement policy, and even others do not mention how these evictions are managed. In this work, we put this important aspect into the spotlight, demonstrating that the way in which evictions of clean data are managed can have important influence on the performance and energy consumption of a directory-based cache coherence protocol. We show that the noisy replacement policy leads to a significant increase in the total traffic (around 20% in several cases, 9.6% on average) compared with the silent policy. Given the important fraction of the total power budget that the on-chip interconnection network of future manycores is expected to consume, assuming the silent replacement policy for clean data will lead to non-negligible energy savings. Moreover, and what is more important, we have observed that depending on the particular directory structure used, assuming silent replacements could affect performance or not. This means that the use of noisy replacements is not justified in all cases, since it would increase unnecessarily network traffic without leading to any performance advantages.
BibTeX:
                            @article{rfernandez-jsc17,
                              author = {Fernández-Pascual, Ricardo and Ros, Alberto and Acacio, Manuel E.},
                              title = {To be silent or not: on the impact of evictions of clean data in cache-coherent multicores},
                              journal = {Journal of Supercomputing},
                              year = {2017},
                              volume = {73},
                              number = {10},
                              pages = {4428--4443},
                              doi = {10.1007/s11227-017-2026-6}
                            }
                            
Jimborean A, Waern J, Ekemark P, Kaxiras S and Ros A (2017), "Automatic detection of extended data-race-free regions", CGO 2017 - Proceedings of the 2017 International Symposium on Code Generation and Optimization. , pp. 14-26.
Abstract: Data-race-free (DRF) parallel programming becomes a standard as newly adopted memory models of mainstream programming languages such as C++ or Java impose data-race-freedom as a requirement. We propose compiler techniques that automatically delineate extended data-race-free regions (xDRF), namely regions of code which provide the same guarantees as the synchronization-free regions (in the context of DRF codes). xDRF regions stretch across synchronization boundaries, function calls and loop back-edges and preserve the data-race-free semantics, thus increasing the optimization opportunities exposed to the compiler and to the underlying architecture. Our compiler techniques precisely analyze the threads' memory accessing behavior and data sharing in shared-memory, general-purpose parallel applications and can therefore infer the limits of xDRF code regions. We evaluate the potential of our technique by employing the xDRF region classification in a state-of-the-art, dualmode cache coherence protocol. Larger xDRF regions reduce the coherence bookkeeping and enable optimizations for performance (6.8%) and energy efficiency (11.7%) compared to a standard directory-based coherence protocol.
BibTeX:
                            @article{ajimborean-cgo17,
                              author = {Jimborean, Alexandra and Waern, Jonatan and Ekemark, Per and Kaxiras, Stefanos and Ros, Alberto},
                              title = {Automatic detection of extended data-race-free regions},
                              journal = {CGO 2017 - Proceedings of the 2017 International Symposium on Code Generation and Optimization},
                              year = {2017},
                              pages = {14--26},
                              doi = {10.1109/CGO.2017.7863725}
                            }
                            
Ros A, Carlson TE, Alipour M and Kaxiras S (2017), "Non-Speculative Load-Load Reordering in TSO", ACM SIGARCH Computer Architecture News. Vol. 45(2), pp. 187-200.
Abstract: textcopyright 2017 Association for Computing Machinery. In Total Store Order memory consistency (TSO), loads can be speculatively reordered to improve performance. If a load-load reordering is seen by other cores, speculative loads must be squashed and re-executed. In architectures with an unordered interconnection network and directory coherence, this has been the established view for decades. We show, for the frst time, that it is not necessary to squash and re-execute speculatively reordered loads in TSO when their reordering is seen. Instead, the reordering can be hidden form other cores by the coherence protocol. The implication is that we can irrevocably bind speculative loads. This allows us to commit reordered loads out-of-order without having to wait (for the loads to become non-speculative) or without having to checkpoint committed state (and rollback if needed), just to ensure correctness in the rare case of some core seeing the reordering. We show that by exposing a reordering to the coherence layer and by appropriately modifying a typical directory protocol we can successfully hide load-load reordering without perceptible performance cost and without deadlock. Our solution is cost-effective and increases the performance of out-of-order commit by a sizable margin, compared to the base case where memory operations are not allowed to commit if the consistency model could be violated.
BibTeX:
                            @article{aros-isca17,
                              author = {Ros, Alberto and Carlson, Trevor E. and Alipour, Mehdi and Kaxiras, Stefanos},
                              title = {Non-Speculative Load-Load Reordering in TSO},
                              journal = {ACM SIGARCH Computer Architecture News},
                              year = {2017},
                              volume = {45},
                              number = {2},
                              pages = {187--200},
                              doi = {10.1145/3140659.3080220}
                            }
                            
Ros A, Leonardsson C, Sakalis C and Kaxiras S (2017), "Efficient self-invalidation/self-downgrade for critical sections with relaxed semantics", IEEE Transactions on Parallel and Distributed Systems. Vol. 28(12), pp. 3413-3425.
Abstract: Cache coherence protocols based on self-invalidation allow simpler hardware implementation compared to traditional write-invalidation protocols, by relying on data-race-free semantics and applying self-invalidation on synchronization points. Their simplicity lies in the absence of invalidation traffic. This eliminates the need to track readers in a directory, and reduces the number of transient protocol states. Similarly, the use of self-downgrade on synchronization eliminates directory indirection, and hence the need to track writers in a directory. These protocols, effectively without a directory, have the potential to reduce area, energy consumption, and complexity, without sacrificing performance - provided, that self-invalidation and self-downgrade are performed prudently. In this work we examine how self-invalidation and self-downgrade are performed in relation to atomicity and ordering. We show that self-invalidation and self-downgrade do not need to be applied conservatively, as so far implemented. Our key observation is that, often, critical sections which are not ordered in time, are intended to provide only atomicity and not thread synchronization. We thus propose a new type of self-invalidation, forwardself-invalidation (FSI), which invalidates solely data that are going to be accessed inside a critical section. Based on the same reasoning, we propose a new type of self-downgrade, forward self-downgrade (FSD), also restricted to writes in critical sections. Finally, we define the semantics of locks using FSI and FSD, which resemble the semantics of relaxed atomic operations in C++. Our evaluation for 64-core multiprocessors shows significant improvements using the proposed FSI and FSD - where applicable - in Splash-3 and PARSEC benchmarks, over a directory-based protocol (17.1 percent in execution time and 33.9 percent in energy consumption) and also over a state-of-the-art self-invalidation/self-downgrade protocol (7.6 percent in execution time and 9.1 percent in energy consumption), while still retaining the design simplicity of the protocol.
BibTeX:
                            @article{aros-tpds17,
                              author = {Ros, Alberto and Leonardsson, Carl and Sakalis, Christos and Kaxiras, Stefanos},
                              title = {Efficient self-invalidation/self-downgrade for critical sections with relaxed semantics},
                              journal = {IEEE Transactions on Parallel and Distributed Systems},
                              year = {2017},
                              volume = {28},
                              number = {12},
                              pages = {3413--3425},
                              doi = {10.1109/TPDS.2017.2720744}
                            }
                            
Titos-Gil R, Flores A, Fernández-Pascual R, Ros A and Acacio ME (2017), "Way-combining directory: An adaptive and scalable low-cost coherence directory", Proceedings of the International Conference on Supercomputing. Vol. Part F1284
Abstract: Today, general-purpose commercial multicores approaching one hundred cores are already a reality and even thousand core chips are being prototyped. Maintaining coherence across such a high number of cores in these manycore architectures requires careful design of the coherence directory used to keep track of current locations of the memory blocks at the private cache level. In this work we propose a novel organization for the coherence directory that builds on the brand-new concept of way combining. Particularly, our proposal employs just one pointer per entry, which is optimal for the common case of having just one sharer. For those addresses that require more than one pointer, we have observed that in the majority of cases extra pointers could be taken from other empty ways in the same set. Thus, our proposal minimizes the storage overheads without losing the flexibility to adapt to several sharing degrees and without the complexities of other previously proposed techniques. Through detailed simulations of a 128-core architecture, we show that the way-combining directory closely approaches the performance of a non-scalable bit-vector sparse directory, and beats other scalable state-of-the-art proposals.
BibTeX:
                            @article{rtitos-ics17,
                              author = {Titos-Gil, Rubén and Flores, Antonio and Fernández-Pascual, Ricardo and Ros, Alberto and Acacio, Manuel E.},
                              title = {Way-combining directory: An adaptive and scalable low-cost coherence directory},
                              journal = {Proceedings of the International Conference on Supercomputing},
                              year = {2017},
                              volume = {Part F1284},
                              doi = {10.1145/3079079.3079096}
                            }
                            
Valls JJ, Ros A, Gómez ME and Sahuquillo J (2017), "The Tag Filter Architecture: An energy-efficient cache and directory design", Journal of Parallel and Distributed Computing. Vol. 100, pp. 193-202. Elsevier Inc..
Abstract: Power consumption in current high-performance chip multiprocessors (CMPs) has become a major design concern that aggravates with the current trend of increasing the core count. A significant fraction of the total power budget is consumed by on-chip caches which are usually deployed with a high associativity degree (even L1 caches are being implemented with eight ways) to enhance the system performance. On a cache access, each way in the corresponding set is accessed in parallel, which is costly in terms of energy. On the other hand, coherence protocols also must implement efficient directory caches that scale in terms of power consumption. Most of the state-of-the-art techniques that reduce the energy consumption of directories are at the cost of performance, which may become unacceptable for high-performance CMPs. In this paper, we propose an energy-efficient architectural design that can be effectively applied to any kind of cache memory. The proposed approach, called the Tag Filter (TF) Architecture, filters the ways accessed in the target cache set, and just a few ways are searched in the tag and data arrays. This allows the approach to reduce the dynamic energy consumption of caches without hurting their access time. For this purpose, the proposed architecture holds the X least significant bits of each tag in a small auxiliary X-bit-wide array. These bits are used to filter the ways where the least significant bits of the tag do not match with the bits in the X-bit array. Experimental results show that, on average, the TF Architecture reduces the dynamic power consumption across the studied applications up to 74.9%, 85.9%, and 84.5% when applied to L1 caches, L2 caches, and directory caches, respectively.
BibTeX:
                            @article{jjvalls-jpdc17,
                              author = {Valls, Joan J. and Ros, Alberto and Gómez, María E. and Sahuquillo, Julio},
                              title = {The Tag Filter Architecture: An energy-efficient cache and directory design},
                              journal = {Journal of Parallel and Distributed Computing},
                              publisher = {Elsevier Inc.},
                              year = {2017},
                              volume = {100},
                              pages = {193--202},
                              doi = {10.1016/j.jpdc.2016.04.016}
                            }
                            
Abellán JL, Fernández J and Acacio ME (2015), "Efficient hardware-supported synchronization mechanisms for manycores", Handbook on Data Centers. , pp. 753-803.
Abstract: In this Chapter, we analyze and propose techniques to mitigate the problem of synchronization at server (manycore processor) level in datacenters. Particularly, we propose two different strategies that provide very efficient, scalable and lightweight hardware implementations for barriers and highly-contended locks. We implement our synchronization architectures using two different technologies. The first is a state-of-the-art full-custom technology, namely G-Lines, whilst the second is a costeffective mainstream industrial toolflow with an advanced 45 nm technology, or Standard technology.
BibTeX:
                            @article{Abellan2015,
                              author = {Abellán, José L. and Fernández, Juan and Acacio, Manuel E.},
                              title = {Efficient hardware-supported synchronization mechanisms for manycores},
                              journal = {Handbook on Data Centers},
                              year = {2015},
                              pages = {753--803},
                              doi = {10.1007/978-1-4939-2092-1_26}
                            }
                            
Bernabé G, Guerrero G and Juan Fernández (2012), "CUDA and OpenCL Implementations of 3D Fast Wavelet Transform", In 3rd IEEE Latin American Symposium on Circuits and Systems (LASCAS). Playa del Carmen (Mexico), feb, 2012. IEEE Computer Society.
BibTeX:
                            @inproceedings{bernabe_lascas12,
                              author = {Bernabé, Gregorio and Guerrero, Ginés and ., Juan Fernández},
                              title = {CUDA and OpenCL Implementations of 3D Fast Wavelet Transform},
                              booktitle = {3rd IEEE Latin American Symposium on Circuits and Systems (LASCAS)},
                              publisher = {IEEE Computer Society},
                              year = {2012},
                              doi = {10.1109/LASCAS.2012.6180318}
                            }
                            
Bernabé G, Guerrero G and Juan Fernández (2012), "CUDA and OpenCL Implementations of 3D Fast Wavelet Transform", In 3rd IEEE Latin American Symposium on Circuits and Systems (LASCAS). Playa del Carmen (Mexico), feb, 2012. IEEE Computer Society.
BibTeX:
                            @inproceedings{bernabe_lascas12,
                              author = {Bernabé, Gregorio and Guerrero, Ginés and ., Juan Fernández},
                              title = {CUDA and OpenCL Implementations of 3D Fast Wavelet Transform},
                              booktitle = {3rd IEEE Latin American Symposium on Circuits and Systems (LASCAS)},
                              publisher = {IEEE Computer Society},
                              year = {2012},
                              doi = {10.1109/LASCAS.2012.6180318}
                            }
                            
Garcbackslash'backslashia-Guirado A, Fernández-Pascual R, Ros A and Garcbackslash'backslashia JM (2012), "DAPSCO: Distance-Aware Partially Shared Cache Organization", Journal of ACM Transactions on Architecture and Code Optimization (TACO)., jan, 2012. Vol. 8(4), pp. 25. Association for Computing Machinery (ACM).
BibTeX:
                            @article{garcia_taco12,
                              author = {Garcbackslash'backslashia-Guirado, Antonio and Fernández-Pascual, Ricardo and Ros, Alberto and Garcbackslash'backslashia, José M},
                              title = {DAPSCO: Distance-Aware Partially Shared Cache Organization},
                              journal = {Journal of ACM Transactions on Architecture and Code Optimization (TACO)},
                              publisher = {Association for Computing Machinery (ACM)},
                              year = {2012},
                              volume = {8},
                              number = {4},
                              pages = {25}
                            }
                            
Abellán JL, Fernández J and Acacio ME (2011), "GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPs", In Proc. of the 25th Int'l Parallel & Distributed Processing Symposium (IPDPS'11) (BEST PAPER in the Architectures Track). Anchorage, USA, may, 2011. , pp. 1-13. IEEE Computer Society Press.
BibTeX:
                            @inproceedings{abellan_ipdps11,
                              author = {Abellán, José L and Fernández, Juan and Acacio, Manuel E},
                              title = {GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPs},
                              booktitle = {Proc. of the 25th Int'l Parallel & Distributed Processing Symposium (IPDPS'11) (BEST PAPER in the Architectures Track)},
                              publisher = {IEEE Computer Society Press},
                              year = {2011},
                              pages = {1--13}
                            }
                            
Cebrián JM, Aragón JL, García JM and Kaxiras S (2011), "Leakage-efficient Design of Value Predictors through State and Non-state Preserving Techniques", Journal of Supercomputing., jan, 2011. Vol. 55(1), pp. 28-50. Springer Netherlands.
BibTeX:
                            @article{cebrian_jsc11,
                              author = {Cebrián, Juan M and Aragón, Juan L and García, José M and Kaxiras, Stefanos},
                              title = {Leakage-efficient Design of Value Predictors through State and Non-state Preserving Techniques},
                              journal = {Journal of Supercomputing},
                              publisher = {Springer Netherlands},
                              year = {2011},
                              volume = {55},
                              number = {1},
                              pages = {28--50}
                            }
                            
Cebrián JM, Aragón JL and Kaxiras S (2011), "Power Token Balancing: Adapting CMPs to Power Constraints for Parallel Multithreaded Workloads", In Proc. of the 25rd IEEE Int. Parallel and Distributed Processing Symposium (IPDPS). Anchorage, Alaska, USA, may, 2011. , pp. 431-442.
BibTeX:
                            @inproceedings{cebrian_ipdps11,
                              author = {Cebrián, Juan M and Aragón, Juan L and Kaxiras, Stefanos},
                              title = {Power Token Balancing: Adapting CMPs to Power Constraints for Parallel Multithreaded Workloads},
                              booktitle = {Proc. of the 25rd IEEE Int. Parallel and Distributed Processing Symposium (IPDPS)},
                              year = {2011},
                              pages = {431--442}
                            }
                            
Cebrián JM, Aragón JL and Kaxiras S (2011), "Token3D: Reducing Temperature in 3D die-stacked CMPs through Cycle-level Power Control Mechanisms", In Proc. of the 17th Int. Conference on Parallel and Distributed Computing (Euro-Par). Bordeaux, France, aug, 2011. , pp. 295-309.
BibTeX:
                            @inproceedings{cebrian_europar11,
                              author = {Cebrián, Juan M and Aragón, Juan L and Kaxiras, Stefanos},
                              title = {Token3D: Reducing Temperature in 3D die-stacked CMPs through Cycle-level Power Control Mechanisms},
                              booktitle = {Proc. of the 17th Int. Conference on Parallel and Distributed Computing (Euro-Par)},
                              year = {2011},
                              pages = {295--309}
                            }
                            
Cuesta B, Ros A, Gómez ME, Robles A and Duato J (2011), "Increasing the Effectiveness of Directory Caches by Avoiding the Tracking of Non-Coherent Memory Blocks", IEEE Transactions on Computers (TC)., dec, 2011. IEEE Computer Society.
BibTeX:
                            @article{cuesta_tc11,
                              author = {Cuesta, Blas and Ros, Alberto and Gómez, Maria E and Robles, Antonio and Duato, José},
                              title = {Increasing the Effectiveness of Directory Caches by Avoiding the Tracking of Non-Coherent Memory Blocks},
                              journal = {IEEE Transactions on Computers (TC)},
                              publisher = {IEEE Computer Society},
                              year = {2011}
                            }
                            
Cuesta B, Ros A, Gómez ME, Robles A and Duato J (2011), "Increasing the Effectiveness of Directory Caches by Deactivating Coherence for Private Memory Blocks", In Proc. of the 38th International Symposium on Computer Architecture (ISCA-38). San José, California, jun, 2011. , pp. 93-103. Association for Computing Machinery (ACM).
BibTeX:
                            @inproceedings{cuesta_isca11,
                              author = {Cuesta, Blas and Ros, Alberto and Gómez, Marbackslash'backslashia E and Robles, Antonio and Duato, José},
                              title = {Increasing the Effectiveness of Directory Caches by Deactivating Coherence for Private Memory Blocks},
                              booktitle = {Proc. of the 38th International Symposium on Computer Architecture (ISCA-38)},
                              publisher = {Association for Computing Machinery (ACM)},
                              year = {2011},
                              pages = {93--103}
                            }
                            
Cuesta B, Ros A, Gómez ME, Robles A and Duato J (2011), "Overriding the Coherence Protocol to Improve Directory Caches", In Actas de las XXII Jornadas de Paralelismo. La Laguna, Tenerife, sep, 2011. , pp. 197-202.
BibTeX:
                            @inproceedings{cuesta_jp11,
                              author = {Cuesta, Blas and Ros, Alberto and Gómez, Marbackslash'backslashia E and Robles, Antonio and Duato, José},
                              title = {Overriding the Coherence Protocol to Improve Directory Caches},
                              booktitle = {Actas de las XXII Jornadas de Paralelismo},
                              year = {2011},
                              pages = {197--202}
                            }
                            
Garcbackslash'backslashia-Guirado A, Fernández-Pascual R, Ros A and Garcbackslash'backslashia JM (2011), "DAPSCO: Distance-Aware Partially Shared Cache Organization", Proc. of the 7th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC-7).. Thesis at: Ponencia. Paris, France, jan, 2011. HiPEAC Network of Excelence.
BibTeX:
                            @article{garcia_hipeac12,
                              author = {Garcbackslash'backslashia-Guirado, Antonio and Fernández-Pascual, Ricardo and Ros, Alberto and Garcbackslash'backslashia, José M},
                              title = {DAPSCO: Distance-Aware Partially Shared Cache Organization},
                              journal = {Proc. of the 7th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC-7)},
                              publisher = {HiPEAC Network of Excelence},
                              school = {Ponencia},
                              year = {2011}
                            }
                            
Garcbackslash'backslashia-Guirado A, Fernández-Pascual R, Ros A and Garcbackslash'backslashia JM (2011), "Energy-Efficient Cache Coherence Protocols in Chip-Multiprocessors for Server Consolidation", In Proc. of the 40th International Conference on Parallel Processing (ICPP-2011). Taipei, Taiwan, sep, 2011. , pp. 51-62. IEEE Computer Society.
BibTeX:
                            @inproceedings{garcia_icpp11,
                              author = {Garcbackslash'backslashia-Guirado, Antonio and Fernández-Pascual, Ricardo and Ros, Alberto and Garcbackslash'backslashia, José M},
                              title = {Energy-Efficient Cache Coherence Protocols in Chip-Multiprocessors for Server Consolidation},
                              booktitle = {Proc. of the 40th International Conference on Parallel Processing (ICPP-2011)},
                              publisher = {IEEE Computer Society},
                              year = {2011},
                              pages = {51--62}
                            }
                            
Negi A, Titos R, Acacio ME, García JM and Stenström P (2011), "Eager Meets Lazy: The Impact of Write-Buffering on Hardware Transactional Memory", In Proc. of the 40th Int'l Conference on Parallel Processing (ICPP-2011). Taipei, Taiwan, sep, 2011. , pp. 73-82. IEEE Computer Society.
BibTeX:
                            @inproceedings{negi_icpp11,
                              author = {Negi, Anurag and Titos, Rubén and Acacio, Manuel E and García, José M and Stenström, Per},
                              title = {Eager Meets Lazy: The Impact of Write-Buffering on Hardware Transactional Memory},
                              booktitle = {Proc. of the 40th Int'l Conference on Parallel Processing (ICPP-2011)},
                              publisher = {IEEE Computer Society},
                              year = {2011},
                              pages = {73--82}
                            }
                            
Negi A, Titos R, Acacio ME, García JM and Stenström P (2011), "The Impact of Non-coherent Buffers on Lazy Hardware Transactional Memory Systems", In Proc. of the 13th Workshop on Advances on Parallel and Distributed Processing Symposium (APDCM 2011), in conjunction with IPDPS 2011. Anchorage, Alaska, USA, may, 2011. , pp. 700-707. IEEE Computer Society.
BibTeX:
                            @inproceedings{negi_apdcm11,
                              author = {Negi, Anurag and Titos, Rubén and Acacio, Manuel E and García, José M and Stenström, Per},
                              title = {The Impact of Non-coherent Buffers on Lazy Hardware Transactional Memory Systems},
                              booktitle = {Proc. of the 13th Workshop on Advances on Parallel and Distributed Processing Symposium (APDCM 2011), in conjunction with IPDPS 2011},
                              publisher = {IEEE Computer Society},
                              year = {2011},
                              pages = {700--707}
                            }
                            
Negi A, Titos-Gil R, Acacio ME, García JM and Stenstrom P (2011), "Eager meets lazy: The impact of write-buffering on hardware transactional memory", Proceedings of the International Conference on Parallel Processing. , pp. 73-82.
Abstract: Hardware transactional memory (HTM) systems have been studied extensively along the dimensions of speculative versioning and contention management policies. The relative performance of several designs policies has been discussed at length in prior work within the framework of scalable chipmultiprocessing systems. Yet, the impact of simple structural optimizations like write-buffering has not been investigated and performance deviations due to the presence or absence of these optimizations remains unclear. This lack of insight into the effective use and impact of these interfacial structures between the processor core and the coherent memory hierarchy forms the crux of the problem we study in this paper. Through detailed modeling of various write-buffering configurations we show that they play a major role in determining the overall performance of a practical HTM system. Our study of both eager and lazy conflict resolution mechanisms in a scalable parallel architecture notes a remarkable convergence of the performance of these two diametrically opposite design points when write buffers are introduced and used well to support the common case. Mitigation of redundant actions, fewer invalidations on abort, latency-hiding and prefetch effects contribute towards reducing execution times for transactions. Shorter transaction durations also imply a lower contention probability, thereby amplifying gains even further. The insights, related to the interplay between buffering mechanisms, system policies and workload characteristics, contained in this paper clearly distinguish gains in performance to be had from write-buffering from those that can be ascribed to HTM policy. We believe that this information would facilitate sound design decisions when incorporating HTMs into parallel architectures. textcopyright 2011 IEEE.
BibTeX:
                            @article{Negi2011a,
                              author = {Negi, Anurag and Titos-Gil, Rubén and Acacio, Manuel E. and García, José M. and Stenstrom, Per},
                              title = {Eager meets lazy: The impact of write-buffering on hardware transactional memory},
                              journal = {Proceedings of the International Conference on Parallel Processing},
                              year = {2011},
                              pages = {73--82},
                              doi = {10.1109/ICPP.2011.63}
                            }
                            
Negi A, Titos-Gil R, Acacio ME, García JM and Stenstrom P (2011), "The impact of non-coherent buffers on lazy hardware transactional memory systems", IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. , pp. 700-707.
Abstract: When supported in silicon, transactional memory (TM) promises to become a fast, simple and scalable parallel programming paradigm for future shared memory multiprocessor systems. Among the multitude of hardware TM design points and policies that have been studied so far, lazy conflict resolution designs often extract the most concurrency, but their inherent need for lazy versioning requires careful management of speculative updates. In this paper we study how coherent buffering, in private caches for example, as has been proposed in several hardware TM proposals, can lead to inefficiencies. We then show how such inefficiencies can be substantially mitigated by using complete or partial non-coherent buffering of speculative writes in dedicated structures or suitably adapted standard per-core write-buffers. These benefits are particularly noticeable in scenarios involving large coarse grained transactions that may write a lot of non-contended data in addition to actively shared data. We believe our analysis provides important insights into some overlooked aspects of TM behaviour and would prove useful to designers wishing to implement lazy TM schemes in hardware. textcopyright 2011 IEEE.
BibTeX:
                            @article{Negi2011,
                              author = {Negi, Anurag and Titos-Gil, Rubén and Acacio, Manuel E. and García, José M. and Stenstrom, Per},
                              title = {The impact of non-coherent buffers on lazy hardware transactional memory systems},
                              journal = {IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum},
                              year = {2011},
                              pages = {700--707},
                              doi = {10.1109/IPDPS.2011.205}
                            }
                            
Ros A, Cuesta B, Fernández-Pascual R, Gómez ME, Acacio ME, Robles A, Garcbackslash'backslashia JM and Duato J (2011), "Extending Magny-Cours Cache Coherence", IEEE Transactions on Computers (TC)., apr, 2011. IEEE Computer Society.
BibTeX:
                            @article{ros_tc11,
                              author = {Ros, Alberto and Cuesta, Blas and Fernández-Pascual, Ricardo and Gómez, Maria E and Acacio, Manuel E and Robles, Antonio and Garcbackslash'backslashia, José M and Duato, José},
                              title = {Extending Magny-Cours Cache Coherence},
                              journal = {IEEE Transactions on Computers (TC)},
                              publisher = {IEEE Computer Society},
                              year = {2011}
                            }
                            
Ros A, Cuesta B, Fernández-Pascual R, Gómez ME, Acacio ME, Robles A, Garcbackslash'backslashia JM and Duato J (2011), "Overcoming the Scalability Constraints of Coherence Protocols of Commodity Systems", In Actas de las XXII Jornadas de Paralelismo. La Laguna, Tenerife, sep, 2011. , pp. 203-208.
BibTeX:
                            @inproceedings{ros_jp11,
                              author = {Ros, Alberto and Cuesta, Blas and Fernández-Pascual, Ricardo and Gómez, Marbackslash'backslashia E and Acacio, Manuel E and Robles, Antonio and Garcbackslash'backslashia, José M and Duato, José},
                              title = {Overcoming the Scalability Constraints of Coherence Protocols of Commodity Systems},
                              booktitle = {Actas de las XXII Jornadas de Paralelismo},
                              year = {2011},
                              pages = {203--208}
                            }
                            
Sánchez D, Aragón JL and García JM (2011), "A Fault-Tolerant Architecture for Parallel Applications in Tiled-CMPs", Journal of Supercomputing, [doi: 10.1007/s11227-011-0670-9]. Springer Netherlands.
BibTeX:
                            @article{sanchez_jsc11,
                              author = {Sánchez, Daniel and Aragón, Juan L and García, José M},
                              title = {A Fault-Tolerant Architecture for Parallel Applications in Tiled-CMPs},
                              journal = {Journal of Supercomputing, [doi: 10.1007/s11227-011-0670-9]},
                              publisher = {Springer Netherlands},
                              year = {2011}
                            }
                            
Sánchez D, Sazeides Y, Aragón JL and García JM (2011), "An Analytical Model for the Calculation of the Expected Miss Ratio in Faulty Caches", In Proc. of the 17th IEEE International On-Line Testing Symposium (IOLTS). Athens, Greece, jul, 2011. , pp. 287-292.
BibTeX:
                            @inproceedings{sanchez_iolts11,
                              author = {Sánchez, Daniel and Sazeides, Yiannakis and Aragón, Juan L and García, José M},
                              title = {An Analytical Model for the Calculation of the Expected Miss Ratio in Faulty Caches},
                              booktitle = {Proc. of the 17th IEEE International On-Line Testing Symposium (IOLTS)},
                              year = {2011},
                              pages = {287--292}
                            }
                            
Titos R, Negi A, Acacio ME, García JM and Stenström P (2011), "ZEBRA: A Data-Centric, Hybrid-Policy Hardware Transactional Memory Design", In Proc. of the 25th Int'l Conference on Supercomputing (ICS-2011). Tucson, Arizona, USA, jun, 2011. , pp. 53-62. ACM Press.
BibTeX:
                            @inproceedings{titos_ics11,
                              author = {Titos, Rubén and Negi, Anurag and Acacio, Manuel E and García, José M and Stenström, Per},
                              title = {ZEBRA: A Data-Centric, Hybrid-Policy Hardware Transactional Memory Design},
                              booktitle = {Proc. of the 25th Int'l Conference on Supercomputing (ICS-2011)},
                              publisher = {ACM Press},
                              year = {2011},
                              pages = {53--62}
                            }
                            
Triviño F, Andujar FJ, Sánchez JL, Alfaro FJ and Ros A (2011), "Self-Related Traces: An Alternative to Full-System Simulation for Networks-On-Chip", In Proc. of the International Conference on High Performance Computing & Simulation (HPCS-2011). Istanbul, Turkey, jul, 2011. , pp. 819-824. IEEE Computer Society.
BibTeX:
                            @inproceedings{triviño_hpcs11,
                              author = {Triviño, Francisco and Andujar, Francisco J and Sánchez, José L and Alfaro, Francisco J and Ros, Alberto},
                              title = {Self-Related Traces: An Alternative to Full-System Simulation for Networks-On-Chip},
                              booktitle = {Proc. of the International Conference on High Performance Computing & Simulation (HPCS-2011)},
                              publisher = {IEEE Computer Society},
                              year = {2011},
                              pages = {819--824}
                            }
                            
Abellán JL, Fernández J and Acacio ME (2010), "A G-line-based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs", In Proc. of the 39th Int'l Conference on Parallel Processing (ICPP'10). San Diego, USA, sep, 2010. , pp. 267-276. IEEE Computer Society.
BibTeX:
                            @inproceedings{abellan_icpp10,
                              author = {Abellán, José L and Fernández, Juan and Acacio, Manuel E},
                              title = {A G-line-based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs},
                              booktitle = {Proc. of the 39th Int'l Conference on Parallel Processing (ICPP'10)},
                              publisher = {IEEE Computer Society},
                              year = {2010},
                              pages = {267--276}
                            }
                            
Abellán JL, Fernández J and Acacio ME (2010), "Characterizing the Basic Synchronization and Communication Operations in Dual Cell-based Blades through CellStats", Journal of Supercomputing., aug, 2010. Vol. 53(2), pp. 247-268. Springer-Verlag.
BibTeX:
                            @article{abellan_jsc10,
                              author = {Abellán, José L and Fernández, Juan and Acacio, Manuel E},
                              title = {Characterizing the Basic Synchronization and Communication Operations in Dual Cell-based Blades through CellStats},
                              journal = {Journal of Supercomputing},
                              publisher = {Springer-Verlag},
                              year = {2010},
                              volume = {53},
                              number = {2},
                              pages = {247--268}
                            }
                            
Abellán JL, Fernández J and Acacio ME (2010), "Efficient and Scalable Barrier Synchronization for Many-Core CMPs", In Proc. of the ACM Int'l Conference on Computing Frontiers (CF'10). Bertinoro, Italy, may, 2010. , pp. 73-74. ACM.
BibTeX:
                            @inproceedings{abellan_cf10,
                              author = {Abellán, José L and Fernández, Juan and Acacio, Manuel E},
                              title = {Efficient and Scalable Barrier Synchronization for Many-Core CMPs},
                              booktitle = {Proc. of the ACM Int'l Conference on Computing Frontiers (CF'10)},
                              publisher = {ACM},
                              year = {2010},
                              pages = {73--74}
                            }
                            
Adhianto L, Banerjee S, Fagan M, Krentel M, Marin G, Mellor-Crummey J and Tallent NR (2010), "HPCTOOLKIT: Tools for performance analysis of optimized parallel programs", Concurrency Computation Practice and Experience. Vol. 22(6), pp. 685-701.
Abstract: HPCTOOLKIT is an integrated suite of tools that supports measurement, analysis, attribution, and presentation of application performance for both sequential and parallel programs. HPCTOOLKIT can pinpoint and quantify scalability bottlenecks in fully optimized parallel programs with a measurement overhead of only a few percent. Recently, new capabilities were added to HPCTOOLKIT for collecting call path profiles for fully optimized codes without any compiler support, pinpointing and quantifying bottlenecks in multithreaded programs, exploring performance information and source code using a new user interface, and displaying hierarchical space-time diagrams based on traces of asynchronous call path samples. This paper provides an overview of HPCTOOLKIT and illustrates its utility for performance analysis of parallel applications. Copyright textcopyright 2009 John Wiley & Sons, Ltd.
BibTeX:
                            @article{jmcebrian-cpe17,
                              author = {Adhianto, L. and Banerjee, S. and Fagan, M. and Krentel, M. and Marin, G. and Mellor-Crummey, J. and Tallent, N. R.},
                              title = {HPCTOOLKIT: Tools for performance analysis of optimized parallel programs},
                              journal = {Concurrency Computation Practice and Experience},
                              year = {2010},
                              volume = {22},
                              number = {6},
                              pages = {685--701},
                              doi = {10.1002/cpe}
                            }
                            
Avilés-González A, Piernas J and Gonzalez-Ferez P (2010), "A Metadata Cluster Based on OSD+ Devices", In Actas de las XXI Jornadas de Paralelismo. Valencia, España , pp. 331-338. IBERGARCETA PUBLICACIONES, S.L.
BibTeX:
                            @inproceedings{aviles_jornadas10,
                              author = {Avilés-González, Ana and Piernas, Juan and Gonzalez-Ferez, Pilar},
                              title = {A Metadata Cluster Based on OSD+ Devices},
                              booktitle = {Actas de las XXI Jornadas de Paralelismo},
                              publisher = {IBERGARCETA PUBLICACIONES, S.L},
                              year = {2010},
                              pages = {331--338}
                            }
                            
Cebrián JM, Aragón JL and Kaxiras S (2010), "Mechanisms to Match Power Constraints in CMPs", In Actas de las XXI Jornadas de Paralelismo. Valencia, sep, 2010. , pp. 185-192.
BibTeX:
                            @inproceedings{cebrian_jp10,
                              author = {Cebrián, Juan M and Aragón, Juan L and Kaxiras, Stefanos},
                              title = {Mechanisms to Match Power Constraints in CMPs},
                              booktitle = {Actas de las XXI Jornadas de Paralelismo},
                              year = {2010},
                              pages = {185--192}
                            }
                            
Fernández-Pascual R, Garcbackslash'backslashia JM, Acacio ME and Duato J (2010), "Dealing with Transient Faults in the Interconnection Network of CMPs at the Cache Coherence Level", IEEE Transactions on Parallel and Distributed Systems., aug, 2010. Vol. 21(8), pp. 1117-1131. IEEE Computer Society.
BibTeX:
                            @article{fernandez_tpds10,
                              author = {Fernández-Pascual, Ricardo and Garcbackslash'backslashia, José M and Acacio, Manuel E and Duato, José},
                              title = {Dealing with Transient Faults in the Interconnection Network of CMPs at the Cache Coherence Level},
                              journal = {IEEE Transactions on Parallel and Distributed Systems},
                              publisher = {IEEE Computer Society},
                              year = {2010},
                              volume = {21},
                              number = {8},
                              pages = {1117--1131}
                            }
                            
Flores A, Acacio ME and Aragón JL (2010), "Exploiting Address Compression and Heterogeneous Interconnects for Efficient Message Management in Tiled CMPs", Journal of Systems Architecture., sep, 2010. Vol. 56(9), pp. 429-441. Elsevier Science Publishers.
BibTeX:
                            @article{flores_jsa10,
                              author = {Flores, Antonio and Acacio, Manuel E and Aragón, Juan L},
                              title = {Exploiting Address Compression and Heterogeneous Interconnects for Efficient Message Management in Tiled CMPs},
                              journal = {Journal of Systems Architecture},
                              publisher = {Elsevier Science Publishers},
                              year = {2010},
                              volume = {56},
                              number = {9},
                              pages = {429--441}
                            }
                            
Flores A, Aragón JL and Acacio ME (2010), "Energy-Efficient Hardware Prefetching for CMPs Using Heterogeneous Interconnects", In Proc. of the 18th Euromicro Int'l Conference on Parallel, Distributed and Network-Based Computing (EUROMICRO-PDP'10). Pisa, Italy, feb, 2010. , pp. 147-154. IEEE Computer Society.
BibTeX:
                            @inproceedings{flores_pdp10,
                              author = {Flores, Antonio and Aragón, Juan L and Acacio, Manuel E},
                              title = {Energy-Efficient Hardware Prefetching for CMPs Using Heterogeneous Interconnects},
                              booktitle = {Proc. of the 18th Euromicro Int'l Conference on Parallel, Distributed and Network-Based Computing (EUROMICRO-PDP'10)},
                              publisher = {IEEE Computer Society},
                              year = {2010},
                              pages = {147--154}
                            }
                            
Flores A, Aragón JL and Acacio ME (2010), "Heterogeneous Interconnects for Energy-Efficient Message Management in CMPs", IEEE Transactions on Computers. Washington, DC, USA, jan, 2010. Vol. 59(1), pp. 16-28. IEEE Computer Society.
BibTeX:
                            @article{flores_tc10,
                              author = {Flores, Antonio and Aragón, Juan L and Acacio, Manuel E},
                              title = {Heterogeneous Interconnects for Energy-Efficient Message Management in CMPs},
                              journal = {IEEE Transactions on Computers},
                              publisher = {IEEE Computer Society},
                              year = {2010},
                              volume = {59},
                              number = {1},
                              pages = {16--28}
                            }
                            
Franco J, Bernabé G, Fernández J and Ujaldón M (2010), "Parallel 3D Fast Wavelet Transform on Manycore GPUs and Multicore CPUs", In Proc. of the Int'l Workshop on Emerging Parallel Architectures, held in conjunction with ICCS'11. Amsterdam, The Netherlands, jun, 2010. Vol. 1(1), pp. 1101-1110. Elsevier Science Publishers.
BibTeX:
                            @inproceedings{franco_iccs10,
                              author = {Franco, Joaquín and Bernabé, Gregorio and Fernández, Juan and Ujaldón, Manuel},
                              editor = {Science, Procedia Computer},
                              title = {Parallel 3D Fast Wavelet Transform on Manycore GPUs and Multicore CPUs},
                              booktitle = {Proc. of the Int'l Workshop on Emerging Parallel Architectures, held in conjunction with ICCS'11},
                              publisher = {Elsevier Science Publishers},
                              year = {2010},
                              volume = {1},
                              number = {1},
                              pages = {1101--1110}
                            }
                            
Gaona E, Titos R, Fernández J and Acacio ME (2010), "Characterizing Energy Consumption in Hardware Transactional Memory Systems", In Proc. of the 22nd Int'l Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'10). Petropolis, Brazil, oct, 2010. , pp. 9-16. IEEE Computer Society.
BibTeX:
                            @inproceedings{gaona_sbacpad10,
                              author = {Gaona, Epifanio and Titos, Rubén and Fernández, Juan and Acacio, Manuel E},
                              title = {Characterizing Energy Consumption in Hardware Transactional Memory Systems},
                              booktitle = {Proc. of the 22nd Int'l Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'10)},
                              publisher = {IEEE Computer Society},
                              year = {2010},
                              pages = {9--16}
                            }
                            
Gonzalez-Ferez P, Piernas J and Cortes T (2010), "Simultaneous Evaluation of Multiple I/O Strategies", In Actas de la XXI Jornadas de Paralelismo. Valencia, España , pp. 315-322. IBERGARCETA PUBLICACIONES, S.L.
BibTeX:
                            @inproceedings{gonzalez_jornadas10,
                              author = {Gonzalez-Ferez, Pilar and Piernas, Juan and Cortes, Toni},
                              title = {Simultaneous Evaluation of Multiple I/O Strategies},
                              booktitle = {Actas de la XXI Jornadas de Paralelismo},
                              publisher = {IBERGARCETA PUBLICACIONES, S.L},
                              year = {2010},
                              pages = {315--322}
                            }
                            
Gonzalez-Ferez P, Piernas J and Cortes T (2010), "Simultaneous Evaluation of Multiple I/O Strategies", In Proc. of the 22nd Int'l Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2010). Petrópolis, Brasil, oct, 2010. , pp. 183-190. IEEE Computer Society.
BibTeX:
                            @inproceedings{gonzalez_sbac_pad10,
                              author = {Gonzalez-Ferez, Pilar and Piernas, Juan and Cortes, Toni},
                              title = {Simultaneous Evaluation of Multiple I/O Strategies},
                              booktitle = {Proc. of the 22nd Int'l Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2010)},
                              publisher = {IEEE Computer Society},
                              year = {2010},
                              pages = {183--190}
                            }
                            
Petoumenos P, Psychou G, Kaxiras S, Cebrián JM and Aragón JL (2010), "MLP-aware Instruction Queue Resizing: The Key to Power-Efficient Performance", In Proc. of the 23rd Int. Conference on Architecture of Computing Systems (ARCS). Hannover, Germany, feb, 2010. , pp. 113-125. Springer-Verlag.
BibTeX:
                            @inproceedings{petoumenos_arcs10,
                              author = {Petoumenos, Pavlos and Psychou, Georgia and Kaxiras, Stefanos and Cebrián, Juan M and Aragón, Juan L},
                              title = {MLP-aware Instruction Queue Resizing: The Key to Power-Efficient Performance},
                              booktitle = {Proc. of the 23rd Int. Conference on Architecture of Computing Systems (ARCS)},
                              publisher = {Springer-Verlag},
                              year = {2010},
                              pages = {113--125}
                            }
                            
Piernas J and Nieplocha J (2010), "Implementation and Evaluation of Active Storage in Modern Parallel File Systems", Parallel Computing., jan, 2010. Vol. 36(1), pp. 26-47. Elsevier Science BV.
BibTeX:
                            @article{piernas_parco10,
                              author = {Piernas, Juan and Nieplocha, Jarek},
                              title = {Implementation and Evaluation of Active Storage in Modern Parallel File Systems},
                              journal = {Parallel Computing},
                              publisher = {Elsevier Science BV},
                              year = {2010},
                              volume = {36},
                              number = {1},
                              pages = {26--47}
                            }
                            
Ros A (2010), "Efficient and Scalable Cache Coherence for Chip Multiprocessors: Novel Proposals for Managing Cache Coherence in Future Many-Core Chip Multiprocessors", feb, 2010. , pp. 196. LAP Lambert Academic Publishing.
BibTeX:
                            @inbook{ros_lap10,
                              author = {Ros, Alberto},
                              title = {Efficient and Scalable Cache Coherence for Chip Multiprocessors: Novel Proposals for Managing Cache Coherence in Future Many-Core Chip Multiprocessors},
                              publisher = {LAP Lambert Academic Publishing},
                              year = {2010},
                              pages = {196}
                            }
                            
Ros A and Acacio ME (2010), "Low-Overhead Organizations for the Directory in Future Many-Core CMPs", In Proc. of the 4th Workshop on Highly Parallel Processing on a Chip (HPPC'10). Ischia, Italy, aug, 2010. , pp. 87-97. Springer-Verlag.
BibTeX:
                            @inproceedings{ros_hppc10,
                              author = {Ros, Alberto and Acacio, Manuel E},
                              title = {Low-Overhead Organizations for the Directory in Future Many-Core CMPs},
                              booktitle = {Proc. of the 4th Workshop on Highly Parallel Processing on a Chip (HPPC'10)},
                              publisher = {Springer-Verlag},
                              year = {2010},
                              pages = {87--97}
                            }
                            
Ros A, Acacio ME and Garcbackslash'backslashia JM (2010), "Parallel and Distributed Computing", jan, 2010. , pp. 93-118. In-Tech.
BibTeX:
                            @inbook{ros_pdc10,
                              author = {Ros, Alberto and Acacio, Manuel E and Garcbackslash'backslashia, José M},
                              editor = {Ros, Alberto},
                              title = {Parallel and Distributed Computing},
                              publisher = {In-Tech},
                              year = {2010},
                              pages = {93--118}
                            }
                            
Ros A, Acacio ME and García JM (2010), "A Direct Coherence Protocol for Many-Core Chip Multiprocessors", IEEE Transactions on Parallel and Distributed Systems (TPDS)., dec, 2010. Vol. 21(12), pp. 1779-1792. IEEE Computer Society.
BibTeX:
                            @article{ros_tpds10,
                              author = {Ros, Alberto and Acacio, Manuel E and García, José M},
                              title = {A Direct Coherence Protocol for Many-Core Chip Multiprocessors},
                              journal = {IEEE Transactions on Parallel and Distributed Systems (TPDS)},
                              publisher = {IEEE Computer Society},
                              year = {2010},
                              volume = {21},
                              number = {12},
                              pages = {1779--1792}
                            }
                            
Ros A, Acacio ME and García JM (2010), "A Scalable Organization for Distributed Directories", Journal of Systems Architecture., mar, 2010. Vol. 56(2-3), pp. 77-87. Elsevier Science Publishers.
BibTeX:
                            @article{ros_jsa10,
                              author = {Ros, Alberto and Acacio, Manuel E and García, José M},
                              title = {A Scalable Organization for Distributed Directories},
                              journal = {Journal of Systems Architecture},
                              publisher = {Elsevier Science Publishers},
                              year = {2010},
                              volume = {56},
                              number = {2--3},
                              pages = {77--87}
                            }
                            
Ros A, Acacio ME and García JM (2010), "A scalable organization for distributed directories", Journal of Systems Architecture. Vol. 56(2-3), pp. 77-87. Elsevier B.V..
Abstract: Although directory-based cache-coherence protocols are the best choice when designing chip multiprocessors with tens of cores on-chip, the memory overhead introduced by the directory structure may not scale gracefully with the number of cores. Many approaches aimed at improving the scalability of directories have been proposed. However, they do not bring perfect scalability and usually reduce the directory memory overhead by compressing coherence information, which in turn results in extra unnecessary coherence messages and, therefore, wasted energy and some performance degradation. In this work, we present a distributed directory organization based on duplicate tags for tiled CMP architectures whose size is independent on the number of tiles of the system up to a certain number of tiles. We demonstrate that this number of tiles corresponds to the number of sets in the private caches. Additionally, we show that the area overhead of the proposed directory structure is 0.56% with respect to the on-chip data caches. Moreover, the proposed directory structure keeps the same information than a non-scalable full-map directory. Finally, we propose a mechanism that takes advantage of this directory organization to remove the network traffic caused by replacements. This mechanism reduces total traffic by 15% for a 16-core configuration compared to a traditional directory-based protocol. textcopyright 2009 Elsevier B.V. All rights reserved.
BibTeX:
                            @article{Ros2010a,
                              author = {Ros, Alberto and Acacio, Manuel E. and García, José M.},
                              title = {A scalable organization for distributed directories},
                              journal = {Journal of Systems Architecture},
                              publisher = {Elsevier B.V.},
                              year = {2010},
                              volume = {56},
                              number = {2-3},
                              pages = {77--87},
                              url = {http://dx.doi.org/10.1016/j.sysarc.2009.11.006},
                              doi = {10.1016/j.sysarc.2009.11.006}
                            }
                            
Ros A, Cintra M, Acacio ME and Garcbackslash'backslashia JM (2010), "A Novel Mapping Policy for Distributed Shared Caches", In Actas de las XXI Jornadas de Paralelismo (JP'10). Valencia , pp. 209-216.
BibTeX:
                            @inproceedings{ros_jp10,
                              author = {Ros, Alberto and Cintra, Marcelo and Acacio, Manuel E and Garcbackslash'backslashia, José M},
                              title = {A Novel Mapping Policy for Distributed Shared Caches},
                              booktitle = {Actas de las XXI Jornadas de Paralelismo (JP'10)},
                              year = {2010},
                              pages = {209--216}
                            }
                            
Ros A, Cuesta B, Fernández-Pascual R, Gómez ME, Acacio ME, Robles A, García JM and Duato J (2010), "EMC$ˆ2$: Extending Magny-Cours Coherence for Large-Scale Servers", In Proc. of the 17th Int'l Conference on High Performance Computing (HiPC-2010). Goa, India, dec, 2010. , pp. 1-10. IEEE Computer Society.
BibTeX:
                            @inproceedings{ros_hipc10,
                              author = {Ros, Alberto and Cuesta, Blas and Fernández-Pascual, Ricardo and Gómez, Maria E and Acacio, Manuel E and Robles, Antonio and García, José M and Duato, José},
                              title = {EMC$ˆ2$: Extending Magny-Cours Coherence for Large-Scale Servers},
                              booktitle = {Proc. of the 17th Int'l Conference on High Performance Computing (HiPC-2010)},
                              publisher = {IEEE Computer Society},
                              year = {2010},
                              pages = {1--10}
                            }
                            
Sánchez D, Aragón JL and García JM (2010), "A Log-Based Redundant Architecture for Reliable Parallel Computation", In Proc. of the 17th Int. Conference on High Performance Computing (HiPC). Goa, India, dec, 2010. , pp. 1-10.
BibTeX:
                            @inproceedings{sanchez_hipc10,
                              author = {Sánchez, Daniel and Aragón, Juan L and García, José M},
                              title = {A Log-Based Redundant Architecture for Reliable Parallel Computation},
                              booktitle = {Proc. of the 17th Int. Conference on High Performance Computing (HiPC)},
                              year = {2010},
                              pages = {1--10}
                            }
                            
Østby KE, Aragón JL, García JM and Ujaldón M (2010), "FATSEA – An Architectural Simulator for General Purpose Computing on GPUs", In Proc. of the 2nd Workshop on Rapid Simulation & Performance Evaluation: Methods and Tools (RAPIDO), held in conjunction with HiPEAC'10. Pisa, Italy, jan, 2010. , pp. 1-6.
BibTeX:
                            @inproceedings{ostby_rapido10,
                              author = {Østby, Kenneth E and Aragón, Juan L and García, José M and Ujaldón, Manuel},
                              title = {FATSEA – An Architectural Simulator for General Purpose Computing on GPUs},
                              booktitle = {Proc. of the 2nd Workshop on Rapid Simulation & Performance Evaluation: Methods and Tools (RAPIDO), held in conjunction with HiPEAC'10},
                              year = {2010},
                              pages = {1--6}
                            }
                            
Bernabé G, Garcbackslash'backslashia JM and González J (2009), "A Lossy 3D Wavelet Transform for High-Quality Compression of Medical Video", Journal of Systems & Software., mar, 2009. Vol. 82(3), pp. 526-534. Elsevier.
BibTeX:
                            @article{bernabe_jss09,
                              author = {Bernabé, Gregorio and Garcbackslash'backslashia, José M and González, José},
                              title = {A Lossy 3D Wavelet Transform for High-Quality Compression of Medical Video},
                              journal = {Journal of Systems & Software},
                              publisher = {Elsevier},
                              year = {2009},
                              volume = {82},
                              number = {3},
                              pages = {526--534},
                              url = {http://www.sciencedirect.com/science/article/pii/S0164121208002185},
                              doi = {10.1016/j.jss.2008.09.034}
                            }
                            
Bernabé G, García JM and González J (2009), "A lossy 3D wavelet transform for high-quality compression of medical video", Journal of Systems and Software. Vol. 82(3), pp. 526-534. Elsevier Inc..
Abstract: In this paper, we present a lossy compression scheme based on the application of the 3D fast wavelet transform to code medical video. This type of video has special features, such as its representation in gray scale, its very few interframe variations, and the quality requirements of the reconstructed images. These characteristics as well as the social impact of the desired applications demand a design and implementation of coding schemes especially oriented to exploit them. We analyze different parameters of the codification process, such as the utilization of different wavelets functions, the number of steps the wavelet function is applied to, the way the thresholds are chosen, and the selected methods in the quantization and entropy encoder. In order to enhance our original encoder, we propose several improvements in the entropy encoder: 3D-conscious run-length, hexadecimal coding and the application of arithmetic coding instead of Huffman. Our coder achieves a good trade-off between compression ratio and quality of the reconstructed video. We have also compared our scheme with MPEG-2 and EZW, obtaining better compression ratios up to 119% and 46%, respectively for the same PSNR. textcopyright 2008 Elsevier Inc. All rights reserved.
BibTeX:
                            @article{Bernabe2009,
                              author = {Bernabé, Gregorio and García, Jose M. and González, José},
                              title = {A lossy 3D wavelet transform for high-quality compression of medical video},
                              journal = {Journal of Systems and Software},
                              publisher = {Elsevier Inc.},
                              year = {2009},
                              volume = {82},
                              number = {3},
                              pages = {526--534},
                              url = {http://dx.doi.org/10.1016/j.jss.2008.09.034},
                              doi = {10.1016/j.jss.2008.09.034}
                            }
                            
Cebrián JM, Aragón JL, García JM, Petoumenos P and Kaxiras S (2009), "Efficient Microarchitecture Policies for Accurately Adapting to Power Constraints", In Proc. of the 23rd IEEE Int. Parallel and Distributed Processing Symposium (IPDPS). Rome, Italy, may, 2009. , pp. 1-12.
BibTeX:
                            @inproceedings{cebrian_ipdps09,
                              author = {Cebrián, Juan M and Aragón, Juan L and García, José M and Petoumenos, Pavlos and Kaxiras, Stefanos},
                              title = {Efficient Microarchitecture Policies for Accurately Adapting to Power Constraints},
                              booktitle = {Proc. of the 23rd IEEE Int. Parallel and Distributed Processing Symposium (IPDPS)},
                              year = {2009},
                              pages = {1--12}
                            }
                            
Cebrián JM, Aragón JL, García JM, Petoumenos P and Kaxiras S (2009), "Energy-Efficient Power Budget Matching", In Actas de las XX Jornadas de Paralelismo. A Coruña, sep, 2009. , pp. 195-200.
BibTeX:
                            @inproceedings{cebrian_jp09,
                              author = {Cebrián, Juan M and Aragón, Juan L and García, José M and Petoumenos, Pavlos and Kaxiras, Stefanos},
                              title = {Energy-Efficient Power Budget Matching},
                              booktitle = {Actas de las XX Jornadas de Paralelismo},
                              year = {2009},
                              pages = {195--200}
                            }
                            
Franco J, Bernabé G, Acacio ME and Fernández J (2009), "A Parallel Implementation of the 2D Wavelet Transform Using CUDA", In Proc. of the 17th Int'l Euromicro Conference on Parallel, Distributed and Network-Based Computing (PDP'09). Weimar, Germany, feb, 2009. , pp. 111-118. IEEE Computer Society.
BibTeX:
                            @inproceedings{franco_pdp09,
                              author = {Franco, Joaquín and Bernabé, Gregorio and Acacio, Manuel E and Fernández, Juan},
                              title = {A Parallel Implementation of the 2D Wavelet Transform Using CUDA},
                              booktitle = {Proc. of the 17th Int'l Euromicro Conference on Parallel, Distributed and Network-Based Computing (PDP'09)},
                              publisher = {IEEE Computer Society},
                              year = {2009},
                              pages = {111--118}
                            }
                            
Gaona E, Fernández J and Acacio ME (2009), "Fast and Efficient Synchronization and Communication Collective Primitives for Dual Cell-based Blades", In Proc. of the 15th Int'l Conference on Parallel and Distributed Computing (Euro-Par'09). Delft, The Netherlands, aug, 2009. , pp. 900-911. Springer-Verlag.
BibTeX:
                            @inproceedings{gaona_europar09,
                              author = {Gaona, Epifanio and Fernández, Juan and Acacio, Manuel E},
                              title = {Fast and Efficient Synchronization and Communication Collective Primitives for Dual Cell-based Blades},
                              booktitle = {Proc. of the 15th Int'l Conference on Parallel and Distributed Computing (Euro-Par'09)},
                              publisher = {Springer-Verlag},
                              year = {2009},
                              pages = {900--911}
                            }
                            
Ros A, Acacio ME and Garcbackslash'backslashia JM (2009), "Achieving Directory Scalability and Lessening Network Traffic in Manycore CMPs", In Actas de las XX Jornadas de Paralelismo (JP'09). A Coruña , pp. 219-224.
BibTeX:
                            @inproceedings{ros_jp09,
                              author = {Ros, Alberto and Acacio, Manuel E and Garcbackslash'backslashia, José M},
                              title = {Achieving Directory Scalability and Lessening Network Traffic in Manycore CMPs},
                              booktitle = {Actas de las XX Jornadas de Paralelismo (JP'09)},
                              year = {2009},
                              pages = {219--224}
                            }
                            
Ros A, Acacio ME and García JM (2009), "Dealing with Traffic-Area Trade-Off in Direct Coherence Protocols for Many-Core CMPs", In Proc. of the 8th Int'l Conference on Advanced Parallel Processing Technologies (APPT-2009). Rapperswil, Switzerland, aug, 2009. , pp. 11-27. Springer-Verlag.
BibTeX:
                            @inproceedings{ros_appt09,
                              author = {Ros, Alberto and Acacio, Manuel E and García, José M},
                              title = {Dealing with Traffic-Area Trade-Off in Direct Coherence Protocols for Many-Core CMPs},
                              booktitle = {Proc. of the 8th Int'l Conference on Advanced Parallel Processing Technologies (APPT-2009)},
                              publisher = {Springer-Verlag},
                              year = {2009},
                              pages = {11--27}
                            }
                            
Ros A, Cintra M, Acacio ME and García JM (2009), "Distance-Aware Round-Robin Mapping for Large NUCA Caches", In Proc. of the 16th Int'l Conference on High Performance Computing (HiPC-2009). Kochi, India, dec, 2009. , pp. 79-88. IEEE Computer Society.
BibTeX:
                            @inproceedings{ros_hipc09,
                              author = {Ros, Alberto and Cintra, Marcelo and Acacio, Manuel E and García, José M},
                              title = {Distance-Aware Round-Robin Mapping for Large NUCA Caches},
                              booktitle = {Proc. of the 16th Int'l Conference on High Performance Computing (HiPC-2009)},
                              publisher = {IEEE Computer Society},
                              year = {2009},
                              pages = {79--88}
                            }
                            
Ros A, Cintra M, Acacio ME and García JM (2009), "Distance-aware round-robin mapping for large NUCA caches", 16th International Conference on High Performance Computing, HiPC 2009 - Proceedings. (1), pp. 79-88.
Abstract: In many-core architectures, memory blocks are commonly assigned to the banks of a NUCA cache by following a physical mapping. This mapping assigns blocks to cache banks in a round-robin fashion, thus neglecting the distance between the cores that most frequently access every block and the corresponding NUCA bank for the block. This issue impacts both cache access latency and the amount of on-chip network traffic generated. On the other hand, first-touch mapping policies, which take into account distance, can lead to an unbalanced utilization of cache banks, and consequently, to an increased number of expensive off-chip accesses. In this work, we propose the distance-aware round-robin mapping policy, an OS-managed policy which addresses the trade-off between cache access latency and number of off-chip accesses. Our policy tries to map the pages accessed by a core to its closest (local) bank, like in a firsttouch policy. However, our policy also introduces an upper bound on the deviation of the distribution of memory pages among cache banks, which lessens the number of off-chip accesses. This tradeoff is addressed without requiring any extra hardware structure. We also show that the private cache indexing commonly used in many-core architectures is not the most appropriate for OS-managed distance-aware mapping policies, and propose to employ different bits for such indexing. Using GEMS simulator we show that our proposal obtains average improvements of 11% for parallel applications and 14% for multi-programmed workloads in terms of execution time, and significant reductions in network traffic, over a traditional physical mapping. Moreover, when compared to a first-touch mapping policy, our proposal improves average execution time by 5% for parallel applications and 6% for multi-programmed workloads, slightly increasing on-chip network traffic. textcopyright2009 IEEE.
BibTeX:
                            @article{Ros2009a,
                              author = {Ros, Alberto and Cintra, Marcelo and Acacio, Manuel E. and García, José M.},
                              title = {Distance-aware round-robin mapping for large NUCA caches},
                              journal = {16th International Conference on High Performance Computing, HiPC 2009 - Proceedings},
                              year = {2009},
                              number = {1},
                              pages = {79--88},
                              doi = {10.1109/HIPC.2009.5433220}
                            }
                            
Sánchez D, Aragón JL and García JM (2009), "Extending SRT for Parallel Applications in Tiled-CMP Architectures", In Proc. of the 14th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems (DPDNS), held in conjunction with IPDPS'09. Rome, Italy, may, 2009. , pp. 1-10.
BibTeX:
                            @inproceedings{sanchez_dpdns09,
                              author = {Sánchez, Daniel and Aragón, Juan L and García, José M},
                              title = {Extending SRT for Parallel Applications in Tiled-CMP Architectures},
                              booktitle = {Proc. of the 14th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems (DPDNS), held in conjunction with IPDPS'09},
                              year = {2009},
                              pages = {1--10}
                            }
                            
Sánchez D, Aragón JL and García JM (2009), "REPAS: Reliable Execution for Parallel ApplicationS in Tiled-CMPs", In Proc. of the 15th Int. Conference on Parallel and Distributed Computing (Euro-Par). Delft, The Netherlands, aug, 2009. , pp. 321-333. Springer-Verlag.
BibTeX:
                            @inproceedings{sanchez_europar09,
                              author = {Sánchez, Daniel and Aragón, Juan L and García, José M},
                              title = {REPAS: Reliable Execution for Parallel ApplicationS in Tiled-CMPs},
                              booktitle = {Proc. of the 15th Int. Conference on Parallel and Distributed Computing (Euro-Par)},
                              publisher = {Springer-Verlag},
                              year = {2009},
                              pages = {321--333}
                            }
                            
Titos R, Acacio ME and García JM (2009), "Speculation-Based Conflict Resolution in Hardware Transactional Memory", In Proc. of the 23rd IEEE Int'l Parallel and Distributed Processing Symposium (IPDPS 2009). Rome, Italy, may, 2009. , pp. 1-12. IEEE Computer Society Press.
BibTeX:
                            @inproceedings{titos_ipdps09,
                              author = {Titos, Rubén and Acacio, Manuel E and García, José M},
                              title = {Speculation-Based Conflict Resolution in Hardware Transactional Memory},
                              booktitle = {Proc. of the 23rd IEEE Int'l Parallel and Distributed Processing Symposium (IPDPS 2009)},
                              publisher = {IEEE Computer Society Press},
                              year = {2009},
                              pages = {1--12}
                            }
                            
Triviño F, Andujar FJ, Ros A, Sánchez JL and Alfaro FJ (2009), "Sistema Integrado de Simulación de NoCs", In Actas de las XX Jornadas de Paralelismo (JP'09). A Coruña , pp. 481-486.
BibTeX:
                            @inproceedings{triviño_jp09,
                              author = {Triviño, Francisco and Andujar, Francisco J and Ros, Alberto and Sánchez, José L and Alfaro, Francisco J},
                              title = {Sistema Integrado de Simulación de NoCs},
                              booktitle = {Actas de las XX Jornadas de Paralelismo (JP'09)},
                              year = {2009},
                              pages = {481--486}
                            }
                            
Abellán JL, Fernández J and Acacio ME (2008), "CellStats: a Tool to Evaluate the Basic Synchronization and Communication Operations of the Cell BE", In Proc. of the 16th Int'l Euromicro Conference on Parallel, Distributed and Network-Based Computing (PDP'08). Toulouse, France, feb, 2008. , pp. 261-268. IEEE Computer Society.
BibTeX:
                            @inproceedings{abellan_pdp08,
                              author = {Abellán, José L and Fernández, Juan and Acacio, Manuel E},
                              title = {CellStats: a Tool to Evaluate the Basic Synchronization and Communication Operations of the Cell BE},
                              booktitle = {Proc. of the 16th Int'l Euromicro Conference on Parallel, Distributed and Network-Based Computing (PDP'08)},
                              publisher = {IEEE Computer Society},
                              year = {2008},
                              pages = {261--268}
                            }
                            
Abellán JL, Fernández J and Acacio ME (2008), "Characterizing the Basic Synchronization and Communication Operation in Dual Cell-based Blades", In Proc. of the Int'l Conference on Computational Science (ICCS'08). Krakow, Poland, jun, 2008. , pp. 456-465. Springer-Verlag.
BibTeX:
                            @inproceedings{abellan_iccs08,
                              author = {Abellán, José L and Fernández, Juan and Acacio, Manuel E},
                              title = {Characterizing the Basic Synchronization and Communication Operation in Dual Cell-based Blades},
                              booktitle = {Proc. of the Int'l Conference on Computational Science (ICCS'08)},
                              publisher = {Springer-Verlag},
                              year = {2008},
                              pages = {456--465}
                            }
                            
Aragón JL and Veidenbaum AV (2008), "Optimizing CAM-based Instruction Cache Designs for Low-Power Embedded Systems", Journal of Systems Architecture., jan, 2008. Vol. 54(12), pp. 1155-1163. Elsevier Science Publishers.
BibTeX:
                            @article{aragon_jsa08,
                              author = {Aragón, Juan L and Veidenbaum, Alexander V},
                              title = {Optimizing CAM-based Instruction Cache Designs for Low-Power Embedded Systems},
                              journal = {Journal of Systems Architecture},
                              publisher = {Elsevier Science Publishers},
                              year = {2008},
                              volume = {54},
                              number = {12},
                              pages = {1155--1163}
                            }
                            
Fernández J, Acacio ME, Bernabé G, Abellán JL and Franco J (2008), "Multicore Platforms for Scientific Computing: Cell BE and NVIDIA Tesla", In Proc. of the Int'l Conference on Scientific Computing (CSC'08). Las Vegas, USA, jul, 2008. , pp. 167-173. CSREA Press.
BibTeX:
                            @inproceedings{fernandez_csc08,
                              author = {Fernández, Juan and Acacio, Manuel E and Bernabé, Gregorio and Abellán, José L and Franco, Joaquín},
                              title = {Multicore Platforms for Scientific Computing: Cell BE and NVIDIA Tesla},
                              booktitle = {Proc. of the Int'l Conference on Scientific Computing (CSC'08)},
                              publisher = {CSREA Press},
                              year = {2008},
                              pages = {167--173}
                            }
                            
Fernández-Pascual R, Garcbackslash'backslashia JM, Acacio ME and Duato J (2008), "A Fault-Tolerant Directory-Based Cache Coherence Protocol for CMP Architectures", In In Proc. of the 38th Annual IEEE/IFIP Int'l Conference on Dependable Systems and Networks (DSN 2008). Anchorage, USA, jun, 2008. , pp. 267-276. IEEE Computer Society Press.
BibTeX:
                            @inproceedings{fernandez_dsn08,
                              author = {Fernández-Pascual, Ricardo and Garcbackslash'backslashia, José M and Acacio, Manuel E and Duato, José},
                              title = {A Fault-Tolerant Directory-Based Cache Coherence Protocol for CMP Architectures},
                              booktitle = {In Proc. of the 38th Annual IEEE/IFIP Int'l Conference on Dependable Systems and Networks (DSN 2008)},
                              publisher = {IEEE Computer Society Press},
                              year = {2008},
                              pages = {267--276}
                            }
                            
Fernández-Pascual R, Garcbackslash'backslashia JM, Acacio ME and Duato J (2008), "Fault-Tolerant Cache Coherence Protocols for CMPs: Evaluation and Trade-Offs", In Proc. of the 15th Int'l Conference on High-Performance Computing (HIPC 2008). Bangalore, India, dec, 2008. , pp. 555-568. Springer-Verlag.
BibTeX:
                            @inproceedings{fernandez_hipc08,
                              author = {Fernández-Pascual, Ricardo and Garcbackslash'backslashia, José M and Acacio, Manuel E and Duato, José},
                              title = {Fault-Tolerant Cache Coherence Protocols for CMPs: Evaluation and Trade-Offs},
                              booktitle = {Proc. of the 15th Int'l Conference on High-Performance Computing (HIPC 2008)},
                              publisher = {Springer-Verlag},
                              year = {2008},
                              pages = {555--568}
                            }
                            
Fernández-Pascual R, García JM, Acacio ME and Duato J (2008), "Extending the TokenCMP Cache Coherence Protocol for Low Overhead Fault Tolerance in CMP Architectures", IEEE Transactions on Parallel and Distributed Systems., aug, 2008. Vol. 19(8), pp. 1044-1056. IEEE Computer Society.
BibTeX:
                            @article{fernandez_tpds08,
                              author = {Fernández-Pascual, Ricardo and García, José M and Acacio, Manuel E and Duato, José},
                              title = {Extending the TokenCMP Cache Coherence Protocol for Low Overhead Fault Tolerance in CMP Architectures},
                              journal = {IEEE Transactions on Parallel and Distributed Systems},
                              publisher = {IEEE Computer Society},
                              year = {2008},
                              volume = {19},
                              number = {8},
                              pages = {1044--1056}
                            }
                            
Flores A, Acacio ME and Aragón JL (2008), "Address Compression and Heterogeneous Interconnects for Energy-Efficient High-Performance in Tiled CMPs", In Proc. of the 37th Int'l Conference on Parallel Processing (ICPP-2008). Portland, USA, sep, 2008. , pp. 295-303. IEEE Computer Society.
BibTeX:
                            @inproceedings{flores_icpp08,
                              author = {Flores, Antonio and Acacio, Manuel E and Aragón, Juan L},
                              title = {Address Compression and Heterogeneous Interconnects for Energy-Efficient High-Performance in Tiled CMPs},
                              booktitle = {Proc. of the 37th Int'l Conference on Parallel Processing (ICPP-2008)},
                              publisher = {IEEE Computer Society},
                              year = {2008},
                              pages = {295--303}
                            }
                            
Flores A, Aragón JL and Acacio ME (2008), "An Energy Consumption Characterization of On-Chip Interconnection Networks for Tiled CMP Architectures", Journal of Supercomputing. Hingham, MA, USA, sep, 2008. Vol. 45(3), pp. 341-364. Springer-Verlag.
BibTeX:
                            @article{flores_jsc08,
                              author = {Flores, Antonio and Aragón, Juan L and Acacio, Manuel E},
                              title = {An Energy Consumption Characterization of On-Chip Interconnection Networks for Tiled CMP Architectures},
                              journal = {Journal of Supercomputing},
                              publisher = {Springer-Verlag},
                              year = {2008},
                              volume = {45},
                              number = {3},
                              pages = {341--364}
                            }
                            
Gonzalez-Ferez P, Piernas J and Cortes T (2008), "Evaluating the Effectiveness of REDCAP to Recover the Locality Missed by Today's Linux Systems", In Proc. of the 16th Annual Meeting of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2008). Baltimore, USA, sep, 2008. , pp. 367-370. IEEE Computer Society.
BibTeX:
                            @inproceedings{gonzalez_mascots08,
                              author = {Gonzalez-Ferez, Pilar and Piernas, Juan and Cortes, Toni},
                              title = {Evaluating the Effectiveness of REDCAP to Recover the Locality Missed by Today's Linux Systems},
                              booktitle = {Proc. of the 16th Annual Meeting of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2008)},
                              publisher = {IEEE Computer Society},
                              year = {2008},
                              pages = {367--370}
                            }
                            
Gonzalez-Ferez P, Piernas J and Cortes T (2008), "The RAM Enhanced Disk Cache Project (REDCAP)", In Actas de la XIX Jornadas de Paralelismo. Castellón, España Publicacions de la Universitat Jaume I. Servei de Comunicació i Publicacions.
BibTeX:
                            @inproceedings{gonzalez_jornadas08,
                              author = {Gonzalez-Ferez, Pilar and Piernas, Juan and Cortes, Toni},
                              title = {The RAM Enhanced Disk Cache Project (REDCAP)},
                              booktitle = {Actas de la XIX Jornadas de Paralelismo},
                              publisher = {Publicacions de la Universitat Jaume I. Servei de Comunicació i Publicacions},
                              year = {2008}
                            }
                            
Piernas J and Nieplocha J (2008), "Efficient Management of Complex Striped Files in Active Storage", In Proc. of the 14th International European Conference on Parallel and Distributed Computing (Euro-Par 2008). Las Palmas de Gran Canaria, Spain, aug, 2008. , pp. 676-685. Springer-Verlag.
BibTeX:
                            @inproceedings{piernas_europar08,
                              author = {Piernas, Juan and Nieplocha, Jarek},
                              title = {Efficient Management of Complex Striped Files in Active Storage},
                              booktitle = {Proc. of the 14th International European Conference on Parallel and Distributed Computing (Euro-Par 2008)},
                              publisher = {Springer-Verlag},
                              year = {2008},
                              pages = {676--685}
                            }
                            
Ros A, Acacio ME and Garcbackslash'backslashia JM (2008), "Efficent Cache Coherence Protocol in Tiled Chip Multiprocessors", In Actas de las XIX Jornadas de Paralelismo (JP'08). Castellón , pp. 199-204.
BibTeX:
                            @inproceedings{ros_jp08,
                              author = {Ros, Alberto and Acacio, Manuel E and Garcbackslash'backslashia, José M},
                              title = {Efficent Cache Coherence Protocol in Tiled Chip Multiprocessors},
                              booktitle = {Actas de las XIX Jornadas de Paralelismo (JP'08)},
                              year = {2008},
                              pages = {199--204}
                            }
                            
Ros A, Acacio ME and García JM (2008), "DiCo-CMP: Efficient Cache Coherency in Tiled CMP Architectures", In Proc. of the 22nd Int'l Parallel & Distributed Processing Symposium (IPDPS-2008). Miami, Florida, apr, 2008. , pp. 1-11. IEEE Computer Society.
BibTeX:
                            @inproceedings{ros_ipdps08,
                              author = {Ros, Alberto and Acacio, Manuel E and García, José M},
                              title = {DiCo-CMP: Efficient Cache Coherency in Tiled CMP Architectures},
                              booktitle = {Proc. of the 22nd Int'l Parallel & Distributed Processing Symposium (IPDPS-2008)},
                              publisher = {IEEE Computer Society},
                              year = {2008},
                              pages = {1--11}
                            }
                            
Ros A, Acacio ME and García JM (2008), "DiCo-CMP: Efficient cache coherency in tiled CMP architectures", IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM.
Abstract: Future CMP designs that will integrate tens of processor cores on-chip will be constrained by area and power. Area constraints make impractical the use of a bus or a crossbar as the on-chip interconnection network, and tiled CMPs organized around a direct interconnection network will probably be the architecture of choice. Power constraints make impractical to rely on broadcasts (as Token-CMP does) or any other brute-force method for keeping cache coherence, and directory-based cache coherence protocols are currently being employed. Unfortunately, directory protocols introduce indirection to access directory information, which negatively impacts performance. In this work, we present DiCo-CMP, a novel cache coherence protocol especially suited to future tiled CMP architectures. In DiCo-CMP the role of storing up-to-date sharing information and ensuring totally ordered accesses for every memory block is assigned to the cache that must provide the block on a miss. Therefore, DiCo-CMP reduces the miss latency compared to a directory protocol by sending coherence messages directly from the requesting caches to those that must observe them (as it would be done in brute-force protocols), and reduces the network traffic compared to Token-CMP (and consequently, power consumption in the interconnection network) by sending just one request message for each miss. Using an extended version of GEMS simulator we show that DiCo-CMP achieves improvements in execution time of up to 8% on average over a directory protocol, and reductions in terms of network traffic of up to 42% on average compared to Token-CMP. textcopyright2008 IEEE.
BibTeX:
                            @article{Ros2008a,
                              author = {Ros, Alberto and Acacio, Manuel E. and García, José M.},
                              title = {DiCo-CMP: Efficient cache coherency in tiled CMP architectures},
                              journal = {IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM},
                              year = {2008},
                              doi = {10.1109/IPDPS.2008.4536287}
                            }
                            
Ros A, Acacio ME and García JM (2008), "Scalable Directory Organization for Tiled CMP Architectures", In Proc. of the Int'l Conference on Computer Design (CDES-2008). Las Vegas, USA, jul, 2008. , pp. 112-118. CSREA Press.
BibTeX:
                            @inproceedings{ros_cdes08,
                              author = {Ros, Alberto and Acacio, Manuel E and García, José M},
                              title = {Scalable Directory Organization for Tiled CMP Architectures},
                              booktitle = {Proc. of the Int'l Conference on Computer Design (CDES-2008)},
                              publisher = {CSREA Press},
                              year = {2008},
                              pages = {112--118}
                            }
                            
Ros A, Acacio ME and García JM (2008), "Scalable directory organization for tiled CMP architectures", Proceedings of the 2008 International Conference on Computer Design, CDES 2008. , pp. 112-118.
Abstract: Although directory-based cache coherence protocols are the best choice when designing chip multiprocessor architectures (CMPs) with tens of processor cores on chip, the memory overhead introduced by the directory structure may not scale gracefully with the number of cores. In this work, we show that a directory organization based on duplicating tags, which are distributed among the tiles of a tiled CMP with a fine-grained interleaving, is scalable. That is to say, the size of each directory bank is independent on the number of tiles of the system. Moreover, based on this directory organization we propose and evaluate the implicit replacements mechanism which leads to savings of up to 32% in terms of number of messages in the interconnection network.
BibTeX:
                            @article{Ros2008,
                              author = {Ros, Alberto and Acacio, Manuel E. and García, José M.},
                              title = {Scalable directory organization for tiled CMP architectures},
                              journal = {Proceedings of the 2008 International Conference on Computer Design, CDES 2008},
                              year = {2008},
                              pages = {112--118}
                            }
                            
Ros A, Fernández-Pascual R, Acacio ME and García JM (2008), "Two Proposals for the Inclusion of Directory Information in the Last-Level Private Caches of Glueless Shared-Memory Multiprocessors.", Journal of Parallel and Distributed Computing., nov, 2008. Vol. 68(11), pp. 1413-1424. Elsevier Science Publishers.
BibTeX:
                            @article{ros_jpdc08,
                              author = {Ros, Alberto and Fernández-Pascual, Ricardo and Acacio, Manuel E and García, José M},
                              title = {Two Proposals for the Inclusion of Directory Information in the Last-Level Private Caches of Glueless Shared-Memory Multiprocessors.},
                              journal = {Journal of Parallel and Distributed Computing},
                              publisher = {Elsevier Science Publishers},
                              year = {2008},
                              volume = {68},
                              number = {11},
                              pages = {1413--1424}
                            }
                            
Ros A and Garcbackslash'backslashia JM (2008), "La plataforma Simics como herramienta de aprendizaje", In Actas de las XIV Jornadas de Enseñanza Universitaria de Informática (JENUI'08). Granada , pp. 291-298.
BibTeX:
                            @inproceedings{ros_jenui08,
                              author = {Ros, Alberto and Garcbackslash'backslashia, José M},
                              title = {La plataforma Simics como herramienta de aprendizaje},
                              booktitle = {Actas de las XIV Jornadas de Enseñanza Universitaria de Informática (JENUI'08)},
                              year = {2008},
                              pages = {291--298}
                            }
                            
Sánchez D, Aragón JL and García JM (2008), "Adapting Dynamic Core Coupling to a Direct-Network Environment", In Actas de las XIX Jornadas de Paralelismo. Castellón, sep, 2008. , pp. 253-258.
BibTeX:
                            @inproceedings{sanchez_jp08,
                              author = {Sánchez, Daniel and Aragón, Juan L and García, José M},
                              title = {Adapting Dynamic Core Coupling to a Direct-Network Environment},
                              booktitle = {Actas de las XIX Jornadas de Paralelismo},
                              year = {2008},
                              pages = {253--258}
                            }
                            
Sánchez D, Aragón JL and García JM (2008), "Evaluating Dynamic Core Coupling in a Scalable Tiled-CMP Architecture", In Proc. of the 7th Int. Workshop on Duplicating, Deconstructing, and Debunking (WDDD), held in conjunction with ISCA'08. Beijing, China, jun, 2008. , pp. 1-10.
BibTeX:
                            @inproceedings{sanchez_wddd08,
                              author = {Sánchez, Daniel and Aragón, Juan L and García, José M},
                              title = {Evaluating Dynamic Core Coupling in a Scalable Tiled-CMP Architecture},
                              booktitle = {Proc. of the 7th Int. Workshop on Duplicating, Deconstructing, and Debunking (WDDD), held in conjunction with ISCA'08},
                              year = {2008},
                              pages = {1--10}
                            }
                            
Titos R, Acacio ME and García JM (2008), "A Characterization of Conflicts in Log-Based Transactional Memory (LogTM)", In Proc. of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP-08). Toulouse, France, feb, 2008. , pp. 30-37. IEEE Computer Society Press.
BibTeX:
                            @inproceedings{titos_pdp08,
                              author = {Titos, Rubén and Acacio, Manuel E and García, José M},
                              title = {A Characterization of Conflicts in Log-Based Transactional Memory (LogTM)},
                              booktitle = {Proc. of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP-08)},
                              publisher = {IEEE Computer Society Press},
                              year = {2008},
                              pages = {30--37}
                            }
                            
Titos R, Acacio ME and García JM (2008), "Directory-Based Conflict Detection in Hardware Transactional Memory", In Proc. of the 15th Int'l Conference on High-Performance Computing (HIPC 2008). Bagalore, India, dec, 2008. , pp. 541-554. Springer-Verlag.
BibTeX:
                            @inproceedings{titos_hipc08,
                              author = {Titos, Rubén and Acacio, Manuel E and García, José M},
                              title = {Directory-Based Conflict Detection in Hardware Transactional Memory},
                              booktitle = {Proc. of the 15th Int'l Conference on High-Performance Computing (HIPC 2008)},
                              publisher = {Springer-Verlag},
                              year = {2008},
                              pages = {541--554}
                            }
                            
Bernabé G, Fernández R, Garcbackslash'backslashia JM, Acacio ME and González J (2007), "An Efficient implementation of a 3D Wavelet Transform based Encoder on Hyper-Threading Technology", Journal of Parallel Computing., jan, 2007. Vol. 33(1), pp. 54-72. Elsevier.
BibTeX:
                            @article{bernabe_parco07,
                              author = {Bernabé, Gregorio and Fernández, Ricardo and Garcbackslash'backslashia, José M and Acacio, Manuel E and González, José},
                              title = {An Efficient implementation of a 3D Wavelet Transform based Encoder on Hyper-Threading Technology},
                              journal = {Journal of Parallel Computing},
                              publisher = {Elsevier},
                              year = {2007},
                              volume = {33},
                              number = {1},
                              pages = {54--72},
                              url = {http://www.sciencedirect.com/science/article/pii/S0167819106001207},
                              doi = {10.1016/j.parco.2006.11.011}
                            }
                            
Bernabé G, Fernández R, García JM, Acacio ME and González J (2007), "An Efficient Implementation of a 3D Wavelet Transform Based Encoder on Hyper-Threading Technology", Parallel Computing., feb, 2007. Vol. 33(1), pp. 54-72. Elsevier Science Publishers.
BibTeX:
                            @article{bernabe_jpc07,
                              author = {Bernabé, Gregorio and Fernández, Ricardo and García, José M and Acacio, Manuel E and González, José},
                              title = {An Efficient Implementation of a 3D Wavelet Transform Based Encoder on Hyper-Threading Technology},
                              journal = {Parallel Computing},
                              publisher = {Elsevier Science Publishers},
                              year = {2007},
                              volume = {33},
                              number = {1},
                              pages = {54--72}
                            }
                            
Cebrián JM, Aragón JL and García JM (2007), "Leakage Energy Reduction in Value Predictors through Static Decay", In Proc. of the Int. Workshop on High-Performance, Power-Aware Computing (HP-PAC), held in conjunction with IPDPS'07. Long Beach, California, USA, mar, 2007. , pp. 269-275.
BibTeX:
                            @inproceedings{cebrian_hppac07,
                              author = {Cebrián, Juan M and Aragón, Juan L and García, José M},
                              title = {Leakage Energy Reduction in Value Predictors through Static Decay},
                              booktitle = {Proc. of the Int. Workshop on High-Performance, Power-Aware Computing (HP-PAC), held in conjunction with IPDPS'07},
                              year = {2007},
                              pages = {269--275}
                            }
                            
Cebrián JM, Aragón JL, García JM and Kaxiras S (2007), "Adaptive VP Decay: Making Value Predictors Leakage-efficient Designs for High Performance Processors", In Proc. of the 4th ACM Int. Conference on Computing Frontiers (CF). Ischia, Italy, may, 2007. , pp. 113-122.
BibTeX:
                            @inproceedings{cebrian_cf07,
                              author = {Cebrián, Juan M and Aragón, Juan L and García, José M and Kaxiras, Stefanos},
                              title = {Adaptive VP Decay: Making Value Predictors Leakage-efficient Designs for High Performance Processors},
                              booktitle = {Proc. of the 4th ACM Int. Conference on Computing Frontiers (CF)},
                              year = {2007},
                              pages = {113--122}
                            }
                            
Cebrián JM, Aragón JL, García JM and and Stefanos Kaxiras (2007), "An Adaptive Approach for Reducing Leakage Energy Consumption in Value Predictors", In Actas de las XVIII Jornadas de Paralelismo. Zaragoza, sep, 2007. , pp. 107-114.
BibTeX:
                            @inproceedings{cebrian_jp07,
                              author = {Cebrián, Juan M and Aragón, Juan L and García, José M and and Stefanos Kaxiras},
                              title = {An Adaptive Approach for Reducing Leakage Energy Consumption in Value Predictors},
                              booktitle = {Actas de las XVIII Jornadas de Paralelismo},
                              year = {2007},
                              pages = {107--114}
                            }
                            
Fernández-Pascual R, Garcbackslash'backslashia JM, Acacio ME and Duato J (2007), "A Low Overhead Fault-Tolerant Coherence Protocol for CMP Architectures", In In Proc. of the 13th Int'l Symposium on High Performance Computer Architecture (HPCA-13). Phoenix, USA, feb, 2007. , pp. 157-168. IEEE Computer Society Press.
BibTeX:
                            @inproceedings{fernandez_hpca07,
                              author = {Fernández-Pascual, Ricardo and Garcbackslash'backslashia, José M and Acacio, Manuel E and Duato, José},
                              title = {A Low Overhead Fault-Tolerant Coherence Protocol for CMP Architectures},
                              booktitle = {In Proc. of the 13th Int'l Symposium on High Performance Computer Architecture (HPCA-13)},
                              publisher = {IEEE Computer Society Press},
                              year = {2007},
                              pages = {157--168}
                            }
                            
Flores A, Aragón JL and Acacio ME (2007), "Efficient Message Management in Tiled CMP Architectures Using a Heterogeneous Interconnection Network", In Proc. of the 14th Int'l Conference on High Performance Computing (HiPC-2007). Goa, India, dec, 2007. , pp. 133-146. Springer-Verlag.
BibTeX:
                            @inproceedings{flores_hipc07,
                              author = {Flores, Antonio and Aragón, Juan L and Acacio, Manuel E},
                              title = {Efficient Message Management in Tiled CMP Architectures Using a Heterogeneous Interconnection Network},
                              booktitle = {Proc. of the 14th Int'l Conference on High Performance Computing (HiPC-2007)},
                              publisher = {Springer-Verlag},
                              year = {2007},
                              pages = {133--146}
                            }
                            
Flores A, Aragón JL and Acacio ME (2007), "Sim-PowerCMP: A Detailed Simulator for Energy Consumption Analysis in Future Embedded CMP Architectures", In Proc. of the 4th Int'l Symposium on Embedded Computing (SEC-4). Niagara Falls, Canada, may, 2007. , pp. 752-757. IEEE Computer Society.
BibTeX:
                            @inproceedings{flores_sec07,
                              author = {Flores, Antonio and Aragón, Juan L and Acacio, Manuel E},
                              title = {Sim-PowerCMP: A Detailed Simulator for Energy Consumption Analysis in Future Embedded CMP Architectures},
                              booktitle = {Proc. of the 4th Int'l Symposium on Embedded Computing (SEC-4)},
                              publisher = {IEEE Computer Society},
                              year = {2007},
                              pages = {752--757}
                            }
                            
Gonzalez-Ferez P, Piernas J and Cortes T (2007), "The RAM Enhanced Disk Cache Project (REDCAP)", In Proc. of the 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007). San Diego, USA, sep, 2007. , pp. 251-256. IEEE Computer Society.
BibTeX:
                            @inproceedings{gonzalez_msst07,
                              author = {Gonzalez-Ferez, Pilar and Piernas, Juan and Cortes, Toni},
                              title = {The RAM Enhanced Disk Cache Project (REDCAP)},
                              booktitle = {Proc. of the 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007)},
                              publisher = {IEEE Computer Society},
                              year = {2007},
                              pages = {251--256}
                            }
                            
Krishnamoorthy S, Canovas JP, Tipparaju V, Nieplocha J and Sadayappan P (2007), "Non-collective Parallel I/O for Global Address Space Programming Models", In Proc. of the 2007 IEEE International Conference on Cluster Computing. Austin, USA, sep, 2007. , pp. 41-49. IEEE Computer Society.
BibTeX:
                            @inproceedings{krishnamoorthy_cluster07,
                              author = {Krishnamoorthy, Sriram and Canovas, Juan Piernas and Tipparaju, Vinod and Nieplocha, Jarek and Sadayappan, P},
                              title = {Non-collective Parallel I/O for Global Address Space Programming Models},
                              booktitle = {Proc. of the 2007 IEEE International Conference on Cluster Computing},
                              publisher = {IEEE Computer Society},
                              year = {2007},
                              pages = {41--49}
                            }
                            
Petrini F, Fossum G, Fernández J, Varbanescu AL, Kistler M and Perrone M (2007), "Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine", In Proc. of the 21st Int'l Parallel & Distributed Processing Symposium (IPDPS'07). Long Beach, USA, apr, 2007. , pp. 1-10. IEEE Computer Society.
BibTeX:
                            @inproceedings{petrini_ipdps07,
                              author = {Petrini, Fabrizio and Fossum, Gordon and Fernández, Juan and Varbanescu, Ana L and Kistler, Mike and Perrone, Michael},
                              title = {Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine},
                              booktitle = {Proc. of the 21st Int'l Parallel & Distributed Processing Symposium (IPDPS'07)},
                              publisher = {IEEE Computer Society},
                              year = {2007},
                              pages = {1--10}
                            }
                            
Piernas J, Cortes T and Carrasco JMG (2007), "The Design of New Journaling File Systems: The DualFS Case", IEEE Transactions on Computers., feb, 2007. Vol. 56(2), pp. 267-281. IEEE Computer Society.
BibTeX:
                            @article{piernas_toc07,
                              author = {Piernas, Juan and Cortes, Toni and Carrasco, José M García},
                              title = {The Design of New Journaling File Systems: The DualFS Case},
                              journal = {IEEE Transactions on Computers},
                              publisher = {IEEE Computer Society},
                              year = {2007},
                              volume = {56},
                              number = {2},
                              pages = {267--281}
                            }
                            
Piernas J, Nieplocha J and Felix EJ (2007), "Evaluation of Active Storage Strategies for the Lustre Parallel File System", In Proc. of the ACM/IEEE 2007 Supercomputing Conference (SC'07). Reno, USA, nov, 2007. , pp. 1-10. IEEE Computer Society.
BibTeX:
                            @inproceedings{piernas_sc07,
                              author = {Piernas, Juan and Nieplocha, Jarek and Felix, Evan. J},
                              title = {Evaluation of Active Storage Strategies for the Lustre Parallel File System},
                              booktitle = {Proc. of the ACM/IEEE 2007 Supercomputing Conference (SC'07)},
                              publisher = {IEEE Computer Society},
                              year = {2007},
                              pages = {1--10}
                            }
                            
Ros A, Acacio ME and García JM (2007), "Direct Coherence: Bringing Together Performance and Scalability in Shared-Memory Multiprocessors", In Proc. of the 14th Int'l Conference on High Performance Computing (HiPC-2007). Goa, India, dec, 2007. , pp. 147-160. Springer-Verlag.
BibTeX:
                            @inproceedings{ros_hipc07,
                              author = {Ros, Alberto and Acacio, Manuel E and García, José M},
                              title = {Direct Coherence: Bringing Together Performance and Scalability in Shared-Memory Multiprocessors},
                              booktitle = {Proc. of the 14th Int'l Conference on High Performance Computing (HiPC-2007)},
                              publisher = {Springer-Verlag},
                              year = {2007},
                              pages = {147--160}
                            }
                            
Ros A, Acacio ME and García JM (2007), "Direct coherence: Bringing together performance and scalability in shared-memory multiprocessors", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 4873 LNCS, pp. 147-160.
Abstract: Traditional directory-based cache coherence protocols suffer from long-latency cache misses as a consequence of the indirection introduced by the home node, which must be accessed on every cache miss before any coherence action can be performed. In this work we present a new protocol that moves the role of storing up-to-date coherence information (and thus ensuring totally ordered accesses) from the home node to one of the sharing caches. Our protocol allows most cache misses to be directly solved from the corresponding remote caches, without requiring the intervention of the home node. In this way, cache miss latencies are reduced. Detailed simulations show that this protocol leads to improvements in total execution time of 8% on average over a highly optimized MOESI directory-based protocol. textcopyright Springer-Verlag Berlin Heidelberg 2007.
BibTeX:
                            @article{Ros2007,
                              author = {Ros, Alberto and Acacio, Manuel E. and García, José M.},
                              title = {Direct coherence: Bringing together performance and scalability in shared-memory multiprocessors},
                              journal = {Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)},
                              year = {2007},
                              volume = {4873 LNCS},
                              pages = {147--160},
                              doi = {10.1007/978-3-540-77220-0_17}
                            }
                            
Ros A, Acacio ME and García JM (2007), "Exploiting Cache-to-Cache Transfers of Clean Data in Glueless Shared-Memory Multiprocessors", In Actas de las XVIII Jornadas de Paralelismo (JP'07). Zaragoza , pp. 123-130.
BibTeX:
                            @inproceedings{ros_jp07,
                              author = {Ros, Alberto and Acacio, Manuel E and García, José M},
                              title = {Exploiting Cache-to-Cache Transfers of Clean Data in Glueless Shared-Memory Multiprocessors},
                              booktitle = {Actas de las XVIII Jornadas de Paralelismo (JP'07)},
                              year = {2007},
                              pages = {123--130}
                            }
                            
Villa O, Scarpazza DP, Petrini F and Fernández J (2007), "Challenges in Mapping Graph Exploration Algorithms on Advanced Multi-core Processors", In Proc. of the 21st Int'l Parallel & Distributed Processing Symposium (IPDPS'07). Long Beach, USA, apr, 2007. , pp. 1-10. IEEE Computer Society.
BibTeX:
                            @inproceedings{villa_ipdps07,
                              author = {Villa, Oreste and Scarpazza, Daniele P and Petrini, Fabrizio and Fernández, Juan},
                              title = {Challenges in Mapping Graph Exploration Algorithms on Advanced Multi-core Processors},
                              booktitle = {Proc. of the 21st Int'l Parallel & Distributed Processing Symposium (IPDPS'07)},
                              publisher = {IEEE Computer Society},
                              year = {2007},
                              pages = {1--10}
                            }
                            
Villa O, Scarpazza DP, Petrini F and Peinador JF (2007), "Challenges in mapping graph exploration algorithms on advanced multi-core processors", Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM. (1), pp. 1-10.
Abstract: Multi-core processors are a shift of paradigm in computer architecture that promises a dramatic increase in performance. But multi-core processors also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges and design choices involved in parallelizing a breadth-first search (BFS) algorithm on a state-of-the-art multi-core processor, the Cell Broadband Engine (Cell BE). Our experiments obtained on a pre-production Cell BE board running at 3.2 GHz show almost linear speedups when using multiple synergistic processing units, and an impressive level of performance when compared to other processors. The Cell BE is typically an order of magnitude faster than conventional processors, such as the AMD Opteron and the Intel Pentium 4 and Woodcrest, an order of magnitude faster than the MTA-2 multi-threaded processor, and two orders of magnitude faster than a BlueGene/L processor. Copyright textcopyright 2007 IEEE.
BibTeX:
                            @article{Villa2007,
                              author = {Villa, Oreste and Scarpazza, Daniele Paolo and Petrini, Fabrizio and Peinador, Juan Fernández},
                              title = {Challenges in mapping graph exploration algorithms on advanced multi-core processors},
                              journal = {Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM},
                              year = {2007},
                              number = {1},
                              pages = {1--10},
                              doi = {10.1109/IPDPS.2007.370253}
                            }
                            
Aragón JL, González J and González A (2006), "Control Speculation for Energy-Efficient Next-Generation Superscalar Processors", IEEE Transactions on Computers., mar, 2006. Vol. 55(3), pp. 281-291. IEEE Computer Society.
BibTeX:
                            @article{aragon_tc06,
                              author = {Aragón, Juan L and González, José and González, Antonio},
                              title = {Control Speculation for Energy-Efficient Next-Generation Superscalar Processors},
                              journal = {IEEE Transactions on Computers},
                              publisher = {IEEE Computer Society},
                              year = {2006},
                              volume = {55},
                              number = {3},
                              pages = {281--291}
                            }
                            
Cebrián JM, Aragón JL and García JM (2006), "Reducing Leakage in Value Predictors by Using Decay Techniques", In Actas de las XVII Jornadas de Paralelismo. Albacete, sep, 2006. , pp. 61-66.
BibTeX:
                            @inproceedings{cebrian_jp06,
                              author = {Cebrián, Juan M and Aragón, Juan L and García, José M},
                              title = {Reducing Leakage in Value Predictors by Using Decay Techniques},
                              booktitle = {Actas de las XVII Jornadas de Paralelismo},
                              year = {2006},
                              pages = {61--66}
                            }
                            
Fernández J, Frachtenberg E, Petrini F and Sancho JC (2006), "An Abstract Interface for System Software on Large-Scale Clusters", The Computer Journal. Washington, DC, USA, may, 2006. Vol. 49(4), pp. 454-469. IEEE Computer Society.
BibTeX:
                            @article{fernandez_cj06,
                              author = {Fernández, Juan and Frachtenberg, Eitan and Petrini, Fabrizio and Sancho, José C},
                              title = {An Abstract Interface for System Software on Large-Scale Clusters},
                              journal = {The Computer Journal},
                              publisher = {IEEE Computer Society},
                              year = {2006},
                              volume = {49},
                              number = {4},
                              pages = {454--469}
                            }
                            
Fernández J, Petrini F and Frachtenberg E (2006), "Engineering the Grid: Status and Perspective", jan, 2006. , pp. 1-22. Beniamino di Martino et al. editors.
BibTeX:
                            @inbook{fernandez_eg06,
                              author = {Fernández, Juan and Petrini, Fabrizio and Frachtenberg, Eitan},
                              editor = {di Martino et al. editors, Beniamino},
                              title = {Engineering the Grid: Status and Perspective},
                              publisher = {Beniamino di Martino et al. editors},
                              year = {2006},
                              pages = {1--22}
                            }
                            
Fernández-Pascual R, Garcbackslash'backslashia JM and Acacio ME (2006), "Validating a Token Coherence Protocol for Scientific Workloads", In In the 5th Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD 2006). Boston, USA, jun, 2006.
BibTeX:
                            @inproceedings{fernandez_wddd06,
                              author = {Fernández-Pascual, Ricardo and Garcbackslash'backslashia, José M and Acacio, Manuel E},
                              title = {Validating a Token Coherence Protocol for Scientific Workloads},
                              booktitle = {In the 5th Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD 2006)},
                              year = {2006}
                            }
                            
Flores A, Aragón JL and Acacio ME (2006), "SimPower-CMP: Un Simulador de Consumo para CMPs", In Actas de las XVII Jornadas de Paralelismo (JP'06). Albacete , pp. 85-90.
BibTeX:
                            @inproceedings{flores_jp06,
                              author = {Flores, Antonio and Aragón, Juan L and Acacio, Manuel E},
                              title = {SimPower-CMP: Un Simulador de Consumo para CMPs},
                              booktitle = {Actas de las XVII Jornadas de Paralelismo (JP'06)},
                              year = {2006},
                              pages = {85--90}
                            }
                            
Frachtenberg E, Petrini F, Fernández J and Pakin S (2006), "STORM: Scalable Resource Management for Large-Scale Parallel Computers", IEEE Transactions on Computers. Washington, DC, USA, dec, 2006. Vol. 55(12), pp. 1572-1587. IEEE Computer Society.
BibTeX:
                            @article{frachtenberg_tc06,
                              author = {Frachtenberg, Eitan and Petrini, Fabrizio and Fernández, Juan and Pakin, Scott},
                              title = {STORM: Scalable Resource Management for Large-Scale Parallel Computers},
                              journal = {IEEE Transactions on Computers},
                              publisher = {IEEE Computer Society},
                              year = {2006},
                              volume = {55},
                              number = {12},
                              pages = {1572--1587}
                            }
                            
Petrini F, Moody A, Fernández J, Frachtenberg E and Panda DK (2006), "NIC-based Reduction Algorithms for Large-Scale Clusters", International Journal of High Performance Computing and Networking. Washington, DC, USA, aug, 2006. Vol. 4(3/4), pp. 122-136. Inderscience Publishers.
BibTeX:
                            @article{petrini_ijhpcn06,
                              author = {Petrini, Fabrizio and Moody, Adam and Fernández, Juan and Frachtenberg, Eitan and Panda, Dhabaleswar K},
                              title = {NIC-based Reduction Algorithms for Large-Scale Clusters},
                              journal = {International Journal of High Performance Computing and Networking},
                              publisher = {Inderscience Publishers},
                              year = {2006},
                              volume = {4},
                              number = {3/4},
                              pages = {122--136}
                            }
                            
Ros A, Acacio ME and Garcbackslash'backslashia JM (2006), "The SGluM Cache for Scalable Glueless Shared-Memory Multiprocesors", In Actas de las XVII Jornadas de Paralelismo (JP'06). Albacete , pp. 73-78.
BibTeX:
                            @inproceedings{ros_jp06,
                              author = {Ros, Alberto and Acacio, Manuel E and Garcbackslash'backslashia, José M},
                              title = {The SGluM Cache for Scalable Glueless Shared-Memory Multiprocesors},
                              booktitle = {Actas de las XVII Jornadas de Paralelismo (JP'06)},
                              year = {2006},
                              pages = {73--78}
                            }
                            
Ros A, Acacio ME and García JM (2006), "An Efficient Cache Design for Scalable Glueless Shared-Memory Multiprocessors", In Proc. of the ACM International Conference on Computing Frontiers. Ischia, Italy, may, 2006. , pp. 321-330. ACM.
BibTeX:
                            @inproceedings{ros_cf06,
                              author = {Ros, Alberto and Acacio, Manuel E and García, José M},
                              title = {An Efficient Cache Design for Scalable Glueless Shared-Memory Multiprocessors},
                              booktitle = {Proc. of the ACM International Conference on Computing Frontiers},
                              publisher = {ACM},
                              year = {2006},
                              pages = {321--330}
                            }
                            
Ros A, Acacio ME and García JM (2006), "An efficient cache design for scalable glueless shared-memory multiprocessors", Proceedings of the 3rd Conference on Computing Frontiers 2006, CF '06. Vol. 2006, pp. 321-330.
Abstract: Traditionally, cache coherence in large-scale shared-memory multiprocessors has been ensured by means of a distributed directory structure stored in main memory. In this way, the access to main memory to recover the sharing status of the block is generally put in the critical path of every cache miss, increasing its latency. Considering the ever-increasing distance to memory, these cache coherence protocols are far from being optimal from the perspective of performance. On the other hand, shared-memory multiprocessors formed by connecting chips that integrate the processor, caches, coherence logic, switch and memory controller through a low-cost, low-latency point-to-point network (glueless shared-memory multiprocessors) are a reality. In this work, we propose a novel design for the L2 cache level, at which coherence has to be maintained, aimed at being used in glueless shared-memory multiprocessors. Our proposal splits the cache structure into two different parts: one for storing data and directory information for the blocks requested by the local processor, and another one for storing only directory information for blocks accessed by remote processors. Using this cache scheme we remove the directory from main memory. Besides saving memory space, our proposal brings very significant reductions in terms of latency of the cache misses (speed-ups of 3.0 on average), which translate into reductions in applications' execution time of 31% on average. Copyright 2006 ACM.
BibTeX:
                            @article{Ros2006,
                              author = {Ros, Alberto and Acacio, Manuel E. and García, José M.},
                              title = {An efficient cache design for scalable glueless shared-memory multiprocessors},
                              journal = {Proceedings of the 3rd Conference on Computing Frontiers 2006, CF '06},
                              year = {2006},
                              volume = {2006},
                              pages = {321--330},
                              doi = {10.1145/1128022.1128065}
                            }
                            
Villa FJ, Acacio ME and García JM (2006), "On the Evaluation of Dense Chip-Multiprocessor Architectures", In Proc. of the 2006 Int'l Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (IC-SAMOS VI). Samos, Greece, jul, 2006. , pp. 21-27. IEEE Computer Society Press.
BibTeX:
                            @inproceedings{villa_icsamos06,
                              author = {Villa, Francisco J and Acacio, Manuel E and García, José M},
                              title = {On the Evaluation of Dense Chip-Multiprocessor Architectures},
                              booktitle = {Proc. of the 2006 Int'l Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (IC-SAMOS VI)},
                              publisher = {IEEE Computer Society Press},
                              year = {2006},
                              pages = {21--27}
                            }
                            
Villa FJ, Acacio ME and García JM (2006), "Toward Energy-Efficient High-Performance Organizations of the Memory Hierarchy in Chip-Multiprocessor Architectures", Journal of Computer Science & Technology., apr, 2006. Vol. 6(1), pp. 1-7. Iberoamerican Science & Technology Education Consortium.
BibTeX:
                            @article{villa_jcst06,
                              author = {Villa, Francisco J and Acacio, Manuel E and García, José M},
                              title = {Toward Energy-Efficient High-Performance Organizations of the Memory Hierarchy in Chip-Multiprocessor Architectures},
                              journal = {Journal of Computer Science & Technology},
                              publisher = {Iberoamerican Science & Technology Education Consortium},
                              year = {2006},
                              volume = {6},
                              number = {1},
                              pages = {1--7}
                            }
                            
A. Navarro C. J. Hernández JAVLEMGBCAGyAJR (2005), "Virtual Surgical Telesimulations in Otolaryngology", In Proc. of the 13th Annual Medicine Meets Virtual Reality (MMVR). Long Beach, USA, jan, 2005. Vol. 11, pp. 353-355. Studies in Health Technology and Informatics. IOS Press.
BibTeX:
                            @inproceedings{navarro_mmvr05,
                              author = {A. Navarro C. J. Hernández, J A Velez L E Munuera G Bernabé C A Gamboa y A J Reyes},
                              title = {Virtual Surgical Telesimulations in Otolaryngology},
                              booktitle = {Proc. of the 13th Annual Medicine Meets Virtual Reality (MMVR)},
                              publisher = {Studies in Health Technology and Informatics. IOS Press},
                              year = {2005},
                              volume = {11},
                              pages = {353--355}
                            }
                            
Acacio ME, Gonzalez J, García JM and Duato J (2005), "A Two-Level Directory Architecture for Highly Scalable cc-NUMA Multiprocessors", IEEE Transactions on Parallel and Distributed Systems., jan, 2005. Vol. 16(1), pp. 67-79. IEEE Computer Society.
BibTeX:
                            @article{acacio_tpds05,
                              author = {Acacio, Manuel E and Gonzalez, José and García, José Manuel and Duato, José},
                              title = {A Two-Level Directory Architecture for Highly Scalable cc-NUMA Multiprocessors},
                              journal = {IEEE Transactions on Parallel and Distributed Systems},
                              publisher = {IEEE Computer Society},
                              year = {2005},
                              volume = {16},
                              number = {1},
                              pages = {67--79}
                            }
                            
Aragón JL and Veidenbaum AV (2005), "Energy-Effective Instruction Fetch Unit for Wide Issue Processors", In Proc. of the 10th Asia-Pacific Computer Systems Architecture Conference (ACSAC). Singapore, Singapore, oct, 2005. , pp. 15-27. Springer-Verlag.
BibTeX:
                            @inproceedings{aragon_acsac05,
                              author = {Aragón, Juan L and Veidenbaum, Alexander V},
                              title = {Energy-Effective Instruction Fetch Unit for Wide Issue Processors},
                              booktitle = {Proc. of the 10th Asia-Pacific Computer Systems Architecture Conference (ACSAC)},
                              publisher = {Springer-Verlag},
                              year = {2005},
                              pages = {15--27}
                            }
                            
Bernabé G, González J and Garcbackslash'backslashia JM (2005), "Reducing 3D Fast Wavelet Transform Execution Time Using Blocking and the Streaming SIMD Extensions", Journal of VLSI Signal Processing Systems., sep, 2005. Vol. 41(2), pp. 209-223. Springer Science + Business Media, Inc. Manufactured in The Netherlands.
BibTeX:
                            @article{bernabe_vlsi05,
                              author = {Bernabé, Gregorio and González, José and Garcbackslash'backslashia, José M},
                              title = {Reducing 3D Fast Wavelet Transform Execution Time Using Blocking and the Streaming SIMD Extensions},
                              journal = {Journal of VLSI Signal Processing Systems},
                              publisher = {Springer Science + Business Media, Inc. Manufactured in The Netherlands},
                              year = {2005},
                              volume = {41},
                              number = {2},
                              pages = {209--223}
                            }
                            
Fernández J, Petrini F and Frachtenberg E (2005), "Monitoring and Debugging Parallel Software with BCS-MPI on Large-Scale Clusters", In Proc. of the Int'l Workshop on System Management Tools for Large-Scale Parallel Systems, held in conjunction with IPDPS'05. Denver, USA, apr, 2005. , pp. 1-8. IEEE Computer Society.
BibTeX:
                            @inproceedings{fernandez_ipdps05,
                              author = {Fernández, Juan and Petrini, Fabrizio and Frachtenberg, Eitan},
                              title = {Monitoring and Debugging Parallel Software with BCS-MPI on Large-Scale Clusters},
                              booktitle = {Proc. of the Int'l Workshop on System Management Tools for Large-Scale Parallel Systems, held in conjunction with IPDPS'05},
                              publisher = {IEEE Computer Society},
                              year = {2005},
                              pages = {1--8}
                            }
                            
Fernández R, García JM, Bernabé G and Acacio ME (2005), "Optimizing a 3D-FWT Video Encoder for SMPs and Hyperthreading Architectures", In Proc. of the 13th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP-05). Lugano, Switzerland, feb, 2005. , pp. 76-83. IEEE Computer Society Press.
BibTeX:
                            @inproceedings{fernandez_pdp05,
                              author = {Fernández, Ricardo and García, José M and Bernabé, Gregorio and Acacio, Manuel E},
                              title = {Optimizing a 3D-FWT Video Encoder for SMPs and Hyperthreading Architectures},
                              booktitle = {Proc. of the 13th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP-05)},
                              publisher = {IEEE Computer Society Press},
                              year = {2005},
                              pages = {76--83}
                            }
                            
Frachtenberg E, Feitelson DG, Petrini F and Fernández J (2005), "Adaptive Parallel Job Scheduling with Flexible Coscheduling", IEEE Transactions on Parallel and Distributed Systems. Washington, DC, USA, nov, 2005. Vol. 16(11), pp. 1066-1077. IEEE Computer Society.
BibTeX:
                            @article{frachtenberg_tpds05,
                              author = {Frachtenberg, Eitan and Feitelson, Dror G and Petrini, Fabrizio and Fernández, Juan},
                              title = {Adaptive Parallel Job Scheduling with Flexible Coscheduling},
                              journal = {IEEE Transactions on Parallel and Distributed Systems},
                              publisher = {IEEE Computer Society},
                              year = {2005},
                              volume = {16},
                              number = {11},
                              pages = {1066--1077}
                            }
                            
García PE, Fernández J, Petrini F and García JM (2005), "Assessing MPI Performance on QsNetII", In Proc. of the 12th Int'l EuroPVM/MPI Conference (EuroPVM/MPI'05). Sorrento, Italy, sep, 2005. , pp. 399-406. Springer-Verlag.
BibTeX:
                            @inproceedings{garcia_europvmmp05,
                              author = {García, Pablo E and Fernández, Juan and Petrini, Fabrizio and García, José Manuel},
                              title = {Assessing MPI Performance on QsNetII},
                              booktitle = {Proc. of the 12th Int'l EuroPVM/MPI Conference (EuroPVM/MPI'05)},
                              publisher = {Springer-Verlag},
                              year = {2005},
                              pages = {399--406}
                            }
                            
Ros A, Acacio ME and Garcbackslash'backslashia JM (2005), "Diseño y Evaluación de una Arquitectura de Directorio Ligero para Multiprocesadores de Memoria Compartida Escalables", In Actas de las XVI Jornadas de Paralelismo (JP'05). Granada , pp. 91-98.
BibTeX:
                            @inproceedings{ros_jp05,
                              author = {Ros, Alberto and Acacio, Manuel E and Garcbackslash'backslashia, José M},
                              title = {Diseño y Evaluación de una Arquitectura de Directorio Ligero para Multiprocesadores de Memoria Compartida Escalables},
                              booktitle = {Actas de las XVI Jornadas de Paralelismo (JP'05)},
                              year = {2005},
                              pages = {91--98}
                            }
                            
Ros A, Acacio ME and García JM (2005), "A Novel Lightweight Directory Architecture for Scalable Shared-Memory Multiprocessors", In Proc. of the 11th International Euro-Par Conference. Lisbon, Portugal, aug, 2005. , pp. 582-591. Springer-Verlag.
BibTeX:
                            @inproceedings{ros_europar05,
                              author = {Ros, Alberto and Acacio, Manuel E and García, José M},
                              title = {A Novel Lightweight Directory Architecture for Scalable Shared-Memory Multiprocessors},
                              booktitle = {Proc. of the 11th International Euro-Par Conference},
                              publisher = {Springer-Verlag},
                              year = {2005},
                              pages = {582--591}
                            }
                            
Villa FJ, Acacio ME and García JM (2005), "Evaluating IA-32 Web Servers Through SIMCS: A Practical Experience", Journal of Systems Architecture., jan, 2005. Vol. 51(4), pp. 251-264. Elsevier Science Publishers.
BibTeX:
                            @article{villa_jsa05,
                              author = {Villa, Francisco J and Acacio, Manuel E and García, José M},
                              title = {Evaluating IA-32 Web Servers Through SIMCS: A Practical Experience},
                              journal = {Journal of Systems Architecture},
                              publisher = {Elsevier Science Publishers},
                              year = {2005},
                              volume = {51},
                              number = {4},
                              pages = {251--264}
                            }
                            
Villa FJ, Acacio ME and García JM (2005), "Memory Subsystem Characterization in a 16-Core Snoop-Based Chip-Multiprocessor Architecture", In Proc. of the 2005 International Conference on High Performance Computing and Communications (HPCC 2005). Sorrento, Italy, sep, 2005. , pp. 223-232. Springer-Verlag.
BibTeX:
                            @inproceedings{villa_hpcc05,
                              author = {Villa, Francisco J and Acacio, Manuel E and García, José M},
                              title = {Memory Subsystem Characterization in a 16-Core Snoop-Based Chip-Multiprocessor Architecture},
                              booktitle = {Proc. of the 2005 International Conference on High Performance Computing and Communications (HPCC 2005)},
                              publisher = {Springer-Verlag},
                              year = {2005},
                              pages = {223--232}
                            }
                            
Acacio ME, Gonzalez J, García JM and Duato J (2004), "An Architecture for High-Performance Scalable Shared-Memory Multiprocessors Exploiting On-Chip Integration", IEEE Transactions on Parallel and Distributed Systems., dec, 2004. Vol. 15(8), pp. 755-768. IEEE Computer Society.
BibTeX:
                            @article{acacio_tpds04,
                              author = {Acacio, Manuel E and Gonzalez, José and García, José Manuel and Duato, José},
                              title = {An Architecture for High-Performance Scalable Shared-Memory Multiprocessors Exploiting On-Chip Integration},
                              journal = {IEEE Transactions on Parallel and Distributed Systems},
                              publisher = {IEEE Computer Society},
                              year = {2004},
                              volume = {15},
                              number = {8},
                              pages = {755--768}
                            }
                            
Acacio ME, González J, García JM and Duato J (2004), "An architecture for high-performance scalable shared-memory multiprocessors exploiting on-chip integration", IEEE Transactions on Parallel and Distributed Systems. Vol. 15(8), pp. 755-768.
Abstract: Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller, the coherence hardware, and the network interface/router. In this paper, we exploit such integration scale, presenting a novel node architecture aimed at reducing the long L2 miss latencies and the memory overhead of using directories that characterize cc-NUMA machines and limit their scalability. Our proposal replaces the traditional directory with a novel three-level directory architecture, as well as it adds a small shared data cache to each of the nodes of a multiprocessor system. Due to their small size, the first-level directory and the shared data cache are integrated into the processor chip in every node, which enhances performance by saving accesses to the slower main memory. Scalability is guaranteed by having the second and third-level directories out of the processor chip and using compressed data structures. A taxonomy of the L2 misses, according to the actions performed by the directory to satisfy them, is also presented. Using execution-driven simulations, we show that significant latency reductions can be obtained by using the proposed node architecture, which translates into reductions of more than 30 percent in several cases in the application execution time. textcopyright 2004 IEEE.
BibTeX:
                            @article{Acacio2004,
                              author = {Acacio, Manuel E. and González, José and García, José M. and Duato, José},
                              title = {An architecture for high-performance scalable shared-memory multiprocessors exploiting on-chip integration},
                              journal = {IEEE Transactions on Parallel and Distributed Systems},
                              year = {2004},
                              volume = {15},
                              number = {8},
                              pages = {755--768},
                              doi = {10.1109/TPDS.2004.27}
                            }
                            
Aragón JL, Nicolaescu D, Veidenbaum AV and Badulescu A-M (2004), "Energy-Efficient Design for Highly Associative Instruction Caches in Next-Generation Embedded Processors", In Proc. of the IEEE Int. Conference on Design, Automation and Test in Europe (DATE). Paris, France, feb, 2004. , pp. 1374-1375.
BibTeX:
                            @inproceedings{aragon_date04,
                              author = {Aragón, Juan L and Nicolaescu, Dan and Veidenbaum, Alexander V and Badulescu, Ana-Maria},
                              title = {Energy-Efficient Design for Highly Associative Instruction Caches in Next-Generation Embedded Processors},
                              booktitle = {Proc. of the IEEE Int. Conference on Design, Automation and Test in Europe (DATE)},
                              year = {2004},
                              pages = {1374--1375}
                            }
                            
Aragón JL and Veidenbaum AV (2004), "Low-power Fetch Unit Design for Superscalar Processors", In Actas de las XV Jornadas de Paralelismo. Almería, sep, 2004. , pp. 173-178.
BibTeX:
                            @inproceedings{aragon_jp04,
                              author = {Aragón, Juan L and Veidenbaum, Alexander V},
                              title = {Low-power Fetch Unit Design for Superscalar Processors},
                              booktitle = {Actas de las XV Jornadas de Paralelismo},
                              year = {2004},
                              pages = {173--178}
                            }
                            
Fernández J, Frachtenberg E, Petrini F, Davis K and Sancho JC (2004), "Architectural Support for System Software on Large-Scale Clusters", In Proc. of the 33th Int'l Conference on Parallel Processing (ICPP'04). Montreal, Canada, aug, 2004. , pp. 519-528. IEEE Computer Society.
BibTeX:
                            @inproceedings{fernandez_icpp04,
                              author = {Fernández, Juan and Frachtenberg, Eitan and Petrini, Fabrizio and Davis, Kei and Sancho, José C},
                              title = {Architectural Support for System Software on Large-Scale Clusters},
                              booktitle = {Proc. of the 33th Int'l Conference on Parallel Processing (ICPP'04)},
                              publisher = {IEEE Computer Society},
                              year = {2004},
                              pages = {519--528}
                            }
                            
Frachtenberg E, Davis K, Petrini F, Fernández J and Sancho JC (2004), "Designing Parallel Operating Systems via Parallel Programming", In Proc. of the 10th Int'l Conference on Parallel and Distributed Computing (Euro-Par'04). Pisa, Italy, aug, 2004. , pp. 689-696. Springer-Verlag.
BibTeX:
                            @inproceedings{frachtenberg_europar04,
                              author = {Frachtenberg, Eitan and Davis, Kei and Petrini, Fabrizio and Fernández, Juan and Sancho, José C},
                              title = {Designing Parallel Operating Systems via Parallel Programming},
                              booktitle = {Proc. of the 10th Int'l Conference on Parallel and Distributed Computing (Euro-Par'04)},
                              publisher = {Springer-Verlag},
                              year = {2004},
                              pages = {689--696}
                            }
                            
Piernas J, Cortes T and Carrasco JMG (2004), "Traditional File Systems versus DualFS: a Performance Comparison Approach", IEICE Transactions on Information and Systems., jul, 2004. Vol. E87-D(7), pp. 1703-1711. The Institute of Electronics, Information and Communication Engineers.
BibTeX:
                            @article{piernas_ieice_tis04,
                              author = {Piernas, Juan and Cortes, Toni and Carrasco, José M García},
                              title = {Traditional File Systems versus DualFS: a Performance Comparison Approach},
                              journal = {IEICE Transactions on Information and Systems},
                              publisher = {The Institute of Electronics, Information and Communication Engineers},
                              year = {2004},
                              volume = {E87-D},
                              number = {7},
                              pages = {1703--1711}
                            }
                            
Sancho JC, Petrini F, Johnson G, Fernández J and Frachtenberg E (2004), "On the Feasibility of Incremental Checkpointing for Scientific Applications", In Proc. of the 18th Int'l IEEE International Parallel & Distributed Processing Symposium (IPDPS'04). Santa Fe, USA, apr, 2004. , pp. 1-10. IEEE Computer Society.
BibTeX:
                            @inproceedings{sancho_ipdps04,
                              author = {Sancho, José C and Petrini, Fabrizio and Johnson, Greg and Fernández, Juan and Frachtenberg, Eitan},
                              title = {On the Feasibility of Incremental Checkpointing for Scientific Applications},
                              booktitle = {Proc. of the 18th Int'l IEEE International Parallel & Distributed Processing Symposium (IPDPS'04)},
                              publisher = {IEEE Computer Society},
                              year = {2004},
                              pages = {1--10}
                            }
                            
Villa FJ, Acacio ME and García JM (2004), "On the Evaluation of x86 Web Servers Using Simics: Limitations and Trade-Offs", In Proc. of the Int'l Conference on Computational Science 2004 (ICCS-2004). Krakov, Poland, jun, 2004. , pp. 541-544. Springer-Verlag.
BibTeX:
                            @inproceedings{villa_iccs04,
                              author = {Villa, Francisco J and Acacio, Manuel E and García, José Manuel},
                              title = {On the Evaluation of x86 Web Servers Using Simics: Limitations and Trade-Offs},
                              booktitle = {Proc. of the Int'l Conference on Computational Science 2004 (ICCS-2004)},
                              publisher = {Springer-Verlag},
                              year = {2004},
                              pages = {541--544}
                            }
                            
Acacio ME and García JM (2003), "Techniques for Improving the Performance and Scalability of Directory-based Shared-Memory Multiprocessors: A Survey", Journal of Computer Science & Technology., oct, 2003. Vol. 3(2), pp. 1-8. Iberoamerican Science & Technology Education Consortium.
BibTeX:
                            @article{acacio_jcst03,
                              author = {Acacio, Manuel E and García, José Manuel},
                              title = {Techniques for Improving the Performance and Scalability of Directory-based Shared-Memory Multiprocessors: A Survey},
                              journal = {Journal of Computer Science & Technology},
                              publisher = {Iberoamerican Science & Technology Education Consortium},
                              year = {2003},
                              volume = {3},
                              number = {2},
                              pages = {1--8}
                            }
                            
Aragón JL, González J and González A (2003), "Power-Aware Control Speculation through Selective Throttling", In Proc. of the 9th IEEE International Symposium on High Performance Computer Architecture (HPCA). Anaheim, California, USA, feb, 2003. , pp. 220-229.
BibTeX:
                            @inproceedings{aragon_hpca03,
                              author = {Aragón, Juan L and González, José and González, Antonio},
                              title = {Power-Aware Control Speculation through Selective Throttling},
                              booktitle = {Proc. of the 9th IEEE International Symposium on High Performance Computer Architecture (HPCA)},
                              year = {2003},
                              pages = {220--229}
                            }
                            
Bernabé G, Garcbackslash'backslashia JM and González J (2003), "Reducing 3D Wavelet Transform Execution Time through the Streaming SIMD Extensions", In Proc. of the 11th Euromicro Conference on Parallel Distributed and Network based Processing (PDP). Genova, Italia, feb, 2003. , pp. 49-56. IEEE Computer Society.
BibTeX:
                            @inproceedings{bernabe_pdp03,
                              author = {Bernabé, Gregorio and Garcbackslash'backslashia, José M and González, José},
                              title = {Reducing 3D Wavelet Transform Execution Time through the Streaming SIMD Extensions},
                              booktitle = {Proc. of the 11th Euromicro Conference on Parallel Distributed and Network based Processing (PDP)},
                              publisher = {IEEE Computer Society},
                              year = {2003},
                              pages = {49--56}
                            }
                            
Bernabé G, García JM and González J (2003), "Reducing 3D wavelet transform execution time through the Streaming SIMD Extensions", Proceedings - 11th Euromicro Conference on Parallel, Distributed and Network-Based Processing, Euro-PDP 2003. , pp. 49-56.
Abstract: This paper focuses on reducing the execution time of the video compression algorithms based on the 3D wavelet transform. We present several optimizations that could not be applied by the compiler due to the characteristics of the algorithm. First, we use the Streaming SIMD Extensions (SSE) for some of the dimensions of the sequence (y and time), in order to reduce the number of floating point instructions, exploiting data level parallelism. Then, we apply loop unrolling and data prefetching to critical parts of the code, and finally the algorithm is vectorized by columns, allowing the use of SIMD instructions for the y dimension. Results show improvements of up to 1.54 over a version compiled with the maximum optimizations of the Intel CIC++ compiler Our experiments also show that, allowing the compiler to perform some of these optimizations (i.e. automatic code vectorization) causes performance slowdown which demonstrates the effectiveness of our optimizations.
BibTeX:
                            @article{Bernabe2003,
                              author = {Bernabé, G. and García, J. M. and González, J.},
                              title = {Reducing 3D wavelet transform execution time through the Streaming SIMD Extensions},
                              journal = {Proceedings - 11th Euromicro Conference on Parallel, Distributed and Network-Based Processing, Euro-PDP 2003},
                              year = {2003},
                              pages = {49--56},
                              doi = {10.1109/EMPDP.2003.1183565}
                            }
                            
Fernández J, Frachtenberg E and Petrini F (2003), "BCS-MPI: a New Approach in the System Software Design for Large-Scale Parallel Computers", In Proc. of the 15th Int'l ACM/IEEE Conference on Supercomputing (SC'03). Phoenix, USA, nov, 2003. , pp. 1-16. ACM/IEEE Computer Society.
BibTeX:
                            @inproceedings{fernandez_sc03,
                              author = {Fernández, Juan and Frachtenberg, Eitan and Petrini, Fabrizio},
                              title = {BCS-MPI: a New Approach in the System Software Design for Large-Scale Parallel Computers},
                              booktitle = {Proc. of the 15th Int'l ACM/IEEE Conference on Supercomputing (SC'03)},
                              publisher = {ACM/IEEE Computer Society},
                              year = {2003},
                              pages = {1--16}
                            }
                            
Frachtenberg E, Feitelson DG, Fernández J and Petrini F (2003), "Parallel Job Scheduling under Dynamic Workloads", In Proc. of the 9th Int'l Workshop on Job Scheduling Strategies for Parallel Processing, held in conjunction with HPDC'03. Seattle, USA, jun, 2003. , pp. 208-227. Springer-Verlag.
BibTeX:
                            @inproceedings{frachtenberg_hpdc03,
                              author = {Frachtenberg, Eitan and Feitelson, Dror G and Fernández, Juan and Petrini, Fabrizio},
                              title = {Parallel Job Scheduling under Dynamic Workloads},
                              booktitle = {Proc. of the 9th Int'l Workshop on Job Scheduling Strategies for Parallel Processing, held in conjunction with HPDC'03},
                              publisher = {Springer-Verlag},
                              year = {2003},
                              pages = {208--227}
                            }
                            
Frachtenberg E, Feitelson DG, Petrini F and Fernández J (2003), "Flexible CoScheduling: Mitigating Load Imbalance and Improving Utilization of Heterogeneous Resources", In Proc. of the 17th Int'l IEEE International Parallel & Distributed Processing Symposium (IPDPS 2003) (BEST PAPER in the Architectures Track). Nice, France, apr, 2003. , pp. 1-10. IEEE Computer Society.
BibTeX:
                            @inproceedings{frachtenberg_ipdps03,
                              author = {Frachtenberg, Eitan and Feitelson, Dror G and Petrini, Fabrizio and Fernández, Juan},
                              title = {Flexible CoScheduling: Mitigating Load Imbalance and Improving Utilization of Heterogeneous Resources},
                              booktitle = {Proc. of the 17th Int'l IEEE International Parallel & Distributed Processing Symposium (IPDPS 2003) (BEST PAPER in the Architectures Track)},
                              publisher = {IEEE Computer Society},
                              year = {2003},
                              pages = {1--10}
                            }
                            
Moody A, Fernández J, Petrini F and Panda DK (2003), "Scalable NIC-based Reduction on Large-Scale Clusters", In Proc. of the 15th Int'l ACM/IEEE Conference on Supercomputing (SC'03). Phoenix, USA, nov, 2003. , pp. 1-19. ACM/IEEE Computer Society.
BibTeX:
                            @inproceedings{moody_sc03,
                              author = {Moody, Adam and Fernández, Juan and Petrini, Fabrizio and Panda, Dhabaleswar K},
                              title = {Scalable NIC-based Reduction on Large-Scale Clusters},
                              booktitle = {Proc. of the 15th Int'l ACM/IEEE Conference on Supercomputing (SC'03)},
                              publisher = {ACM/IEEE Computer Society},
                              year = {2003},
                              pages = {1--19}
                            }
                            
Navarro AA, Vélez JA, Satizabal JE, Múnuera LE and Bernabé G (2003), "Virtual Surgical Telesimulations in Ophtalmology", In Proc. of the 17th International Congress on Computer Assisted Radiology and Surgery (CARS 2003). Londres, Reino Unido, jun, 2003. , pp. 145-150. Elsevier Science.
BibTeX:
                            @inproceedings{navarro_cars03,
                              author = {Navarro, Andrés A and Vélez, Jorge A and Satizabal, J E and Múnuera, Luis E and Bernabé, Gregorio},
                              title = {Virtual Surgical Telesimulations in Ophtalmology},
                              booktitle = {Proc. of the 17th International Congress on Computer Assisted Radiology and Surgery (CARS 2003)},
                              publisher = {Elsevier Science},
                              year = {2003},
                              pages = {145--150}
                            }
                            
Petrini F, Fernández J, Frachtenberg E and Coll S (2003), "Scalable Collective Communication on the ASCI Q Machine", In Proc. of the 11th Int'l Hot Interconnects Conference (HOTi'11). Stanford, USA, aug, 2003. , pp. 54-59. IEEE Computer Society.
BibTeX:
                            @inproceedings{petrini_hoti03,
                              author = {Petrini, Fabrizio and Fernández, Juan and Frachtenberg, Eitan and Coll, Salvador},
                              title = {Scalable Collective Communication on the ASCI Q Machine},
                              booktitle = {Proc. of the 11th Int'l Hot Interconnects Conference (HOTi'11)},
                              publisher = {IEEE Computer Society},
                              year = {2003},
                              pages = {54--59}
                            }
                            
Piernas J, Cortes T and García JM (2003), "Traditional versus Next-Generation Journaling File Systems: a Performance Comparison Approach", In Proc. of the 2nd Workshop on Hardware/Software Support for High Performance Scientific and Engineering Computing (SHPSEC'03), held in conjuction with PACT'03.. New Orleans, USA, sep, 2003.
BibTeX:
                            @inproceedings{piernas_shpsec03,
                              author = {Piernas, Juan and Cortes, Toni and García, José M},
                              title = {Traditional versus Next-Generation Journaling File Systems: a Performance Comparison Approach},
                              booktitle = {Proc. of the 2nd Workshop on Hardware/Software Support for High Performance Scientific and Engineering Computing (SHPSEC'03), held in conjuction with PACT'03.},
                              year = {2003}
                            }
                            
Vélez JA, Navarro AA, Múnuera LE and Bernabé G (2003), "A Software Architecture for Virtual Simulation of Endoscopic Surgery", In Proc. of the 17th International Congress on Computer Assisted Radiology and Surgery (CARS 2003). Londres, Reino Unido, jun, 2003. , pp. 151-155. Elsevier Science.
BibTeX:
                            @inproceedings{velez_cars03,
                              author = {Vélez, Jorge A and Navarro, Andrés A and Múnuera, Luis E and Bernabé, Gregorio},
                              title = {A Software Architecture for Virtual Simulation of Endoscopic Surgery},
                              booktitle = {Proc. of the 17th International Congress on Computer Assisted Radiology and Surgery (CARS 2003)},
                              publisher = {Elsevier Science},
                              year = {2003},
                              pages = {151--155}
                            }
                            
Acacio ME, Cánovas Ó, García JM and López-de-Teruel PE (2002), "MPI-Delphi: An MPI Implementation for Visual Programming Environments and Heterogeneous Computing", Journal of Future Generation Computer Systems., jan, 2002. Vol. 18(3), pp. 317-333. Elsevier Science Publishers.
BibTeX:
                            @article{acacio_fgcs02,
                              author = {Acacio, Manuel E and Cánovas, Óscar and García, José M and López-de-Teruel, Pedro E},
                              title = {MPI-Delphi: An MPI Implementation for Visual Programming Environments and Heterogeneous Computing},
                              journal = {Journal of Future Generation Computer Systems},
                              publisher = {Elsevier Science Publishers},
                              year = {2002},
                              volume = {18},
                              number = {3},
                              pages = {317--333}
                            }
                            
Acacio ME, Gonzalez J, García JM and Duato J (2002), "Owner Prediction for Accelerating Cache-to-Cache Transfer Misses in a cc-NUMA Architecture", In Proc. of the ACM/IEEE 2002 Conf. Supercomputing (SC-2002). Baltimore, USA, nov, 2002. , pp. 1-12. IEEE Computer Society.
BibTeX:
                            @inproceedings{acacio_sc02,
                              author = {Acacio, Manuel E and Gonzalez, José and García, José Manuel and Duato, José},
                              title = {Owner Prediction for Accelerating Cache-to-Cache Transfer Misses in a cc-NUMA Architecture},
                              booktitle = {Proc. of the ACM/IEEE 2002 Conf. Supercomputing (SC-2002)},
                              publisher = {IEEE Computer Society},
                              year = {2002},
                              pages = {1--12}
                            }
                            
Acacio ME, González J, García JM and Duato J (2002), "A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors", In Proc. of the 16th Int'l Parallel & Distributed Processing Symposium (IPDPS-2002). Fort Lauderdale, USA, apr, 2002. , pp. 62-69. IEEE Computer Society.
BibTeX:
                            @inproceedings{acacio_ipdps02,
                              author = {Acacio, Manuel E and González, José and García, José Manuel and Duato, José},
                              title = {A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors},
                              booktitle = {Proc. of the 16th Int'l Parallel & Distributed Processing Symposium (IPDPS-2002)},
                              publisher = {IEEE Computer Society},
                              year = {2002},
                              pages = {62--69}
                            }
                            
Acacio ME, González J, García JM and Duato J (2002), "Reducing the Latency of L2 Misses in Shared-Memory Multiprocessors through On-Chip Directory Integration", In Proc. of 10th Euromicro Int'l Conference on Parallel, Distributed and Network-Based Computing (EUROMICRO-PDP'02). Las Palmas, Spain, jan, 2002. , pp. 368-375. IEEE Computer Society.
BibTeX:
                            @inproceedings{acacio_pdp02,
                              author = {Acacio, Manuel E and González, José and García, José Manuel and Duato, José},
                              title = {Reducing the Latency of L2 Misses in Shared-Memory Multiprocessors through On-Chip Directory Integration},
                              booktitle = {Proc. of 10th Euromicro Int'l Conference on Parallel, Distributed and Network-Based Computing (EUROMICRO-PDP'02)},
                              publisher = {IEEE Computer Society},
                              year = {2002},
                              pages = {368--375}
                            }
                            
Acacio ME, González J, García JM and Duato J (2002), "The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors", In Proc. of the 2002 Int'l Conference on Parallel Architectures and Compilation Techniques (PACT 2002). Charlottesville, USA, sep, 2002. , pp. 155-164. IEEE Computer Society.
BibTeX:
                            @inproceedings{acacio_pact02,
                              author = {Acacio, Manuel E and González, José and García, José Manuel and Duato, José},
                              title = {The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors},
                              booktitle = {Proc. of the 2002 Int'l Conference on Parallel Architectures and Compilation Techniques (PACT 2002)},
                              publisher = {IEEE Computer Society},
                              year = {2002},
                              pages = {155--164}
                            }
                            
Aragón JL, González J, González A and Smith JE (2002), "Dual Path Instruction Processing", In Proc. of the 16th ACM International Conference on Supercomputing (ICS). New York, New York, USA, jun, 2002. , pp. 220-229.
BibTeX:
                            @inproceedings{aragon_ics02,
                              author = {Aragón, Juan L and González, José and González, Antonio and Smith, James E},
                              title = {Dual Path Instruction Processing},
                              booktitle = {Proc. of the 16th ACM International Conference on Supercomputing (ICS)},
                              year = {2002},
                              pages = {220--229}
                            }
                            
Bernabé G, González J, Garcbackslash'backslashia JM and Duato J (2002), "Memory Conscious 3D Wavelet Transform", In Proc. of the 28th Euromicro Conference. Multimedia and Telecommunications. (EUROMICRO). Dortmund, Alemania, sep, 2002. , pp. 108-113. IEEE Computer Society.
BibTeX:
                            @inproceedings{bernabe_euromicro02,
                              author = {Bernabé, Gregorio and González, José and Garcbackslash'backslashia, José M and Duato, José},
                              title = {Memory Conscious 3D Wavelet Transform},
                              booktitle = {Proc. of the 28th Euromicro Conference. Multimedia and Telecommunications. (EUROMICRO)},
                              publisher = {IEEE Computer Society},
                              year = {2002},
                              pages = {108--113}
                            }
                            
Fernández J, García JM and Duato J (2002), "Improving the Performance of Real-Time Communication Services on High-Speed LANs under Topology Changes", In Proc. of the 27th Int'l IEEE Conference on Local Computer Networks (LCN'02). Tampa, USA, nov, 2002. , pp. 385-394. IEEE Computer Society.
BibTeX:
                            @inproceedings{fernandez_lcn02,
                              author = {Fernández, Juan and García, José Manuel and Duato, José},
                              title = {Improving the Performance of Real-Time Communication Services on High-Speed LANs under Topology Changes},
                              booktitle = {Proc. of the 27th Int'l IEEE Conference on Local Computer Networks (LCN'02)},
                              publisher = {IEEE Computer Society},
                              year = {2002},
                              pages = {385--394}
                            }
                            
Frachtenberg E, Petrini F, Fernández J and Coll S (2002), "Scalable Resource Management in High Performance Computers", In Proc. of the Int'l IEEE Conference on Cluster Computing (Cluster'02). Chicago, USA, sep, 2002. , pp. 305-314. IEEE Computer Society.
BibTeX:
                            @inproceedings{frachtenberg_cluster02,
                              author = {Frachtenberg, Eitan and Petrini, Fabrizio and Fernández, Juan and Coll, Salvador},
                              title = {Scalable Resource Management in High Performance Computers},
                              booktitle = {Proc. of the Int'l IEEE Conference on Cluster Computing (Cluster'02)},
                              publisher = {IEEE Computer Society},
                              year = {2002},
                              pages = {305--314}
                            }
                            
Frachtenberg E, Petrini F, Fernández J, Pakin S and Coll S (2002), "STORM: Lightning-Fast Resource Management", In Proc. of the 14th Int'l ACM/IEEE conference on Supercomputing (SC'02). Baltimore, USA, nov, 2002. , pp. 1-26. ACM/IEEE Computer Society.
BibTeX:
                            @inproceedings{frachtenberg_sc02,
                              author = {Frachtenberg, Eitan and Petrini, Fabrizio and Fernández, Juan and Pakin, Scott and Coll, Salvador},
                              title = {STORM: Lightning-Fast Resource Management},
                              booktitle = {Proc. of the 14th Int'l ACM/IEEE conference on Supercomputing (SC'02)},
                              publisher = {ACM/IEEE Computer Society},
                              year = {2002},
                              pages = {1--26}
                            }
                            
Piernas J, Cortes T and García JM (2002), "DualFS: a New Journaling File System without Meta-Data Duplication", In Proc. of the 16th ACM International Conference on Supercomputing (ICS'02). New York, USA, jun, 2002. , pp. 137-146. ACM Press.
BibTeX:
                            @inproceedings{piernas_ics02,
                              author = {Piernas, Juan and Cortes, Toni and García, José M},
                              title = {DualFS: a New Journaling File System without Meta-Data Duplication},
                              booktitle = {Proc. of the 16th ACM International Conference on Supercomputing (ICS'02)},
                              publisher = {ACM Press},
                              year = {2002},
                              pages = {137--146}
                            }
                            
Piernas J, Cortes T and García JM (2002), "DualFS: Toward a New Journaling File Systems", In Actas de las XIII Jornadas de Paralelismo. Lleida, España , pp. 299-303. Edicions de la Universitat de Lleida.
BibTeX:
                            @inproceedings{piernas_jornadas02,
                              author = {Piernas, Juan and Cortes, Toni and García, José M},
                              title = {DualFS: Toward a New Journaling File Systems},
                              booktitle = {Actas de las XIII Jornadas de Paralelismo},
                              publisher = {Edicions de la Universitat de Lleida},
                              year = {2002},
                              pages = {299--303}
                            }
                            
Acacio ME, González J, García JM and Duato J (2001), "A New Scalable Directory Architecture for Large-Scale Multiprocessors", In Proc. of the 7th Int'l Symposium on High-Performance Computer Architecture (HPCA-7). Monterrey, Mexico, jan, 2001. , pp. 97-106. IEEE Computer Society.
BibTeX:
                            @inproceedings{acacio_hpca01,
                              author = {Acacio, Manuel E and González, José and García, José Manuel and Duato, José},
                              title = {A New Scalable Directory Architecture for Large-Scale Multiprocessors},
                              booktitle = {Proc. of the 7th Int'l Symposium on High-Performance Computer Architecture (HPCA-7)},
                              publisher = {IEEE Computer Society},
                              year = {2001},
                              pages = {97--106}
                            }
                            
Aragón JL, González J, García JM and González A (2001), "Branch Prediction Reversal by Correlating with Data Values", In Actas de las XII Jornadas de Paralelismo. Valencia, sep, 2001. , pp. 9-14.
BibTeX:
                            @inproceedings{aragon_jp01,
                              author = {Aragón, Juan L and González, José and García, José M and González, Antonio},
                              title = {Branch Prediction Reversal by Correlating with Data Values},
                              booktitle = {Actas de las XII Jornadas de Paralelismo},
                              year = {2001},
                              pages = {9--14}
                            }
                            
Aragón JL, González J, García JM and González A (2001), "Confidence Estimation for Branch Prediction Reversal", In Proc. of the 8th Int. Conference on High Performance Computing (HiPC). Hyderabad, India, dec, 2001. , pp. 214-223. Springer-Verlag.
BibTeX:
                            @inproceedings{aragon_hipc01,
                              author = {Aragón, Juan L and González, José and García, José M and González, Antonio},
                              title = {Confidence Estimation for Branch Prediction Reversal},
                              booktitle = {Proc. of the 8th Int. Conference on High Performance Computing (HiPC)},
                              publisher = {Springer-Verlag},
                              year = {2001},
                              pages = {214--223}
                            }
                            
Aragón JL, González J, García JM and González A (2001), "Selective Branch Prediction Reversal by Correlating with Data Values and Control Flow", In Proc. of the 19th IEEE International Conference on Computer Design (ICCD). Austin, Texas, USA, sep, 2001. , pp. 228-233.
BibTeX:
                            @inproceedings{aragon_iccd01,
                              author = {Aragón, Juan L and González, José and García, José M and González, Antonio},
                              title = {Selective Branch Prediction Reversal by Correlating with Data Values and Control Flow},
                              booktitle = {Proc. of the 19th IEEE International Conference on Computer Design (ICCD)},
                              year = {2001},
                              pages = {228--233}
                            }
                            
Bernabé G, González J, Garcbackslash'backslashia JM and Duato J (2001), "Enhancing the Entropy Encoder of a 3D-FWT for High-Quality Compression of Medical Video", In Proc. of IEEE International Symposium for Intelligent Signal Processing and Communication Systems (ISPACS). Nashville, USA, nov, 2001. , pp. 181-184. IEEE Signal Processing Society.
BibTeX:
                            @inproceedings{bernabe_ispacs01,
                              author = {Bernabé, Gregorio and González, José and Garcbackslash'backslashia, José M and Duato, José},
                              title = {Enhancing the Entropy Encoder of a 3D-FWT for High-Quality Compression of Medical Video},
                              booktitle = {Proc. of IEEE International Symposium for Intelligent Signal Processing and Communication Systems (ISPACS)},
                              publisher = {IEEE Signal Processing Society},
                              year = {2001},
                              pages = {181--184}
                            }
                            
Fernández J, García JM and Duato J (2001), "A New Approach to Provide Real-Time Services on High-Speed Local Area Networks", In Proc. of the Int'l Workshop on Fault-Tolerant Parallel and Distributed Systems, held in conjunction with IPDPS'01. San Francisco, USA, apr, 2001. , pp. 1-8. IEEE Computer Society.
BibTeX:
                            @inproceedings{fernandez_ipdps01,
                              author = {Fernández, Juan and García, José Manuel and Duato, José},
                              title = {A New Approach to Provide Real-Time Services on High-Speed Local Area Networks},
                              booktitle = {Proc. of the Int'l Workshop on Fault-Tolerant Parallel and Distributed Systems, held in conjunction with IPDPS'01},
                              publisher = {IEEE Computer Society},
                              year = {2001},
                              pages = {1--8}
                            }
                            
Fernández J, García JM and Duato J (2001), "Performance Evaluation of Real-Time Communication Services on High-Speed LANs under Topology Changes", In Proc. of the 8th Int'l Conference on High Performance Computing (HiPC'01). Hyderabad, India, dec, 2001. , pp. 341-350. Springer-Verlag.
BibTeX:
                            @inproceedings{fernandez_hipc01,
                              author = {Fernández, Juan and García, José Manuel and Duato, José},
                              title = {Performance Evaluation of Real-Time Communication Services on High-Speed LANs under Topology Changes},
                              booktitle = {Proc. of the 8th Int'l Conference on High Performance Computing (HiPC'01)},
                              publisher = {Springer-Verlag},
                              year = {2001},
                              pages = {341--350}
                            }
                            
Bernabé G, González J, Garcbackslash'backslashia JM and Duato J (2000), "A New Lossy 3-D Wavelet Transform for High-Quality Compression of Medical Video", In Proc. of IEEE EMBS International Conference on Information Technology Applications in Biomedicine (ITAB ITIS). Washington, USA, nov, 2000. , pp. 226-231. IEEE Networking the World.
BibTeX:
                            @inproceedings{bernabe_itab00,
                              author = {Bernabé, Gregorio and González, José and Garcbackslash'backslashia, José M and Duato, José},
                              title = {A New Lossy 3-D Wavelet Transform for High-Quality Compression of Medical Video},
                              booktitle = {Proc. of IEEE EMBS International Conference on Information Technology Applications in Biomedicine (ITAB ITIS)},
                              publisher = {IEEE Networking the World},
                              year = {2000},
                              pages = {226--231}
                            }
                            
Bernabé G, González J, Garcbackslash'backslashia JM and Duato J (2000), "Applying the 3-D Wavelet Transform to Transmit Medical Video in Telemedicine", In Proc. of the 5th World Congress on the Internet in Medicine (MEDNET). Bruselas, Belgica, nov, 2000. , pp. 204-205. IOS Press.
BibTeX:
                            @inproceedings{bernabe_mednet00,
                              author = {Bernabé, Gregorio and González, José and Garcbackslash'backslashia, José M and Duato, José},
                              title = {Applying the 3-D Wavelet Transform to Transmit Medical Video in Telemedicine},
                              booktitle = {Proc. of the 5th World Congress on the Internet in Medicine (MEDNET)},
                              publisher = {IOS Press},
                              year = {2000},
                              pages = {204--205}
                            }
                            
Gómez-Skarmeta AF, Piernas J and Delgado M (1997), "Fuzzy Clustering and Image Reduction", In Proc. of the 5th Fuzzy Days. Internationl Conference on Computational Intelligence. Dortmund, Germany, apr, 1997. , pp. 241-249. Springer-Verlag.
BibTeX:
                            @inproceedings{gomez_fuzzy97,
                              author = {Gómez-Skarmeta, Antonio F and Piernas, Juan and Delgado, Miguel},
                              title = {Fuzzy Clustering and Image Reduction},
                              booktitle = {Proc. of the 5th Fuzzy Days. Internationl Conference on Computational Intelligence},
                              publisher = {Springer-Verlag},
                              year = {1997},
                              pages = {241--249}
                            }
                            
Piernas J, Flores A and M.García J (1997), "Analyzing the Performance of MPI in a Cluster of Workstations Based on Fast Ethernet", In Proc. of the 4th European PVM/MPI User's Group Meeting. Cracow, Poland, nov, 1997. , pp. 17-24. Springer-Verlag.
BibTeX:
                            @inproceedings{piernas_pvmmpi97,
                              author = {Piernas, Juan and Flores, Antonio and M.García, José},
                              title = {Analyzing the Performance of MPI in a Cluster of Workstations Based on Fast Ethernet},
                              booktitle = {Proc. of the 4th European PVM/MPI User's Group Meeting},
                              publisher = {Springer-Verlag},
                              year = {1997},
                              pages = {17--24}
                            }