Logo: to the web site of Uppsala University

uu.sePublications from Uppsala University
Change search
Link to record
Permanent link

Direct link
Publications (10 of 91) Show all publications
Kvalsvik, A. B., Aimoniotis, P., Kaxiras, S. & Själander, M. (2023). Doppelganger Loads: A Safe, Complexity-Effective Optimization for Secure Speculation Schemes. In: ISCA '23: Proceedings of the 50th Annual International Symposium on Computer Architecture: . Paper presented at 50th Annual International Symposium on Computer Architecture (ISCA), JUN 17-21, 2023, Orlando, FL, USA. New York, NY: Association for Computing Machinery (ACM), Article ID 53.
Open this publication in new window or tab >>Doppelganger Loads: A Safe, Complexity-Effective Optimization for Secure Speculation Schemes
2023 (English)In: ISCA '23: Proceedings of the 50th Annual International Symposium on Computer Architecture, New York, NY: Association for Computing Machinery (ACM), 2023, article id 53Conference paper, Published paper (Refereed)
Abstract [en]

Speculative side-channel attacks have forced computer architects to rethink speculative execution. Effectively preventing microarchitectural state from leaking sensitive information will be a key requirement in future processor design.

An important limitation of many secure speculation schemes is a reduction in the available memory parallelism, as unsafe loads (depending on the particular scheme) are blocked, as they might potentially leak information. Our contribution is to show that it is possible to recover some of this lost memory parallelism, by safely predicting the addresses of these loads in a threat-model transparent way, i.e., without worsening the security guarantees of the underlying secure scheme. To demonstrate the generality of the approach, we apply it to three different secure speculation schemes: Non-speculative Data Access (NDA), Speculative Taint Tracking (STT), and Delay-on-Miss (DoM).

An address predictor is trained on non-speculative data, and can afterwards predict the addresses of unsafe slow-to-issue loads, preloading the target registers with speculative values, that can be released faster on correct predictions than starting the entire load process. This new perspective on speculative execution encompasses all loads, and gives speedups, separately from prefetching.

We call the address-predicted counterparts of loads Doppelganger Loads. They give notable performance improvements for the three secure speculation schemes we evaluate, NDA, STT, and DoM. The Doppelganger Loads reduce the geometric mean slowdown by 42%, 48%, and 30% respectively, as compared to an unsafe baseline, for a wide variety of SPEC2006 and SPEC2017 benchmarks. Furthermore, Doppelganger Loads can be efficiently implemented with only minor core modifications, reusing existing resources such as a stride prefetcher, and most importantly, requiring no changes to the memory hierarchy outside the core.

Place, publisher, year, edition, pages
New York, NY: Association for Computing Machinery (ACM), 2023
Series
Conference Proceedings Annual International Symposium on Computer Architecture, ISSN 1063-6897
Keywords
computer architecture, security, speculative side-channels, spectre
National Category
Computer Systems
Identifiers
urn:nbn:se:uu:diva-509800 (URN)10.1145/3579371.3589088 (DOI)001098723900053 ()979-8-4007-0095-8 (ISBN)
Conference
50th Annual International Symposium on Computer Architecture (ISCA), JUN 17-21, 2023, Orlando, FL, USA
Funder
Vinnova, 2021-02422Swedish Research Council, 2018-05254Swedish Research Council, 2022-04959Uppsala UniversitySwedish Foundation for Strategic Research, FUS21-0067
Available from: 2023-08-22 Created: 2023-08-22 Last updated: 2024-02-21Bibliographically approved
Chen, X., Aimoniotis, P. & Kaxiras, S. (2023). How addresses are made. In: 2023 IEEE International ymposium on Workload Characterization, IISWC: . Paper presented at 26th IEEE International Symposium on Workload Characterization (IISWC), OCT 01-03, 2023, Gent, Belgium (pp. 223-225). IEEE
Open this publication in new window or tab >>How addresses are made
2023 (English)In: 2023 IEEE International ymposium on Workload Characterization, IISWC, IEEE, 2023, p. 223-225Conference paper, Published paper (Refereed)
Abstract [en]

This work uses Dynamic Information Flow Tracking (DIFT) to characterize how memory addresses are made by studying the transformation of data values into memory addresses. We show that in SPEC CPU 2017 benchmarks, a high proportion of values in memory are transformed into memory addresses. The majority of the transformations are done directly without explicit arithmetic instructions. Most of the addresses are made from one or more loaded values.

Place, publisher, year, edition, pages
IEEE, 2023
Series
International Symposium on Workload Characterization Proceedings
National Category
Computer Engineering
Identifiers
urn:nbn:se:uu:diva-523358 (URN)10.1109/IISWC59245.2023.00031 (DOI)001103166400023 ()979-8-3503-0317-9 (ISBN)979-8-3503-0318-6 (ISBN)
Conference
26th IEEE International Symposium on Workload Characterization (IISWC), OCT 01-03, 2023, Gent, Belgium
Funder
Swedish Research Council, 2018-05254Vinnova, 2021-02422Swedish Foundation for Strategic Research, FUS21-0067Swedish Research Council, NAISS 2023/22-203Swedish Research Council, 2022-06725
Available from: 2024-02-19 Created: 2024-02-19 Last updated: 2024-02-19Bibliographically approved
Song, W., Kaxiras, S., Mottola, L., Voigt, T. & Yao, Y. (2023). Silent Stores in the Battery-less Internet of Things: A Good Idea?. In: : . Paper presented at International Conference on Embedded Wireless Systems and Networks.
Open this publication in new window or tab >>Silent Stores in the Battery-less Internet of Things: A Good Idea?
Show others...
2023 (English)Conference paper, Published paper (Refereed)
National Category
Embedded Systems
Identifiers
urn:nbn:se:uu:diva-509586 (URN)
Conference
International Conference on Embedded Wireless Systems and Networks
Available from: 2023-08-21 Created: 2023-08-21 Last updated: 2023-08-21
Feliu, J., Ros, A., Acacio, M. E. & Kaxiras, S. (2023). Speculative inter-thread store-to-load forwarding in SMT architectures. Journal of Parallel and Distributed Computing, 173, 94-106
Open this publication in new window or tab >>Speculative inter-thread store-to-load forwarding in SMT architectures
2023 (English)In: Journal of Parallel and Distributed Computing, ISSN 0743-7315, E-ISSN 1096-0848, Vol. 173, p. 94-106Article in journal (Refereed) Published
Abstract [en]

Applications running on out-of-order cores have benefited for decades of store-to-load forwarding which accelerates communication of store values to loads of the same thread. Despite threads running on a simultaneous multithreading (SMT) core could also access the load queues (LQ) and store queues (SQ) / store buffers (SB) of other threads to allow inter-thread store-to-load forwarding, we have skipped exploiting it because if we allow communication of different SMT threads via their LQs and SQs/SBs, write atomicity may be violated with respect to the outside world beyond the acceptable model of read -own-write-early multiple-copy atomicity (rMCA).In our prior work, we leveraged this idea to propose inter-thread store-to-load forwarding (ITSLF). ITLSF accelerates synchronization and communication of threads running in a simultaneous multi-threading processor by allowing stores in the store-queue of a thread to forward data to loads of another thread running in the same core without violating rMCA.In this work, we extend the original ITSLF mechanism to allow inter-thread forwarding from speculative stores (Spec-ITSLF). Spec-ITSLF allows forwarding store values to other threads earlier, which further accelerates synchronization. Spec-ITSLF outperforms a baseline SMT core by 15%, which is 2% better on average (and up to 5% for the TATP workload) than the original ITSLF mechanism. More importantly, Spec-ITSLF is on par with the original ITSLF mechanism regarding storage overhead but does not need to keep track of the speculative state of stores, which was an important source of overhead and complexity in the original mechanism. (c) 2022 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Place, publisher, year, edition, pages
ElsevierACADEMIC PRESS INC ELSEVIER SCIENCE, 2023
Keywords
Simultaneous multithreading, Memory consistency, Store -to -load forwarding, Multiple -copy atomicity
National Category
Computer Sciences Computer Engineering
Identifiers
urn:nbn:se:uu:diva-492103 (URN)10.1016/j.jpdc.2022.11.007 (DOI)000891766200008 ()
Funder
EU, European Research Council, 819134Swedish Research Council, 2018-05254
Available from: 2023-01-31 Created: 2023-01-31 Last updated: 2024-01-15Bibliographically approved
Shimchenko, M., Titos-Gil, R., Fernández-Pascual, R., Acacio, M. E., Kaxiras, S., Ros, A. & Jimborean, A. (2022). Analysing software prefetching opportunities in hardware transactional memory. Journal of Supercomputing, 78(1), 919-944
Open this publication in new window or tab >>Analysing software prefetching opportunities in hardware transactional memory
Show others...
2022 (English)In: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 78, no 1, p. 919-944Article in journal (Refereed) Published
Abstract [en]

Hardware transactional memory emerged to make parallel programming more accessible. However, the performance pitfall of this technique is squashing speculatively executed instructions and re-executing them in case of aborts, ultimately resorting to serialization in case of repeated conflicts. A significant fraction of aborts occurs due to conflicts (concurrent reads and writes to the same memory location performed by different threads). Our proposal aims to reduce conflict aborts by reducing the window of time during which transactional regions can suffer conflicts. We achieve this by using software prefetching instructions inserted automatically at compile-time. Through these prefetch instructions, we intend to bring the necessary data for each transaction from the main memory to the cache before the transaction itself starts to execute, thus converting the otherwise long latency cache misses into hits during the execution of the transaction. The obtained results show that our approach decreases the number of aborts by 30% on average and improves performance by up to 19% and 10% for two out of the eight evaluated benchmarks. We provide insights into when our technique is beneficial given certain characteristics of the transactional regions, the advantages and disadvantages of our approach, and finally, discuss potential solutions to overcome some of its limitations.

Place, publisher, year, edition, pages
Springer NatureSpringer Nature, 2022
Keywords
Hardware transactional memory, Parallel programming, Compiler, Software prefetching
National Category
Computer Engineering Computer Sciences
Identifiers
urn:nbn:se:uu:diva-468639 (URN)10.1007/s11227-021-03897-z (DOI)000657204400008 ()
Funder
EU, Horizon 2020, 819134Swedish Research Council, 2016-05086European Commission, RTI2018-098156B-C53
Available from: 2022-03-01 Created: 2022-03-01 Last updated: 2024-01-15Bibliographically approved
Chen, X., Aimoniotis, P. & Kaxiras, S. (2022). Clueless: A Tool Characterising Values Leaking as Addresses. In: Proceedings of the 11th International Workshop on Hardware and Architectural Support for Security And Privacy, HASP 2022: . Paper presented at 11th International Workshop on Hardware and Architectural Support for Security and Privacy (HASP), October 1, 2022, Chicago, IL (pp. 27-34). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Clueless: A Tool Characterising Values Leaking as Addresses
2022 (English)In: Proceedings of the 11th International Workshop on Hardware and Architectural Support for Security And Privacy, HASP 2022, Association for Computing Machinery (ACM), 2022, p. 27-34Conference paper, Published paper (Refereed)
Abstract [en]

Clueless is a binary instrumentation tool that characterises explicit cache side channel vulnerabilities of programs. It detects the transformation of data values into addresses by tracking dynamic instruction dependencies. Clueless tags data values in memory if it discovers that they are used in address calculations to further access other data. Clueless can report on the amount of data that are used as addresses at each point during execution. It can also be specifically instructed to track certain data in memory (e.g., a password) to see if they are turned into addresses at any point during execution. It returns a trace on how the tracked data are turned into addresses, if they do. We demonstrate Clueless on SPEC 2006 and characterise, for the first time, the amount of data values that are turned into addresses in these programs. We further demonstrate Clueless on a micro benchmark and on a case study. The case study is the different implementations of AES in OpenSSL: T-table, Vector Permutation AES (VPAES), and Intel Advanced Encryption Standard New Instructions (AES-NI). Clueless shows how the encryption key is transformed into addresses in the T-table implementation, while explicit cache side channel vulnerabilities are note detected in the other implementations.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2022
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-523359 (URN)10.1145/3569562.3569566 (DOI)001135045800004 ()978-1-4503-9871-8 (ISBN)
Conference
11th International Workshop on Hardware and Architectural Support for Security and Privacy (HASP), October 1, 2022, Chicago, IL
Funder
Swedish Research Council, 2018-05254Vinnova, 2021-02422Swedish Foundation for Strategic Research, FUS21-0067
Available from: 2024-02-19 Created: 2024-02-19 Last updated: 2024-02-19Bibliographically approved
Sakalis, C., Kaxiras, S. & Sjalander, M. (2022). Delay-on-Squash: Stopping Microarchitectural Replay Attacks in Their Tracks. ACM Transactions on Architecture and Code Optimization (TACO), 20(1), Article ID 9.
Open this publication in new window or tab >>Delay-on-Squash: Stopping Microarchitectural Replay Attacks in Their Tracks
2022 (English)In: ACM Transactions on Architecture and Code Optimization (TACO), ISSN 1544-3566, E-ISSN 1544-3973, Vol. 20, no 1, article id 9Article in journal (Refereed) Published
Abstract [en]

MicroScope and other similar microarchitectural replay attacks take advantage of the characteristics of speculative execution to trap the execution of the victim application in a loop, enabling the attacker to amplify a side-channel attack by executing it indefinitely. Due to the nature of the replay, it can be used to effectively attack software that are shielded against replay, even under conditions where a side-channel attack would not be possible (e.g., in secure enclaves). At the same time, unlike speculative side-channel attacks, microarchitectural replay attacks can be used to amplify the correct path of execution, rendering many existing speculative side-channel defenses ineffective. In this work, we generalize microarchitectural replay attacks beyond MicroScope and present an efficient defense against them. We make the observation that such attacks rely on repeated squashes of so-called "replay handles" and that the instructions causing the side-channel must reside in the same reorder buffer window as the handles. We propose Delay-on-Squash, a hardware-only technique for tracking squashed instructions and preventing them from being replayed by speculative replay handles. Our evaluation shows that it is possible to achieve full security against microarchitectural replay attacks with very modest hardware requirements while still maintaining 97% of the insecure baseline performance.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2022
Keywords
Microarchitecture, side-channels, security, replay attacks
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-501608 (URN)10.1145/3563695 (DOI)000934935100009 ()
Funder
Swedish Research Council, 2015-05159Swedish Research Council, 2018-05254
Available from: 2023-05-10 Created: 2023-05-10 Last updated: 2023-05-10Bibliographically approved
Asgharzadeh, A., Cebrian, J. M., Perais, A., Kaxiras, S. & Ros, A. (2022). Free Atomics: Hardware Atomic Operations without Fences. In: PROCEEDINGS OF THE 2022 THE 49TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA '22): . Paper presented at 49th IEEE/ACM Annual International Symposium on Computer Architecture (ISCA), JUN 18-22, 2022, New York, NY (pp. 14-26). ASSOC COMPUTING MACHINERY Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Free Atomics: Hardware Atomic Operations without Fences
Show others...
2022 (English)In: PROCEEDINGS OF THE 2022 THE 49TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA '22), ASSOC COMPUTING MACHINERY Association for Computing Machinery (ACM), 2022, p. 14-26Conference paper, Published paper (Refereed)
Abstract [en]

Atomic Read-Modify-Write (RMW) instructions are primitive synchronization operations implemented in hardware that provide the building blocks for higher-abstraction synchronization mechanisms to programmers. According to publicly available documentation, current x86 implementations serialize atomic RMW operations, i.e., the store buffer is drained before issuing atomic RMWs and subsequent memory operations are stalled until the atomic RMW commits. This serialization, carried out by memory fences, incurs a performance cost which is expected to increase with deeper pipelines. This work proposes Free atomics, a lightweight, speculative, deadlock-free implementation of atomic operations that removes the need for memory fences, thus improving performance, while preserving atomicity and consistency. Free atomics is, to the best of our knowledge, the first proposal to enable store-to-load forwarding for atomic RMWs. Free atomics only requires simple modifications and incurs a small area overhead (15 bytes). Our evaluation using gem5-20 shows that, for a 32-core configuration, Free atomics improves performance by 12.5%, on average, for a large range of parallel workloads and 25.2%, on average, for atomic-intensive parallel workloads over a fenced atomic RMW implementation.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM)ASSOC COMPUTING MACHINERY, 2022
Series
Conference Proceedings Annual International Symposium on Computer Architecture, ISSN 1063-6897
Keywords
Multi-core architectures, microarchitecture, atomic Read-Modify-Write instructions, Total-Store-Order (TSO), store-to-load forwarding
National Category
Computer Engineering Computer Sciences
Identifiers
urn:nbn:se:uu:diva-485382 (URN)10.1145/3470496.3527385 (DOI)000852702500002 ()978-1-4503-8610-4 (ISBN)
Conference
49th IEEE/ACM Annual International Symposium on Computer Architecture (ISCA), JUN 18-22, 2022, New York, NY
Funder
EU, Horizon 2020, 819134Swedish Research Council, 2018-05254EU, Horizon 2020
Available from: 2022-09-22 Created: 2022-09-22 Last updated: 2024-01-15Bibliographically approved
Gómez-Hernández, E. J., Cebrian, J. M., Kaxiras, S. & Ros, A. (2022). Splash-4: A Modern Benchmark Suite with Lock-Free Constructs. In: 2022 IEEE International Symposium on Workload Characterization (IISWC): . Paper presented at IEEE International Symposium on Workload Characterization (IISWC), NOV 06-08, 2022, Austin, TX (pp. 51-64). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Splash-4: A Modern Benchmark Suite with Lock-Free Constructs
2022 (English)In: 2022 IEEE International Symposium on Workload Characterization (IISWC), Institute of Electrical and Electronics Engineers (IEEE), 2022, p. 51-64Conference paper, Published paper (Refereed)
Abstract [en]

The cornerstone for the performance evaluation of computer systems is the benchmark suite. Among the many benchmark suites used in high-performance computing and multicore research, Splash-2 has been instrumental in advancing knowledge for both academia and industry. Published in 1995 and with over 5276 citations and counting, this benchmark suite is still in use to evaluate novel architectural proposals. Recently, the Splash-3 suite eliminates important performance bugs, data races, and improper synchronization that plagued Splash-2 benchmarks after the formal definition of the C memory model.

However, keeping up with architectural changes while maintaining the same workloads and algorithms (for comparative purposes) is a real challenge. Benchmark suites can misrepresent the performance characteristics of a computer system if they do not reflect the available features of the hardware and architects may end up overestimating the impact of proposed techniques or underestimating others.

In this work we introduce a revised version of Splash-3, designated Splash-4, that introduces modern programming techniques to improve scalability on contemporary hardware. We then characterize Splash-3 and Splash-4 in a state-ofthe-art simulated architecture, Intel's Ice Lake with gem5-20 simulator, as well as a real contemporary hardware processor (AMD's EPYC 7002 series). Our evaluation shows that for a 64-thread execution Splash-4 reduces the normalized execution time by an average of 52% and 34% for AMD's EPYC and Intel's Ice Lake, respectively.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2022
Series
Proceedings of the IEEE International Symposium on Workload Characterization, ISSN 2835-222X, E-ISSN 2835-2238
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-498071 (URN)10.1109/IISWC55918.2022.00015 (DOI)000904205700005 ()978-1-6654-8798-6 (ISBN)978-1-6654-8799-3 (ISBN)
Conference
IEEE International Symposium on Workload Characterization (IISWC), NOV 06-08, 2022, Austin, TX
Funder
EU, Horizon 2020, 819134
Available from: 2023-03-13 Created: 2023-03-13 Last updated: 2023-03-13Bibliographically approved
Sakalis, C., Chowdhury, Z. I., Wadle, S., Akturk, I., Ros, A., Själander, M., . . . Karpuzcu, U. R. (2021). Do Not Predict – Recompute!: How Value Recomputation Can Truly Boost the Performance of Invisible Speculation. In: 2021 International Symposium on Secure and Private Execution Environment Design (SEED): . Paper presented at 2021 International Symposium on Secure and Private Execution Environment Design (SEED), Online, September 20-21, 2021 (pp. 89-100). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Do Not Predict – Recompute!: How Value Recomputation Can Truly Boost the Performance of Invisible Speculation
Show others...
2021 (English)In: 2021 International Symposium on Secure and Private Execution Environment Design (SEED), Institute of Electrical and Electronics Engineers (IEEE), 2021, p. 89-100Conference paper, Published paper (Refereed)
Abstract [en]

Recent architectural approaches that address speculative side-channel attacks aim to prevent software from exposing the microarchitectural state changes of transient execution. The Delay-on-Miss technique is one such approach, which simply delays loads that miss in the L1 cache until they become non-speculative, resulting in no transient changes in the memory hierarchy.  However, this costs performance, prompting the use of value prediction (VP) to regain some of the delay.

However, the problem cannot be solved by simply introducing a new kind of speculation (value prediction). Value-predicted loads have to be validated, which cannot be commenced until the load becomes non-speculative. Thus, value-predicted loads occupy the same amount of precious core resources (e.g., reorder buffer entries) as Delay-on-Miss. The end result is that VP only yields marginal benefits over Delay-on-Miss.

In this paper, our insight is that we can achieve the same goal as VP (increasing performance by providing the value of loads that miss) without incurring its negative side-effect (delaying the release of precious resources), if we can safely, non-speculatively, recompute a value in isolation (without being seen from the outside), so that we do not expose any information by transferring such a value via the memory hierarchy. Value Recomputation, which trades computation for data transfer was previously proposed in an entirely different context: to reduce energy-expensive data transfers in the memory hierarchy. In this paper, we demonstrate the potential of value recomputation in relation to the Delay-on-Miss approach of hiding speculation, discuss the trade-offs, and show that we can achieve the same level of security, reaching 93% of the unsecured baseline performance (5% higher than Delay-on-miss), and exceeding (by 3%) what even an oracular (100% accuracy and coverage) value predictor could do.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2021
National Category
Computer Engineering
Identifiers
urn:nbn:se:uu:diva-453758 (URN)10.1109/SEED51797.2021.00021 (DOI)000799181700013 ()978-1-6654-2025-9 (ISBN)
Conference
2021 International Symposium on Secure and Private Execution Environment Design (SEED), Online, September 20-21, 2021
Funder
Swedish Research Council, 2015-05159Swedish Research Council, 2018-05254
Available from: 2021-09-22 Created: 2021-09-22 Last updated: 2022-06-28Bibliographically approved
Projects
Interval-Based Approach to Power Modeling in Multicores [2010-04741_VR]; Uppsala UniversityEfficient Modeling of Heterogeneity in the Era of Dark Silicon [2012-05332_VR]; Uppsala UniversityEnabling Near Data Processing for Emerging Workloads [2018-05254_VR]; Uppsala UniversityDon’t hack my memory: Towards efficient, ubiquitous memory protection [2022-04959_VR]; Uppsala UniversityMitigating Side-Channel Attacks: Foundations and Applications [2023-05242_VR]; Uppsala University
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-8267-0232

Search in DiVA

Show all publications