Logo: to the web site of Uppsala University

uu.sePublications from Uppsala University
Change search
Link to record
Permanent link

Direct link
Tran, Kim-Anh
Publications (9 of 9) Show all publications
Tran, K.-A., Sakalis, C., Själander, M., Ros, A., Kaxiras, S. & Jimborean, A. (2020). Clearing the Shadows: Recovering Lost Performance for Invisible Speculative Execution through HW/SW Co-Design. In: PACT ’20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. Paper presented at PACT '20:International Conference on Parallel Architectures and Compilation Techniques, Virtual Event GA USA, October 3 - 7, 2020 (pp. 241-254). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Clearing the Shadows: Recovering Lost Performance for Invisible Speculative Execution through HW/SW Co-Design
Show others...
2020 (English)In: PACT ’20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques, Association for Computing Machinery (ACM) , 2020, p. 241-254Conference paper, Published paper (Refereed)
Abstract [en]

Out-of-order processors heavily rely on speculation to achieve high performance, allowing instructions to bypass other slower instructions in order to fully utilize the processor's resources. Speculatively executed instructions do not affect the correctness of the application, as they never change the architectural state, but they do affect the micro-architectural behavior of the system. Until recently, these changes were considered to be safe but with the discovery of new security attacks that misuse speculative execution to leak secrete information through observable micro-architectural changes (so called side-channels), this is no longer the case. To solve this issue, a wave of software and hardware mitigations have been proposed, the majority of which delay and/or hide speculative execution until it is deemed to be safe, trading performance for security. These newly enforced restrictions change how speculation is applied and where the performance bottlenecks appear, forcing us to rethink how we design and optimize both the hardware and the software.

We observe that many of the state-of-the-art hardware solutions targeting memory systems operate on a common scheme: the visible execution of loads or their dependents is blocked until they become safe to execute. In this work we propose a generally applicable hardware-software extension that focuses on removing the causes of loads' unsafety, generally caused by control and memory dependence speculation. As a result, we manage to make more loads safe to execute at an early stage, which enables us to schedule more loads at a time to overlap their delays and improve performance. We apply our techniques on the state-of-the-art Delay-on-Miss hardware defense and show that we reduce the performance gap to the unsafe baseline by 53% (on average).

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2020
Series
International Conference on Parallel Architectures and Compilation Techniques, ISSN 1089-795X
Keywords
speculative execution, side-channel attacks, caches, compiler, in- struction reordering, coherence protocoL
National Category
Computer Engineering
Identifiers
urn:nbn:se:uu:diva-428516 (URN)10.1145/3410463.3414640 (DOI)000723645400023 ()978-1-4503-8075-1 (ISBN)
Conference
PACT '20:International Conference on Parallel Architectures and Compilation Techniques, Virtual Event GA USA, October 3 - 7, 2020
Funder
Swedish Research Council, 2015-05159Swedish Research Council, 2016-05086Swedish Research Council, 2018-05254EU, Horizon 2020, 819134
Available from: 2020-12-14 Created: 2020-12-14 Last updated: 2021-12-21Bibliographically approved
Tran, K.-A. (2020). Finding and Exploiting Memory-Level-Parallelism in Constrained Speculative Architectures. (Doctoral dissertation). Uppsala: Acta Universitatis Upsaliensis
Open this publication in new window or tab >>Finding and Exploiting Memory-Level-Parallelism in Constrained Speculative Architectures
2020 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

One of the main performance bottlenecks of processors today is the discrepancy between processor and memory speed, known as the memory wall. While the processor executes instructions at a high pace, the memory is too slow to provide data in a timely manner. Load instructions that require an access to memory are referred to as long-latency or delinquent loads. To prevent the processor from stalling, independent instruction past the load may execute, including independent loads. Overlapping load operations and thus their latency is referred to as memory-level parallelism. Memory-level parallelism (MLP) can significantly improve performance. Today's out-of-order processors are therefore equipped with complex hardware that allows them to look into the future and to select independent loads that can be overlapped. However, the ability to choose future instructions and speculatively execute them in advance introduces complexity, increased power consumption and potential security risks. In this thesis we look at constrained speculative architectures that struggle to hide memory latencies as they are constrained by design, by their resources, or by security. We investigate ways for the compiler to help them in finding MLP, with the ultimate goal to avoid processor stalls as much as possible. This includes small energy-efficient processors that lack the ability to look-ahead far enough to find independent loads, but also large processors that are disallowed to speculatively execute independent loads due to enforced security measures to circumvent side-channel attacks. We identify the reason for their limitation and propose software transformations and hardware extensions to overcome their restrictions.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2020. p. 50
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1897
Keywords
Memory-level-parallelism, Energy-efficiency, Performance, Compiler, Instruction Scheduling, SW/HW Co-Design
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:uu:diva-402642 (URN)978-91-513-0860-9 (ISBN)
Public defence
2020-09-24, ITC 1406, ITC, Lägerhyddsvägen 2, Uppsala, 10:00 (English)
Opponent
Supervisors
Available from: 2020-02-19 Created: 2020-01-17 Last updated: 2020-09-21
Tran, K.-A. (2018). Static instruction scheduling for high performance on energy-efficient processors. (Licentiate dissertation). Uppsala University
Open this publication in new window or tab >>Static instruction scheduling for high performance on energy-efficient processors
2018 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

New trends such as the internet-of-things and smart homes push the demands for energy-efficiency. Choosing energy-efficient hardware, however, often comes as a trade-off to high-performance. In order to strike a good balance between the two, we propose software solutions to tackle the performance bottlenecks of small and energy-efficient processors.

One of the main performance bottlenecks of processors is the discrepancy between processor and memory speed, known as the memory wall. While the processor executes instructions at a high pace, the memory is too slow to provide data in a timely manner, if data has not been cached in advance. Load instructions that require an access to memory are thereby referred to as long-latency or delinquent loads. Long latencies caused by delinquent loads are putting a strain on small processors, which have few or no resources to effectively hide the latencies. As a result, the processor may stall.

In this thesis we propose compile-time transformation techniques to mitigate the penalties of delinquent loads on small out-of-order processors, with the ultimate goal to avoid processor stalls as much as possible. Our code transformation is applicable for general-purpose code, including unknown memory dependencies, complex control flow and pointers. We further propose a software-hardware co-design that combines the code transformation technique with lightweight hardware support to hide latencies on a stall-on-use in-order processor.

Place, publisher, year, edition, pages
Uppsala University, 2018
Series
Information technology licentiate theses: Licentiate theses from the Department of Information Technology, ISSN 1404-5117 ; 2018-001
National Category
Computer Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:uu:diva-349420 (URN)
Supervisors
Projects
UPMARC
Available from: 2017-12-18 Created: 2018-04-26 Last updated: 2019-02-25Bibliographically approved
Tran, K.-A., Carlson, T. E., Koukos, K., Själander, M., Spiliopoulos, V., Kaxiras, S. & Jimborean, A. (2018). Static instruction scheduling for high performance on limited hardware. IEEE Transactions on Computers, 67(4), 513-527
Open this publication in new window or tab >>Static instruction scheduling for high performance on limited hardware
Show others...
2018 (English)In: IEEE Transactions on Computers, ISSN 0018-9340, E-ISSN 1557-9956, Vol. 67, no 4, p. 513-527Article in journal (Refereed) Published
Abstract [en]

Complex out-of-order (OoO) processors have been designed to overcome the restrictions of outstanding long-latency misses at the cost of increased energy consumption. Simple, limited OoO processors are a compromise in terms of energy consumption and performance, as they have fewer hardware resources to tolerate the penalties of long-latency loads. In worst case, these loads may stall the processor entirely. We present Clairvoyance, a compiler based technique that generates code able to hide memory latency and better utilize simple OoO processors. By clustering loads found across basic block boundaries, Clairvoyance overlaps the outstanding latencies to increases memory-level parallelism. We show that these simple OoO processors, equipped with the appropriate compiler support, can effectively hide long-latency loads and achieve performance improvements for memory-bound applications. To this end, Clairvoyance tackles (i) statically unknown dependencies, (ii) insufficient independent instructions, and (iii) register pressure. Clairvoyance achieves a geomean execution time improvement of 14 percent for memory-bound applications, on top of standard O3 optimizations, while maintaining compute-bound applications' high-performance.

National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-334011 (URN)10.1109/TC.2017.2769641 (DOI)000427420800005 ()
Projects
UPMARC
Funder
Swedish Research Council, 2016-05086
Available from: 2017-11-03 Created: 2017-11-20 Last updated: 2023-03-28Bibliographically approved
Tran, K.-A., Jimborean, A., Carlson, T. E., Koukos, K., Själander, M. & Kaxiras, S. (2018). SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores. In: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation: . Paper presented at PLDI 2018 the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, June 18-22 2018, Philadelphia, USA (pp. 328-343). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores
Show others...
2018 (English)In: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, Association for Computing Machinery (ACM), 2018, p. 328-343Conference paper, Published paper (Refereed)
Abstract [en]

Increasing demands for energy efficiency constrain emerging hardware. These new hardware trends challenge the established assumptions in code generation and force us to rethink existing software optimization techniques. We propose a cross-layer redesign of the way compilers and the underlying microarchitecture are built and interact, to achieve both performance and high energy efficiency.

In this paper, we address one of the main performance bottlenecks — last-level cache misses — through a software-hardware co-design. Our approach is able to hide memory latency and attain increased memory and instruction level parallelism by orchestrating a non-speculative, execute-ahead paradigm in software (SWOOP). While out-of-order (OoO) architectures attempt to hide memory latency by dynamically reordering instructions, they do so through expensive, power-hungry, speculative mechanisms.We aim to shift this complexity into software, and we build upon compilation techniques inherited from VLIW, software pipelining, modulo scheduling, decoupled access-execution, and software prefetching. In contrast to previous approaches we do not rely on either software or hardware speculation that can be detrimental to efficiency. Our SWOOP compiler is enhanced with lightweight architectural support, thus being able to transform applications that include highly complex control-flow and indirect memory accesses.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2018
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-361359 (URN)10.1145/3192366.3192393 (DOI)000452469600023 ()978-1-4503-5698-5 (ISBN)
Conference
PLDI 2018 the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, June 18-22 2018, Philadelphia, USA
Projects
UPMARC
Funder
Swedish Research Council, 2016-05086
Available from: 2018-09-23 Created: 2018-09-23 Last updated: 2020-01-17Bibliographically approved
Tran, K.-A., Carlson, T. E., Koukos, K., Själander, M., Spiliopoulos, V., Kaxiras, S. & Jimborean, A. (2017). Clairvoyance: Look-ahead compile-time scheduling. In: Proc. 15th International Symposium on Code Generation and Optimization: . Paper presented at CGO 2017, February 4–8, Austin, TX (pp. 171-184). Piscataway, NJ: IEEE Press
Open this publication in new window or tab >>Clairvoyance: Look-ahead compile-time scheduling
Show others...
2017 (English)In: Proc. 15th International Symposium on Code Generation and Optimization, Piscataway, NJ: IEEE Press, 2017, p. 171-184Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
Piscataway, NJ: IEEE Press, 2017
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-316480 (URN)000402548700015 ()978-1-5090-4931-8 (ISBN)
Conference
CGO 2017, February 4–8, Austin, TX
Projects
UPMARC
Funder
Swedish Research Council, 2010-4741
Available from: 2017-02-04 Created: 2017-03-01 Last updated: 2020-01-17Bibliographically approved
Carlson, T. E., Tran, K.-A., Jimborean, A., Koukos, K., Själander, M. & Kaxiras, S. (2017). Transcending hardware limits with software out-of-order processing. IEEE Computer Architecture Letters, 16(2), 162-165
Open this publication in new window or tab >>Transcending hardware limits with software out-of-order processing
Show others...
2017 (English)In: IEEE Computer Architecture Letters, ISSN 1556-6056, Vol. 16, no 2, p. 162-165Article in journal (Refereed) Published
National Category
Computer Systems
Identifiers
urn:nbn:se:uu:diva-334012 (URN)10.1109/LCA.2017.2672559 (DOI)000418870500018 ()
Projects
UPMARC
Available from: 2017-02-22 Created: 2017-11-20 Last updated: 2018-04-26Bibliographically approved
Tran, K.-A. (2016). Software Out-of-Order Execution for In-Order Architectures. In: Proc. 25th International Conference on Parallel Architectures and Compilation Techniques: . Paper presented at PACT 2016, September 11–15, Haifa, Israel (pp. 458-458). New York: ACM Press
Open this publication in new window or tab >>Software Out-of-Order Execution for In-Order Architectures
2016 (English)In: Proc. 25th International Conference on Parallel Architectures and Compilation Techniques, New York: ACM Press, 2016, p. 458-458Conference paper, Poster (with or without abstract) (Refereed)
Place, publisher, year, edition, pages
New York: ACM Press, 2016
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-309768 (URN)10.1145/2967938.2971466 (DOI)000392249100055 ()978-1-4503-4121-9 (ISBN)
Conference
PACT 2016, September 11–15, Haifa, Israel
Projects
UPMARC
Available from: 2016-09-11 Created: 2016-12-07 Last updated: 2018-04-26Bibliographically approved
Tran, K.-A., Sakalis, C., Själander, M., Ros, A., Kaxiras, S. & Jimborean, A. Clearing the Shadows: Recovering Lost Performance for Invisible Speculative Execution through HW/SW Co-Design.
Open this publication in new window or tab >>Clearing the Shadows: Recovering Lost Performance for Invisible Speculative Execution through HW/SW Co-Design
Show others...
(English)In: Article in journal (Other academic) Submitted
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-402638 (URN)
Funder
Swedish Research Council, 2016-05086Swedish National Infrastructure for Computing (SNIC), 2019/3-227
Available from: 2020-01-17 Created: 2020-01-17 Last updated: 2020-02-04Bibliographically approved
Organisations

Search in DiVA

Show all publications