Logo: to the web site of Uppsala University

uu.sePublications from Uppsala University
Change search
Link to record
Permanent link

Direct link
Jimborean, AlexandraORCID iD iconorcid.org/0000-0001-8642-2447
Publications (10 of 27) Show all publications
Shimchenko, M., Titos-Gil, R., Fernández-Pascual, R., Acacio, M. E., Kaxiras, S., Ros, A. & Jimborean, A. (2022). Analysing software prefetching opportunities in hardware transactional memory. Journal of Supercomputing, 78(1), 919-944
Open this publication in new window or tab >>Analysing software prefetching opportunities in hardware transactional memory
Show others...
2022 (English)In: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 78, no 1, p. 919-944Article in journal (Refereed) Published
Abstract [en]

Hardware transactional memory emerged to make parallel programming more accessible. However, the performance pitfall of this technique is squashing speculatively executed instructions and re-executing them in case of aborts, ultimately resorting to serialization in case of repeated conflicts. A significant fraction of aborts occurs due to conflicts (concurrent reads and writes to the same memory location performed by different threads). Our proposal aims to reduce conflict aborts by reducing the window of time during which transactional regions can suffer conflicts. We achieve this by using software prefetching instructions inserted automatically at compile-time. Through these prefetch instructions, we intend to bring the necessary data for each transaction from the main memory to the cache before the transaction itself starts to execute, thus converting the otherwise long latency cache misses into hits during the execution of the transaction. The obtained results show that our approach decreases the number of aborts by 30% on average and improves performance by up to 19% and 10% for two out of the eight evaluated benchmarks. We provide insights into when our technique is beneficial given certain characteristics of the transactional regions, the advantages and disadvantages of our approach, and finally, discuss potential solutions to overcome some of its limitations.

Place, publisher, year, edition, pages
Springer NatureSpringer Nature, 2022
Keywords
Hardware transactional memory, Parallel programming, Compiler, Software prefetching
National Category
Computer Engineering Computer Sciences
Identifiers
urn:nbn:se:uu:diva-468639 (URN)10.1007/s11227-021-03897-z (DOI)000657204400008 ()
Funder
EU, Horizon 2020, 819134Swedish Research Council, 2016-05086European Commission, RTI2018-098156B-C53
Available from: 2022-03-01 Created: 2022-03-01 Last updated: 2024-01-15Bibliographically approved
Tran, K.-A., Sakalis, C., Själander, M., Ros, A., Kaxiras, S. & Jimborean, A. (2020). Clearing the Shadows: Recovering Lost Performance for Invisible Speculative Execution through HW/SW Co-Design. In: PACT ’20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. Paper presented at PACT '20:International Conference on Parallel Architectures and Compilation Techniques, Virtual Event GA USA, October 3 - 7, 2020 (pp. 241-254). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Clearing the Shadows: Recovering Lost Performance for Invisible Speculative Execution through HW/SW Co-Design
Show others...
2020 (English)In: PACT ’20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques, Association for Computing Machinery (ACM) , 2020, p. 241-254Conference paper, Published paper (Refereed)
Abstract [en]

Out-of-order processors heavily rely on speculation to achieve high performance, allowing instructions to bypass other slower instructions in order to fully utilize the processor's resources. Speculatively executed instructions do not affect the correctness of the application, as they never change the architectural state, but they do affect the micro-architectural behavior of the system. Until recently, these changes were considered to be safe but with the discovery of new security attacks that misuse speculative execution to leak secrete information through observable micro-architectural changes (so called side-channels), this is no longer the case. To solve this issue, a wave of software and hardware mitigations have been proposed, the majority of which delay and/or hide speculative execution until it is deemed to be safe, trading performance for security. These newly enforced restrictions change how speculation is applied and where the performance bottlenecks appear, forcing us to rethink how we design and optimize both the hardware and the software.

We observe that many of the state-of-the-art hardware solutions targeting memory systems operate on a common scheme: the visible execution of loads or their dependents is blocked until they become safe to execute. In this work we propose a generally applicable hardware-software extension that focuses on removing the causes of loads' unsafety, generally caused by control and memory dependence speculation. As a result, we manage to make more loads safe to execute at an early stage, which enables us to schedule more loads at a time to overlap their delays and improve performance. We apply our techniques on the state-of-the-art Delay-on-Miss hardware defense and show that we reduce the performance gap to the unsafe baseline by 53% (on average).

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2020
Series
International Conference on Parallel Architectures and Compilation Techniques, ISSN 1089-795X
Keywords
speculative execution, side-channel attacks, caches, compiler, in- struction reordering, coherence protocoL
National Category
Computer Engineering
Identifiers
urn:nbn:se:uu:diva-428516 (URN)10.1145/3410463.3414640 (DOI)000723645400023 ()978-1-4503-8075-1 (ISBN)
Conference
PACT '20:International Conference on Parallel Architectures and Compilation Techniques, Virtual Event GA USA, October 3 - 7, 2020
Funder
Swedish Research Council, 2015-05159Swedish Research Council, 2016-05086Swedish Research Council, 2018-05254EU, Horizon 2020, 819134
Available from: 2020-12-14 Created: 2020-12-14 Last updated: 2021-12-21Bibliographically approved
Sakalis, C., Jimborean, A., Kaxiras, S. & Själander, M. (2020). Evaluating the Potential Applications of Quaternary Logic for Approximate Computing. ACM Journal on Emerging Technologies in Computing Systems, 16(1), Article ID 5.
Open this publication in new window or tab >>Evaluating the Potential Applications of Quaternary Logic for Approximate Computing
2020 (English)In: ACM Journal on Emerging Technologies in Computing Systems, ISSN 1550-4832, E-ISSN 1550-4840, Vol. 16, no 1, article id 5Article in journal (Refereed) Published
Abstract [en]

There exist extensive ongoing research efforts on emerging atomic-scale technologies that have the potential to become an alternative to today’s complementary metal--oxide--semiconductor technologies. A common feature among the investigated technologies is that of multi-level devices, particularly the possibility of implementing quaternary logic gates and memory cells. However, for such multi-level devices to be used reliably, an increase in energy dissipation and operation time is required. Building on the principle of approximate computing, we present a set of combinational logic circuits and memory based on multi-level logic gates in which we can trade reliability against energy efficiency. Keeping the energy and timing constraints constant, important data are encoded in a more robust binary format while error-tolerant data are encoded in a quaternary format. We analyze the behavior of the logic circuits when exposed to transient errors caused as a side effect of this encoding. We also evaluate the potential benefit of the logic circuits and memory by embedding them in a conventional computer system on which we execute jpeg, sobel, and blackscholes approximately. We demonstrate that blackscholes is not suitable for such a system and explain why. However, we also achieve dynamic energy reductions of 10% and 13% for jpeg and sobel, respectively, and improve execution time by 38% for sobel, while maintaining adequate output quality.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2020
Keywords
approximate computing, quaternary
National Category
Computer Systems
Research subject
Computer Systems Sciences
Identifiers
urn:nbn:se:uu:diva-396028 (URN)10.1145/3359620 (DOI)000535717000005 ()
Funder
Swedish Research Council, 2015-05159Swedish National Infrastructure for Computing (SNIC)
Available from: 2019-10-29 Created: 2019-10-29 Last updated: 2024-02-21Bibliographically approved
Sakalis, C., Kaxiras, S., Ros, A., Jimborean, A. & Själander, M. (2020). Understanding Selective Delay as a Method for Efficient Secure Speculative Execution. IEEE Transactions on Computers, 69(11), 1584-1595
Open this publication in new window or tab >>Understanding Selective Delay as a Method for Efficient Secure Speculative Execution
Show others...
2020 (English)In: IEEE Transactions on Computers, ISSN 0018-9340, E-ISSN 1557-9956, Vol. 69, no 11, p. 1584-1595Article in journal (Refereed) Published
Abstract [en]

Since the introduction of Meltdown and Spectre, the research community has been tirelessly working on speculative side-channel attacks and on how to shield computer systems from them. To ensure that a system is protected not only from all the currently known attacks but also from future, yet to be discovered, attacks, the solutions developed need to be general in nature, covering a wide array of system components, while at the same time keeping the performance, energy, area, and implementation complexity costs at a minimum. One such solution is our own delay-on-miss, which efficiently protects the memory hierarchy by i) selectively delaying speculative load instructions and ii) utilizing value prediction as an invisible form of speculation. In this article we dive deeper into delay-on-miss, offering insights into why and how it affects the performance of the system. We also reevaluate value prediction as an invisible form of speculation. Specifically, we focus on the implications that delaying memory loads has in the memory level parallelism of the system and how this affects the value predictor and the overall performance of the system. We present new, updated results but more importantly, we also offer deeper insight into why delay-on-miss works so well and what this means for the future of secure speculative execution.

Keywords
Speculative execution, side-channel attacks, memory, security
National Category
Computer Systems
Identifiers
urn:nbn:se:uu:diva-404312 (URN)10.1109/TC.2020.3014456 (DOI)000576255400003 ()
Funder
Swedish Research Council, 2015-05159Swedish Foundation for Strategic Research , SM17-0064European Regional Development Fund (ERDF), RTI2018098156-B-C53Swedish National Infrastructure for Computing (SNIC)
Available from: 2020-02-17 Created: 2020-02-17 Last updated: 2023-03-28Bibliographically approved
Sakalis, C., Kaxiras, S., Ros, A., Jimborean, A. & Själander, M. (2019). Efficient invisible speculative execution through selective delay and value prediction. In: Proc. 46th International Symposium on Computer Architecture: . Paper presented at ISCA 2019, June 22–26, Phoenix, AZ, USA (pp. 723-735). New York: ACM Press
Open this publication in new window or tab >>Efficient invisible speculative execution through selective delay and value prediction
Show others...
2019 (English)In: Proc. 46th International Symposium on Computer Architecture, New York: ACM Press, 2019, p. 723-735Conference paper, Published paper (Refereed)
Abstract [en]

Speculative execution, the base on which modern high-performance general-purpose CPUs are built on, has recently been shown to enable a slew of security attacks.  All these attacks are centered around a common set of behaviors: During speculative execution, the architectural state of the system is kept unmodified, until the speculation can be verified.  In the event that a misspeculation occurs, then anything that can affect the architectural state is reverted (squashed) and re-executed correctly.  However, the same is not true for the microarchitectural state.  Normally invisible to the user, changes to the microarchitectural state can be observed through various side-channels, with timing differences caused by the memory hierarchy being one of the most common and easy to exploit.  The speculative side-channels can then be exploited to perform attacks that can bypass software and hardware checks in order to leak information.  These attacks, out of which the most infamous are perhaps Spectre and Meltdown, have led to a frantic search for solutions.In this work, we present our own solution for reducing the microarchitectural state-changes caused by speculative execution in the memory hierarchy.  It is based on the observation that if we only allow accesses that hit in the L1 data cache to proceed, then we can easily hide any microarchitectural changes until after the speculation has been verified.  At the same time, we propose to prevent stalls by value predicting the loads that miss in the L1.  Value prediction, though speculative, constitutes an invisible form of speculation, not seen outside the core.  We evaluate our solution and show that we can prevent observable microarchitectural changes in the memory hierarchy while keeping the performance and energy costs at 11% and 7%, respectively.  In comparison, the current state of the art solution, InvisiSpec, incurs a 46% performance loss and a 51% energy increase.

Place, publisher, year, edition, pages
New York: ACM Press, 2019
Keywords
caches, side-channel attacks, speculative execution
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-387329 (URN)10.1145/3307650.3322216 (DOI)000521059600056 ()978-1-4503-6669-4 (ISBN)
Conference
ISCA 2019, June 22–26, Phoenix, AZ, USA
Funder
Swedish Research Council, 2015-05159Swedish Foundation for Strategic Research , SM17-0064
Note

Available from: 2019-06-22 Created: 2019-06-21 Last updated: 2021-10-15Bibliographically approved
Popov, M., Jimborean, A. & Black-Schaffer, D. (2019). Efficient thread/page/parallelism autotuning for NUMA systems. In: ACM (Ed.), ICS '19: Proceedings of the ACM International Conference on Supercomputing. Paper presented at 33rd ACM International Conference on Supercomputing (ICS), Phoenix, AZ, USA, June 26–28, 2019 (pp. 342-353). New York, NY, USA: Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Efficient thread/page/parallelism autotuning for NUMA systems
2019 (English)In: ICS '19: Proceedings of the ACM International Conference on Supercomputing / [ed] ACM, New York, NY, USA: Association for Computing Machinery (ACM), 2019, , p. 12p. 342-353Conference paper, Published paper (Refereed)
Abstract [en]

Current multi-socket systems have complex memory hierarchies with significant Non-Uniform Memory Access (NUMA) effects: memory performance depends on the location of the data and the thread. This complexity means that thread- and data-mappings have a significant impact on performance. However, it is hard to find efficient data mappings and thread configurations due to the complex interactions between applications and systems.

In this paper we explore the combined search space of thread mappings, data mappings, number of NUMA nodes, and degreeof-parallelism, per application phase, and across multiple systems. We show that there are significant performance benefits from optimizing this wide range of parameters together. However, such an optimization presents two challenges: accurately modeling the performance impact of configurations across applications and systems, and exploring the vast space of configurations. To overcome the modeling challenge, we use native execution of small, representative codelets, which reproduce the system and application interactions. To make the search practical, we build a search space by combining a range of state of the art thread- and data-mapping policies.

Combining these two approaches results in a tractable search space that can be quickly and accurately evaluated without sacrificing significant performance. This search finds non-intuitive configurations that perform significantly better than previous works. With this approach we are able to achieve an average speedup of 1.97× on a four node NUMA system

Place, publisher, year, edition, pages
New York, NY, USA: Association for Computing Machinery (ACM), 2019. p. 12
Keywords
NUMA, autotunning, thread placement, page placement, code isolation, OpenMP, performance optimization
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:uu:diva-396173 (URN)10.1145/3330345.3330376 (DOI)000546022700031 ()978-1-4503-6079-1 (ISBN)
Conference
33rd ACM International Conference on Supercomputing (ICS), Phoenix, AZ, USA, June 26–28, 2019
Funder
Swedish Foundation for Strategic Research , FFL12-0051Knut and Alice Wallenberg FoundationEU, Horizon 2020, 715283Swedish Foundation for Strategic Research , RIT15-0012
Available from: 2019-10-30 Created: 2019-10-30 Last updated: 2020-09-11Bibliographically approved
Sakalis, C., Alipour, M., Ros, A., Jimborean, A., Kaxiras, S. & Själander, M. (2019). Ghost Loads: What is the cost of invisible speculation?. In: Proceedings of the 16th ACM International Conference on Computing Frontiers: . Paper presented at CF 2019, April 30 – May 2, Alghero, Sardinia, Italy (pp. 153-163). New York: ACM Press
Open this publication in new window or tab >>Ghost Loads: What is the cost of invisible speculation?
Show others...
2019 (English)In: Proceedings of the 16th ACM International Conference on Computing Frontiers, New York: ACM Press, 2019, p. 153-163Conference paper, Published paper (Refereed)
Abstract [en]

Speculative execution is necessary for achieving high performance on modern general-purpose CPUs but, starting with Spectre and Meltdown, it has also been proven to cause severe security flaws. In case of a misspeculation, the architectural state is restored to assure functional correctness but a multitude of microarchitectural changes (e.g., cache updates), caused by the speculatively executed instructions, are commonly left in the system.  These changes can be used to leak sensitive information, which has led to a frantic search for solutions that can eliminate such security flaws. The contribution of this work is an evaluation of the cost of hiding speculative side-effects in the cache hierarchy, making them visible only after the speculation has been resolved. For this, we compare (for the first time) two broad approaches: i) waiting for loads to become non-speculative before issuing them to the memory system, and ii) eliminating the side-effects of speculation, a solution consisting of invisible loads (Ghost loads) and performance optimizations (Ghost Buffer and Materialization). While previous work, InvisiSpec, has proposed a similar solution to our latter approach, it has done so with only a minimal evaluation and at a significant performance cost. The detailed evaluation of our solutions shows that: i) waiting for loads to become non-speculative is no more costly than the previously proposed InvisiSpec solution, albeit much simpler, non-invasive in the memory system, and stronger security-wise; ii) hiding speculation with Ghost loads (in the context of a relaxed memory model) can be achieved at the cost of 12% performance degradation and 9% energy increase, which is significantly better that the previous state-of-the-art solution.

Place, publisher, year, edition, pages
New York: ACM Press, 2019
Keywords
speculation, security, side-channel attacks, caches
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-383173 (URN)10.1145/3310273.3321558 (DOI)000474686400019 ()978-1-4503-6685-4 (ISBN)
Conference
CF 2019, April 30 – May 2, Alghero, Sardinia, Italy
Funder
Swedish Research Council, 2015-05159Swedish National Infrastructure for Computing (SNIC)
Note

Available from: 2019-05-10 Created: 2019-05-10 Last updated: 2021-10-15Bibliographically approved
Jimborean, A., Ekemark, P., Waern, J., Kaxiras, S. & Ros, A. (2018). Automatic Detection of Large Extended Data-Race-Free Regions with Conflict Isolation. IEEE Transactions on Parallel and Distributed Systems, 29(3), 527-541
Open this publication in new window or tab >>Automatic Detection of Large Extended Data-Race-Free Regions with Conflict Isolation
Show others...
2018 (English)In: IEEE Transactions on Parallel and Distributed Systems, ISSN 1045-9219, E-ISSN 1558-2183, Vol. 29, no 3, p. 527-541Article in journal (Refereed) Published
Abstract [en]

Data-race-free (DRF) parallel programming becomes a standard as newly adopted memory models of mainstream programming languages such as C++ or Java impose data-race-freedom as a requirement. We propose compiler techniques that automatically delineate extended data-race-free (xDRF) regions, namely regions of code that provide the same guarantees as the synchronization-free regions (in the context of DRF codes). xDRF regions stretch across synchronization boundaries, function calls and loop back-edges and preserve the data-race-free semantics, thus increasing the optimization opportunities exposed to the compiler and to the underlying architecture. We further enlarge xDRF regions with a conflict isolation (CI) technique, delineating what we call xDRF-CI regions while preserving the same properties as xDRF regions. Our compiler (1) precisely analyzes the threads' memory accessing behavior and data sharing in shared-memory, general-purpose parallel applications, (2) isolates data-sharing and (3) marks the limits of xDRF-CI code regions. The contribution of this work consists in a simple but effective method to alleviate the drawbacks of the compiler's conservative nature in order to be competitive with (and even surpass) an expert in delineating xDRF regions manually. We evaluate the potential of our technique by employing xDRF and xDRF-CI region classification in a state-of-the-art, dual-mode cache coherence protocol. We show that xDRF regions reduce the coherence bookkeeping and enable optimizations for performance (6.4 percent) and energy efficiency (12.2 percent) compared to a standard directory-based coherence protocol. Enhancing the xDRF analysis with the conflict isolation technique improves performance by 7.1 percent and energy efficiency by 15.9 percent.

Place, publisher, year, edition, pages
IEEE COMPUTER SOC, 2018
Keywords
Compile-time analysis, inter-procedural analysis, inter-thread analysis, data sharing, data races, cache coherence
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-348845 (URN)10.1109/TPDS.2017.2771509 (DOI)000425173200004 ()
Funder
Swedish Research Council, 2016-05086
Available from: 2018-04-25 Created: 2018-04-25 Last updated: 2023-10-31Bibliographically approved
Tran, K.-A., Carlson, T. E., Koukos, K., Själander, M., Spiliopoulos, V., Kaxiras, S. & Jimborean, A. (2018). Static instruction scheduling for high performance on limited hardware. IEEE Transactions on Computers, 67(4), 513-527
Open this publication in new window or tab >>Static instruction scheduling for high performance on limited hardware
Show others...
2018 (English)In: IEEE Transactions on Computers, ISSN 0018-9340, E-ISSN 1557-9956, Vol. 67, no 4, p. 513-527Article in journal (Refereed) Published
Abstract [en]

Complex out-of-order (OoO) processors have been designed to overcome the restrictions of outstanding long-latency misses at the cost of increased energy consumption. Simple, limited OoO processors are a compromise in terms of energy consumption and performance, as they have fewer hardware resources to tolerate the penalties of long-latency loads. In worst case, these loads may stall the processor entirely. We present Clairvoyance, a compiler based technique that generates code able to hide memory latency and better utilize simple OoO processors. By clustering loads found across basic block boundaries, Clairvoyance overlaps the outstanding latencies to increases memory-level parallelism. We show that these simple OoO processors, equipped with the appropriate compiler support, can effectively hide long-latency loads and achieve performance improvements for memory-bound applications. To this end, Clairvoyance tackles (i) statically unknown dependencies, (ii) insufficient independent instructions, and (iii) register pressure. Clairvoyance achieves a geomean execution time improvement of 14 percent for memory-bound applications, on top of standard O3 optimizations, while maintaining compute-bound applications' high-performance.

National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-334011 (URN)10.1109/TC.2017.2769641 (DOI)000427420800005 ()
Projects
UPMARC
Funder
Swedish Research Council, 2016-05086
Available from: 2017-11-03 Created: 2017-11-20 Last updated: 2023-03-28Bibliographically approved
Tran, K.-A., Jimborean, A., Carlson, T. E., Koukos, K., Själander, M. & Kaxiras, S. (2018). SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores. In: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation: . Paper presented at PLDI 2018 the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, June 18-22 2018, Philadelphia, USA (pp. 328-343). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores
Show others...
2018 (English)In: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, Association for Computing Machinery (ACM), 2018, p. 328-343Conference paper, Published paper (Refereed)
Abstract [en]

Increasing demands for energy efficiency constrain emerging hardware. These new hardware trends challenge the established assumptions in code generation and force us to rethink existing software optimization techniques. We propose a cross-layer redesign of the way compilers and the underlying microarchitecture are built and interact, to achieve both performance and high energy efficiency.

In this paper, we address one of the main performance bottlenecks — last-level cache misses — through a software-hardware co-design. Our approach is able to hide memory latency and attain increased memory and instruction level parallelism by orchestrating a non-speculative, execute-ahead paradigm in software (SWOOP). While out-of-order (OoO) architectures attempt to hide memory latency by dynamically reordering instructions, they do so through expensive, power-hungry, speculative mechanisms.We aim to shift this complexity into software, and we build upon compilation techniques inherited from VLIW, software pipelining, modulo scheduling, decoupled access-execution, and software prefetching. In contrast to previous approaches we do not rely on either software or hardware speculation that can be detrimental to efficiency. Our SWOOP compiler is enhanced with lightweight architectural support, thus being able to transform applications that include highly complex control-flow and indirect memory accesses.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2018
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-361359 (URN)10.1145/3192366.3192393 (DOI)000452469600023 ()978-1-4503-5698-5 (ISBN)
Conference
PLDI 2018 the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, June 18-22 2018, Philadelphia, USA
Projects
UPMARC
Funder
Swedish Research Council, 2016-05086
Available from: 2018-09-23 Created: 2018-09-23 Last updated: 2020-01-17Bibliographically approved
Projects
Optimizing for performance and energy efficiency with speculative compilers and co-designed hardware [2016-05086_VR]; Uppsala University
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-8642-2447

Search in DiVA

Show all publications