uu.seUppsala University Publications
Change search
Link to record
Permanent link

Direct link
BETA
Publications (10 of 69) Show all publications
Alipour, M., Kumar, R., Kaxiras, S. & Black-Schaffer, D. (2020). Delay and Bypass: Ready and Criticality Aware Instruction Scheduling in Out-of-Order Processors. In: IEEE (Ed.), The 26th IEEE International Symposium on High-Performance Computer Architecture (HPCA): . Paper presented at The 26th IEEE International Symposium on High-Performance Computer Architecture (HPCA), Feb. 22-26, 2020, San Diego, CA, USA.
Open this publication in new window or tab >>Delay and Bypass: Ready and Criticality Aware Instruction Scheduling in Out-of-Order Processors
2020 (English)In: The 26th IEEE International Symposium on High-Performance Computer Architecture (HPCA) / [ed] IEEE, 2020Conference paper, Published paper (Refereed)
Abstract [en]

Flexible instruction scheduling is essential for performance in out-of-order processors. This is typically achieved by using CAM-based Instruction Queues (IQs) that provide complete flexibility in choosing ready instructions for execution, but at the cost of significant scheduling energy.

In this work we seek to reduce the instruction scheduling energy by reducing the depth and width of the IQ. We do so by classifying instructions based on their readiness and criticality, and using this information to bypass the IQ for instructions that will not benefit from its expensive scheduling structures and delay instructions that will not harm performance. Combined, these approaches allow us to offload a significant portion of the instructions from the IQ to much cheaper FIFO-based scheduling structures without hurting performance. As a result we can reduce the IQ depth and width by half, thereby saving energy.

Our design, Delay and Bypass (DNB), is the first design to explicitly address both readiness and criticality to reduce scheduling energy. By handling both classes we are able to achieve 95% of the baseline out-of-order performance while only using 33% of the scheduling energy. This represents a significant improvement over previous designs which addressed only criticality or readiness (91%/89% performance at 74%/53% energy).

National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-403674 (URN)
Conference
The 26th IEEE International Symposium on High-Performance Computer Architecture (HPCA), Feb. 22-26, 2020, San Diego, CA, USA
Available from: 2020-02-02 Created: 2020-02-02 Last updated: 2020-03-24Bibliographically approved
Sakalis, C., Kaxiras, S., Ros, A., Jimborean, A. & Själander, M. (2019). Efficient invisible speculative execution through selective delay and value prediction. In: Proc. 46th International Symposium on Computer Architecture: . Paper presented at ISCA 2019, June 22–26, Phoenix, AZ, USA (pp. 723-735). New York: ACM Press
Open this publication in new window or tab >>Efficient invisible speculative execution through selective delay and value prediction
Show others...
2019 (English)In: Proc. 46th International Symposium on Computer Architecture, New York: ACM Press, 2019, p. 723-735Conference paper, Published paper (Refereed)
Abstract [en]

Speculative execution, the base on which modern high-performance general-purpose CPUs are built on, has recently been shown to enable a slew of security attacks.  All these attacks are centered around a common set of behaviors: During speculative execution, the architectural state of the system is kept unmodified, until the speculation can be verified.  In the event that a misspeculation occurs, then anything that can affect the architectural state is reverted (squashed) and re-executed correctly.  However, the same is not true for the microarchitectural state.  Normally invisible to the user, changes to the microarchitectural state can be observed through various side-channels, with timing differences caused by the memory hierarchy being one of the most common and easy to exploit.  The speculative side-channels can then be exploited to perform attacks that can bypass software and hardware checks in order to leak information.  These attacks, out of which the most infamous are perhaps Spectre and Meltdown, have led to a frantic search for solutions.In this work, we present our own solution for reducing the microarchitectural state-changes caused by speculative execution in the memory hierarchy.  It is based on the observation that if we only allow accesses that hit in the L1 data cache to proceed, then we can easily hide any microarchitectural changes until after the speculation has been verified.  At the same time, we propose to prevent stalls by value predicting the loads that miss in the L1.  Value prediction, though speculative, constitutes an invisible form of speculation, not seen outside the core.  We evaluate our solution and show that we can prevent observable microarchitectural changes in the memory hierarchy while keeping the performance and energy costs at 11% and 7%, respectively.  In comparison, the current state of the art solution, InvisiSpec, incurs a 46% performance loss and a 51% energy increase.

Place, publisher, year, edition, pages
New York: ACM Press, 2019
Keywords
caches, side-channel attacks, speculative execution
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-387329 (URN)10.1145/3307650.3322216 (DOI)978-1-4503-6669-4 (ISBN)
Conference
ISCA 2019, June 22–26, Phoenix, AZ, USA
Funder
Swedish Research Council, 2015-05159Swedish Foundation for Strategic Research , SM17-0064
Note

Available from: 2019-06-22 Created: 2019-06-21 Last updated: 2020-01-30Bibliographically approved
Sakalis, C., Jimborean, A., Kaxiras, S. & Själander, M. (2019). Evaluating the Potential Applications of Quaternary Logic for Approximate Computing. ACM Journal on Emerging Technologies in Computing Systems (JETC), 16(1), Article ID 5.
Open this publication in new window or tab >>Evaluating the Potential Applications of Quaternary Logic for Approximate Computing
2019 (English)In: ACM Journal on Emerging Technologies in Computing Systems (JETC), ISSN 1550-4832, Vol. 16, no 1, article id 5Article in journal (Refereed) Published
Abstract [en]

There exist extensive ongoing research efforts on emerging atomic-scale technologies that have the potential to become an alternative to today’s complementary metal--oxide--semiconductor technologies. A common feature among the investigated technologies is that of multi-level devices, particularly the possibility of implementing quaternary logic gates and memory cells. However, for such multi-level devices to be used reliably, an increase in energy dissipation and operation time is required. Building on the principle of approximate computing, we present a set of combinational logic circuits and memory based on multi-level logic gates in which we can trade reliability against energy efficiency. Keeping the energy and timing constraints constant, important data are encoded in a more robust binary format while error-tolerant data are encoded in a quaternary format. We analyze the behavior of the logic circuits when exposed to transient errors caused as a side effect of this encoding. We also evaluate the potential benefit of the logic circuits and memory by embedding them in a conventional computer system on which we execute jpeg, sobel, and blackscholes approximately. We demonstrate that blackscholes is not suitable for such a system and explain why. However, we also achieve dynamic energy reductions of 10% and 13% for jpeg and sobel, respectively, and improve execution time by 38% for sobel, while maintaining adequate output quality.

Place, publisher, year, edition, pages
New York, NY, USA: , 2019
Keywords
approximate computing, quaternary
National Category
Computer Systems
Research subject
Computer Systems Sciences
Identifiers
urn:nbn:se:uu:diva-396028 (URN)10.1145/3359620 (DOI)
Funder
Swedish Research Council, 2015-05159
Available from: 2019-10-29 Created: 2019-10-29 Last updated: 2020-02-14Bibliographically approved
Alipour, M., Kumar, R., Kaxiras, S. & Black-Schaffer, D. (2019). FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors. In: 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE): . Paper presented at Design, Automation & Test in Europe Conference & Exhibition (DATE), MAR 25-29, 2019, Florence, ITALY (pp. 716-721). IEEE
Open this publication in new window or tab >>FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors
2019 (English)In: 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), IEEE, 2019, p. 716-721Conference paper, Published paper (Refereed)
Abstract [en]

The number of instructions a processor's instruction queue can examine (depth) and the number it can issue together (width) determine its ability to take advantage of the ILP in an application. Unfortunately, increasing either the width or depth of the instruction queue is very costly due to the content-addressable logic needed to wakeup and select instructions out-of-order. This work makes the observation that a large number of instructions have both operands ready at dispatch, and therefore do not benefit from out-of-order scheduling. We leverage this to place such ready-at-dispatch instructions in separate, simpler, in-order FIFO queues for scheduling. With such additional queues, we can reduce the size and width of the expensive out-of-order instruction queue, without reducing the processor's overall issue width and depth. Our design, FIFOrder, is able to steer more than 60% of instructions to the cheaper FIFO queues, providing a 50% energy savings over a traditional out-of-order instruction queue design, while delivering 8% higher performance.

Place, publisher, year, edition, pages
IEEE, 2019
Series
Design Automation and Test in Europe Conference and Exhibition, ISSN 1530-1591
National Category
Computer Systems Computer Sciences
Identifiers
urn:nbn:se:uu:diva-389930 (URN)10.23919/DATE.2019.8715034 (DOI)000470666100132 ()978-3-9819263-2-3 (ISBN)
Conference
Design, Automation & Test in Europe Conference & Exhibition (DATE), MAR 25-29, 2019, Florence, ITALY
Funder
Knut and Alice Wallenberg Foundation
Available from: 2019-08-01 Created: 2019-08-01 Last updated: 2020-02-02Bibliographically approved
Alves, R., Ros, A., Black-Schaffer, D. & Kaxiras, S. (2019). Filter caching for free: The untapped potential of the store-buffer. In: Proc. 46th International Symposium on Computer Architecture: . Paper presented at ISCA 2019, June 22–26, Phoenix, AZ (pp. 436-448). New York: ACM Press
Open this publication in new window or tab >>Filter caching for free: The untapped potential of the store-buffer
2019 (English)In: Proc. 46th International Symposium on Computer Architecture, New York: ACM Press, 2019, p. 436-448Conference paper, Published paper (Refereed)
Abstract [en]

Modern processors contain store-buffers to allow stores to retire under a miss, thus hiding store-miss latency. The store-buffer needs to be large (for performance) and searched on every load (for correctness), thereby making it a costly structure in both area and energy. Yet on every load, the store-buffer is probed in parallel with the L1 and TLB, with no concern for the store-buffer's intrinsic hit rate or whether a store-buffer hit can be predicted to save energy by disabling the L1 and TLB probes.

In this work we cache data that have been written back to memory in a unified store-queue/buffer/cache, and predict hits to avoid L1/TLB probes and save energy. By dynamically adjusting the allocation of entries between the store-queue/buffer/cache, we can achieve nearly optimal reuse, without causing stalls. We are able to do this efficiently and cheaply by recognizing key properties of stores: free caching (since they must be written into the store-buffer for correctness we need no additional data movement), cheap coherence (since we only need to track state changes of the local, dirty data in the store-buffer), and free and accurate hit prediction (since the memory dependence predictor already does this for scheduling).

As a result, we are able to increase the store-buffer hit rate and reduce store-buffer/TLB/L1 dynamic energy by 11.8% (up to 26.4%) on SPEC2006 without hurting performance (average IPC improvements of 1.5%, up to 4.7%).The cost for these improvements is a 0.2% increase in L1 cache capacity (1 bit per line) and one additional tail pointer in the store-buffer.

Place, publisher, year, edition, pages
New York: ACM Press, 2019
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-383473 (URN)10.1145/3307650.3322269 (DOI)978-1-4503-6669-4 (ISBN)
Conference
ISCA 2019, June 22–26, Phoenix, AZ
Funder
Knut and Alice Wallenberg FoundationEU, Horizon 2020, 715283EU, Horizon 2020, 801051Swedish Foundation for Strategic Research , SM17-0064
Available from: 2019-06-22 Created: 2019-05-16 Last updated: 2019-07-03Bibliographically approved
Sakalis, C., Alipour, M., Ros, A., Jimborean, A., Kaxiras, S. & Själander, M. (2019). Ghost Loads: What is the cost of invisible speculation?. In: Proceedings of the 16th ACM International Conference on Computing Frontiers: . Paper presented at CF 2019, April 30 – May 2, Alghero, Sardinia, Italy (pp. 153-163). New York: ACM Press
Open this publication in new window or tab >>Ghost Loads: What is the cost of invisible speculation?
Show others...
2019 (English)In: Proceedings of the 16th ACM International Conference on Computing Frontiers, New York: ACM Press, 2019, p. 153-163Conference paper, Published paper (Refereed)
Abstract [en]

Speculative execution is necessary for achieving high performance on modern general-purpose CPUs but, starting with Spectre and Meltdown, it has also been proven to cause severe security flaws. In case of a misspeculation, the architectural state is restored to assure functional correctness but a multitude of microarchitectural changes (e.g., cache updates), caused by the speculatively executed instructions, are commonly left in the system.  These changes can be used to leak sensitive information, which has led to a frantic search for solutions that can eliminate such security flaws. The contribution of this work is an evaluation of the cost of hiding speculative side-effects in the cache hierarchy, making them visible only after the speculation has been resolved. For this, we compare (for the first time) two broad approaches: i) waiting for loads to become non-speculative before issuing them to the memory system, and ii) eliminating the side-effects of speculation, a solution consisting of invisible loads (Ghost loads) and performance optimizations (Ghost Buffer and Materialization). While previous work, InvisiSpec, has proposed a similar solution to our latter approach, it has done so with only a minimal evaluation and at a significant performance cost. The detailed evaluation of our solutions shows that: i) waiting for loads to become non-speculative is no more costly than the previously proposed InvisiSpec solution, albeit much simpler, non-invasive in the memory system, and stronger security-wise; ii) hiding speculation with Ghost loads (in the context of a relaxed memory model) can be achieved at the cost of 12% performance degradation and 9% energy increase, which is significantly better that the previous state-of-the-art solution.

Place, publisher, year, edition, pages
New York: ACM Press, 2019
Keywords
speculation, security, side-channel attacks, caches
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-383173 (URN)10.1145/3310273.3321558 (DOI)000474686400019 ()978-1-4503-6685-4 (ISBN)
Conference
CF 2019, April 30 – May 2, Alghero, Sardinia, Italy
Funder
Swedish Research Council, 2015-05159Swedish National Infrastructure for Computing (SNIC)
Note

Available from: 2019-05-10 Created: 2019-05-10 Last updated: 2020-01-30Bibliographically approved
Alipour, M., Carlson, T. E., Black-Schaffer, D. & Kaxiras, S. (2019). Maximizing limited resources: A limit-based study and taxonomy of out-of-order commit. Journal of Signal Processing Systems, 91(3-4), 379-397
Open this publication in new window or tab >>Maximizing limited resources: A limit-based study and taxonomy of out-of-order commit
2019 (English)In: Journal of Signal Processing Systems, ISSN 1939-8018, E-ISSN 1939-8115, Vol. 91, no 3-4, p. 379-397Article in journal (Refereed) Published
Abstract [en]

Out-of-order execution is essential for high performance, general-purpose computation, as it can find and execute useful work instead of stalling. However, it is typically limited by the requirement of visibly sequential, atomic instruction executionin other words, in-order instruction commit. While in-order commit has a number of advantages, such as providing precise interrupts and avoiding complications with the memory consistency model, it requires the core to hold on to resources (reorder buffer entries, load/store queue entries, physical registers) until they are released in program order. In contrast, out-of-order commit can release some resources much earlier, yielding improved performance and/or lower resource requirements. Non-speculative out-of-order commit is limited in terms of correctness by the conditions described in the work of Bell and Lipasti (2004). In this paper we revisit out-of-order commit by examining the potential performance benefits of lifting these conditions one by one and in combination, for both non-speculative and speculative out-of-order commit. While correctly handling recovery for all out-of-order commit conditions currently requires complex tracking and expensive checkpointing, this work aims to demonstrate the potential for selective, speculative out-of-order commit using an oracle implementation without speculative rollback costs. Through this analysis of the potential of out-of-order commit, we learn that: a) there is significant untapped potential for aggressive variants of out-of-order commit; b) it is important to optimize the out-of-order commit depth for a balanced design, as smaller cores benefit from reduced depth while larger cores continue to benefit from deeper designs; c) the focus on implementing only a subset of the out-of-order commit conditions could lead to efficient implementations; d) the benefits of out-of-order commit increases with higher memory latency and in conjunction with prefetching; e) out-of-order commit exposes additional parallelism in the memory hierarchy.

National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-365899 (URN)10.1007/s11265-018-1369-4 (DOI)000459428200012 ()
Available from: 2018-04-26 Created: 2018-11-14 Last updated: 2020-02-02Bibliographically approved
Alves, R., Kaxiras, S. & Black-Schaffer, D. (2019). Minimizing Replay under Way-Prediction.
Open this publication in new window or tab >>Minimizing Replay under Way-Prediction
2019 (English)Report (Other academic)
Abstract [en]

Way-predictors are effective at reducing dynamic cache energy by reducing the number of ways accessed, but introduce additional latency for incorrect way-predictions. While previous work has studied the impact of the increased latency for incorrect way-predictions, we show that the latency variability has a far greater effect as it forces replay of in-flight instructions on an incorrect way-prediction. To address the problem, we propose a solution that learns the confidence of the way-prediction and dynamically disables it when it is likely to mispredict. We further improve this approach by biasing the confidence to reduce latency variability further at the cost of reduced way-predictions. Our results show that instruction replay in a way-predictor reduces IPC by 6.9% due to 10% of the instructions being replayed. Our confidence-based way-predictor degrades IPC by only 2.9% by replaying just 3.4% of the instructions, reducing way-predictor cache energy overhead (compared to serial access cache) from 8.5% to 1.9%.

Series
Technical report / Department of Information Technology, Uppsala University, ISSN 1404-3203 ; 2019-003
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-383596 (URN)
Available from: 2019-05-17 Created: 2019-05-17 Last updated: 2019-07-03Bibliographically approved
Jimborean, A., Ekemark, P., Waern, J., Kaxiras, S. & Ros, A. (2018). Automatic Detection of Large Extended Data-Race-Free Regions with Conflict Isolation. IEEE Transactions on Parallel and Distributed Systems, 29(3), 527-541
Open this publication in new window or tab >>Automatic Detection of Large Extended Data-Race-Free Regions with Conflict Isolation
Show others...
2018 (English)In: IEEE Transactions on Parallel and Distributed Systems, ISSN 1045-9219, E-ISSN 1558-2183, Vol. 29, no 3, p. 527-541Article in journal (Refereed) Published
Abstract [en]

Data-race-free (DRF) parallel programming becomes a standard as newly adopted memory models of mainstream programming languages such as C++ or Java impose data-race-freedom as a requirement. We propose compiler techniques that automatically delineate extended data-race-free (xDRF) regions, namely regions of code that provide the same guarantees as the synchronization-free regions (in the context of DRF codes). xDRF regions stretch across synchronization boundaries, function calls and loop back-edges and preserve the data-race-free semantics, thus increasing the optimization opportunities exposed to the compiler and to the underlying architecture. We further enlarge xDRF regions with a conflict isolation (CI) technique, delineating what we call xDRF-CI regions while preserving the same properties as xDRF regions. Our compiler (1) precisely analyzes the threads' memory accessing behavior and data sharing in shared-memory, general-purpose parallel applications, (2) isolates data-sharing and (3) marks the limits of xDRF-CI code regions. The contribution of this work consists in a simple but effective method to alleviate the drawbacks of the compiler's conservative nature in order to be competitive with (and even surpass) an expert in delineating xDRF regions manually. We evaluate the potential of our technique by employing xDRF and xDRF-CI region classification in a state-of-the-art, dual-mode cache coherence protocol. We show that xDRF regions reduce the coherence bookkeeping and enable optimizations for performance (6.4 percent) and energy efficiency (12.2 percent) compared to a standard directory-based coherence protocol. Enhancing the xDRF analysis with the conflict isolation technique improves performance by 7.1 percent and energy efficiency by 15.9 percent.

Place, publisher, year, edition, pages
IEEE COMPUTER SOC, 2018
Keywords
Compile-time analysis, inter-procedural analysis, inter-thread analysis, data sharing, data races, cache coherence
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-348845 (URN)10.1109/TPDS.2017.2771509 (DOI)000425173200004 ()
Funder
Swedish Research Council, 2016-05086
Available from: 2018-04-25 Created: 2018-04-25 Last updated: 2018-12-03Bibliographically approved
Alves, R., Kaxiras, S. & Black-Schaffer, D. (2018). Dynamically Disabling Way-prediction to Reduce Instruction Replay. In: 2018 IEEE 36th International Conference on Computer Design (ICCD): . Paper presented at IEEE 36th International Conference on Computer Design (ICCD), October 7–10, 2018, Orlando, FL, USA (pp. 140-143). IEEE
Open this publication in new window or tab >>Dynamically Disabling Way-prediction to Reduce Instruction Replay
2018 (English)In: 2018 IEEE 36th International Conference on Computer Design (ICCD), IEEE, 2018, p. 140-143Conference paper, Published paper (Refereed)
Abstract [en]

Way-predictors have long been used to reduce dynamic cache energy without the performance loss of serial caches. However, they produce variable-latency hits, as incorrect predictions increase load-to-use latency. While the performance impact of these extra cycles has been well-studied, the need to replay subsequent instructions in the pipeline due to the load latency increase has been ignored. In this work we show that way-predictors pay a significant performance penalty beyond previously studied effects due to instruction replays caused by mispredictions. To address this, we propose a solution that learns the confidence of the way prediction and dynamically disables it when it is likely to mispredict and cause replays. This allows us to reduce cache latency (when we can trust the way-prediction) while still avoiding the need to replay instructions in the pipeline (by avoiding way-mispredictions). Standard way-predictors degrade IPC by 6.9% vs. a parallel cache due to 10% of the instructions being replayed (worst case 42.3%). While our solution decreases way-prediction accuracy by turning off the way-predictor in some cases when it would have been correct, it delivers higher performance than a standard way-predictor. Our confidence-based way-predictor degrades IPC by only 4.4% by replaying just 5.6% of the instructions (worse case 16.3%). This reduces the way-predictor cache energy overhead compared to serial access cache, from 8.5% to 3.7% on average and on the worst case, from 33.8% to 9.5%.

Place, publisher, year, edition, pages
IEEE, 2018
Series
Proceedings IEEE International Conference on Computer Design, ISSN 1063-6404, E-ISSN 2576-6996
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-361215 (URN)10.1109/ICCD.2018.00029 (DOI)000458293200018 ()978-1-5386-8477-1 (ISBN)
Conference
IEEE 36th International Conference on Computer Design (ICCD), October 7–10, 2018, Orlando, FL, USA
Available from: 2018-09-21 Created: 2018-09-21 Last updated: 2019-05-22Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-8267-0232

Search in DiVA

Show all publications