uu.seUppsala University Publications
Change search
Link to record
Permanent link

Direct link
BETA
Black-Schaffer, David
Publications (10 of 54) Show all publications
Alves, R., Kaxiras, S. & Black-Schaffer, D. (2020). Efficient temporal and spatial load to load forwarding. In: Proc. 26th International Symposium on High-Performance and Computer Architecture: . Paper presented at HPCA 2020, February 22–26, San Diego, CA. IEEE Computer Society
Open this publication in new window or tab >>Efficient temporal and spatial load to load forwarding
2020 (English)In: Proc. 26th International Symposium on High-Performance and Computer Architecture, IEEE Computer Society, 2020Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
IEEE Computer Society, 2020
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-383477 (URN)
Conference
HPCA 2020, February 22–26, San Diego, CA
Note

to appear

Available from: 2019-08-21 Created: 2019-05-16 Last updated: 2019-08-21Bibliographically approved
Alipour, M., Kumar, R., Kaxiras, S. & Black-Schaffer, D. (2019). FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors. In: 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE): . Paper presented at Design, Automation & Test in Europe Conference & Exhibition (DATE), MAR 25-29, 2019, Florence, ITALY (pp. 716-721). IEEE
Open this publication in new window or tab >>FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors
2019 (English)In: 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), IEEE, 2019, p. 716-721Conference paper, Published paper (Refereed)
Abstract [en]

The number of instructions a processor's instruction queue can examine (depth) and the number it can issue together (width) determine its ability to take advantage of the ILP in an application. Unfortunately, increasing either the width or depth of the instruction queue is very costly due to the content-addressable logic needed to wakeup and select instructions out-of-order. This work makes the observation that a large number of instructions have both operands ready at dispatch, and therefore do not benefit from out-of-order scheduling. We leverage this to place such ready-at-dispatch instructions in separate, simpler, in-order FIFO queues for scheduling. With such additional queues, we can reduce the size and width of the expensive out-of-order instruction queue, without reducing the processor's overall issue width and depth. Our design, FIFOrder, is able to steer more than 60% of instructions to the cheaper FIFO queues, providing a 50% energy savings over a traditional out-of-order instruction queue design, while delivering 8% higher performance.

Place, publisher, year, edition, pages
IEEE, 2019
Series
Design Automation and Test in Europe Conference and Exhibition, ISSN 1530-1591
National Category
Computer Systems Computer Sciences
Identifiers
urn:nbn:se:uu:diva-389930 (URN)10.23919/DATE.2019.8715034 (DOI)000470666100132 ()978-3-9819263-2-3 (ISBN)
Conference
Design, Automation & Test in Europe Conference & Exhibition (DATE), MAR 25-29, 2019, Florence, ITALY
Funder
Knut and Alice Wallenberg Foundation
Available from: 2019-08-01 Created: 2019-08-01 Last updated: 2019-08-01Bibliographically approved
Alves, R., Ros, A., Black-Schaffer, D. & Kaxiras, S. (2019). Filter caching for free: The untapped potential of the store-buffer. In: Proc. 46th International Symposium on Computer Architecture: . Paper presented at ISCA 2019, June 22–26, Phoenix, AZ (pp. 436-448). New York: ACM Press
Open this publication in new window or tab >>Filter caching for free: The untapped potential of the store-buffer
2019 (English)In: Proc. 46th International Symposium on Computer Architecture, New York: ACM Press, 2019, p. 436-448Conference paper, Published paper (Refereed)
Abstract [en]

Modern processors contain store-buffers to allow stores to retire under a miss, thus hiding store-miss latency. The store-buffer needs to be large (for performance) and searched on every load (for correctness), thereby making it a costly structure in both area and energy. Yet on every load, the store-buffer is probed in parallel with the L1 and TLB, with no concern for the store-buffer's intrinsic hit rate or whether a store-buffer hit can be predicted to save energy by disabling the L1 and TLB probes.

In this work we cache data that have been written back to memory in a unified store-queue/buffer/cache, and predict hits to avoid L1/TLB probes and save energy. By dynamically adjusting the allocation of entries between the store-queue/buffer/cache, we can achieve nearly optimal reuse, without causing stalls. We are able to do this efficiently and cheaply by recognizing key properties of stores: free caching (since they must be written into the store-buffer for correctness we need no additional data movement), cheap coherence (since we only need to track state changes of the local, dirty data in the store-buffer), and free and accurate hit prediction (since the memory dependence predictor already does this for scheduling).

As a result, we are able to increase the store-buffer hit rate and reduce store-buffer/TLB/L1 dynamic energy by 11.8% (up to 26.4%) on SPEC2006 without hurting performance (average IPC improvements of 1.5%, up to 4.7%).The cost for these improvements is a 0.2% increase in L1 cache capacity (1 bit per line) and one additional tail pointer in the store-buffer.

Place, publisher, year, edition, pages
New York: ACM Press, 2019
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-383473 (URN)10.1145/3307650.3322269 (DOI)978-1-4503-6669-4 (ISBN)
Conference
ISCA 2019, June 22–26, Phoenix, AZ
Funder
Knut and Alice Wallenberg FoundationEU, Horizon 2020, 715283EU, Horizon 2020, 801051Swedish Foundation for Strategic Research , SM17-0064
Available from: 2019-06-22 Created: 2019-05-16 Last updated: 2019-07-03Bibliographically approved
Kumar, R., Alipour, M. & Black-Schaffer, D. (2019). Freeway: Maximizing MLP for Slice-Out-of-Order Execution. In: 2019 25th IEEE International Symposium On High Performance Computer Architecture (HPCA): . Paper presented at 25th IEEE International Symposium on High Performance Computer Architecture (HPCA), FEB 16-20, 2019, Washington, DC (pp. 558-569). IEEE
Open this publication in new window or tab >>Freeway: Maximizing MLP for Slice-Out-of-Order Execution
2019 (English)In: 2019 25th IEEE International Symposium On High Performance Computer Architecture (HPCA), IEEE, 2019, p. 558-569Conference paper, Published paper (Refereed)
Abstract [en]

Exploiting memory level parallelism (MLP) is crucial to hide long memory and last level cache access latencies. While out-of-order (OoO) cores, and techniques building on them, are effective at exploiting MLP, they deliver poor energy efficiency due to their complex hardware and the resulting energy overheads. As energy efficiency becomes the prime design constraint, we investigate low complexity/energy mechanisms to exploit MLP. This work revisits slice-out-of-order (sOoO) cores as an energy efficient alternative to OoO cores for MLP exploitation. These cores construct slices of MLP generating instructions and execute them out-of-order with respect to the rest of instructions. However, the slices and the remaining instructions, by themselves, execute in-order. Though their energy overhead is low compared to full OoO cores, sOoO cores fall considerably behind in terms of MLP extraction. We observe that their dependence-oblivious inorder slice execution causes dependent slices to frequently block MLP generation. To boost MLP generation in sOoO cores, we introduce Freeway, a sOoO core based on a new dependence-aware slice execution policy that tracks dependent slices and keeps them out of the way of MLP extraction. The proposed core incurs minimal area and power overheads, yet approaches the MLP benefits of fully OoO cores. Our evaluation shows that Freeway outperforms the state-of-the-art sOoO core by 12% and is within 7% of the MLP limits of full OoO execution.

Place, publisher, year, edition, pages
IEEE, 2019
Series
International Symposium on High-Performance Computer Architecture-Proceedings, ISSN 1530-0897
National Category
Computer Engineering
Identifiers
urn:nbn:se:uu:diva-387993 (URN)10.1109/HPCA.2019.00009 (DOI)000469766300044 ()978-1-7281-1444-6 (ISBN)
Conference
25th IEEE International Symposium on High Performance Computer Architecture (HPCA), FEB 16-20, 2019, Washington, DC
Funder
Knut and Alice Wallenberg FoundationEU, European Research Council, 715283
Available from: 2019-06-27 Created: 2019-06-27 Last updated: 2019-06-27Bibliographically approved
Alipour, M., Carlson, T. E., Black-Schaffer, D. & Kaxiras, S. (2019). Maximizing limited resources: A limit-based study and taxonomy of out-of-order commit. Journal of Signal Processing Systems, 91(3-4), 379-397
Open this publication in new window or tab >>Maximizing limited resources: A limit-based study and taxonomy of out-of-order commit
2019 (English)In: Journal of Signal Processing Systems, ISSN 1939-8018, E-ISSN 1939-8115, Vol. 91, no 3-4, p. 379-397Article in journal (Refereed) Published
Abstract [en]

Out-of-order execution is essential for high performance, general-purpose computation, as it can find and execute useful work instead of stalling. However, it is typically limited by the requirement of visibly sequential, atomic instruction executionin other words, in-order instruction commit. While in-order commit has a number of advantages, such as providing precise interrupts and avoiding complications with the memory consistency model, it requires the core to hold on to resources (reorder buffer entries, load/store queue entries, physical registers) until they are released in program order. In contrast, out-of-order commit can release some resources much earlier, yielding improved performance and/or lower resource requirements. Non-speculative out-of-order commit is limited in terms of correctness by the conditions described in the work of Bell and Lipasti (2004). In this paper we revisit out-of-order commit by examining the potential performance benefits of lifting these conditions one by one and in combination, for both non-speculative and speculative out-of-order commit. While correctly handling recovery for all out-of-order commit conditions currently requires complex tracking and expensive checkpointing, this work aims to demonstrate the potential for selective, speculative out-of-order commit using an oracle implementation without speculative rollback costs. Through this analysis of the potential of out-of-order commit, we learn that: a) there is significant untapped potential for aggressive variants of out-of-order commit; b) it is important to optimize the out-of-order commit depth for a balanced design, as smaller cores benefit from reduced depth while larger cores continue to benefit from deeper designs; c) the focus on implementing only a subset of the out-of-order commit conditions could lead to efficient implementations; d) the benefits of out-of-order commit increases with higher memory latency and in conjunction with prefetching; e) out-of-order commit exposes additional parallelism in the memory hierarchy.

National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-365899 (URN)10.1007/s11265-018-1369-4 (DOI)000459428200012 ()
Available from: 2018-04-26 Created: 2018-11-14 Last updated: 2019-03-21Bibliographically approved
Alves, R., Kaxiras, S. & Black-Schaffer, D. (2019). Minimizing Replay under Way-Prediction.
Open this publication in new window or tab >>Minimizing Replay under Way-Prediction
2019 (English)Report (Other academic)
Abstract [en]

Way-predictors are effective at reducing dynamic cache energy by reducing the number of ways accessed, but introduce additional latency for incorrect way-predictions. While previous work has studied the impact of the increased latency for incorrect way-predictions, we show that the latency variability has a far greater effect as it forces replay of in-flight instructions on an incorrect way-prediction. To address the problem, we propose a solution that learns the confidence of the way-prediction and dynamically disables it when it is likely to mispredict. We further improve this approach by biasing the confidence to reduce latency variability further at the cost of reduced way-predictions. Our results show that instruction replay in a way-predictor reduces IPC by 6.9% due to 10% of the instructions being replayed. Our confidence-based way-predictor degrades IPC by only 2.9% by replaying just 3.4% of the instructions, reducing way-predictor cache energy overhead (compared to serial access cache) from 8.5% to 1.9%.

Series
Technical report / Department of Information Technology, Uppsala University, ISSN 1404-3203 ; 2019-003
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-383596 (URN)
Available from: 2019-05-17 Created: 2019-05-17 Last updated: 2019-07-03Bibliographically approved
Ceballos, G., Grass, T., Hugo, A. & Black-Schaffer, D. (2018). Analyzing performance variation of task schedulers with TaskInsight. Parallel Computing, 75, 11-27
Open this publication in new window or tab >>Analyzing performance variation of task schedulers with TaskInsight
2018 (English)In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 75, p. 11-27Article in journal (Refereed) Published
National Category
Computer Engineering
Identifiers
urn:nbn:se:uu:diva-340202 (URN)10.1016/j.parco.2018.02.003 (DOI)000433655700002 ()
Projects
UPMARCResource Sharing Modeling
Funder
Swedish Research Council, FFL12-0051Swedish Foundation for Strategic Research , FFL12-0051
Available from: 2018-02-22 Created: 2018-01-26 Last updated: 2018-11-16Bibliographically approved
Ceballos, G., Sembrant, A., Carlson, T. E. & Black-Schaffer, D. (2018). Behind the Scenes: Memory Analysis of Graphical Workloads on Tile-based GPUs. In: Proc. International Symposium on Performance Analysis of Systems and Software: ISPASS 2018. Paper presented at ISPASS 2018, April 2–4, Belfast, UK (pp. 1-11). IEEE Computer Society
Open this publication in new window or tab >>Behind the Scenes: Memory Analysis of Graphical Workloads on Tile-based GPUs
2018 (English)In: Proc. International Symposium on Performance Analysis of Systems and Software: ISPASS 2018, IEEE Computer Society, 2018, p. 1-11Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
IEEE Computer Society, 2018
National Category
Computer Systems
Identifiers
urn:nbn:se:uu:diva-361214 (URN)10.1109/ISPASS.2018.00009 (DOI)978-1-5386-5010-3 (ISBN)
Conference
ISPASS 2018, April 2–4, Belfast, UK
Projects
UPMARC
Available from: 2018-09-21 Created: 2018-09-21 Last updated: 2018-11-16Bibliographically approved
Alves, R., Kaxiras, S. & Black-Schaffer, D. (2018). Dynamically Disabling Way-prediction to Reduce Instruction Replay. In: 2018 IEEE 36th International Conference on Computer Design (ICCD): . Paper presented at IEEE 36th International Conference on Computer Design (ICCD), October 7–10, 2018, Orlando, FL, USA (pp. 140-143). IEEE
Open this publication in new window or tab >>Dynamically Disabling Way-prediction to Reduce Instruction Replay
2018 (English)In: 2018 IEEE 36th International Conference on Computer Design (ICCD), IEEE, 2018, p. 140-143Conference paper, Published paper (Refereed)
Abstract [en]

Way-predictors have long been used to reduce dynamic cache energy without the performance loss of serial caches. However, they produce variable-latency hits, as incorrect predictions increase load-to-use latency. While the performance impact of these extra cycles has been well-studied, the need to replay subsequent instructions in the pipeline due to the load latency increase has been ignored. In this work we show that way-predictors pay a significant performance penalty beyond previously studied effects due to instruction replays caused by mispredictions. To address this, we propose a solution that learns the confidence of the way prediction and dynamically disables it when it is likely to mispredict and cause replays. This allows us to reduce cache latency (when we can trust the way-prediction) while still avoiding the need to replay instructions in the pipeline (by avoiding way-mispredictions). Standard way-predictors degrade IPC by 6.9% vs. a parallel cache due to 10% of the instructions being replayed (worst case 42.3%). While our solution decreases way-prediction accuracy by turning off the way-predictor in some cases when it would have been correct, it delivers higher performance than a standard way-predictor. Our confidence-based way-predictor degrades IPC by only 4.4% by replaying just 5.6% of the instructions (worse case 16.3%). This reduces the way-predictor cache energy overhead compared to serial access cache, from 8.5% to 3.7% on average and on the worst case, from 33.8% to 9.5%.

Place, publisher, year, edition, pages
IEEE, 2018
Series
Proceedings IEEE International Conference on Computer Design, ISSN 1063-6404, E-ISSN 2576-6996
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-361215 (URN)10.1109/ICCD.2018.00029 (DOI)000458293200018 ()978-1-5386-8477-1 (ISBN)
Conference
IEEE 36th International Conference on Computer Design (ICCD), October 7–10, 2018, Orlando, FL, USA
Available from: 2018-09-21 Created: 2018-09-21 Last updated: 2019-05-22Bibliographically approved
Ceballos, G., Hagersten, E. & Black-Schaffer, D. (2018). Tail-PASS: Resource-based Cache Management for Tiled Graphics Rendering Hardware. In: Proc. 16th International Conference on Parallel and Distributed Processing with Applications: . Paper presented at ISPA 2018, December 11–13, Melbourne, Australia (pp. 55-63). IEEE
Open this publication in new window or tab >>Tail-PASS: Resource-based Cache Management for Tiled Graphics Rendering Hardware
2018 (English)In: Proc. 16th International Conference on Parallel and Distributed Processing with Applications, IEEE, 2018, p. 55-63Conference paper, Published paper (Refereed)
Abstract [en]

Modern graphics rendering is a very expensive process and can account for 60% of the battery consumption on current games. Much of the cost comes from the high memory bandwidth of rendering complex graphics. To render a frame, multiple smaller rendering passes called scenes are executed, with each one tiled for parallel execution. The data for each scene comes from hundreds of software resources (textures). We observe that each frame can consume up to 1000s of MB of data, but that over 75% of the graphics memory accesses are to the top-10 resources, and that bypassing the remaining infrequently accessed (tail) resources reduces cache pollution. Bypassing the tail can save up to 35% of the main memory traffic over resource-oblivious replacement policies and cache management techniques. In this paper, we propose Tail-PASS, a cache management technique that detects the most accessed resources at runtime, learns if it is worth bypassing the least accessed ones, and then dynamically enables/disables bypassing to reduce cache pollution on a per-scene basis. Overall, we see an average reduction in bandwidth-per-frame of 22% (up to 46%) by bypassing all but the top-10 resources and an 11% (up to 44%) reduction if only the top-2 resources are cached.

Place, publisher, year, edition, pages
IEEE, 2018
National Category
Computer Systems Computer Sciences
Identifiers
urn:nbn:se:uu:diva-363920 (URN)10.1109/BDCloud.2018.00022 (DOI)000467843200008 ()978-1-7281-1141-4 (ISBN)
Conference
ISPA 2018, December 11–13, Melbourne, Australia
Funder
EU, European Research Council, 715283
Available from: 2018-10-21 Created: 2018-10-21 Last updated: 2019-06-17Bibliographically approved
Organisations

Search in DiVA

Show all publications