uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Filter caching for free: The untapped potential of the store-buffer
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication. (UART)ORCID iD: 0000-0002-6259-7821
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
2019 (English)In: Proc. 46th International Symposium on Computer Architecture, New York: ACM Press, 2019, p. 436-448Conference paper, Published paper (Refereed)
Abstract [en]

Modern processors contain store-buffers to allow stores to retire under a miss, thus hiding store-miss latency. The store-buffer needs to be large (for performance) and searched on every load (for correctness), thereby making it a costly structure in both area and energy. Yet on every load, the store-buffer is probed in parallel with the L1 and TLB, with no concern for the store-buffer's intrinsic hit rate or whether a store-buffer hit can be predicted to save energy by disabling the L1 and TLB probes.

In this work we cache data that have been written back to memory in a unified store-queue/buffer/cache, and predict hits to avoid L1/TLB probes and save energy. By dynamically adjusting the allocation of entries between the store-queue/buffer/cache, we can achieve nearly optimal reuse, without causing stalls. We are able to do this efficiently and cheaply by recognizing key properties of stores: free caching (since they must be written into the store-buffer for correctness we need no additional data movement), cheap coherence (since we only need to track state changes of the local, dirty data in the store-buffer), and free and accurate hit prediction (since the memory dependence predictor already does this for scheduling).

As a result, we are able to increase the store-buffer hit rate and reduce store-buffer/TLB/L1 dynamic energy by 11.8% (up to 26.4%) on SPEC2006 without hurting performance (average IPC improvements of 1.5%, up to 4.7%).The cost for these improvements is a 0.2% increase in L1 cache capacity (1 bit per line) and one additional tail pointer in the store-buffer.

Place, publisher, year, edition, pages
New York: ACM Press, 2019. p. 436-448
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:uu:diva-383473DOI: 10.1145/3307650.3322269ISBN: 978-1-4503-6669-4 (print)OAI: oai:DiVA.org:uu-383473DiVA, id: diva2:1316126
Conference
ISCA 2019, June 22–26, Phoenix, AZ
Funder
Knut and Alice Wallenberg FoundationEU, Horizon 2020, 715283EU, Horizon 2020, 801051Swedish Foundation for Strategic Research , SM17-0064Available from: 2019-06-22 Created: 2019-05-16 Last updated: 2019-07-03Bibliographically approved
In thesis
1. Leveraging Existing Microarchitectural Structures to Improve First-Level Caching Efficiency
Open this publication in new window or tab >>Leveraging Existing Microarchitectural Structures to Improve First-Level Caching Efficiency
2019 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Low-latency data access is essential for performance. To achieve this, processors use fast first-level caches combined with out-of-order execution, to decrease and hide memory access latency respectively. While these approaches are effective for performance, they cost significant energy, leading to the development of many techniques that require designers to trade-off performance and efficiency.

Way-prediction and filter caches are two of the most common strategies for improving first-level cache energy efficiency while still minimizing latency. They both have compromises as way-prediction trades off some latency for better energy efficiency, while filter caches trade off some energy efficiency for lower latency. However, these strategies are not mutually exclusive. By borrowing elements from both, and taking into account SRAM memory layout limitations, we proposed a novel MRU-L0 cache that mitigates many of their shortcomings while preserving their benefits. Moreover, while first-level caches are tightly integrated into the cpu pipeline, existing work on these techniques largely ignores the impact they have on instruction scheduling. We show that the variable hit latency introduced by way-misspredictions causes instruction replays of load dependent instruction chains, which hurts performance and efficiency. We study this effect and propose a variable latency cache-hit instruction scheduler, that identifies potential misschedulings, reduces instruction replays, reduces negative performance impact, and further improves cache energy efficiency.

Modern pipelines also employ sophisticated execution strategies to hide memory latency and improve performance. While their primary use is for performance and correctness, they require intermediate storage that can be used as a cache as well. In this work we demonstrate how the store-buffer, paired with the memory dependency predictor, can be used to efficiently cache dirty data; and how the physical register file, paired with a value predictor, can be used to efficiently cache clean data. These strategies not only improve both performance and energy, but do so with no additional storage and minimal additional complexity, since they recycle existing cpu structures to detect reuse, memory ordering violations, and misspeculations.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2019. p. 42
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1821
Keywords
Energy Efficient Caching, Memory Architecture, Single Thread Performance, First-Level Caching, Out-of-Order Pipelines, Instruction Scheduling, Filter-Cache, Way-Prediction, Value-Prediction, Register-Sharing.
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-383811 (URN)978-91-513-0681-0 (ISBN)
Public defence
2019-08-26, Sal VIII, Universitetshuset, Biskopsgatan 3, Uppsala, 09:00 (English)
Opponent
Supervisors
Available from: 2019-06-11 Created: 2019-05-22 Last updated: 2019-07-01

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Authority records BETA

Alves, RicardoRos, AlbertoBlack-Schaffer, DavidKaxiras, Stefanos

Search in DiVA

By author/editor
Alves, RicardoRos, AlbertoBlack-Schaffer, DavidKaxiras, Stefanos
By organisation
Computer Architecture and Computer Communication
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 182 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf