Logo: to the web site of Uppsala University

uu.sePublications from Uppsala University
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Filter caching for free: The untapped potential of the store-buffer
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication. (UART)ORCID iD: 0000-0002-6259-7821
Univ Murcia, Murcia, Spain.
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
2019 (English)In: Proc. 46th International Symposium on Computer Architecture, New York: ACM Press, 2019, p. 436-448Conference paper, Published paper (Refereed)
Abstract [en]

Modern processors contain store-buffers to allow stores to retire under a miss, thus hiding store-miss latency. The store-buffer needs to be large (for performance) and searched on every load (for correctness), thereby making it a costly structure in both area and energy. Yet on every load, the store-buffer is probed in parallel with the L1 and TLB, with no concern for the store-buffer's intrinsic hit rate or whether a store-buffer hit can be predicted to save energy by disabling the L1 and TLB probes.

In this work we cache data that have been written back to memory in a unified store-queue/buffer/cache, and predict hits to avoid L1/TLB probes and save energy. By dynamically adjusting the allocation of entries between the store-queue/buffer/cache, we can achieve nearly optimal reuse, without causing stalls. We are able to do this efficiently and cheaply by recognizing key properties of stores: free caching (since they must be written into the store-buffer for correctness we need no additional data movement), cheap coherence (since we only need to track state changes of the local, dirty data in the store-buffer), and free and accurate hit prediction (since the memory dependence predictor already does this for scheduling).

As a result, we are able to increase the store-buffer hit rate and reduce store-buffer/TLB/L1 dynamic energy by 11.8% (up to 26.4%) on SPEC2006 without hurting performance (average IPC improvements of 1.5%, up to 4.7%).The cost for these improvements is a 0.2% increase in L1 cache capacity (1 bit per line) and one additional tail pointer in the store-buffer.

Place, publisher, year, edition, pages
New York: ACM Press, 2019. p. 436-448
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:uu:diva-383473DOI: 10.1145/3307650.3322269ISI: 000521059600034ISBN: 978-1-4503-6669-4 (print)OAI: oai:DiVA.org:uu-383473DiVA, id: diva2:1316126
Conference
ISCA 2019, June 22–26, Phoenix, AZ
Funder
Knut and Alice Wallenberg FoundationEU, Horizon 2020, 715283EU, Horizon 2020, 801051Swedish Foundation for Strategic Research , SM17-0064Available from: 2019-06-22 Created: 2019-05-16 Last updated: 2020-04-27Bibliographically approved
In thesis
1. Leveraging Existing Microarchitectural Structures to Improve First-Level Caching Efficiency
Open this publication in new window or tab >>Leveraging Existing Microarchitectural Structures to Improve First-Level Caching Efficiency
2019 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Low-latency data access is essential for performance. To achieve this, processors use fast first-level caches combined with out-of-order execution, to decrease and hide memory access latency respectively. While these approaches are effective for performance, they cost significant energy, leading to the development of many techniques that require designers to trade-off performance and efficiency.

Way-prediction and filter caches are two of the most common strategies for improving first-level cache energy efficiency while still minimizing latency. They both have compromises as way-prediction trades off some latency for better energy efficiency, while filter caches trade off some energy efficiency for lower latency. However, these strategies are not mutually exclusive. By borrowing elements from both, and taking into account SRAM memory layout limitations, we proposed a novel MRU-L0 cache that mitigates many of their shortcomings while preserving their benefits. Moreover, while first-level caches are tightly integrated into the cpu pipeline, existing work on these techniques largely ignores the impact they have on instruction scheduling. We show that the variable hit latency introduced by way-misspredictions causes instruction replays of load dependent instruction chains, which hurts performance and efficiency. We study this effect and propose a variable latency cache-hit instruction scheduler, that identifies potential misschedulings, reduces instruction replays, reduces negative performance impact, and further improves cache energy efficiency.

Modern pipelines also employ sophisticated execution strategies to hide memory latency and improve performance. While their primary use is for performance and correctness, they require intermediate storage that can be used as a cache as well. In this work we demonstrate how the store-buffer, paired with the memory dependency predictor, can be used to efficiently cache dirty data; and how the physical register file, paired with a value predictor, can be used to efficiently cache clean data. These strategies not only improve both performance and energy, but do so with no additional storage and minimal additional complexity, since they recycle existing cpu structures to detect reuse, memory ordering violations, and misspeculations.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2019. p. 42
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1821
Keywords
Energy Efficient Caching, Memory Architecture, Single Thread Performance, First-Level Caching, Out-of-Order Pipelines, Instruction Scheduling, Filter-Cache, Way-Prediction, Value-Prediction, Register-Sharing.
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-383811 (URN)978-91-513-0681-0 (ISBN)
Public defence
2019-08-26, Sal VIII, Universitetshuset, Biskopsgatan 3, Uppsala, 09:00 (English)
Opponent
Supervisors
Available from: 2019-06-11 Created: 2019-05-22 Last updated: 2019-08-23

Open Access in DiVA

fulltext(766 kB)777 downloads
File information
File name FULLTEXT02.pdfFile size 766 kBChecksum SHA-512
a7121ae828c7fc3918e8d66358e235b90f97b7f950da9c0c90634b13a5da0352cb6cf2d1fd10bec4d4f5a21502da1acc4cbd3df21725df7107fdf246aa4ced72
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Authority records

Alves, RicardoRos, AlbertoBlack-Schaffer, DavidKaxiras, Stefanos

Search in DiVA

By author/editor
Alves, RicardoRos, AlbertoBlack-Schaffer, DavidKaxiras, Stefanos
By organisation
Computer Architecture and Computer Communication
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 779 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 1058 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf