uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Low Overhead Instruction-Cache Modeling Using Instruction Reuse Profiles
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. (UART)
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. (UART)
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. (UART)
2012 (English)In: International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'12), IEEE Computer Society , 2012, 260-269 p.Conference paper, Published paper (Refereed)
Abstract [en]

Performance loss caused by L1 instruction cache misses varies between different architectures and cache sizes. For processors employing power-efficient in-order execution with small caches, performance can be significantly affected by instruction cache misses. The growing use of low-power multi-threaded CPUs (with shared L1 caches) in general purpose computing platforms requires new efficient techniques for analyzing application instruction cache usage. Such insight can be achieved using traditional simulation technologies modeling several cache sizes, but the overhead of simulators may be prohibitive for practical optimization usage. In this paper we present a statistical method to quickly model application instruction cache performance. Most importantly we propose a very low-overhead sampling mechanism to collect runtime data from the application's instruction stream. This data is fed to the statistical model which accurately estimates the instruction cache miss ratio for the sampled execution. Our sampling method is about 10x faster than previously suggested sampling approaches, with average runtime overhead as low as 25% over native execution. The architecturally-independent data collected is used to accurately model miss ratio for several cache sizes simultaneously, with average absolute error of 0.2%. Finally, we show how our tool can be used to identify program phases with large instruction cache footprint. Such phases can then be targeted to optimize for reduced code footprint.

Place, publisher, year, edition, pages
IEEE Computer Society , 2012. 260-269 p.
Series
Computer Architecture and High Performance Computing, ISSN 1550-6533
National Category
Computer Systems Computer Science Computer Engineering
Research subject
Computer Science; Computer Systems
Identifiers
URN: urn:nbn:se:uu:diva-180148DOI: 10.1109/SBAC-PAD.2012.25ISBN: 978-1-4673-4790-7 (print)OAI: oai:DiVA.org:uu-180148DiVA: diva2:548389
Conference
24 th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), October 2012, New York, NY, USA
Projects
CoDeR-MPUPMARC
Available from: 2012-08-30 Created: 2012-08-30 Last updated: 2016-03-09
In thesis
1. Optimizing Performance in Highly Utilized Multicores with Intelligent Prefetching
Open this publication in new window or tab >>Optimizing Performance in Highly Utilized Multicores with Intelligent Prefetching
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Modern processors apply sophisticated techniques, such as deep cache hierarchies and hardware prefetching, to increase performance. Such complex hardware structures have helped improve performance in general, however, their full potential is not realized as software often utilizes the memory hierarchy inefficiently. Performance can be improved further by ensuring careful interaction between software and hardware. Performance can typically improve by increasing the cache utilization and by conserving the DRAM bandwidth, i.e., retaining more useful data in the caches and lowering data requests to the DRAM. One way to achieve this is to conserve space across the cache hierarchy and increase opportunity for temporal reuse of cached data. Similarly, conserving the DRAM bandwidth is essential for performance in highly utilized multicores, as it can easily become a critical resource. When multiple cores are active and the per-core share of DRAM bandwidth shrinks, its efficient utilization plays an important role in improving the overall performance. Together the cache hierarchy and the DRAM bandwidth play a significant role in defining the overall performance in multicores.

Based on deep insight from memory behavior modeling of software, this thesis explores five software-only methods to analyze and increase performance in multicores. The underlying philosophy that drives these techniques is to increase cache utilization and conserve DRAM bandwidth by 1) focusing on making data prefetching more accurate, and 2) lowering the miss rate in the cache hierarchy either by preserving useful data longer by cache-bypassing the less useful data or via code size compaction using compiler options. First, we show how microarchitecture-independent memory access profiles can be used to analyze the Instruction Cache performance of software. We use this information in a compiler pass to recompile application phases (with large Instruction cache miss rate) for smaller code size in an effort to improve the application Instruction Cache behavior. Second, we demonstrate how a resourceefficient software prefetching method can be combined with hardware prefetching to improve performance in multicores when running software that exhibits irregular memory access patterns. Third, we show that hardware prefetching on high performance commodity multicores is sub-optimal and demonstrate how a resource-efficient software-only prefetching method can perform better in fully utilized multicores. Fourth, we present an adaptive prefetching approach that dynamically combines software and hardware prefetching in a runtime system to improve performance in highly utilized multicores. Finally, in the fifth work we develop a method to predict per-core prefetching configurations that deliver near-optimal overall multicore performance. These software techniques enable us to tap greater performance in multicores (up to 50%), without requiring more processing resources.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2016. 54 p.
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1335
Keyword
Performance, Optimization, Prefetching, multicore, memory hierarchy
National Category
Computer Science
Research subject
Computer Science
Identifiers
urn:nbn:se:uu:diva-272095 (URN)978-91-554-9450-6 (ISBN)
Public defence
2016-03-21, ITC/2446, Informationsteknologiskt centrum, Lägerhyddsvägen 2, Uppsala, 13:00 (English)
Opponent
Supervisors
Available from: 2016-02-25 Created: 2016-01-11 Last updated: 2016-04-18Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full text

Authority records BETA

Sembrant, AndreasHagersten, Erik

Search in DiVA

By author/editor
Sembrant, AndreasHagersten, Erik
By organisation
Computer Systems
Computer SystemsComputer ScienceComputer Engineering

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 702 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf