uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Modeling performance variation due to cache sharing
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. (UART)ORCID iD: 0000-0001-9349-5791
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. (UART)
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. (UART)
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. (UART)
2013 (English)In: Proc. 19th IEEE International Symposium on High Performance Computer Architecture, IEEE Computer Society, 2013, 155-166 p.Conference paper, Published paper (Refereed)
Abstract [en]

Shared cache contention can cause significant variability in the performance of co-running applications from run to run. This variability arises from different overlappings of the applications' phases, which can be the result of offsets in application start times or other delays in the system. Understanding this variability is important for generating an accurate view of the expected impact of cache contention. However, variability effects are typically ignored due to the high overhead of modeling or simulating the many executions needed to expose them.

This paper introduces a method for efficiently investigating the performance variability due to cache contention. Our method relies on input data captured from native execution of applications running in isolation and a fast, phase-aware, cache sharing performance model. This allows us to assess the performance interactions and bandwidth demands of co-running applications by quickly evaluating hundreds of overlappings.

We evaluate our method on a contemporary multicore machine and show that performance and bandwidth demands can vary significantly across runs of the same set of co-running applications. We show that our method can predict application slowdown with an average relative error of 0.41% (maximum 1.8%) as well as bandwidth consumption. Using our method, we can estimate an application pair's performance variation 213x faster, on average, than native execution.

Place, publisher, year, edition, pages
IEEE Computer Society, 2013. 155-166 p.
National Category
Computer Systems
Research subject
Computer Systems
Identifiers
URN: urn:nbn:se:uu:diva-196181DOI: 10.1109/HPCA.2013.6522315ISI: 000323775000014ISBN: 978-1-4673-5585-8 (print)OAI: oai:DiVA.org:uu-196181DiVA: diva2:612407
Conference
HPCA 2013, February 23-27, Shenzhen, China
Projects
CoDeR-MPUPMARC
Available from: 2013-03-21 Created: 2013-03-05 Last updated: 2014-04-29Bibliographically approved
In thesis
1. Understanding Multicore Performance: Efficient Memory System Modeling and Simulation
Open this publication in new window or tab >>Understanding Multicore Performance: Efficient Memory System Modeling and Simulation
2014 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

To increase performance, modern processors employ complex techniques such as out-of-order pipelines and deep cache hierarchies. While the increasing complexity has paid off in performance, it has become harder to accurately predict the effects of hardware/software optimizations in such systems. Traditional microarchitectural simulators typically execute code 10 000×–100 000× slower than native execution, which leads to three problems: First, high simulation overhead makes it hard to use microarchitectural simulators for tasks such as software optimizations where rapid turn-around is required. Second, when multiple cores share the memory system, the resulting performance is sensitive to how memory accesses from the different cores interleave. This requires that applications are simulated multiple times with different interleaving to estimate their performance distribution, which is rarely feasible with today's simulators. Third, the high overhead limits the size of the applications that can be studied. This is usually solved by only simulating a relatively small number of instructions near the start of an application, with the risk of reporting unrepresentative results.

In this thesis we demonstrate three strategies to accurately model multicore processors without the overhead of traditional simulation. First, we show how microarchitecture-independent memory access profiles can be used to drive automatic cache optimizations and to qualitatively classify an application's last-level cache behavior. Second, we demonstrate how high-level performance profiles, that can be measured on existing hardware, can be used to model the behavior of a shared cache. Unlike previous models, we predict the effective amount of cache available to each application and the resulting performance distribution due to different interleaving without requiring a processor model. Third, in order to model future systems, we build an efficient sampling simulator. By using native execution to fast-forward between samples, we reach new samples much faster than a single sample can be simulated. This enables us to simulate multiple samples in parallel, resulting in almost linear scalability and a maximum simulation rate close to native execution.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2014. 54 p.
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1136
Keyword
Computer Architecture, Simulation, Modeling, Sampling, Caches, Memory Systems, gem5, Parallel Simulation, Virtualization, Sampling, Multicore
National Category
Computer Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:uu:diva-220652 (URN)978-91-554-8922-9 (ISBN)
Public defence
2014-05-22, Room 2446, Polacksbacken, Lägerhyddsvägen 2, Uppsala, 09:30 (English)
Opponent
Supervisors
Projects
CoDeR-MPUPMARC
Available from: 2014-04-28 Created: 2014-03-18 Last updated: 2014-07-21Bibliographically approved

Open Access in DiVA

fulltext(834 kB)387 downloads
File information
File name FULLTEXT02.pdfFile size 834 kBChecksum SHA-512
482c7dc26860b30801a824df0b7a55abd1baefd3d8fb502e16c9323e03ca26ee549f9d02f2fc88730d19e7a009bc7225001660ca3c03417aa623ee7e6db50643
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Authority records BETA

Sandberg, AndreasSembrant, AndreasHagersten, ErikBlack-Schaffer, David

Search in DiVA

By author/editor
Sandberg, AndreasSembrant, AndreasHagersten, ErikBlack-Schaffer, David
By organisation
Computer Systems
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 387 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 991 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf