uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Understanding Multicore Performance: Efficient Memory System Modeling and Simulation
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Computer Systems. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. (UART)ORCID iD: 0000-0001-9349-5791
2014 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

To increase performance, modern processors employ complex techniques such as out-of-order pipelines and deep cache hierarchies. While the increasing complexity has paid off in performance, it has become harder to accurately predict the effects of hardware/software optimizations in such systems. Traditional microarchitectural simulators typically execute code 10 000×–100 000× slower than native execution, which leads to three problems: First, high simulation overhead makes it hard to use microarchitectural simulators for tasks such as software optimizations where rapid turn-around is required. Second, when multiple cores share the memory system, the resulting performance is sensitive to how memory accesses from the different cores interleave. This requires that applications are simulated multiple times with different interleaving to estimate their performance distribution, which is rarely feasible with today's simulators. Third, the high overhead limits the size of the applications that can be studied. This is usually solved by only simulating a relatively small number of instructions near the start of an application, with the risk of reporting unrepresentative results.

In this thesis we demonstrate three strategies to accurately model multicore processors without the overhead of traditional simulation. First, we show how microarchitecture-independent memory access profiles can be used to drive automatic cache optimizations and to qualitatively classify an application's last-level cache behavior. Second, we demonstrate how high-level performance profiles, that can be measured on existing hardware, can be used to model the behavior of a shared cache. Unlike previous models, we predict the effective amount of cache available to each application and the resulting performance distribution due to different interleaving without requiring a processor model. Third, in order to model future systems, we build an efficient sampling simulator. By using native execution to fast-forward between samples, we reach new samples much faster than a single sample can be simulated. This enables us to simulate multiple samples in parallel, resulting in almost linear scalability and a maximum simulation rate close to native execution.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2014. , 54 p.
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1136
Keyword [en]
Computer Architecture, Simulation, Modeling, Sampling, Caches, Memory Systems, gem5, Parallel Simulation, Virtualization, Sampling, Multicore
National Category
Computer Engineering
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:uu:diva-220652ISBN: 978-91-554-8922-9 (print)OAI: oai:DiVA.org:uu-220652DiVA: diva2:708754
Public defence
2014-05-22, Room 2446, Polacksbacken, Lägerhyddsvägen 2, Uppsala, 09:30 (English)
Opponent
Supervisors
Projects
CoDeR-MPUPMARC
Available from: 2014-04-28 Created: 2014-03-18 Last updated: 2014-07-21Bibliographically approved
List of papers
1. Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses
Open this publication in new window or tab >>Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses
2010 (English)In: Proc. International Conference for High Performance Computing, Networking, Storage and Analysis: SC 2010, Piscataway, NJ: IEEE , 2010, 11- p.Conference paper, Published paper (Refereed)
Abstract [en]

Contention for shared cache resources has been recognized as a major bottleneck for multicores—especially for mixed workloads of independent applications. While most modern processors implement instructions to manage caches, these instructions are largely unused due to a lack of understanding of how to best leverage them. This paper introduces a classification of applications into four cache usage categories. We discuss how applications from different categories affect each other's performance indirectly through cache sharing and devise a scheme to optimize such sharing. We also propose a low-overhead method to automatically find the best per-instruction cache management policy. We demonstrate how the indirect cache-sharing effects of mixed workloads can be tamed by automatically altering some instructions to better manage cache resources. Practical experiments demonstrate that our software-only method can improve application performance up to 35% on x86 multicore hardware.

Place, publisher, year, edition, pages
Piscataway, NJ: IEEE, 2010
National Category
Computer Science
Identifiers
urn:nbn:se:uu:diva-134386 (URN)10.1109/SC.2010.44 (DOI)978-1-4244-7557-5 (ISBN)
Conference
2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, November 13-19 2010
Projects
Coder-mpUPMARC
Available from: 2010-11-25 Created: 2010-11-24 Last updated: 2014-04-29Bibliographically approved
2. Efficient techniques for predicting cache sharing and throughput
Open this publication in new window or tab >>Efficient techniques for predicting cache sharing and throughput
2012 (English)In: Proc. 21st International Conference on Parallel Architectures and Compilation Techniques, New York: ACM Press, 2012, 305-314 p.Conference paper, Published paper (Refereed)
Abstract [en]

This work addresses the modeling of shared cache contention in multicore systems and its impact on throughput and bandwidth. We develop two simple and fast cache sharing models for accurately predicting shared cache allocations for random and LRU caches.

To accomplish this we use low-overhead input data that captures the behavior of applications running on real hardware as a function of their shared cache allocation. This data enables us to determine how much and how aggressively data is reused by an application depending on how much shared cache it receives. From this we can model how applications compete for cache space, their aggregate performance (throughput)¸ and bandwidth.

We evaluate our models for two- and four-application workloads in simulation and on modern hardware. On a four-core machine, we demonstrate an average relative fetch ratio error of 6.7% for groups of four applications. We are able to predict workload bandwidth with an average relative error of less than 5.2% and throughput with an average error of less than 1.8%. The model can predict cache size with an average error of 1.3% compared to simulation.

Place, publisher, year, edition, pages
New York: ACM Press, 2012
National Category
Computer Systems
Research subject
Computer Systems
Identifiers
urn:nbn:se:uu:diva-178207 (URN)10.1145/2370816.2370861 (DOI)978-1-4503-1182-3 (ISBN)
Conference
PACT 2012, September 19–23, Minneapolis, MN
Projects
CoDeR-MPUPMARC
Available from: 2012-10-09 Created: 2012-07-30 Last updated: 2014-04-29Bibliographically approved
3. Modeling performance variation due to cache sharing
Open this publication in new window or tab >>Modeling performance variation due to cache sharing
2013 (English)In: Proc. 19th IEEE International Symposium on High Performance Computer Architecture, IEEE Computer Society, 2013, 155-166 p.Conference paper, Published paper (Refereed)
Abstract [en]

Shared cache contention can cause significant variability in the performance of co-running applications from run to run. This variability arises from different overlappings of the applications' phases, which can be the result of offsets in application start times or other delays in the system. Understanding this variability is important for generating an accurate view of the expected impact of cache contention. However, variability effects are typically ignored due to the high overhead of modeling or simulating the many executions needed to expose them.

This paper introduces a method for efficiently investigating the performance variability due to cache contention. Our method relies on input data captured from native execution of applications running in isolation and a fast, phase-aware, cache sharing performance model. This allows us to assess the performance interactions and bandwidth demands of co-running applications by quickly evaluating hundreds of overlappings.

We evaluate our method on a contemporary multicore machine and show that performance and bandwidth demands can vary significantly across runs of the same set of co-running applications. We show that our method can predict application slowdown with an average relative error of 0.41% (maximum 1.8%) as well as bandwidth consumption. Using our method, we can estimate an application pair's performance variation 213x faster, on average, than native execution.

Place, publisher, year, edition, pages
IEEE Computer Society, 2013
National Category
Computer Systems
Research subject
Computer Systems
Identifiers
urn:nbn:se:uu:diva-196181 (URN)10.1109/HPCA.2013.6522315 (DOI)000323775000014 ()978-1-4673-5585-8 (ISBN)
Conference
HPCA 2013, February 23-27, Shenzhen, China
Projects
CoDeR-MPUPMARC
Available from: 2013-03-21 Created: 2013-03-05 Last updated: 2014-04-29Bibliographically approved
4. Full Speed Ahead: Detailed Architectural Simulation at Near-Native Speed
Open this publication in new window or tab >>Full Speed Ahead: Detailed Architectural Simulation at Near-Native Speed
2014 (English)Report (Other academic)
Abstract [en]

Popular microarchitecture simulators are typically several orders of magnitude slower than the systems they simulate. This leads to two problems: First, due to the slow simulation rate, simulation studies are usually limited to the first few billion instructions, which corresponds to less than 10% the execution time of many standard benchmarks. Since such studies only cover a small fraction of the applications, they run the risk of reporting unrepresentative application behavior unless sampling strategies are employed. Second, the high overhead of traditional simulators make them unsuitable for hardware/software co-design studies where rapid turn-around is required.

In spite of previous efforts to parallelize simulators, most commonly used full-system simulations remain single threaded. In this paper, we explore a simple and effective way to parallelize sampling full-system simulators. In order to simulate at high speed, we need to be able to efficiently fast-forward between sample points. We demonstrate how hardware virtualization can be used to implement highly efficient fast-forwarding in the standard gem5 simulator and how this enables efficient execution between sample points. This extremely rapid fast-forwarding enables us to reach new sample points much quicker than a single sample can be simulated. Together with efficient copying of simulator state, this enables parallel execution of sample simulation. These techniques allow us to implement a highly scalable sampling simulator that exploits sample-level parallelism.

We demonstrate how virtualization can be used to fast-forward simulators at 90% of native execution speed on average. Using virtualized fast-forwarding, we demonstrate a parallel sampling simulator that can be used to accurately estimate the IPC of standard workloads with an average error of 2.2% while still reaching an execution rate of 2.0 GIPS (63% of native) on average. We demonstrate that our parallelization strategy scales almost linearly and simulates one core at up to 93% of its native execution rate, 19,000x faster than detailed simulation, while using 8 cores.

Series
Technical report / Department of Information Technology, Uppsala University, ISSN 1404-3203 ; 2014-005
Keyword
Computer Architecture, Simulation, Sampling, Native Execution, Virtualization, pFSA, FSA, KVM
National Category
Computer Engineering
Research subject
Computer Systems
Identifiers
urn:nbn:se:uu:diva-220649 (URN)
Projects
UPMARCCoDeR-MP
Available from: 2014-03-18 Created: 2014-03-18 Last updated: 2014-04-29Bibliographically approved

Open Access in DiVA

fulltext(448 kB)625 downloads
File information
File name FULLTEXT02.pdfFile size 448 kBChecksum SHA-512
49d88261ede86a34f810334fda042045a61a4391787f1b80b302b6f448b83724af4e34c0f58a0b42cb0b3bcb2639ad17ff8defdf6921b1b9495d7aaf6fe40ca8
Type fulltextMimetype application/pdf
Buy this publication >>

Authority records BETA

Sandberg, Andreas

Search in DiVA

By author/editor
Sandberg, Andreas
By organisation
Division of Computer SystemsComputer Systems
Computer Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 645 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 47380 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf