uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Full Speed Ahead: Detailed Architectural Simulation at Near-Native Speed
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. (UART)ORCID iD: 0000-0001-9349-5791
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. (UART)
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. (UART)
2014 (English)Report (Other academic)
Abstract [en]

Popular microarchitecture simulators are typically several orders of magnitude slower than the systems they simulate. This leads to two problems: First, due to the slow simulation rate, simulation studies are usually limited to the first few billion instructions, which corresponds to less than 10% the execution time of many standard benchmarks. Since such studies only cover a small fraction of the applications, they run the risk of reporting unrepresentative application behavior unless sampling strategies are employed. Second, the high overhead of traditional simulators make them unsuitable for hardware/software co-design studies where rapid turn-around is required.

In spite of previous efforts to parallelize simulators, most commonly used full-system simulations remain single threaded. In this paper, we explore a simple and effective way to parallelize sampling full-system simulators. In order to simulate at high speed, we need to be able to efficiently fast-forward between sample points. We demonstrate how hardware virtualization can be used to implement highly efficient fast-forwarding in the standard gem5 simulator and how this enables efficient execution between sample points. This extremely rapid fast-forwarding enables us to reach new sample points much quicker than a single sample can be simulated. Together with efficient copying of simulator state, this enables parallel execution of sample simulation. These techniques allow us to implement a highly scalable sampling simulator that exploits sample-level parallelism.

We demonstrate how virtualization can be used to fast-forward simulators at 90% of native execution speed on average. Using virtualized fast-forwarding, we demonstrate a parallel sampling simulator that can be used to accurately estimate the IPC of standard workloads with an average error of 2.2% while still reaching an execution rate of 2.0 GIPS (63% of native) on average. We demonstrate that our parallelization strategy scales almost linearly and simulates one core at up to 93% of its native execution rate, 19,000x faster than detailed simulation, while using 8 cores.

Place, publisher, year, edition, pages
2014.
Series
Technical report / Department of Information Technology, Uppsala University, ISSN 1404-3203 ; 2014-005
Keyword [en]
Computer Architecture, Simulation, Sampling, Native Execution, Virtualization, pFSA, FSA, KVM
National Category
Computer Engineering
Research subject
Computer Systems
Identifiers
URN: urn:nbn:se:uu:diva-220649OAI: oai:DiVA.org:uu-220649DiVA: diva2:705982
Projects
UPMARCCoDeR-MP
Available from: 2014-03-18 Created: 2014-03-18 Last updated: 2014-04-29Bibliographically approved
In thesis
1. Understanding Multicore Performance: Efficient Memory System Modeling and Simulation
Open this publication in new window or tab >>Understanding Multicore Performance: Efficient Memory System Modeling and Simulation
2014 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

To increase performance, modern processors employ complex techniques such as out-of-order pipelines and deep cache hierarchies. While the increasing complexity has paid off in performance, it has become harder to accurately predict the effects of hardware/software optimizations in such systems. Traditional microarchitectural simulators typically execute code 10 000×–100 000× slower than native execution, which leads to three problems: First, high simulation overhead makes it hard to use microarchitectural simulators for tasks such as software optimizations where rapid turn-around is required. Second, when multiple cores share the memory system, the resulting performance is sensitive to how memory accesses from the different cores interleave. This requires that applications are simulated multiple times with different interleaving to estimate their performance distribution, which is rarely feasible with today's simulators. Third, the high overhead limits the size of the applications that can be studied. This is usually solved by only simulating a relatively small number of instructions near the start of an application, with the risk of reporting unrepresentative results.

In this thesis we demonstrate three strategies to accurately model multicore processors without the overhead of traditional simulation. First, we show how microarchitecture-independent memory access profiles can be used to drive automatic cache optimizations and to qualitatively classify an application's last-level cache behavior. Second, we demonstrate how high-level performance profiles, that can be measured on existing hardware, can be used to model the behavior of a shared cache. Unlike previous models, we predict the effective amount of cache available to each application and the resulting performance distribution due to different interleaving without requiring a processor model. Third, in order to model future systems, we build an efficient sampling simulator. By using native execution to fast-forward between samples, we reach new samples much faster than a single sample can be simulated. This enables us to simulate multiple samples in parallel, resulting in almost linear scalability and a maximum simulation rate close to native execution.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2014. 54 p.
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1136
Keyword
Computer Architecture, Simulation, Modeling, Sampling, Caches, Memory Systems, gem5, Parallel Simulation, Virtualization, Sampling, Multicore
National Category
Computer Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:uu:diva-220652 (URN)978-91-554-8922-9 (ISBN)
Public defence
2014-05-22, Room 2446, Polacksbacken, Lägerhyddsvägen 2, Uppsala, 09:30 (English)
Opponent
Supervisors
Projects
CoDeR-MPUPMARC
Available from: 2014-04-28 Created: 2014-03-18 Last updated: 2014-07-21Bibliographically approved

Open Access in DiVA

fulltext(175 kB)1098 downloads
File information
File name FULLTEXT01.pdfFile size 175 kBChecksum SHA-512
058f9856afc081d5513a61b025d724c766e0bd9d4d741b7618536045704c0e34dbabe39c762af65ed6c491742f086a182986a0e9dbc2e6b0f069e8e9e2ae3846
Type fulltextMimetype application/pdf

Other links

http://www.it.uu.se/research/publications/reports/2014-005/2014-005-nc.pdf

Authority records BETA

Sandberg, AndreasHagersten, ErikBlack-Schaffer, David

Search in DiVA

By author/editor
Sandberg, AndreasHagersten, ErikBlack-Schaffer, David
By organisation
Computer Systems
Computer Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 1098 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 1130 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf