uu.seUppsala University Publications
Change search
Link to record
Permanent link

Direct link
BETA
Publications (10 of 12) Show all publications
Nikoleris, N., Sandberg, A., Hagersten, E. & Carlson, T. E. (2016). CoolSim: Statistical Techniques to Replace Cache Warming with Efficient, Virtualized Profiling. In: Najjar, W Gerstlauer, A (Ed.), Proceedings Of 2016 International Conference On Embedded Computer Systems: Architectures, Modeling And Simulation (Samos). Paper presented at International Conference on Embedded Computer Systems - Architectures, Modeling and Simulation (SAMOS), JUL 17-21, 2016, Samos, GREECE (pp. 106-115). IEEE
Open this publication in new window or tab >>CoolSim: Statistical Techniques to Replace Cache Warming with Efficient, Virtualized Profiling
2016 (English)In: Proceedings Of 2016 International Conference On Embedded Computer Systems: Architectures, Modeling And Simulation (Samos) / [ed] Najjar, W Gerstlauer, A, IEEE , 2016, p. 106-115Conference paper, Published paper (Refereed)
Abstract [en]

Simulation is an important part of the evaluation of next-generation computing systems. Detailed, cycle-accurate simulation, however, can be very slow when evaluating realistic workloads on modern microarchitectures. Sampled simulation (e.g., SMARTS and SimPoint) improves simulation performance by an order of magnitude or more through the reduction of large workloads into a small but representative sample. Additionally, the execution state just prior to a simulation sample can be stored into checkpoints, allowing for fast restoration and evaluation. Unfortunately, changes in software, architecture or fundamental pieces of the microarchitecture (e.g., hardware-software co-design) require checkpoint regeneration. The end result for co-design degenerates to creating checkpoints for each modification, a task checkpointing was designed to eliminate. Therefore, a solution is needed that allows for fast and accurate simulation, without the need for checkpoints. Virtualized fast-forwarding (VFF), an alternative to using checkpoints, allows for execution at near-native speed between simulation points. Warming the micro-architectural state prior to each simulation point, however, requires functional simulation, a costly operation for large caches (e.g., 8 M B). Simulating future systems with caches of many MBs can require warming of billions of instructions, dominating simulation time. This paper proposes CoolSim, an efficient simulation framework that eliminates cache warming. CoolSim uses VFF to advance between simulation points collecting at the same time sparse memory reuse information (MRI). The MRI is collected more than an order of magnitude faster than functional simulation. At the simulation point, detailed simulation with a statistical cache model is used to evaluate the design. The previously acquired MRI is used to estimate whether each memory request hits in the cache. The MRI is an architecturally independent metric and a single profile can be used in simulations of any size cache. We describe a prototype implementation of CoolSim based on KVM and gem5 running 19 x faster than the state-of-the-art sampled simulation, while it estimates the CPI of the SPEC CPU2006 benchmarks with 3.62% error on average, across a wide range of cache sizes.

Place, publisher, year, edition, pages
IEEE, 2016
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-322061 (URN)000399143000015 ()9781509030767 (ISBN)
Conference
International Conference on Embedded Computer Systems - Architectures, Modeling and Simulation (SAMOS), JUL 17-21, 2016, Samos, GREECE
Funder
Swedish Foundation for Strategic Research EU, FP7, Seventh Framework Programme, 610490
Available from: 2017-05-16 Created: 2017-05-16 Last updated: 2018-12-14Bibliographically approved
Khan, M., Sandberg, A. & Hagersten, E. (2014). A case for resource efficient prefetching in multicores. In: Proc. 43rd International Conference on Parallel Processing: . Paper presented at 2014 43nd International Conference on Parallel Processing (ICPP), September 9-12, Minneapolis, MN (pp. 101-110). IEEE Computer Society
Open this publication in new window or tab >>A case for resource efficient prefetching in multicores
2014 (English)In: Proc. 43rd International Conference on Parallel Processing, IEEE Computer Society, 2014, p. 101-110Conference paper, Published paper (Refereed)
Abstract [en]

Modern processors typically employ sophisticated prefetching techniques for hiding memory latency. Hardware prefetching has proven very effective and can speed up some SPEC CPU 2006 benchmarks by more than 40% when running in isolation. However, this speedup often comes at the cost of prefetching a significant volume of useless data (sometimes more than twice the data required) which wastes shared last level cache space and off-chip bandwidth. This paper explores how an accurate resource-efficient prefetching scheme can benefit performance by conserving shared resources in multicores. We present a framework that uses low-overhead runtime sampling and fast cache modeling to accurately identify memory instructions that frequently miss in the cache. We then use this information to automatically insert software prefetches in the application. Our prefetching scheme has good accuracy and employs cache bypassing whenever possible. These properties help reduce off-chip bandwidth consumption and last-level cache pollution. While single-thread performance remains comparable to hardware prefetching, the full advantage of the scheme is realized when several cores are used and demand for shared resources grows. We evaluate our method on two modern commodity multicores. Across 180 mixed workloads that fully utilize a multicore, the proposed software prefetching mechanism achieves up to 24% better throughput than hardware prefetching, and performs 10% better on average.

Place, publisher, year, edition, pages
IEEE Computer Society, 2014
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-234547 (URN)10.1109/ICPP.2014.19 (DOI)978-1-4799-5618-0 (ISBN)
Conference
2014 43nd International Conference on Parallel Processing (ICPP), September 9-12, Minneapolis, MN
Available from: 2014-11-25 Created: 2014-10-20 Last updated: 2018-01-11Bibliographically approved
Khan, M., Sandberg, A. & Hagersten, E. (2014). A case for resource efficient prefetching in multicores. In: Proc. International Symposium on Performance Analysis of Systems and Software: ISPASS 2014. Paper presented at ISPASS 2014, March 23-25, Monterey, CA (pp. 137-138). IEEE Computer Society
Open this publication in new window or tab >>A case for resource efficient prefetching in multicores
2014 (English)In: Proc. International Symposium on Performance Analysis of Systems and Software: ISPASS 2014, IEEE Computer Society, 2014, p. 137-138Conference paper, Poster (with or without abstract) (Refereed)
Abstract [en]

Hardware prefetching has proven very effective for hiding memory latency and can speed up some applications by more than 40%. However, this speedup comes at the cost of often prefetching a significant volume of useless data which wastes shared last level cache space and off-chip bandwidth. This directly impacts the performance of co-scheduled applications which compete for shared resources in multicores. This paper explores how a resource-efficient prefetching scheme can benefit performance by conserving shared resources in multicores. We present a framework that uses fast cache modeling to accurately identify memory instructions that benefit most from prefetching. The framework inserts software prefetches in the application only when they benefit performance, and employs cache bypassing whenever possible. These properties help reduce off-chip bandwidth consumption and last-level cache pollution. While single-thread performance remains comparable to hardware prefetching, the full advantage of the scheme is realized when several cores are used and demand for shared resources grows.

Place, publisher, year, edition, pages
IEEE Computer Society, 2014
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-234546 (URN)10.1109/ISPASS.2014.6844473 (DOI)978-1-4799-3604-5 (ISBN)
Conference
ISPASS 2014, March 23-25, Monterey, CA
Projects
UPMARC
Available from: 2014-05-06 Created: 2014-10-20 Last updated: 2018-01-11Bibliographically approved
Sandberg, A., Hagersten, E. & Black-Schaffer, D. (2014). Full Speed Ahead: Detailed Architectural Simulation at Near-Native Speed.
Open this publication in new window or tab >>Full Speed Ahead: Detailed Architectural Simulation at Near-Native Speed
2014 (English)Report (Other academic)
Abstract [en]

Popular microarchitecture simulators are typically several orders of magnitude slower than the systems they simulate. This leads to two problems: First, due to the slow simulation rate, simulation studies are usually limited to the first few billion instructions, which corresponds to less than 10% the execution time of many standard benchmarks. Since such studies only cover a small fraction of the applications, they run the risk of reporting unrepresentative application behavior unless sampling strategies are employed. Second, the high overhead of traditional simulators make them unsuitable for hardware/software co-design studies where rapid turn-around is required.

In spite of previous efforts to parallelize simulators, most commonly used full-system simulations remain single threaded. In this paper, we explore a simple and effective way to parallelize sampling full-system simulators. In order to simulate at high speed, we need to be able to efficiently fast-forward between sample points. We demonstrate how hardware virtualization can be used to implement highly efficient fast-forwarding in the standard gem5 simulator and how this enables efficient execution between sample points. This extremely rapid fast-forwarding enables us to reach new sample points much quicker than a single sample can be simulated. Together with efficient copying of simulator state, this enables parallel execution of sample simulation. These techniques allow us to implement a highly scalable sampling simulator that exploits sample-level parallelism.

We demonstrate how virtualization can be used to fast-forward simulators at 90% of native execution speed on average. Using virtualized fast-forwarding, we demonstrate a parallel sampling simulator that can be used to accurately estimate the IPC of standard workloads with an average error of 2.2% while still reaching an execution rate of 2.0 GIPS (63% of native) on average. We demonstrate that our parallelization strategy scales almost linearly and simulates one core at up to 93% of its native execution rate, 19,000x faster than detailed simulation, while using 8 cores.

Series
Technical report / Department of Information Technology, Uppsala University, ISSN 1404-3203 ; 2014-005
Keywords
Computer Architecture, Simulation, Sampling, Native Execution, Virtualization, pFSA, FSA, KVM
National Category
Computer Engineering
Research subject
Computer Systems
Identifiers
urn:nbn:se:uu:diva-220649 (URN)
Projects
UPMARCCoDeR-MP
Available from: 2014-03-18 Created: 2014-03-18 Last updated: 2018-01-11Bibliographically approved
Sandberg, A. (2014). Understanding Multicore Performance: Efficient Memory System Modeling and Simulation. (Doctoral dissertation). Uppsala: Acta Universitatis Upsaliensis
Open this publication in new window or tab >>Understanding Multicore Performance: Efficient Memory System Modeling and Simulation
2014 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

To increase performance, modern processors employ complex techniques such as out-of-order pipelines and deep cache hierarchies. While the increasing complexity has paid off in performance, it has become harder to accurately predict the effects of hardware/software optimizations in such systems. Traditional microarchitectural simulators typically execute code 10 000×–100 000× slower than native execution, which leads to three problems: First, high simulation overhead makes it hard to use microarchitectural simulators for tasks such as software optimizations where rapid turn-around is required. Second, when multiple cores share the memory system, the resulting performance is sensitive to how memory accesses from the different cores interleave. This requires that applications are simulated multiple times with different interleaving to estimate their performance distribution, which is rarely feasible with today's simulators. Third, the high overhead limits the size of the applications that can be studied. This is usually solved by only simulating a relatively small number of instructions near the start of an application, with the risk of reporting unrepresentative results.

In this thesis we demonstrate three strategies to accurately model multicore processors without the overhead of traditional simulation. First, we show how microarchitecture-independent memory access profiles can be used to drive automatic cache optimizations and to qualitatively classify an application's last-level cache behavior. Second, we demonstrate how high-level performance profiles, that can be measured on existing hardware, can be used to model the behavior of a shared cache. Unlike previous models, we predict the effective amount of cache available to each application and the resulting performance distribution due to different interleaving without requiring a processor model. Third, in order to model future systems, we build an efficient sampling simulator. By using native execution to fast-forward between samples, we reach new samples much faster than a single sample can be simulated. This enables us to simulate multiple samples in parallel, resulting in almost linear scalability and a maximum simulation rate close to native execution.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2014. p. 54
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1136
Keywords
Computer Architecture, Simulation, Modeling, Sampling, Caches, Memory Systems, gem5, Parallel Simulation, Virtualization, Sampling, Multicore
National Category
Computer Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:uu:diva-220652 (URN)978-91-554-8922-9 (ISBN)
Public defence
2014-05-22, Room 2446, Polacksbacken, Lägerhyddsvägen 2, Uppsala, 09:30 (English)
Opponent
Supervisors
Projects
CoDeR-MPUPMARC
Available from: 2014-04-28 Created: 2014-03-18 Last updated: 2018-01-11Bibliographically approved
Bischoff, S., Sandberg, A., Hansson, A., Dam, S., Saidi, A., Horsnell, M. & Al-Hashimi, B. (2013). Flexible and High-Speed System-Level Performance Analysis using Hardware-Accelerated Simulation. In: : . Paper presented at Design, Automation & Test in Europe (DATE), 18-22 Match, 2013, Grenoble, France. Grenoble, France: Design, Automation & Test in Europe (DATE)
Open this publication in new window or tab >>Flexible and High-Speed System-Level Performance Analysis using Hardware-Accelerated Simulation
Show others...
2013 (English)Conference paper, Oral presentation with published abstract (Other academic)
Place, publisher, year, edition, pages
Grenoble, France: Design, Automation & Test in Europe (DATE), 2013
National Category
Computer Systems
Research subject
Computer Systems
Identifiers
urn:nbn:se:uu:diva-197299 (URN)
Conference
Design, Automation & Test in Europe (DATE), 18-22 Match, 2013, Grenoble, France
Available from: 2013-07-19 Created: 2013-03-21 Last updated: 2014-01-09Bibliographically approved
Sandberg, A., Sembrant, A., Hagersten, E. & Black-Schaffer, D. (2013). Modeling performance variation due to cache sharing. In: Proc. 19th IEEE International Symposium on High Performance Computer Architecture: . Paper presented at HPCA 2013, February 23-27, Shenzhen, China (pp. 155-166). IEEE Computer Society
Open this publication in new window or tab >>Modeling performance variation due to cache sharing
2013 (English)In: Proc. 19th IEEE International Symposium on High Performance Computer Architecture, IEEE Computer Society, 2013, p. 155-166Conference paper, Published paper (Refereed)
Abstract [en]

Shared cache contention can cause significant variability in the performance of co-running applications from run to run. This variability arises from different overlappings of the applications' phases, which can be the result of offsets in application start times or other delays in the system. Understanding this variability is important for generating an accurate view of the expected impact of cache contention. However, variability effects are typically ignored due to the high overhead of modeling or simulating the many executions needed to expose them.

This paper introduces a method for efficiently investigating the performance variability due to cache contention. Our method relies on input data captured from native execution of applications running in isolation and a fast, phase-aware, cache sharing performance model. This allows us to assess the performance interactions and bandwidth demands of co-running applications by quickly evaluating hundreds of overlappings.

We evaluate our method on a contemporary multicore machine and show that performance and bandwidth demands can vary significantly across runs of the same set of co-running applications. We show that our method can predict application slowdown with an average relative error of 0.41% (maximum 1.8%) as well as bandwidth consumption. Using our method, we can estimate an application pair's performance variation 213x faster, on average, than native execution.

Place, publisher, year, edition, pages
IEEE Computer Society, 2013
National Category
Computer Systems
Research subject
Computer Systems
Identifiers
urn:nbn:se:uu:diva-196181 (URN)10.1109/HPCA.2013.6522315 (DOI)000323775000014 ()978-1-4673-5585-8 (ISBN)
Conference
HPCA 2013, February 23-27, Shenzhen, China
Projects
CoDeR-MPUPMARC
Available from: 2013-03-21 Created: 2013-03-05 Last updated: 2014-04-29Bibliographically approved
Sandberg, A., Black-Schaffer, D. & Hagersten, E. (2012). Efficient techniques for predicting cache sharing and throughput. In: Proc. 21st International Conference on Parallel Architectures and Compilation Techniques: . Paper presented at PACT 2012, September 19–23, Minneapolis, MN (pp. 305-314). New York: ACM Press
Open this publication in new window or tab >>Efficient techniques for predicting cache sharing and throughput
2012 (English)In: Proc. 21st International Conference on Parallel Architectures and Compilation Techniques, New York: ACM Press, 2012, p. 305-314Conference paper, Published paper (Refereed)
Abstract [en]

This work addresses the modeling of shared cache contention in multicore systems and its impact on throughput and bandwidth. We develop two simple and fast cache sharing models for accurately predicting shared cache allocations for random and LRU caches.

To accomplish this we use low-overhead input data that captures the behavior of applications running on real hardware as a function of their shared cache allocation. This data enables us to determine how much and how aggressively data is reused by an application depending on how much shared cache it receives. From this we can model how applications compete for cache space, their aggregate performance (throughput)¸ and bandwidth.

We evaluate our models for two- and four-application workloads in simulation and on modern hardware. On a four-core machine, we demonstrate an average relative fetch ratio error of 6.7% for groups of four applications. We are able to predict workload bandwidth with an average relative error of less than 5.2% and throughput with an average error of less than 1.8%. The model can predict cache size with an average error of 1.3% compared to simulation.

Place, publisher, year, edition, pages
New York: ACM Press, 2012
National Category
Computer Systems
Research subject
Computer Systems
Identifiers
urn:nbn:se:uu:diva-178207 (URN)10.1145/2370816.2370861 (DOI)978-1-4503-1182-3 (ISBN)
Conference
PACT 2012, September 19–23, Minneapolis, MN
Projects
CoDeR-MPUPMARC
Available from: 2012-10-09 Created: 2012-07-30 Last updated: 2014-04-29Bibliographically approved
Sandberg, A., Black-Schaffer, D. & Erik, H. (2011). A simple statistical cache sharing model for multicores. In: Kessler, Christoph (Ed.), Proc. 4th Swedish Workshop on Multi-Core Computing: . Paper presented at MCC-2011, 4th Swedish Workshop on Multicore Computing, November 23-25, 2011, Linköping, Sweden (pp. 31-36). Linköping, Sweden: Linköping University
Open this publication in new window or tab >>A simple statistical cache sharing model for multicores
2011 (English)In: Proc. 4th Swedish Workshop on Multi-Core Computing / [ed] Kessler, Christoph, Linköping, Sweden: Linköping University , 2011, p. 31-36Conference paper, Published paper (Other academic)
Abstract [en]

The introduction of multicores has made analysis of shared  resources, such as shared caches and shared DRAM bandwidth, an  important topic to study. We present two simple, but accurate, cache  sharing models that use high-level data that can easily be measured  on existing systems. We evaluate our model using a simulated  multicore processor with four cores and a shared L2 cache. Our  evaluation shows that we can predict average sharing in groups of  four benchmarks with an average error smaller than 0.79% for random caches and 1.34% for LRU caches.

Place, publisher, year, edition, pages
Linköping, Sweden: Linköping University, 2011
National Category
Computer Engineering
Research subject
Computer Systems
Identifiers
urn:nbn:se:uu:diva-165779 (URN)
Conference
MCC-2011, 4th Swedish Workshop on Multicore Computing, November 23-25, 2011, Linköping, Sweden
Projects
CoDeR-MPUPMARC
Available from: 2012-01-10 Created: 2012-01-09 Last updated: 2018-01-12Bibliographically approved
Sandberg, A., Eklöv, D. & Hagersten, E. (2010). A Software Technique for Reducing Cache Pollution. In: Proc. 3rd Swedish Workshop on Multi-Core Computing: . Paper presented at 3rd Swedish Workshop on Multi-Core Computing (MCC 2010) (pp. 59-62). Göteborg, Sweden: Chalmers University of Technology
Open this publication in new window or tab >>A Software Technique for Reducing Cache Pollution
2010 (English)In: Proc. 3rd Swedish Workshop on Multi-Core Computing, Göteborg, Sweden: Chalmers University of Technology , 2010, p. 59-62Conference paper, Published paper (Other academic)
Abstract [en]

Contention for shared cache resources has been recognizedas a major bottleneck for multicores—especially for mixedworkloads of independent applications. While most modernprocessors implement instructions to manage caches, theseinstructions are largely unused due to a lack of understand-ing of how to best leverage them.

We propose an automatic, low-overhead, method to reducecache contention by finding instructions that are prone tocache trashing and a method to automatically disable cachingfor such instructions. Practical experiments demonstratethat our software-only method can improve application per-formance up to 35% on x86 multicore hardware.

Place, publisher, year, edition, pages
Göteborg, Sweden: Chalmers University of Technology, 2010
National Category
Computer Sciences Computer Engineering
Identifiers
urn:nbn:se:uu:diva-134388 (URN)
Conference
3rd Swedish Workshop on Multi-Core Computing (MCC 2010)
Projects
Coder-mpUPMARC
Funder
Swedish Research Council
Available from: 2010-12-10 Created: 2010-11-25 Last updated: 2018-01-12Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-9349-5791

Search in DiVA

Show all publications