Logo: to the web site of Uppsala University

uu.sePublications from Uppsala University
Change search
Link to record
Permanent link

Direct link
Hagersten, Erik
Alternative names
Publications (10 of 137) Show all publications
Nikoleris, N., Eeckhout, L., Hagersten, E. & Carlson, T. E. (2019). Directed Statistical Warming through Time Traveling. In: MICRO'52: The 52nd Annual IEEE/ACM International Symposium On Microarchitecture. Paper presented at 52nd Annual IEEE/ACM International Symposium On Microarchitecture, Columbus, Ohio, USA, Oct 12-16, 2019 (pp. 1037-1049).
Open this publication in new window or tab >>Directed Statistical Warming through Time Traveling
2019 (English)In: MICRO'52: The 52nd Annual IEEE/ACM International Symposium On Microarchitecture, 2019, p. 1037-1049Conference paper, Published paper (Refereed)
Abstract [en]

Improving the speed of computer architecture evaluation is of paramount importance to shorten the time-to-market when developing new platforms. Sampling is a widely used methodology to speed up workload analysis and performance evaluation by extrapolating from a set of representative detailed regions. Installing an accurate cache state for each detailed region is critical to achieving high accuracy. Prior work requires either huge amounts of storage (checkpoint-based warming), an excessive number of memory accesses to warm up the cache (functional warming), or the collection of a large number of reuse distances (randomized statistical warming) to accurately predict cache warm-up effects. This work proposes DeLorean, a novel statistical warming and sampling methodology that builds upon two key contributions: directed statistical warming and time traveling. Instead of collecting a large number of randomly selected reuse distances as in randomized statistical warming, directed statistical warming collects a select number of key reuse distances, i.e., the most recent reuse distance for each unique memory location referenced in the detailed region. Time traveling leverages virtualized fast-forwarding to quickly 'look into the future' - to determine the key cachelines - and then 'go back in time' - to collect the reuse distances for those key cachelines at near-native hardware speed through virtualized directed profiling. Directed statistical warming reduces the number of warm-up references by 30x compared to randomized statistical warming. Time traveling translates this reduction into a 5.7x simulation speedup. In addition to improving simulation speed, DeLorean reduces the prediction error from around 9% to around 3% on average. We further demonstrate how to amortize warm-up cost across multiple parallel simulations in design space exploration studies. Implementing DeLorean in gem5 enables detailed cycle-accurate simulation at a speed of 126 MIPS.

Keywords
performance analysis, architectural simulation, sampled simulation, statistical cache modeling, cache warming
National Category
Computer Systems
Identifiers
urn:nbn:se:uu:diva-408058 (URN)10.1145/3352460.3358264 (DOI)000519057400077 ()
Conference
52nd Annual IEEE/ACM International Symposium On Microarchitecture, Columbus, Ohio, USA, Oct 12-16, 2019
Funder
Swedish Foundation for Strategic Research Swedish Research CouncilEU, European Research Council, 741097
Available from: 2020-04-03 Created: 2020-04-03 Last updated: 2020-04-03Bibliographically approved
Nikoleris, N., Hagersten, E. & Carlson, T. E. (2018). Delorean: Virtualized Directed Profiling for Cache Modeling in Sampled Simulation.
Open this publication in new window or tab >>Delorean: Virtualized Directed Profiling for Cache Modeling in Sampled Simulation
2018 (English)Report (Other academic)
Abstract [en]

Current practice for accurate and efficient simulation (e.g., SMARTS and Simpoint) makes use of sampling to significantly reduce the time needed to evaluate new research ideas. By evaluating a small but representative portion of the original application, sampling can allow for both fast and accurate performance analysis. However, as cache sizes of modern architectures grow, simulation time is dominated by warming microarchitectural state and not by detailed simulation, reducing overall simulation efficiency. While checkpoints can significantly reduce cache warming, improving efficiency, they limit the flexibility of the system under evaluation, requiring new checkpoints for software updates (such as changes to the compiler and compiler flags) and many types of hardware modifications. An ideal solution would allow for accurate cache modeling for each simulation run without the need to generate rigid checkpointing data a priori.

Enabling this new direction for fast and flexible simulation requires a combination of (1) a methodology that allows for hardware and software flexibility and (2) the ability to quickly and accurately model arbitrarily-sized caches. Current approaches that rely on checkpointing or statistical cache modeling require rigid, up-front state to be collected which needs to be amortized over a large number of simulation runs. These earlier methodologies are insufficient for our goals for improved flexibility. In contrast, our proposed methodology, Delorean, outlines a unique solution to this problem. The Delorean simulation methodology enables both flexibility and accuracy by quickly generating a targeted cache model for the next detailed region on the fly without the need for up-front simulation or modeling. More specifically, we propose a new, more accurate statistical cache modeling method that takes advantage of hardware virtualization to precisely determine the memory regions accessed and to minimize the time needed for data collection while maintaining accuracy.

Delorean uses a multi-pass approach to understand the memory regions accessed by the next, upcoming detailed region. Our methodology collects the entire set of key memory accesses and, through fast virtualization techniques, progressively scans larger, earlier regions to learn more about these key accesses in an efficient way. Using these techniques, we demonstrate that Delorean allows for the fast evaluation of systems and their software though the generation of accurate cache models on the fly. Delorean outperforms previous proposals by an order of magnitude, with a simulation speed of 150 MIPS and a similar average CPI error (below 4%).

Publisher
p. 12
Series
Technical report / Department of Information Technology, Uppsala University, ISSN 1404-3203 ; 2018-014
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:uu:diva-369320 (URN)
Available from: 2018-12-12 Created: 2018-12-12 Last updated: 2024-05-29Bibliographically approved
Ceballos, G., Hagersten, E. & Black-Schaffer, D. (2018). Tail-PASS: Resource-based Cache Management for Tiled Graphics Rendering Hardware. In: Proc. 16th International Conference on Parallel and Distributed Processing with Applications: . Paper presented at ISPA 2018, December 11–13, Melbourne, Australia (pp. 55-63). IEEE
Open this publication in new window or tab >>Tail-PASS: Resource-based Cache Management for Tiled Graphics Rendering Hardware
2018 (English)In: Proc. 16th International Conference on Parallel and Distributed Processing with Applications, IEEE, 2018, p. 55-63Conference paper, Published paper (Refereed)
Abstract [en]

Modern graphics rendering is a very expensive process and can account for 60% of the battery consumption on current games. Much of the cost comes from the high memory bandwidth of rendering complex graphics. To render a frame, multiple smaller rendering passes called scenes are executed, with each one tiled for parallel execution. The data for each scene comes from hundreds of software resources (textures). We observe that each frame can consume up to 1000s of MB of data, but that over 75% of the graphics memory accesses are to the top-10 resources, and that bypassing the remaining infrequently accessed (tail) resources reduces cache pollution. Bypassing the tail can save up to 35% of the main memory traffic over resource-oblivious replacement policies and cache management techniques. In this paper, we propose Tail-PASS, a cache management technique that detects the most accessed resources at runtime, learns if it is worth bypassing the least accessed ones, and then dynamically enables/disables bypassing to reduce cache pollution on a per-scene basis. Overall, we see an average reduction in bandwidth-per-frame of 22% (up to 46%) by bypassing all but the top-10 resources and an 11% (up to 44%) reduction if only the top-2 resources are cached.

Place, publisher, year, edition, pages
IEEE, 2018
National Category
Computer Systems Computer Sciences
Identifiers
urn:nbn:se:uu:diva-363920 (URN)10.1109/BDCloud.2018.00022 (DOI)000467843200008 ()978-1-7281-1141-4 (ISBN)
Conference
ISPA 2018, December 11–13, Melbourne, Australia
Funder
EU, European Research Council, 715283
Available from: 2018-10-21 Created: 2018-10-21 Last updated: 2019-06-17Bibliographically approved
Sembrant, A., Carlson, T. E., Hagersten, E. & Black-Schaffer, D. (2017). A graphics tracing framework for exploring CPU+GPU memory systems. In: Proc. 20th International Symposium on Workload Characterization: . Paper presented at IISWC 2017, October 1–3, Seattle, WA (pp. 54-65). IEEE
Open this publication in new window or tab >>A graphics tracing framework for exploring CPU+GPU memory systems
2017 (English)In: Proc. 20th International Symposium on Workload Characterization, IEEE, 2017, p. 54-65Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
IEEE, 2017
National Category
Computer Engineering
Identifiers
urn:nbn:se:uu:diva-357055 (URN)10.1109/IISWC.2017.8167756 (DOI)000428206700006 ()978-1-5386-1233-0 (ISBN)
Conference
IISWC 2017, October 1–3, Seattle, WA
Available from: 2017-12-07 Created: 2018-08-17 Last updated: 2018-09-24Bibliographically approved
Sembrant, A., Hagersten, E. & Black-Schaffer, D. (2017). A split cache hierarchy for enabling data-oriented optimizations. In: Proc. 23rd International Symposium on High Performance Computer Architecture: . Paper presented at HPCA 2017, February 4–8, Austin, TX (pp. 133-144). IEEE Computer Society
Open this publication in new window or tab >>A split cache hierarchy for enabling data-oriented optimizations
2017 (English)In: Proc. 23rd International Symposium on High Performance Computer Architecture, IEEE Computer Society, 2017, p. 133-144Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
IEEE Computer Society, 2017
National Category
Computer Engineering
Identifiers
urn:nbn:se:uu:diva-306368 (URN)10.1109/HPCA.2017.25 (DOI)000403330300012 ()978-1-5090-4985-1 (ISBN)
Conference
HPCA 2017, February 4–8, Austin, TX
Projects
UPMARC
Available from: 2017-05-08 Created: 2016-10-27 Last updated: 2020-12-10Bibliographically approved
Ceballos, G., Hugo, A., Hagersten, E. & Black-Schaffer, D. (2017). Exploring scheduling effects on task performance with TaskInsight. Supercomputing frontiers and innovations, 4(3), 91-98
Open this publication in new window or tab >>Exploring scheduling effects on task performance with TaskInsight
2017 (English)In: Supercomputing frontiers and innovations, ISSN 2214-3270, E-ISSN 2313-8734, Vol. 4, no 3, p. 91-98Article in journal (Refereed) Published
National Category
Computer Systems
Identifiers
urn:nbn:se:uu:diva-335528 (URN)10.14529/jsfi170306 (DOI)
Projects
UPMARC
Funder
Swedish Foundation for Strategic Research , FFL12-0051
Available from: 2017-12-06 Created: 2017-12-06 Last updated: 2018-11-16Bibliographically approved
Sembrant, A., Carlson, T. E., Hagersten, E. & Black-Schaffer, D. (2017). POSTER: Putting the G back into GPU/CPU Systems Research. In: 2017 26TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT): . Paper presented at 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), SEP 09-13, 2017, Portland, OR, USA. (pp. 130-131).
Open this publication in new window or tab >>POSTER: Putting the G back into GPU/CPU Systems Research
2017 (English)In: 2017 26TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT), 2017, p. 130-131Conference paper, Published paper (Refereed)
Abstract [en]

Modern SoCs contain several CPU cores and many GPU cores to execute both general purpose and highly-parallel graphics workloads. In many SoCs, more area is dedicated to graphics than to general purpose compute. Despite this, the micro-architecture research community primarily focuses on GPGPU and CPU-only research, and not on graphics (the primary workload for many SoCs). The main reason for this is the lack of efficient tools and simulators for modern graphics applications. This work focuses on the GPU's memory traffic generated by graphics. We describe a new graphics tracing framework and use it to both study graphics applications' memory behavior as well as how CPUs and GPUs affect system performance. Our results show that graphics applications exhibit a wide range of memory behavior between applications and across time, and slows down co-running SPEC applications by 59% on average.

Series
International Conference on Parallel Architectures and Compilation Techniques, ISSN 1089-795X
National Category
Computer Systems Computer Engineering
Identifiers
urn:nbn:se:uu:diva-347752 (URN)10.1109/PACT.2017.60 (DOI)000417411300011 ()978-1-5090-6764-0 (ISBN)
Conference
26th International Conference on Parallel Architectures and Compilation Techniques (PACT), SEP 09-13, 2017, Portland, OR, USA.
Available from: 2018-04-17 Created: 2018-04-17 Last updated: 2018-04-17Bibliographically approved
Davari, M., Hagersten, E. & Kaxiras, S. (2017). Scope-Aware Classification: Taking the hierarchical private/shared data classification to the next level.
Open this publication in new window or tab >>Scope-Aware Classification: Taking the hierarchical private/shared data classification to the next level
2017 (English)Report (Other academic)
Series
Technical report / Department of Information Technology, Uppsala University, ISSN 1404-3203 ; 2017-008
National Category
Computer Systems
Identifiers
urn:nbn:se:uu:diva-320324 (URN)
Available from: 2017-04-27 Created: 2017-04-19 Last updated: 2024-05-29Bibliographically approved
Davari, M., Hagersten, E. & Kaxiras, S. (2017). The best of both works: A hybrid data-race-free cache coherence scheme.
Open this publication in new window or tab >>The best of both works: A hybrid data-race-free cache coherence scheme
2017 (English)Report (Other academic)
National Category
Computer Systems
Identifiers
urn:nbn:se:uu:diva-320320 (URN)
Available from: 2017-04-19 Created: 2017-04-19 Last updated: 2020-12-10Bibliographically approved
Ceballos, G., Hagersten, E. & Black-Schaffer, D. (2017). Understanding the interplay between task scheduling, memory and performance. In: Proc. Companion 8th ACM International Conference on Systems, Programming, Languages, and Applications: Software for Humanity. Paper presented at SPLASH 2017, October 22–27, Vancouver, Canada (pp. 21-23). New York: ACM Press
Open this publication in new window or tab >>Understanding the interplay between task scheduling, memory and performance
2017 (English)In: Proc. Companion 8th ACM International Conference on Systems, Programming, Languages, and Applications: Software for Humanity, New York: ACM Press, 2017, p. 21-23Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
New York: ACM Press, 2017
National Category
Computer Systems
Identifiers
urn:nbn:se:uu:diva-335556 (URN)10.1145/3135932.3135942 (DOI)978-1-4503-5514-8 (ISBN)
Conference
SPLASH 2017, October 22–27, Vancouver, Canada
Projects
UPMARC
Funder
Swedish Foundation for Strategic Research , FFL12-0051
Available from: 2017-10-22 Created: 2017-12-06 Last updated: 2018-11-16Bibliographically approved
Organisations

Search in DiVA

Show all publications