uu.seUppsala University Publications
Change search
Refine search result
1 - 10 of 10
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Eklöv, David
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Nikoleris, Nikkos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Hägersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Bandwidth bandit: Quantitative characterization of memory contention2012In: Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, 2012, p. 457-458Conference paper (Refereed)
    Abstract [en]

    Applications that are co-scheduled on a multi-core compete for shared resources, such as cache capacity and memory bandwidth. The performance degradation resulting from this contention can be substantial, which makes it important to effectively manage these shared resources. This, however, requires quantitative insight into how applications are impacted by such contention. In this paper we present a quantitative method to measure applications' sensitivities to different degrees of contention for off-chip memory bandwidth on real hardware. We then use the data captured with our profiling method to estimate the throughput of a set of co-running instances of a single threaded application.

  • 2.
    Eklöv, David
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Nikoleris, Nikos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Bandwidth Bandit: Quantitative Characterization of Memory Contention2013In: Proc. 11th International Symposium on Code Generation and Optimization: CGO 2013, IEEE Computer Society, 2013, p. 99-108Conference paper (Refereed)
    Abstract [en]

    On multicore processors, co-executing applications compete for shared resources, such as cache capacity and memory bandwidth. This leads to suboptimal resource allocation and can cause substantial performance loss, which makes it important to effectively manage these shared resources. This, however, requires insights into how the applications are impacted by such resource sharing. While there are several methods to analyze the performance impact of cache contention, less attention has been paid to general, quantitative methods for analyzing the impact of contention for memory bandwidth. To this end we introduce the Bandwidth Bandit, a general, quantitative, profiling method for analyzing the performance impact of contention for memory bandwidth on multicore machines. The profiling data captured by the Bandwidth Bandit is presented in a bandwidth graph. This graph accurately captures the measured application's performance as a function of its available memory bandwidth, and enables us to determine how much the application suffers when its available bandwidth is reduced. To demonstrate the value of this data, we present a case study in which we use the bandwidth graph to analyze the performance impact of memory contention when co-running multiple instances of single threaded application.

  • 3.
    Eklöv, David
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Nikoleris, Nikos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Cache Pirating: Measuring the curse of the shared cache2011Report (Other academic)
  • 4.
    Eklöv, David
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Nikoleris, Nikos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Cache Pirating: Measuring the Curse of the Shared Cache2011In: Proc. 40th International Conference on Parallel Processing, IEEE Computer Society, 2011, p. 165-175Conference paper (Refereed)
  • 5.
    Eklöv, David
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Nikoleris, Nikos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Hägersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Quantitative Characterization of Memory Contention2012Report (Other academic)
    Abstract [en]

    On multicore processors, co-executing applications compete for shared resources, such as cache capacity and memory bandwidth. This leads to suboptimal resource allocation and can cause substantial performance loss, which makes it important to effectively manage these shared resources. This, however, requires insights into how the applications are impacted by such resource sharing.

    While there are several methods to analyze the performance impact of cache contention, less attention has been paid to general, quantitative methods for analyzing the impact of contention for memory bandwidth. To this end we introduce the Bandwidth Bandit, a general, quantitative, profiling method for analyzing the performance impact of contention for memory bandwidth on multicore machines.

    The profiling data captured by the Bandwidth Bandit is presented in a it bandwidth graph. This graph accurately captures the measured application's performance as a function of its available memory bandwidth, and enables us to determine how much the application suffers when its available bandwidth is reduced. To demonstrate the value of this data, we present a case study in which we use the bandwidth graph to analyze the performance impact of memory contention when co-running multiple instances of single threaded application.

  • 6.
    Eklöv, David
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Nikoleris, Nikos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    A software based profiling method for obtaining speedup stacks on commodity multi-cores2014In: 2014 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS): ISPASS 2014, IEEE Computer Society, 2014, p. 148-157Conference paper (Refereed)
    Abstract [en]

    A key goodness metric of multi-threaded programs is how their execution times scale when increasing the number of threads. However, there are several bottlenecks that can limit the scalability of a multi-threaded program, e.g., contention for shared cache capacity and off-chip memory bandwidth; and synchronization overheads. In order to improve the scalability of a multi-threaded program, it is vital to be able to quantify how the program is impacted by these scalability bottlenecks. We present a software profiling method for obtaining speedup stacks. A speedup stack reports how much each scalability bottleneck limits the scalability of a multi-threaded program. It thereby quantifies how much its scalability can be improved by eliminating a given bottleneck. A software developer can use this information to determine what optimizations are most likely to improve scalability, while a computer architect can use it to analyze the resource demands of emerging workloads. The proposed method profiles the program on real commodity multi-cores (i.e., no simulations required) using existing performance counters. Consequently, the obtained speedup stacks accurately account for all idiosyncrasies of the machine on which the program is profiled. While the main contribution of this paper is the profiling method to obtain speedup stacks, we present several examples of how speedup stacks can be used to analyze the resource requirements of multi-threaded programs. Furthermore, we discuss how their scalability can be improved by both software developers and computer architects.

  • 7.
    Nikoleris, Nikos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Computer Systems. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Efficient Memory Modeling During Simulation and Native Execution2019Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    Application performance on computer processors depends on a number of complex architectural and microarchitectural design decisions. Consequently, computer architects rely on performance modeling to improve future processors without building prototypes. This thesis focuses on performance modeling and proposes methods that quantify the impact of the memory system on application performance.

    Detailed architectural simulation, a common approach to performance modeling, can be five orders of magnitude slower than execution on the actual processor. At this rate, simulating realistic workloads requires years of CPU time. Prior research uses sampling to speed up simulation. Using sampled simulation, only a number of small but representative portions of the workload are evaluated in detail. To fully exploit the speed potential of sampled simulation, the simulation method has to efficiently reconstruct the architectural and microarchitectural state prior to the simulation samples. Practical approaches to sampled simulation use either functional simulation at the expense of performance or checkpoints at the expense of flexibility. This thesis proposes three approaches that use statistical cache modeling to efficiently address the problem of cache warm up and speed up sampled simulation, without compromising flexibility. The statistical cache model uses sparse memory reuse information obtained with native techniques to model the performance of the cache. The proposed sampled simulation framework evaluates workloads 150 times faster than approaches that use functional simulation to warm up the cache.

    Other approaches to performance modeling use analytical models based on data obtained from execution on native hardware. These native techniques allow for better understanding of the performance bottlenecks on existing hardware. Efficient resource utilization in modern multicore processors is necessary to exploit their peak performance. This thesis proposes native methods that characterize shared resource utilization in modern multicores. These methods quantify the impact of cache sharing and off-chip memory sharing on overall application performance. Additionally, they can quantify scalability bottlenecks for data-parallel, symmetric workloads.

    List of papers
    1. Extending statistical cache models to support detailed pipeline simulators
    Open this publication in new window or tab >>Extending statistical cache models to support detailed pipeline simulators
    2014 (English)In: 2014 IEEE International Symposium On Performance Analysis Of Systems And Software (Ispass), IEEE Computer Society, 2014, p. 86-95Conference paper, Published paper (Refereed)
    Abstract [en]

    Simulators are widely used in computer architecture research. While detailed cycle-accurate simulations provide useful insights, studies using modern workloads typically require days or weeks. Evaluating many design points, only exacerbates the simulation overhead. Recent works propose methods with good accuracy that reduce the simulated overhead either by sampling the execution (e.g., SMARTS and SimPoint) or by using fast analytical models of the simulated designs (e.g., Interval Simulation). While these techniques reduce significantly the simulation overhead, modeling processor components with large state, such as the last-level cache, requires costly simulation to warm them up. Statistical simulation methods, such as SMARTS, report that the warm-up overhead accounts for 99% of the simulation overhead, while only 1% of the time is spent simulating the target design. This paper proposes WarmSim, a method that eliminates the need to warm up the cache. WarmSim builds on top of a statistical cache modeling technique and extends it to model accurately not only the miss ratio but also the outcome of every cache request. WarmSim uses as input, an application's memory reuse information which is hardware independent. Therefore, different cache configurations can be simulated using the same input data. We demonstrate that this approach can be used to estimate the CPI of the SPEC CPU2006 benchmarks with an average error of 1.77%, reducing the overhead compared to a simulation with a 10M instruction warm-up by a factor of 50x.

    Place, publisher, year, edition, pages
    IEEE Computer Society, 2014
    Series
    IEEE International Symposium on Performance Analysis of Systems and Software-ISPASS
    National Category
    Computer Sciences
    Identifiers
    urn:nbn:se:uu:diva-224221 (URN)10.1109/ISPASS.2014.6844464 (DOI)000364102000010 ()978-1-4799-3604-5 (ISBN)
    Conference
    ISPASS 2014, March 23-25, Monterey, CA
    Projects
    UPMARC
    Available from: 2014-05-06 Created: 2014-05-06 Last updated: 2018-12-14Bibliographically approved
    2. CoolSim: Statistical Techniques to Replace Cache Warming with Efficient, Virtualized Profiling
    Open this publication in new window or tab >>CoolSim: Statistical Techniques to Replace Cache Warming with Efficient, Virtualized Profiling
    2016 (English)In: Proceedings Of 2016 International Conference On Embedded Computer Systems: Architectures, Modeling And Simulation (Samos) / [ed] Najjar, W Gerstlauer, A, IEEE , 2016, p. 106-115Conference paper, Published paper (Refereed)
    Abstract [en]

    Simulation is an important part of the evaluation of next-generation computing systems. Detailed, cycle-accurate simulation, however, can be very slow when evaluating realistic workloads on modern microarchitectures. Sampled simulation (e.g., SMARTS and SimPoint) improves simulation performance by an order of magnitude or more through the reduction of large workloads into a small but representative sample. Additionally, the execution state just prior to a simulation sample can be stored into checkpoints, allowing for fast restoration and evaluation. Unfortunately, changes in software, architecture or fundamental pieces of the microarchitecture (e.g., hardware-software co-design) require checkpoint regeneration. The end result for co-design degenerates to creating checkpoints for each modification, a task checkpointing was designed to eliminate. Therefore, a solution is needed that allows for fast and accurate simulation, without the need for checkpoints. Virtualized fast-forwarding (VFF), an alternative to using checkpoints, allows for execution at near-native speed between simulation points. Warming the micro-architectural state prior to each simulation point, however, requires functional simulation, a costly operation for large caches (e.g., 8 M B). Simulating future systems with caches of many MBs can require warming of billions of instructions, dominating simulation time. This paper proposes CoolSim, an efficient simulation framework that eliminates cache warming. CoolSim uses VFF to advance between simulation points collecting at the same time sparse memory reuse information (MRI). The MRI is collected more than an order of magnitude faster than functional simulation. At the simulation point, detailed simulation with a statistical cache model is used to evaluate the design. The previously acquired MRI is used to estimate whether each memory request hits in the cache. The MRI is an architecturally independent metric and a single profile can be used in simulations of any size cache. We describe a prototype implementation of CoolSim based on KVM and gem5 running 19 x faster than the state-of-the-art sampled simulation, while it estimates the CPI of the SPEC CPU2006 benchmarks with 3.62% error on average, across a wide range of cache sizes.

    Place, publisher, year, edition, pages
    IEEE, 2016
    National Category
    Computer Sciences
    Identifiers
    urn:nbn:se:uu:diva-322061 (URN)000399143000015 ()9781509030767 (ISBN)
    Conference
    International Conference on Embedded Computer Systems - Architectures, Modeling and Simulation (SAMOS), JUL 17-21, 2016, Samos, GREECE
    Funder
    Swedish Foundation for Strategic Research EU, FP7, Seventh Framework Programme, 610490
    Available from: 2017-05-16 Created: 2017-05-16 Last updated: 2018-12-14Bibliographically approved
    3. Delorean: Virtualized Directed Profiling for Cache Modeling in Sampled Simulation
    Open this publication in new window or tab >>Delorean: Virtualized Directed Profiling for Cache Modeling in Sampled Simulation
    2018 (English)Report (Other academic)
    Abstract [en]

    Current practice for accurate and efficient simulation (e.g., SMARTS and Simpoint) makes use of sampling to significantly reduce the time needed to evaluate new research ideas. By evaluating a small but representative portion of the original application, sampling can allow for both fast and accurate performance analysis. However, as cache sizes of modern architectures grow, simulation time is dominated by warming microarchitectural state and not by detailed simulation, reducing overall simulation efficiency. While checkpoints can significantly reduce cache warming, improving efficiency, they limit the flexibility of the system under evaluation, requiring new checkpoints for software updates (such as changes to the compiler and compiler flags) and many types of hardware modifications. An ideal solution would allow for accurate cache modeling for each simulation run without the need to generate rigid checkpointing data a priori.

    Enabling this new direction for fast and flexible simulation requires a combination of (1) a methodology that allows for hardware and software flexibility and (2) the ability to quickly and accurately model arbitrarily-sized caches. Current approaches that rely on checkpointing or statistical cache modeling require rigid, up-front state to be collected which needs to be amortized over a large number of simulation runs. These earlier methodologies are insufficient for our goals for improved flexibility. In contrast, our proposed methodology, Delorean, outlines a unique solution to this problem. The Delorean simulation methodology enables both flexibility and accuracy by quickly generating a targeted cache model for the next detailed region on the fly without the need for up-front simulation or modeling. More specifically, we propose a new, more accurate statistical cache modeling method that takes advantage of hardware virtualization to precisely determine the memory regions accessed and to minimize the time needed for data collection while maintaining accuracy.

    Delorean uses a multi-pass approach to understand the memory regions accessed by the next, upcoming detailed region. Our methodology collects the entire set of key memory accesses and, through fast virtualization techniques, progressively scans larger, earlier regions to learn more about these key accesses in an efficient way. Using these techniques, we demonstrate that Delorean allows for the fast evaluation of systems and their software though the generation of accurate cache models on the fly. Delorean outperforms previous proposals by an order of magnitude, with a simulation speed of 150 MIPS and a similar average CPI error (below 4%).

    Publisher
    p. 12
    Series
    Technical report / Department of Information Technology, Uppsala University, ISSN 1404-3203
    National Category
    Computer Systems
    Research subject
    Computer Science
    Identifiers
    urn:nbn:se:uu:diva-369320 (URN)
    Available from: 2018-12-12 Created: 2018-12-12 Last updated: 2019-01-08Bibliographically approved
    4. Cache Pirating: Measuring the Curse of the Shared Cache
    Open this publication in new window or tab >>Cache Pirating: Measuring the Curse of the Shared Cache
    2011 (English)In: Proc. 40th International Conference on Parallel Processing, IEEE Computer Society, 2011, p. 165-175Conference paper, Published paper (Refereed)
    Place, publisher, year, edition, pages
    IEEE Computer Society, 2011
    National Category
    Computer Engineering
    Identifiers
    urn:nbn:se:uu:diva-181254 (URN)10.1109/ICPP.2011.15 (DOI)978-1-4577-1336-1 (ISBN)
    Conference
    ICPP 2011
    Projects
    UPMARCCoDeR-MP
    Available from: 2011-10-17 Created: 2012-09-20 Last updated: 2018-12-14Bibliographically approved
    5. Bandwidth Bandit: Quantitative Characterization of Memory Contention
    Open this publication in new window or tab >>Bandwidth Bandit: Quantitative Characterization of Memory Contention
    2013 (English)In: Proc. 11th International Symposium on Code Generation and Optimization: CGO 2013, IEEE Computer Society, 2013, p. 99-108Conference paper, Published paper (Refereed)
    Abstract [en]

    On multicore processors, co-executing applications compete for shared resources, such as cache capacity and memory bandwidth. This leads to suboptimal resource allocation and can cause substantial performance loss, which makes it important to effectively manage these shared resources. This, however, requires insights into how the applications are impacted by such resource sharing. While there are several methods to analyze the performance impact of cache contention, less attention has been paid to general, quantitative methods for analyzing the impact of contention for memory bandwidth. To this end we introduce the Bandwidth Bandit, a general, quantitative, profiling method for analyzing the performance impact of contention for memory bandwidth on multicore machines. The profiling data captured by the Bandwidth Bandit is presented in a bandwidth graph. This graph accurately captures the measured application's performance as a function of its available memory bandwidth, and enables us to determine how much the application suffers when its available bandwidth is reduced. To demonstrate the value of this data, we present a case study in which we use the bandwidth graph to analyze the performance impact of memory contention when co-running multiple instances of single threaded application.

    Place, publisher, year, edition, pages
    IEEE Computer Society, 2013
    Keywords
    bandwidth, memory, caches
    National Category
    Computer Sciences
    Research subject
    Computer Science
    Identifiers
    urn:nbn:se:uu:diva-194101 (URN)10.1109/CGO.2013.6494987 (DOI)000318700200010 ()978-1-4673-5524-7 (ISBN)
    Conference
    CGO 2013, 23-27 February, Shenzhen, China
    Projects
    UPMARC
    Funder
    Swedish Research Council
    Available from: 2013-04-18 Created: 2013-02-08 Last updated: 2018-12-14Bibliographically approved
    6. A software based profiling method for obtaining speedup stacks on commodity multi-cores
    Open this publication in new window or tab >>A software based profiling method for obtaining speedup stacks on commodity multi-cores
    2014 (English)In: 2014 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS): ISPASS 2014, IEEE Computer Society, 2014, p. 148-157Conference paper, Published paper (Refereed)
    Abstract [en]

    A key goodness metric of multi-threaded programs is how their execution times scale when increasing the number of threads. However, there are several bottlenecks that can limit the scalability of a multi-threaded program, e.g., contention for shared cache capacity and off-chip memory bandwidth; and synchronization overheads. In order to improve the scalability of a multi-threaded program, it is vital to be able to quantify how the program is impacted by these scalability bottlenecks. We present a software profiling method for obtaining speedup stacks. A speedup stack reports how much each scalability bottleneck limits the scalability of a multi-threaded program. It thereby quantifies how much its scalability can be improved by eliminating a given bottleneck. A software developer can use this information to determine what optimizations are most likely to improve scalability, while a computer architect can use it to analyze the resource demands of emerging workloads. The proposed method profiles the program on real commodity multi-cores (i.e., no simulations required) using existing performance counters. Consequently, the obtained speedup stacks accurately account for all idiosyncrasies of the machine on which the program is profiled. While the main contribution of this paper is the profiling method to obtain speedup stacks, we present several examples of how speedup stacks can be used to analyze the resource requirements of multi-threaded programs. Furthermore, we discuss how their scalability can be improved by both software developers and computer architects.

    Place, publisher, year, edition, pages
    IEEE Computer Society, 2014
    Series
    IEEE International Symposium on Performance Analysis of Systems and Software-ISPASS
    National Category
    Computer Sciences
    Identifiers
    urn:nbn:se:uu:diva-224230 (URN)10.1109/ISPASS.2014.6844479 (DOI)000364102000025 ()978-1-4799-3604-5 (ISBN)
    Conference
    ISPASS 2014, March 23-25, Monterey, CA
    Projects
    UPMARC
    Available from: 2014-05-06 Created: 2014-05-06 Last updated: 2018-12-14Bibliographically approved
  • 8.
    Nikoleris, Nikos
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Eklöv, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Extending statistical cache models to support detailed pipeline simulators2014In: 2014 IEEE International Symposium On Performance Analysis Of Systems And Software (Ispass), IEEE Computer Society, 2014, p. 86-95Conference paper (Refereed)
    Abstract [en]

    Simulators are widely used in computer architecture research. While detailed cycle-accurate simulations provide useful insights, studies using modern workloads typically require days or weeks. Evaluating many design points, only exacerbates the simulation overhead. Recent works propose methods with good accuracy that reduce the simulated overhead either by sampling the execution (e.g., SMARTS and SimPoint) or by using fast analytical models of the simulated designs (e.g., Interval Simulation). While these techniques reduce significantly the simulation overhead, modeling processor components with large state, such as the last-level cache, requires costly simulation to warm them up. Statistical simulation methods, such as SMARTS, report that the warm-up overhead accounts for 99% of the simulation overhead, while only 1% of the time is spent simulating the target design. This paper proposes WarmSim, a method that eliminates the need to warm up the cache. WarmSim builds on top of a statistical cache modeling technique and extends it to model accurately not only the miss ratio but also the outcome of every cache request. WarmSim uses as input, an application's memory reuse information which is hardware independent. Therefore, different cache configurations can be simulated using the same input data. We demonstrate that this approach can be used to estimate the CPI of the SPEC CPU2006 benchmarks with an average error of 1.77%, reducing the overhead compared to a simulation with a 10M instruction warm-up by a factor of 50x.

  • 9.
    Nikoleris, Nikos
    et al.
    Arm Research, Cambridge UK.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Carlson, Trevor E.
    Department of Computer Science, National University of Singapore.
    Delorean: Virtualized Directed Profiling for Cache Modeling in Sampled Simulation2018Report (Other academic)
    Abstract [en]

    Current practice for accurate and efficient simulation (e.g., SMARTS and Simpoint) makes use of sampling to significantly reduce the time needed to evaluate new research ideas. By evaluating a small but representative portion of the original application, sampling can allow for both fast and accurate performance analysis. However, as cache sizes of modern architectures grow, simulation time is dominated by warming microarchitectural state and not by detailed simulation, reducing overall simulation efficiency. While checkpoints can significantly reduce cache warming, improving efficiency, they limit the flexibility of the system under evaluation, requiring new checkpoints for software updates (such as changes to the compiler and compiler flags) and many types of hardware modifications. An ideal solution would allow for accurate cache modeling for each simulation run without the need to generate rigid checkpointing data a priori.

    Enabling this new direction for fast and flexible simulation requires a combination of (1) a methodology that allows for hardware and software flexibility and (2) the ability to quickly and accurately model arbitrarily-sized caches. Current approaches that rely on checkpointing or statistical cache modeling require rigid, up-front state to be collected which needs to be amortized over a large number of simulation runs. These earlier methodologies are insufficient for our goals for improved flexibility. In contrast, our proposed methodology, Delorean, outlines a unique solution to this problem. The Delorean simulation methodology enables both flexibility and accuracy by quickly generating a targeted cache model for the next detailed region on the fly without the need for up-front simulation or modeling. More specifically, we propose a new, more accurate statistical cache modeling method that takes advantage of hardware virtualization to precisely determine the memory regions accessed and to minimize the time needed for data collection while maintaining accuracy.

    Delorean uses a multi-pass approach to understand the memory regions accessed by the next, upcoming detailed region. Our methodology collects the entire set of key memory accesses and, through fast virtualization techniques, progressively scans larger, earlier regions to learn more about these key accesses in an efficient way. Using these techniques, we demonstrate that Delorean allows for the fast evaluation of systems and their software though the generation of accurate cache models on the fly. Delorean outperforms previous proposals by an order of magnitude, with a simulation speed of 150 MIPS and a similar average CPI error (below 4%).

  • 10.
    Nikoleris, Nikos
    et al.
    ARM Res, Cambridge, England..
    Sandberg, Andreas
    ARM Res, Cambridge, England..
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Carlson, Trevor E.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    CoolSim: Statistical Techniques to Replace Cache Warming with Efficient, Virtualized Profiling2016In: Proceedings Of 2016 International Conference On Embedded Computer Systems: Architectures, Modeling And Simulation (Samos) / [ed] Najjar, W Gerstlauer, A, IEEE , 2016, p. 106-115Conference paper (Refereed)
    Abstract [en]

    Simulation is an important part of the evaluation of next-generation computing systems. Detailed, cycle-accurate simulation, however, can be very slow when evaluating realistic workloads on modern microarchitectures. Sampled simulation (e.g., SMARTS and SimPoint) improves simulation performance by an order of magnitude or more through the reduction of large workloads into a small but representative sample. Additionally, the execution state just prior to a simulation sample can be stored into checkpoints, allowing for fast restoration and evaluation. Unfortunately, changes in software, architecture or fundamental pieces of the microarchitecture (e.g., hardware-software co-design) require checkpoint regeneration. The end result for co-design degenerates to creating checkpoints for each modification, a task checkpointing was designed to eliminate. Therefore, a solution is needed that allows for fast and accurate simulation, without the need for checkpoints. Virtualized fast-forwarding (VFF), an alternative to using checkpoints, allows for execution at near-native speed between simulation points. Warming the micro-architectural state prior to each simulation point, however, requires functional simulation, a costly operation for large caches (e.g., 8 M B). Simulating future systems with caches of many MBs can require warming of billions of instructions, dominating simulation time. This paper proposes CoolSim, an efficient simulation framework that eliminates cache warming. CoolSim uses VFF to advance between simulation points collecting at the same time sparse memory reuse information (MRI). The MRI is collected more than an order of magnitude faster than functional simulation. At the simulation point, detailed simulation with a statistical cache model is used to evaluate the design. The previously acquired MRI is used to estimate whether each memory request hits in the cache. The MRI is an architecturally independent metric and a single profile can be used in simulations of any size cache. We describe a prototype implementation of CoolSim based on KVM and gem5 running 19 x faster than the state-of-the-art sampled simulation, while it estimates the CPI of the SPEC CPU2006 benchmarks with 3.62% error on average, across a wide range of cache sizes.

1 - 10 of 10
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf