Logo: to the web site of Uppsala University

uu.sePublications from Uppsala University
Change search
Refine search result
1 - 16 of 16
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Alipour, Mehdi
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Carlson, Trevor E.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Maximizing limited resources: A limit-based study and taxonomy of out-of-order commit2019In: Journal of Signal Processing Systems, ISSN 1939-8018, E-ISSN 1939-8115, Vol. 91, no 3-4, p. 379-397Article in journal (Refereed)
    Abstract [en]

    Out-of-order execution is essential for high performance, general-purpose computation, as it can find and execute useful work instead of stalling. However, it is typically limited by the requirement of visibly sequential, atomic instruction executionin other words, in-order instruction commit. While in-order commit has a number of advantages, such as providing precise interrupts and avoiding complications with the memory consistency model, it requires the core to hold on to resources (reorder buffer entries, load/store queue entries, physical registers) until they are released in program order. In contrast, out-of-order commit can release some resources much earlier, yielding improved performance and/or lower resource requirements. Non-speculative out-of-order commit is limited in terms of correctness by the conditions described in the work of Bell and Lipasti (2004). In this paper we revisit out-of-order commit by examining the potential performance benefits of lifting these conditions one by one and in combination, for both non-speculative and speculative out-of-order commit. While correctly handling recovery for all out-of-order commit conditions currently requires complex tracking and expensive checkpointing, this work aims to demonstrate the potential for selective, speculative out-of-order commit using an oracle implementation without speculative rollback costs. Through this analysis of the potential of out-of-order commit, we learn that: a) there is significant untapped potential for aggressive variants of out-of-order commit; b) it is important to optimize the out-of-order commit depth for a balanced design, as smaller cores benefit from reduced depth while larger cores continue to benefit from deeper designs; c) the focus on implementing only a subset of the out-of-order commit conditions could lead to efficient implementations; d) the benefits of out-of-order commit increases with higher memory latency and in conjunction with prefetching; e) out-of-order commit exposes additional parallelism in the memory hierarchy.

    Download full text (pdf)
    fulltext
  • 2.
    Alipour, Mehdi
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Carlson, Trevor E.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    A Taxonomy of Out-of-Order Instruction Commit2017In: 2017 Ieee International Symposium On Performance Analysis Of Systems And Software (Ispass), Los Alamitos: IEEE Computer Society, 2017, p. 135-136Conference paper (Refereed)
    Abstract [en]

    While in-order instruction commit has its advantages, such as providing precise interrupts and avoiding complications with the memory consistency model, it requires the core to hold on to resources (reorder buffer entries, load/store queue entries, registers) until they are released in program order. In contrast, out-of-order commit releases resources much earlier, yielding improved performance without the need for additional hardware resources. In this paper, we revisit out-of-order commit from a different perspective, not by proposing another hardware technique, but by introducing a taxonomy and evaluating three different micro-architectures that have this technique enabled. We show how smaller processors can benefit from simple out-oforder commit strategies, but that larger, aggressive cores require more aggressive strategies to improve performance.

  • 3.
    Alipour, Mehdi
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Carlson, Trevor E.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Exploring the performance limits of out-of-order commit2017In: Proc. 14th Computing Frontiers Conference, New York: ACM Press, 2017, p. 211-220Conference paper (Refereed)
    Abstract [en]

    Out-of-order execution is essential for high performance, general-purpose computation, as it can find and execute useful work instead of stalling. However, it is limited by the requirement of visibly sequential, atomic instruction execution --- in other words in-order instruction commit. While in-order commit has its advantages, such as providing precise interrupts and avoiding complications with the memory consistency model, it requires the core to hold on to resources (reorder buffer entries, load/store queue entries, registers) until they are released in program order. In contrast, out-of-order commit releases resources much earlier, yielding improved performance with fewer traditional hardware resources. However, out-of-order commit is limited in terms of correctness by the conditions described in the work of Bell and Lipasti. In this paper we revisit out-of-order commit from a different perspective, not by proposing another hardware technique, but by examining these conditions one by one and in combination with respect to their potential performance benefit for both non-speculative and speculative out-of-order commit. While correctly handling recovery for all out-of-order commit conditions currently requires complex tracking and expensive checkpointing, this work aims to demonstrate the potential for selective, speculative out-of-order commit using an oracle implementation without speculative rollback costs. We learn that: a) there is significant untapped potential for aggressive variants of out-of-order commit; b) it is important to optimize the commit depth, or the search distance for out-of-order commit, for a balanced design: smaller cores can benefit from shorter depths while larger cores continue to benefit from aggressive parameters; c) the focus on a subset of out-of-order commit conditions could lead to efficient implementations; d) the benefits for out-of-order commit increase with higher memory latency and works well in conjunction with prefetching to continue to improve performance.

    Download (pdf)
    bilaga
  • 4.
    Carlson, Trevor E.
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Heirman, Wim
    Intel, ExaSci Lab, Santa Clara, CA USA..
    Allam, Osman
    Univ Ghent, B-9000 Ghent, Belgium..
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Eeckhout, Lieven
    Univ Ghent, B-9000 Ghent, Belgium..
    The Load Slice Core Microarchitecture2015In: 2015 ACM/IEEE 42Nd Annual International Symposium On Computer Architecture (ISCA), 2015, p. 272-284Conference paper (Refereed)
    Abstract [en]

    Driven by the motivation to expose instruction-level parallelism (ILP), microprocessor cores have evolved from simple, in-order pipelines into complex, superscalar out-of-order designs. By extracting ILP, these processors also enable parallel cache and memory operations as a useful side-effect. Today, however, the growing off-chip memory wall and complex cache hierarchies of many-core processors make cache and memory accesses ever more costly. This increases the importance of extracting memory hierarchy parallelism (MHP), while reducing the net impact of more general, yet complex and power-hungry ILP-extraction techniques. In addition, for multi-core processors operating in power- and energy-constrained environments, energy-efficiency has largely replaced single-thread performance as the primary concern. Based on this observation, we propose a core microarchitecture that is aimed squarely at generating parallel accesses to the memory hierarchy while maximizing energy efficiency. The Load Slice Core extends the efficient in-order, stall-on-use core with a second in-order pipeline that enables memory accesses and address-generating instructions to bypass stalled instructions in the main pipeline. Backward program slices containing address-generating instructions leading up to loads and stores are extracted automatically by the hardware, using a novel iterative algorithm that requires no software support or recompilation. On average, the Load Slice Core improves performance over a baseline in-order processor by 53% with overheads of only 15% in area and 22% in power, leading to an increase in energy efficiency (MIPS/Watt) over in-order and out-of-order designs by 43% and over 4.7x, respectively. In addition, for a power- and area-constrained many-core design, the Load Slice Core outperforms both in-order and out-of-order designs, achieving a 53% and 95% higher performance, respectively, thus providing an alternative direction for future many-core processors.

  • 5.
    Carlson, Trevor E.
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Tran, Kim-Anh
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Jimborean, Alexandra
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Koukos, Konstantinos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Själander, Magnus
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Transcending hardware limits with software out-of-order processing2017In: IEEE Computer Architecture Letters, ISSN 1556-6056, Vol. 16, no 2, p. 162-165Article in journal (Refereed)
  • 6.
    Ceballos, Germán
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Sembrant, Andreas
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Carlson, Trevor E.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Analyzing Graphics Workloads on Tile-based GPUs2017In: Proc. 20th International Symposium on Workload Characterization, IEEE, 2017, p. 108-109Conference paper (Refereed)
  • 7.
    Ceballos, Germán
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Sembrant, Andreas
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Carlson, Trevor E.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Behind the Scenes: Memory Analysis of Graphical Workloads on Tile-based GPUs2018In: Proc. International Symposium on Performance Analysis of Systems and Software: ISPASS 2018, IEEE Computer Society, 2018, p. 1-11Conference paper (Refereed)
    Download full text (pdf)
    fulltext
  • 8.
    Krzywda, Jakub
    et al.
    Umea Univ, Dept Comp Sci, SE-90187 Umea, Sweden..
    Ali-Eldin, Ahmed
    Umea Univ, Dept Comp Sci, SE-90187 Umea, Sweden..
    Carlson, Trevor E.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. Natl Univ Singapore, Sch Comp, 13 Comp Dr, Singapore 117417, Singapore.
    Östberg, Per-Olov
    Umea Univ, Dept Comp Sci, SE-90187 Umea, Sweden..
    Elmroth, Erik
    Umea Univ, Dept Comp Sci, SE-90187 Umea, Sweden..
    Power-performance tradeoffs in data center servers: DVFS, CPU pinning, horizontal, and vertical scaling2018In: Future generations computer systems, ISSN 0167-739X, E-ISSN 1872-7115, Vol. 81, p. 114-128Article in journal (Refereed)
    Abstract [en]

    Dynamic Voltage and Frequency Scaling (DVFS), CPU pinning, horizontal, and vertical scaling, are four techniques that have been proposed as actuators to control the performance and energy consumption on data center servers. This work investigates the utility of these four actuators, and quantifies the power-performance tradeoffs associated with them. Using replicas of the German Wikipedia running on our local testbed, we perform a set of experiments to quantify the influence of DVFS, vertical and horizontal scaling, and CPU pinning on end-to-end response time (average and tail), throughput, and power consumption with different workloads. Results of the experiments show that DVFS rarely reduces the power consumption of underloaded servers by more than 5%, but it can be used to limit the maximal power consumption of a saturated server by up to 20% (at a cost of performance degradation). CPU pinning reduces the power consumption of underloaded server (by up to 7%) at the cost of performance degradation, which can be limited by choosing an appropriate CPU pinning scheme. Horizontal and vertical scaling improves both the average and tail response time, but the improvement is not proportional to the amount of resources added. The load balancing strategy has a big impact on the tail response time of horizontally scaled applications.

  • 9.
    Nikoleris, Nikos
    et al.
    Arm Research, Cambridge UK.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Carlson, Trevor E.
    Department of Computer Science, National University of Singapore.
    Delorean: Virtualized Directed Profiling for Cache Modeling in Sampled Simulation2018Report (Other academic)
    Abstract [en]

    Current practice for accurate and efficient simulation (e.g., SMARTS and Simpoint) makes use of sampling to significantly reduce the time needed to evaluate new research ideas. By evaluating a small but representative portion of the original application, sampling can allow for both fast and accurate performance analysis. However, as cache sizes of modern architectures grow, simulation time is dominated by warming microarchitectural state and not by detailed simulation, reducing overall simulation efficiency. While checkpoints can significantly reduce cache warming, improving efficiency, they limit the flexibility of the system under evaluation, requiring new checkpoints for software updates (such as changes to the compiler and compiler flags) and many types of hardware modifications. An ideal solution would allow for accurate cache modeling for each simulation run without the need to generate rigid checkpointing data a priori.

    Enabling this new direction for fast and flexible simulation requires a combination of (1) a methodology that allows for hardware and software flexibility and (2) the ability to quickly and accurately model arbitrarily-sized caches. Current approaches that rely on checkpointing or statistical cache modeling require rigid, up-front state to be collected which needs to be amortized over a large number of simulation runs. These earlier methodologies are insufficient for our goals for improved flexibility. In contrast, our proposed methodology, Delorean, outlines a unique solution to this problem. The Delorean simulation methodology enables both flexibility and accuracy by quickly generating a targeted cache model for the next detailed region on the fly without the need for up-front simulation or modeling. More specifically, we propose a new, more accurate statistical cache modeling method that takes advantage of hardware virtualization to precisely determine the memory regions accessed and to minimize the time needed for data collection while maintaining accuracy.

    Delorean uses a multi-pass approach to understand the memory regions accessed by the next, upcoming detailed region. Our methodology collects the entire set of key memory accesses and, through fast virtualization techniques, progressively scans larger, earlier regions to learn more about these key accesses in an efficient way. Using these techniques, we demonstrate that Delorean allows for the fast evaluation of systems and their software though the generation of accurate cache models on the fly. Delorean outperforms previous proposals by an order of magnitude, with a simulation speed of 150 MIPS and a similar average CPI error (below 4%).

    Download full text (pdf)
    fulltext
  • 10.
    Nikoleris, Nikos
    et al.
    ARM Res, Cambridge, England..
    Sandberg, Andreas
    ARM Res, Cambridge, England..
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Carlson, Trevor E.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    CoolSim: Statistical Techniques to Replace Cache Warming with Efficient, Virtualized Profiling2016In: Proceedings Of 2016 International Conference On Embedded Computer Systems: Architectures, Modeling And Simulation (Samos) / [ed] Najjar, W Gerstlauer, A, IEEE , 2016, p. 106-115Conference paper (Refereed)
    Abstract [en]

    Simulation is an important part of the evaluation of next-generation computing systems. Detailed, cycle-accurate simulation, however, can be very slow when evaluating realistic workloads on modern microarchitectures. Sampled simulation (e.g., SMARTS and SimPoint) improves simulation performance by an order of magnitude or more through the reduction of large workloads into a small but representative sample. Additionally, the execution state just prior to a simulation sample can be stored into checkpoints, allowing for fast restoration and evaluation. Unfortunately, changes in software, architecture or fundamental pieces of the microarchitecture (e.g., hardware-software co-design) require checkpoint regeneration. The end result for co-design degenerates to creating checkpoints for each modification, a task checkpointing was designed to eliminate. Therefore, a solution is needed that allows for fast and accurate simulation, without the need for checkpoints. Virtualized fast-forwarding (VFF), an alternative to using checkpoints, allows for execution at near-native speed between simulation points. Warming the micro-architectural state prior to each simulation point, however, requires functional simulation, a costly operation for large caches (e.g., 8 M B). Simulating future systems with caches of many MBs can require warming of billions of instructions, dominating simulation time. This paper proposes CoolSim, an efficient simulation framework that eliminates cache warming. CoolSim uses VFF to advance between simulation points collecting at the same time sparse memory reuse information (MRI). The MRI is collected more than an order of magnitude faster than functional simulation. At the simulation point, detailed simulation with a statistical cache model is used to evaluate the design. The previously acquired MRI is used to estimate whether each memory request hits in the cache. The MRI is an architecturally independent metric and a single profile can be used in simulations of any size cache. We describe a prototype implementation of CoolSim based on KVM and gem5 running 19 x faster than the state-of-the-art sampled simulation, while it estimates the CPI of the SPEC CPU2006 benchmarks with 3.62% error on average, across a wide range of cache sizes.

  • 11.
    Ros, Alberto
    et al.
    Univ Murcia, Dept Comp Engn, Murcia, Spain.
    Carlson, Trevor E.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Alipour, Mehdi
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Non-speculative load-load reordering in TSO2017In: Proc. 44th International Symposium on Computer Architecture, New York: ACM Press, 2017, p. 187-200Conference paper (Refereed)
    Abstract [en]

    In Total Store Order memory consistency (TSO), loads can be speculatively reordered to improve performance. If a load-load reordering is seen by other cores, speculative loads must be squashed and re-executed. In architectures with an unordered interconnection network and directory coherence, this has been the established view for decades. We show, for the first time, that it is not necessary to squash and re-execute speculatively reordered loads in TSO when their reordering is seen. Instead, the reordering can be hidden form other cores by the coherence protocol. The implication is that we can irrevocably bind speculative loads. This allows us to commit reordered loads out-of-order without having to wait (for the loads to become non-speculative) or without having to checkpoint committed state (and rollback if needed), just to ensure correctness in the rare case of some core seeing the reordering. We show that by exposing a reordering to the coherence layer and by appropriately modifying a typical directory protocol we can successfully hide load-load reordering without perceptible performance cost and without deadlock. Our solution is cost-effective and increases the performance of out-of-order commit by a sizable margin, compared to the base case where memory operations are not allowed to commit if the consistency model could be violated.

  • 12.
    Sembrant, Andreas
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Carlson, Trevor E.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    A graphics tracing framework for exploring CPU+GPU memory systems2017In: Proc. 20th International Symposium on Workload Characterization, IEEE, 2017, p. 54-65Conference paper (Refereed)
  • 13.
    Sembrant, Andreas
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Carlson, Trevor E.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    POSTER: Putting the G back into GPU/CPU Systems Research2017In: 2017 26TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT), 2017, p. 130-131Conference paper (Refereed)
    Abstract [en]

    Modern SoCs contain several CPU cores and many GPU cores to execute both general purpose and highly-parallel graphics workloads. In many SoCs, more area is dedicated to graphics than to general purpose compute. Despite this, the micro-architecture research community primarily focuses on GPGPU and CPU-only research, and not on graphics (the primary workload for many SoCs). The main reason for this is the lack of efficient tools and simulators for modern graphics applications. This work focuses on the GPU's memory traffic generated by graphics. We describe a new graphics tracing framework and use it to both study graphics applications' memory behavior as well as how CPUs and GPUs affect system performance. Our results show that graphics applications exhibit a wide range of memory behavior between applications and across time, and slows down co-running SPEC applications by 59% on average.

  • 14.
    Sembrant, Andreas
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Carlson, Trevor E.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Perais, Arthur
    INRIA.
    Seznec, André
    INRIA.
    Michaud, Pierre
    INRIA.
    Long Term Parking (LTP): Criticality-aware Resource Allocation in OOO Processors2015In: Proc. 48th International Symposium on Microarchitecture, 2015, p. 334-346Conference paper (Refereed)
    Abstract [en]

    Modern processors employ large structures (IQ, LSQ, register file, etc.) to expose instruction-level parallelism (ILP) and memory-level parallelism (MLP). These resources are typically allocated to instructions in program order. This wastes resources by allocating resources to instructions that are not yet ready to be executed and by eagerly allocating resources to instructions that are not part of the application’s critical path.

    This work explores the possibility of allocating pipeline resources only when needed to expose MLP, and thereby enabling a processor design with significantly smaller structures, without sacrificing performance. First we identify the classes of instructions that should not reserve resources in program order and evaluate the potential performance gains we could achieve by delaying their allocations. We then use this information to “park” such instructions in a simpler, and therefore more efficient, Long Term Parking (LTP) structure. The LTP stores instructions until they are ready to execute, without allocating pipeline resources, and thereby keeps the pipeline available for instructions that can generate further MLP.

    LTP can accurately and rapidly identify which instructions to park, park them before they execute, wake them when needed to preserve performance, and do so using a simple queue instead of a complex IQ. We show that even a very simple queue-based LTP design allows us to significantly reduce IQ (64 →32) and register file (128→96) sizes while retaining MLP performance and improving energy efficiency.

  • 15.
    Tran, Kim-Anh
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Carlson, Trevor E.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Koukos, Konstantinos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Själander, Magnus
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Spiliopoulos, Vasileios
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Jimborean, Alexandra
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Clairvoyance: Look-ahead compile-time scheduling2017In: Proc. 15th International Symposium on Code Generation and Optimization, Piscataway, NJ: IEEE Press, 2017, p. 171-184Conference paper (Refereed)
    Download full text (pdf)
    fulltext
  • 16.
    Van den Steen, Sam
    et al.
    Univ Ghent, Dept Elect & Informat Syst, Ghent, Belgium..
    Eyerman, Stijn
    Intel, Kontich, Belgium..
    De Pestel, Sander
    Univ Ghent, Dept Elect & Informat Syst, Ghent, Belgium..
    Mechri, Moncef
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Carlson, Trevor E.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Eeckhout, Lieven
    Univ Ghent, Dept Elect & Informat Syst, Ghent, Belgium..
    Analytical Processor Performance and Power Modeling Using Micro-Architecture Independent Characteristics2016In: IEEE Transactions on Computers, ISSN 0018-9340, E-ISSN 1557-9956, Vol. 65, no 12, p. 3537-3551Article in journal (Refereed)
    Abstract [en]

    Optimizing processors for (a) specific application(s) can substantially improve energy-efficiency. With the end of Dennard scaling, and the corresponding reduction in energy-efficiency gains from technology scaling, such approaches may become increasingly important. However, designing application-specific processors requires fast design space exploration tools to optimize for the targeted application(s). Analytical models can be a good fit for such design space exploration as they provide fast performance and power estimates and insight into the interaction between an application's characteristics and the micro-architecture of a processor. Unfortunately, prior analytical models for superscalar out-of-order processors require micro-architecture dependent inputs, such as cache miss rates, branch miss rates and memory-level parallelism. This requires profiling the applications for each cache and branch predictor configuration of interest, which is far more time-consuming than evaluating the analytical performance models. In this work we present a micro-architecture independent profiler and associated analytical models that allow us to produce performance and power estimates across a large superscalar out-of-order processor design space almost instantaneously. We show that using a micro-architecture independent profile leads to a speedup of 300x compared to detailed simulation for our evaluated design space. Over a large design space, the model has a 9.3 percent average error for performance and a 4.3 percent average error for power, compared to detailed cycle-level simulation. The model is able to accurately determine the optimal processor configuration for different applications under power or performance constraints, and provides insight into performance through cycle stacks.

1 - 16 of 16
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf