Logo: to the web site of Uppsala University

uu.sePublications from Uppsala University
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rethinking Dynamic Instruction Scheduling and Retirement for Efficient Microarchitectures
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Computer Systems. (UART)ORCID iD: 0000-0001-9842-8715
2020 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Out-of-order execution is one of the main micro-architectural techniques used to improve the performance of both single- and multi-threaded processors. The application of such a processor varies from mobile devices to server computers. This technique achieves higher performance by finding independent instructions and hiding execution latency and uses the cycles which otherwise would be wasted or caused a CPU stall. To accomplish this, it uses scheduling resources including the ROB, IQ, LSQ and physical registers, to store and prioritize instructions.

The pipeline of an out-of-order processor has three macro-stages: the front-end, the scheduler, and the back-end. The front-end fetches instructions, places them in the out-of-order resources, and analyzes them to prepare for their execution. The scheduler identifies which instructions are ready for execution and prioritizes them for scheduling. The back-end updates the processor state with the results of the oldest completed instructions, deallocates the resources and commits the instructions in the program order to maintain correct execution.

Since out-of-order execution needs to be able to choose any available instructions for execution, its scheduling resources must have complex circuits for identifying and prioritizing instructions, which makes them very expansive, therefore, limited. Due to their cost, the scheduling resources are constrained in size. This limited size leads to two stall points respectively at the front-end and the back-end of the pipeline. The front-end can stall due to fully allocated resources and therefore no more new instructions can be placed in the scheduler. The back-end can stall due to the unfinished execution of an instruction at the head of the ROB which prevents other resources from being deallocated, preventing new instructions from being inserted into the pipeline.

To address these two stalls, this thesis focuses on reducing the time instructions occupy the scheduling resources. Our front-end technique tackles IQ pressure while our back-end approach considers the rest of the resources. To reduce front-end stalls we reduce the pressure on the IQ for both storing (depth) and issuing (width) instructions by bypassing them to cheaper storage structures. To reduce back-end stalls, we explore how we can retire instructions earlier, and out-of-order, to reduce the pressure on the out-of-order resource.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2020. , p. 76
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1902
Keywords [en]
Out-of-Order Processors, Energy-Efficient, High-Performance, Instruction Scheduling
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:uu:diva-403675ISBN: 978-91-513-0868-5 (print)OAI: oai:DiVA.org:uu-403675DiVA, id: diva2:1390628
Public defence
2020-10-02, ITC 2446, ITC, Lägerhyddsvägen 2, Uppsala, 13:00 (English)
Opponent
Supervisors
Available from: 2020-02-27 Created: 2020-02-02 Last updated: 2020-09-21Bibliographically approved
List of papers
1. A Taxonomy of Out-of-Order Instruction Commit
Open this publication in new window or tab >>A Taxonomy of Out-of-Order Instruction Commit
2017 (English)In: 2017 Ieee International Symposium On Performance Analysis Of Systems And Software (Ispass), Los Alamitos: IEEE Computer Society, 2017, p. 135-136Conference paper, Published paper (Refereed)
Abstract [en]

While in-order instruction commit has its advantages, such as providing precise interrupts and avoiding complications with the memory consistency model, it requires the core to hold on to resources (reorder buffer entries, load/store queue entries, registers) until they are released in program order. In contrast, out-of-order commit releases resources much earlier, yielding improved performance without the need for additional hardware resources. In this paper, we revisit out-of-order commit from a different perspective, not by proposing another hardware technique, but by introducing a taxonomy and evaluating three different micro-architectures that have this technique enabled. We show how smaller processors can benefit from simple out-oforder commit strategies, but that larger, aggressive cores require more aggressive strategies to improve performance.

Place, publisher, year, edition, pages
Los Alamitos: IEEE Computer Society, 2017
National Category
Computer Systems Computer Sciences
Identifiers
urn:nbn:se:uu:diva-352938 (URN)10.1109/ISPASS.2017.7975283 (DOI)000426905600020 ()978-1-5386-3890-3 (ISBN)978-1-5386-3891-0 (ISBN)978-1-5386-3889-7 (ISBN)
Conference
2017 Ieee International Symposium On Performance Analysis Of Systems And Software (Ispass), Santa Rosa, CA, USA.
Available from: 2018-06-12 Created: 2018-06-12 Last updated: 2020-02-02Bibliographically approved
2. Exploring the performance limits of out-of-order commit
Open this publication in new window or tab >>Exploring the performance limits of out-of-order commit
2017 (English)In: Proc. 14th Computing Frontiers Conference, New York: ACM Press, 2017, p. 211-220Conference paper, Published paper (Refereed)
Abstract [en]

Out-of-order execution is essential for high performance, general-purpose computation, as it can find and execute useful work instead of stalling. However, it is limited by the requirement of visibly sequential, atomic instruction execution --- in other words in-order instruction commit. While in-order commit has its advantages, such as providing precise interrupts and avoiding complications with the memory consistency model, it requires the core to hold on to resources (reorder buffer entries, load/store queue entries, registers) until they are released in program order. In contrast, out-of-order commit releases resources much earlier, yielding improved performance with fewer traditional hardware resources. However, out-of-order commit is limited in terms of correctness by the conditions described in the work of Bell and Lipasti. In this paper we revisit out-of-order commit from a different perspective, not by proposing another hardware technique, but by examining these conditions one by one and in combination with respect to their potential performance benefit for both non-speculative and speculative out-of-order commit. While correctly handling recovery for all out-of-order commit conditions currently requires complex tracking and expensive checkpointing, this work aims to demonstrate the potential for selective, speculative out-of-order commit using an oracle implementation without speculative rollback costs. We learn that: a) there is significant untapped potential for aggressive variants of out-of-order commit; b) it is important to optimize the commit depth, or the search distance for out-of-order commit, for a balanced design: smaller cores can benefit from shorter depths while larger cores continue to benefit from aggressive parameters; c) the focus on a subset of out-of-order commit conditions could lead to efficient implementations; d) the benefits for out-of-order commit increase with higher memory latency and works well in conjunction with prefetching to continue to improve performance.

Place, publisher, year, edition, pages
New York: ACM Press, 2017
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-334601 (URN)10.1145/3075564.3075581 (DOI)000626242600024 ()978-1-4503-4487-6 (ISBN)
Conference
CF 2017, May 15–17, Siena, Italy
Projects
UPMARC
Available from: 2017-05-15 Created: 2017-11-24 Last updated: 2024-01-23Bibliographically approved
3. Maximizing limited resources: A limit-based study and taxonomy of out-of-order commit
Open this publication in new window or tab >>Maximizing limited resources: A limit-based study and taxonomy of out-of-order commit
2019 (English)In: Journal of Signal Processing Systems, ISSN 1939-8018, E-ISSN 1939-8115, Vol. 91, no 3-4, p. 379-397Article in journal (Refereed) Published
Abstract [en]

Out-of-order execution is essential for high performance, general-purpose computation, as it can find and execute useful work instead of stalling. However, it is typically limited by the requirement of visibly sequential, atomic instruction executionin other words, in-order instruction commit. While in-order commit has a number of advantages, such as providing precise interrupts and avoiding complications with the memory consistency model, it requires the core to hold on to resources (reorder buffer entries, load/store queue entries, physical registers) until they are released in program order. In contrast, out-of-order commit can release some resources much earlier, yielding improved performance and/or lower resource requirements. Non-speculative out-of-order commit is limited in terms of correctness by the conditions described in the work of Bell and Lipasti (2004). In this paper we revisit out-of-order commit by examining the potential performance benefits of lifting these conditions one by one and in combination, for both non-speculative and speculative out-of-order commit. While correctly handling recovery for all out-of-order commit conditions currently requires complex tracking and expensive checkpointing, this work aims to demonstrate the potential for selective, speculative out-of-order commit using an oracle implementation without speculative rollback costs. Through this analysis of the potential of out-of-order commit, we learn that: a) there is significant untapped potential for aggressive variants of out-of-order commit; b) it is important to optimize the out-of-order commit depth for a balanced design, as smaller cores benefit from reduced depth while larger cores continue to benefit from deeper designs; c) the focus on implementing only a subset of the out-of-order commit conditions could lead to efficient implementations; d) the benefits of out-of-order commit increases with higher memory latency and in conjunction with prefetching; e) out-of-order commit exposes additional parallelism in the memory hierarchy.

National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-365899 (URN)10.1007/s11265-018-1369-4 (DOI)000459428200012 ()
Available from: 2018-04-26 Created: 2018-11-14 Last updated: 2020-02-02Bibliographically approved
4. FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors
Open this publication in new window or tab >>FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors
2019 (English)In: 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), IEEE, 2019, p. 716-721Conference paper, Published paper (Refereed)
Abstract [en]

The number of instructions a processor's instruction queue can examine (depth) and the number it can issue together (width) determine its ability to take advantage of the ILP in an application. Unfortunately, increasing either the width or depth of the instruction queue is very costly due to the content-addressable logic needed to wakeup and select instructions out-of-order. This work makes the observation that a large number of instructions have both operands ready at dispatch, and therefore do not benefit from out-of-order scheduling. We leverage this to place such ready-at-dispatch instructions in separate, simpler, in-order FIFO queues for scheduling. With such additional queues, we can reduce the size and width of the expensive out-of-order instruction queue, without reducing the processor's overall issue width and depth. Our design, FIFOrder, is able to steer more than 60% of instructions to the cheaper FIFO queues, providing a 50% energy savings over a traditional out-of-order instruction queue design, while delivering 8% higher performance.

Place, publisher, year, edition, pages
IEEE, 2019
Series
Design Automation and Test in Europe Conference and Exhibition, ISSN 1530-1591
National Category
Computer Systems Computer Sciences
Identifiers
urn:nbn:se:uu:diva-389930 (URN)10.23919/DATE.2019.8715034 (DOI)000470666100132 ()978-3-9819263-2-3 (ISBN)
Conference
Design, Automation & Test in Europe Conference & Exhibition (DATE), MAR 25-29, 2019, Florence, ITALY
Funder
Knut and Alice Wallenberg Foundation
Available from: 2019-08-01 Created: 2019-08-01 Last updated: 2020-02-02Bibliographically approved
5. Delay and Bypass: Ready and Criticality Aware Instruction Scheduling in Out-of-Order Processors
Open this publication in new window or tab >>Delay and Bypass: Ready and Criticality Aware Instruction Scheduling in Out-of-Order Processors
2020 (English)In: 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, p. 424-434Conference paper, Published paper (Refereed)
Abstract [en]

Flexible instruction scheduling is essential for performance in out-of-order processors. This is typically achieved by using CAM-based Instruction Queues (IQs) that provide complete flexibility in choosing ready instructions for execution, but at the cost of significant scheduling energy.

In this work we seek to reduce the instruction scheduling energy by reducing the depth and width of the IQ. We do so by classifying instructions based on their readiness and criticality, and using this information to bypass the IQ for instructions that will not benefit from its expensive scheduling structures and delay instructions that will not harm performance. Combined, these approaches allow us to offload a significant portion of the instructions from the IQ to much cheaper FIFO-based scheduling structures without hurting performance. As a result we can reduce the IQ depth and width by half, thereby saving energy.

Our design, Delay and Bypass (DNB), is the first design to explicitly address both readiness and criticality to reduce scheduling energy. By handling both classes we are able to achieve 95% of the baseline out-of-order performance while only using 33% of the scheduling energy. This represents a significant improvement over previous designs which addressed only criticality or readiness (91%/89% performance at 74%/53% energy).

Series
International Symposium on High-Performance Computer Architecture-Proceedings, ISSN 1530-0897, E-ISSN 2378-203X
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-403674 (URN)10.1109/HPCA47549.2020.00042 (DOI)000531494100032 ()978-1-7281-6149-5 (ISBN)
Conference
The 26th IEEE International Symposium on High-Performance Computer Architecture (HPCA), Feb. 22-26, 2020, San Diego, CA, USA
Note

As originally published there was an error in the document's author byline. The order was intended to be: Mehdi Alipour (Uppsala University); Rakesh Kumar (Norwegian University of Science and Technology (NTNU)); Stefanos Kaxiras and David Black-Schaffer (Uppsala University), as noted here. The article PDF remains unchanged.

Available from: 2020-02-02 Created: 2020-02-02 Last updated: 2020-06-17Bibliographically approved

Open Access in DiVA

fulltext(7332 kB)1256 downloads
File information
File name FULLTEXT01.pdfFile size 7332 kBChecksum SHA-512
4eeaf1d5cbd820317e0a4f8cd360fffe09a02bf14c5e77b3ac2e46eb7d3de64f01e42438c4eb6635f89e0484bf4bc64d90b621ee14ed0490fc0d09093bb881bc
Type fulltextMimetype application/pdf

Other links

Join Zoom Meeting

Authority records

Alipour, Mehdi

Search in DiVA

By author/editor
Alipour, Mehdi
By organisation
Computer Architecture and Computer CommunicationDivision of Computer Systems
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 1257 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1981 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf