uu.seUppsala University Publications
Change search
Link to record
Permanent link

Direct link
BETA
Carlson, Trevor E.
Publications (10 of 16) Show all publications
Alipour, M., Carlson, T. E., Black-Schaffer, D. & Kaxiras, S. (2019). Maximizing limited resources: A limit-based study and taxonomy of out-of-order commit. Journal of Signal Processing Systems, 91(3-4), 379-397
Open this publication in new window or tab >>Maximizing limited resources: A limit-based study and taxonomy of out-of-order commit
2019 (English)In: Journal of Signal Processing Systems, ISSN 1939-8018, E-ISSN 1939-8115, Vol. 91, no 3-4, p. 379-397Article in journal (Refereed) Published
Abstract [en]

Out-of-order execution is essential for high performance, general-purpose computation, as it can find and execute useful work instead of stalling. However, it is typically limited by the requirement of visibly sequential, atomic instruction executionin other words, in-order instruction commit. While in-order commit has a number of advantages, such as providing precise interrupts and avoiding complications with the memory consistency model, it requires the core to hold on to resources (reorder buffer entries, load/store queue entries, physical registers) until they are released in program order. In contrast, out-of-order commit can release some resources much earlier, yielding improved performance and/or lower resource requirements. Non-speculative out-of-order commit is limited in terms of correctness by the conditions described in the work of Bell and Lipasti (2004). In this paper we revisit out-of-order commit by examining the potential performance benefits of lifting these conditions one by one and in combination, for both non-speculative and speculative out-of-order commit. While correctly handling recovery for all out-of-order commit conditions currently requires complex tracking and expensive checkpointing, this work aims to demonstrate the potential for selective, speculative out-of-order commit using an oracle implementation without speculative rollback costs. Through this analysis of the potential of out-of-order commit, we learn that: a) there is significant untapped potential for aggressive variants of out-of-order commit; b) it is important to optimize the out-of-order commit depth for a balanced design, as smaller cores benefit from reduced depth while larger cores continue to benefit from deeper designs; c) the focus on implementing only a subset of the out-of-order commit conditions could lead to efficient implementations; d) the benefits of out-of-order commit increases with higher memory latency and in conjunction with prefetching; e) out-of-order commit exposes additional parallelism in the memory hierarchy.

National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-365899 (URN)10.1007/s11265-018-1369-4 (DOI)000459428200012 ()
Available from: 2018-04-26 Created: 2018-11-14 Last updated: 2019-03-21Bibliographically approved
Ceballos, G., Sembrant, A., Carlson, T. E. & Black-Schaffer, D. (2018). Behind the Scenes: Memory Analysis of Graphical Workloads on Tile-based GPUs. In: Proc. International Symposium on Performance Analysis of Systems and Software: ISPASS 2018. Paper presented at ISPASS 2018, April 2–4, Belfast, UK (pp. 1-11). IEEE Computer Society
Open this publication in new window or tab >>Behind the Scenes: Memory Analysis of Graphical Workloads on Tile-based GPUs
2018 (English)In: Proc. International Symposium on Performance Analysis of Systems and Software: ISPASS 2018, IEEE Computer Society, 2018, p. 1-11Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
IEEE Computer Society, 2018
National Category
Computer Systems
Identifiers
urn:nbn:se:uu:diva-361214 (URN)10.1109/ISPASS.2018.00009 (DOI)978-1-5386-5010-3 (ISBN)
Conference
ISPASS 2018, April 2–4, Belfast, UK
Projects
UPMARC
Available from: 2018-09-21 Created: 2018-09-21 Last updated: 2018-11-16Bibliographically approved
Nikoleris, N., Hagersten, E. & Carlson, T. E. (2018). Delorean: Virtualized Directed Profiling for Cache Modeling in Sampled Simulation.
Open this publication in new window or tab >>Delorean: Virtualized Directed Profiling for Cache Modeling in Sampled Simulation
2018 (English)Report (Other academic)
Abstract [en]

Current practice for accurate and efficient simulation (e.g., SMARTS and Simpoint) makes use of sampling to significantly reduce the time needed to evaluate new research ideas. By evaluating a small but representative portion of the original application, sampling can allow for both fast and accurate performance analysis. However, as cache sizes of modern architectures grow, simulation time is dominated by warming microarchitectural state and not by detailed simulation, reducing overall simulation efficiency. While checkpoints can significantly reduce cache warming, improving efficiency, they limit the flexibility of the system under evaluation, requiring new checkpoints for software updates (such as changes to the compiler and compiler flags) and many types of hardware modifications. An ideal solution would allow for accurate cache modeling for each simulation run without the need to generate rigid checkpointing data a priori.

Enabling this new direction for fast and flexible simulation requires a combination of (1) a methodology that allows for hardware and software flexibility and (2) the ability to quickly and accurately model arbitrarily-sized caches. Current approaches that rely on checkpointing or statistical cache modeling require rigid, up-front state to be collected which needs to be amortized over a large number of simulation runs. These earlier methodologies are insufficient for our goals for improved flexibility. In contrast, our proposed methodology, Delorean, outlines a unique solution to this problem. The Delorean simulation methodology enables both flexibility and accuracy by quickly generating a targeted cache model for the next detailed region on the fly without the need for up-front simulation or modeling. More specifically, we propose a new, more accurate statistical cache modeling method that takes advantage of hardware virtualization to precisely determine the memory regions accessed and to minimize the time needed for data collection while maintaining accuracy.

Delorean uses a multi-pass approach to understand the memory regions accessed by the next, upcoming detailed region. Our methodology collects the entire set of key memory accesses and, through fast virtualization techniques, progressively scans larger, earlier regions to learn more about these key accesses in an efficient way. Using these techniques, we demonstrate that Delorean allows for the fast evaluation of systems and their software though the generation of accurate cache models on the fly. Delorean outperforms previous proposals by an order of magnitude, with a simulation speed of 150 MIPS and a similar average CPI error (below 4%).

Publisher
p. 12
Series
Technical report / Department of Information Technology, Uppsala University, ISSN 1404-3203
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:uu:diva-369320 (URN)
Available from: 2018-12-12 Created: 2018-12-12 Last updated: 2019-01-08Bibliographically approved
Krzywda, J., Ali-Eldin, A., Carlson, T. E., Östberg, P.-O. & Elmroth, E. (2018). Power-performance tradeoffs in data center servers: DVFS, CPU pinning, horizontal, and vertical scaling. Future generations computer systems, 81, 114-128
Open this publication in new window or tab >>Power-performance tradeoffs in data center servers: DVFS, CPU pinning, horizontal, and vertical scaling
Show others...
2018 (English)In: Future generations computer systems, ISSN 0167-739X, E-ISSN 1872-7115, Vol. 81, p. 114-128Article in journal (Refereed) Published
Abstract [en]

Dynamic Voltage and Frequency Scaling (DVFS), CPU pinning, horizontal, and vertical scaling, are four techniques that have been proposed as actuators to control the performance and energy consumption on data center servers. This work investigates the utility of these four actuators, and quantifies the power-performance tradeoffs associated with them. Using replicas of the German Wikipedia running on our local testbed, we perform a set of experiments to quantify the influence of DVFS, vertical and horizontal scaling, and CPU pinning on end-to-end response time (average and tail), throughput, and power consumption with different workloads. Results of the experiments show that DVFS rarely reduces the power consumption of underloaded servers by more than 5%, but it can be used to limit the maximal power consumption of a saturated server by up to 20% (at a cost of performance degradation). CPU pinning reduces the power consumption of underloaded server (by up to 7%) at the cost of performance degradation, which can be limited by choosing an appropriate CPU pinning scheme. Horizontal and vertical scaling improves both the average and tail response time, but the improvement is not proportional to the amount of resources added. The load balancing strategy has a big impact on the tail response time of horizontally scaled applications.

Place, publisher, year, edition, pages
ELSEVIER SCIENCE BV, 2018
Keywords
Power-performance tradeoffs, Dynamic Voltage and Frequency Scaling (DVFS), CPU pinning, Horizontal scaling, Vertical scaling
National Category
Computer Engineering
Identifiers
urn:nbn:se:uu:diva-345707 (URN)10.1016/j.future.2017.10.044 (DOI)000423652200010 ()
Funder
Swedish Research Council, C0590801eSSENCE - An eScience CollaborationEU, FP7, Seventh Framework Programme, 610711, 610490EU, Horizon 2020, 732667
Available from: 2018-03-14 Created: 2018-03-14 Last updated: 2018-03-14Bibliographically approved
Sembrant, A., Carlson, T. E., Hagersten, E. & Black-Schaffer, D. (2017). A graphics tracing framework for exploring CPU+GPU memory systems. In: Proc. 20th International Symposium on Workload Characterization: . Paper presented at IISWC 2017, October 1–3, Seattle, WA (pp. 54-65). IEEE
Open this publication in new window or tab >>A graphics tracing framework for exploring CPU+GPU memory systems
2017 (English)In: Proc. 20th International Symposium on Workload Characterization, IEEE, 2017, p. 54-65Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
IEEE, 2017
National Category
Computer Engineering
Identifiers
urn:nbn:se:uu:diva-357055 (URN)10.1109/IISWC.2017.8167756 (DOI)000428206700006 ()978-1-5386-1233-0 (ISBN)
Conference
IISWC 2017, October 1–3, Seattle, WA
Available from: 2017-12-07 Created: 2018-08-17 Last updated: 2018-09-24Bibliographically approved
Alipour, M., Carlson, T. E. & Kaxiras, S. (2017). A Taxonomy of Out-of-Order Instruction Commit. In: 2017 Ieee International Symposium On Performance Analysis Of Systems And Software (Ispass): . Paper presented at 2017 Ieee International Symposium On Performance Analysis Of Systems And Software (Ispass), Santa Rosa, CA, USA. (pp. 135-136). Los Alamitos: IEEE Computer Society
Open this publication in new window or tab >>A Taxonomy of Out-of-Order Instruction Commit
2017 (English)In: 2017 Ieee International Symposium On Performance Analysis Of Systems And Software (Ispass), Los Alamitos: IEEE Computer Society, 2017, p. 135-136Conference paper, Published paper (Refereed)
Abstract [en]

While in-order instruction commit has its advantages, such as providing precise interrupts and avoiding complications with the memory consistency model, it requires the core to hold on to resources (reorder buffer entries, load/store queue entries, registers) until they are released in program order. In contrast, out-of-order commit releases resources much earlier, yielding improved performance without the need for additional hardware resources. In this paper, we revisit out-of-order commit from a different perspective, not by proposing another hardware technique, but by introducing a taxonomy and evaluating three different micro-architectures that have this technique enabled. We show how smaller processors can benefit from simple out-oforder commit strategies, but that larger, aggressive cores require more aggressive strategies to improve performance.

Place, publisher, year, edition, pages
Los Alamitos: IEEE Computer Society, 2017
National Category
Computer Systems Computer Sciences
Identifiers
urn:nbn:se:uu:diva-352938 (URN)10.1109/ISPASS.2017.7975283 (DOI)000426905600020 ()978-1-5386-3890-3 (ISBN)978-1-5386-3891-0 (ISBN)978-1-5386-3889-7 (ISBN)
Conference
2017 Ieee International Symposium On Performance Analysis Of Systems And Software (Ispass), Santa Rosa, CA, USA.
Available from: 2018-06-12 Created: 2018-06-12 Last updated: 2018-06-12Bibliographically approved
Ceballos, G., Sembrant, A., Carlson, T. E. & Black-Schaffer, D. (2017). Analyzing Graphics Workloads on Tile-based GPUs. In: Proc. 20th International Symposium on Workload Characterization: . Paper presented at IISWC 2017, October 1–3, Seattle, WA (pp. 108-109). IEEE
Open this publication in new window or tab >>Analyzing Graphics Workloads on Tile-based GPUs
2017 (English)In: Proc. 20th International Symposium on Workload Characterization, IEEE, 2017, p. 108-109Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
IEEE, 2017
National Category
Computer Systems Computer Engineering
Identifiers
urn:nbn:se:uu:diva-335559 (URN)10.1109/IISWC.2017.8167761 (DOI)000428206700011 ()978-1-5386-1233-0 (ISBN)
Conference
IISWC 2017, October 1–3, Seattle, WA
Projects
UPMARC
Funder
Swedish Foundation for Strategic Research , FFL12-0051
Available from: 2017-12-06 Created: 2017-12-06 Last updated: 2018-11-15Bibliographically approved
Tran, K.-A., Carlson, T. E., Koukos, K., Själander, M., Spiliopoulos, V., Kaxiras, S. & Jimborean, A. (2017). Clairvoyance: Look-ahead compile-time scheduling. In: Proc. 15th International Symposium on Code Generation and Optimization: . Paper presented at CGO 2017, February 4–8, Austin, TX (pp. 171-184). Piscataway, NJ: IEEE Press
Open this publication in new window or tab >>Clairvoyance: Look-ahead compile-time scheduling
Show others...
2017 (English)In: Proc. 15th International Symposium on Code Generation and Optimization, Piscataway, NJ: IEEE Press, 2017, p. 171-184Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
Piscataway, NJ: IEEE Press, 2017
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-316480 (URN)000402548700015 ()978-1-5090-4931-8 (ISBN)
Conference
CGO 2017, February 4–8, Austin, TX
Projects
UPMARC
Funder
Swedish Research Council, 2010-4741
Available from: 2017-02-04 Created: 2017-03-01 Last updated: 2018-04-26Bibliographically approved
Alipour, M., Carlson, T. E. & Kaxiras, S. (2017). Exploring the performance limits of out-of-order commit. In: Proc. 14th Computing Frontiers Conference: . Paper presented at CF 2017, May 15–17, Siena, Italy (pp. 211-220). New York: ACM Press
Open this publication in new window or tab >>Exploring the performance limits of out-of-order commit
2017 (English)In: Proc. 14th Computing Frontiers Conference, New York: ACM Press, 2017, p. 211-220Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
New York: ACM Press, 2017
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-334601 (URN)10.1145/3075564.3075581 (DOI)978-1-4503-4487-6 (ISBN)
Conference
CF 2017, May 15–17, Siena, Italy
Projects
UPMARC
Available from: 2017-05-15 Created: 2017-11-24 Last updated: 2018-01-13Bibliographically approved
Ros, A., Carlson, T. E., Alipour, M. & Kaxiras, S. (2017). Non-speculative load-load reordering in TSO. In: Proc. 44th International Symposium on Computer Architecture: . Paper presented at ISCA 2017, June 24–28, Toronto, Canada (pp. 187-200). New York: ACM Press
Open this publication in new window or tab >>Non-speculative load-load reordering in TSO
2017 (English)In: Proc. 44th International Symposium on Computer Architecture, New York: ACM Press, 2017, p. 187-200Conference paper, Published paper (Refereed)
Abstract [en]

In Total Store Order memory consistency (TSO), loads can be speculatively reordered to improve performance. If a load-load reordering is seen by other cores, speculative loads must be squashed and re-executed. In architectures with an unordered interconnection network and directory coherence, this has been the established view for decades. We show, for the first time, that it is not necessary to squash and re-execute speculatively reordered loads in TSO when their reordering is seen. Instead, the reordering can be hidden form other cores by the coherence protocol. The implication is that we can irrevocably bind speculative loads. This allows us to commit reordered loads out-of-order without having to wait (for the loads to become non-speculative) or without having to checkpoint committed state (and rollback if needed), just to ensure correctness in the rare case of some core seeing the reordering. We show that by exposing a reordering to the coherence layer and by appropriately modifying a typical directory protocol we can successfully hide load-load reordering without perceptible performance cost and without deadlock. Our solution is cost-effective and increases the performance of out-of-order commit by a sizable margin, compared to the base case where memory operations are not allowed to commit if the consistency model could be violated.

Place, publisher, year, edition, pages
New York: ACM Press, 2017
Keywords
Cache coherence, memory consistency, TSO, load reordering, out-of-order commit
National Category
Computer Engineering
Identifiers
urn:nbn:se:uu:diva-323468 (URN)10.1145/3079856.3080220 (DOI)000426483300015 ()978-1-4503-4892-8 (ISBN)
Conference
ISCA 2017, June 24–28, Toronto, Canada
Projects
UPMARC
Funder
Swedish Research Council, 621-2012-5332
Available from: 2017-06-24 Created: 2017-06-07 Last updated: 2018-06-08Bibliographically approved
Organisations

Search in DiVA

Show all publications