uu.seUppsala University Publications
Change search
Link to record
Permanent link

Direct link
BETA
Sembrant, Andreas
Publications (10 of 19) Show all publications
Ceballos, G., Sembrant, A., Carlson, T. E. & Black-Schaffer, D. (2018). Behind the Scenes: Memory Analysis of Graphical Workloads on Tile-based GPUs. In: Proc. International Symposium on Performance Analysis of Systems and Software: ISPASS 2018. Paper presented at ISPASS 2018, April 2–4, Belfast, UK (pp. 1-11). IEEE Computer Society
Open this publication in new window or tab >>Behind the Scenes: Memory Analysis of Graphical Workloads on Tile-based GPUs
2018 (English)In: Proc. International Symposium on Performance Analysis of Systems and Software: ISPASS 2018, IEEE Computer Society, 2018, p. 1-11Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
IEEE Computer Society, 2018
National Category
Computer Systems
Identifiers
urn:nbn:se:uu:diva-361214 (URN)10.1109/ISPASS.2018.00009 (DOI)978-1-5386-5010-3 (ISBN)
Conference
ISPASS 2018, April 2–4, Belfast, UK
Projects
UPMARC
Available from: 2018-09-21 Created: 2018-09-21 Last updated: 2018-11-16Bibliographically approved
Sembrant, A., Carlson, T. E., Hagersten, E. & Black-Schaffer, D. (2017). A graphics tracing framework for exploring CPU+GPU memory systems. In: Proc. 20th International Symposium on Workload Characterization: . Paper presented at IISWC 2017, October 1–3, Seattle, WA (pp. 54-65). IEEE
Open this publication in new window or tab >>A graphics tracing framework for exploring CPU+GPU memory systems
2017 (English)In: Proc. 20th International Symposium on Workload Characterization, IEEE, 2017, p. 54-65Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
IEEE, 2017
National Category
Computer Engineering
Identifiers
urn:nbn:se:uu:diva-357055 (URN)10.1109/IISWC.2017.8167756 (DOI)000428206700006 ()978-1-5386-1233-0 (ISBN)
Conference
IISWC 2017, October 1–3, Seattle, WA
Available from: 2017-12-07 Created: 2018-08-17 Last updated: 2018-09-24Bibliographically approved
Sembrant, A., Hagersten, E. & Black-Schaffer, D. (2017). A split cache hierarchy for enabling data-oriented optimizations. In: Proc. 23rd International Symposium on High Performance Computer Architecture: . Paper presented at HPCA 2017, February 4–8, Austin, TX (pp. 133-144). IEEE Computer Society
Open this publication in new window or tab >>A split cache hierarchy for enabling data-oriented optimizations
2017 (English)In: Proc. 23rd International Symposium on High Performance Computer Architecture, IEEE Computer Society, 2017, p. 133-144Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
IEEE Computer Society, 2017
National Category
Computer Engineering
Identifiers
urn:nbn:se:uu:diva-306368 (URN)10.1109/HPCA.2017.25 (DOI)000403330300012 ()978-1-5090-4985-1 (ISBN)
Conference
HPCA 2017, February 4–8, Austin, TX
Available from: 2017-05-08 Created: 2016-10-27 Last updated: 2018-01-14Bibliographically approved
Borgström, G., Sembrant, A. & Black-Schaffer, D. (2017). Adaptive cache warming for faster simulations. In: Proc. 9th Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools. Paper presented at RAPIDO 2017, January 23–25, Stockholm, Sweden. New York: ACM Press, Article ID 1.
Open this publication in new window or tab >>Adaptive cache warming for faster simulations
2017 (English)In: Proc. 9th Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, New York: ACM Press, 2017, article id 1Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
New York: ACM Press, 2017
National Category
Computer Engineering
Identifiers
urn:nbn:se:uu:diva-310625 (URN)10.1145/3023973.3023974 (DOI)978-1-4503-4840-9 (ISBN)
Conference
RAPIDO 2017, January 23–25, Stockholm, Sweden
Projects
UPMARC
Funder
Swedish Research Council, 2014-5480Swedish Foundation for Strategic Research , FFL12-0051
Available from: 2017-01-23 Created: 2016-12-16 Last updated: 2018-01-13Bibliographically approved
Ceballos, G., Sembrant, A., Carlson, T. E. & Black-Schaffer, D. (2017). Analyzing Graphics Workloads on Tile-based GPUs. In: Proc. 20th International Symposium on Workload Characterization: . Paper presented at IISWC 2017, October 1–3, Seattle, WA (pp. 108-109). IEEE
Open this publication in new window or tab >>Analyzing Graphics Workloads on Tile-based GPUs
2017 (English)In: Proc. 20th International Symposium on Workload Characterization, IEEE, 2017, p. 108-109Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
IEEE, 2017
National Category
Computer Systems Computer Engineering
Identifiers
urn:nbn:se:uu:diva-335559 (URN)10.1109/IISWC.2017.8167761 (DOI)000428206700011 ()978-1-5386-1233-0 (ISBN)
Conference
IISWC 2017, October 1–3, Seattle, WA
Projects
UPMARC
Funder
Swedish Foundation for Strategic Research , FFL12-0051
Available from: 2017-12-06 Created: 2017-12-06 Last updated: 2018-11-15Bibliographically approved
Sembrant, A., Carlson, T. E., Hagersten, E. & Black-Schaffer, D. (2017). POSTER: Putting the G back into GPU/CPU Systems Research. In: 2017 26TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT): . Paper presented at 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), SEP 09-13, 2017, Portland, OR, USA. (pp. 130-131).
Open this publication in new window or tab >>POSTER: Putting the G back into GPU/CPU Systems Research
2017 (English)In: 2017 26TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT), 2017, p. 130-131Conference paper, Published paper (Refereed)
Abstract [en]

Modern SoCs contain several CPU cores and many GPU cores to execute both general purpose and highly-parallel graphics workloads. In many SoCs, more area is dedicated to graphics than to general purpose compute. Despite this, the micro-architecture research community primarily focuses on GPGPU and CPU-only research, and not on graphics (the primary workload for many SoCs). The main reason for this is the lack of efficient tools and simulators for modern graphics applications. This work focuses on the GPU's memory traffic generated by graphics. We describe a new graphics tracing framework and use it to both study graphics applications' memory behavior as well as how CPUs and GPUs affect system performance. Our results show that graphics applications exhibit a wide range of memory behavior between applications and across time, and slows down co-running SPEC applications by 59% on average.

Series
International Conference on Parallel Architectures and Compilation Techniques, ISSN 1089-795X
National Category
Computer Systems Computer Engineering
Identifiers
urn:nbn:se:uu:diva-347752 (URN)10.1109/PACT.2017.60 (DOI)000417411300011 ()978-1-5090-6764-0 (ISBN)
Conference
26th International Conference on Parallel Architectures and Compilation Techniques (PACT), SEP 09-13, 2017, Portland, OR, USA.
Available from: 2018-04-17 Created: 2018-04-17 Last updated: 2018-04-17Bibliographically approved
Spiliopoulos, V., Sembrant, A., Keramidas, G., Hagersten, E. & Kaxiras, S. (2016). A unified DVFS-cache resizing framework.
Open this publication in new window or tab >>A unified DVFS-cache resizing framework
Show others...
2016 (English)Report (Other academic)
Series
Technical report / Department of Information Technology, Uppsala University, ISSN 1404-3203 ; 2016-014
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-300840 (URN)
Available from: 2016-08-15 Created: 2016-08-15 Last updated: 2018-01-10Bibliographically approved
Sembrant, A., Hagersten, E. & Black-Schaffer, D. (2016). Data placement across the cache hierarchy: Minimizing data movement with reuse-aware placement. In: Proc. 34th International Conference on Computer Design: . Paper presented at ICCD 2016, October 2–5, Phoenix, AZ (pp. 117-124). Piscataway, NJ: IEEE
Open this publication in new window or tab >>Data placement across the cache hierarchy: Minimizing data movement with reuse-aware placement
2016 (English)In: Proc. 34th International Conference on Computer Design, Piscataway, NJ: IEEE , 2016, p. 117-124Conference paper, Published paper (Refereed)
Abstract [en]

Modern processors employ multiple levels of caching to address bandwidth, latency and performance requirements. The behavior of these hierarchies is determined by their approach to data placement and data eviction. Recent research has developed many intelligent data eviction policies, but cache hierarchies remain primarily either exclusive or inclusive with regards to data placement. This means that today's cache hierarchies typically install accessed data into all cache levels at one point or another, regardless of whether the data is reused in each level. Such data movement wastes energy by installing data into cache levels where the data is not reused. This paper presents Reuse Aware Placement (RAP), an efficient data placement mechanism to determine where to place data in the cache hierarchy based on whether the data will be reused at each level. RAP dynamically identifies data sets and measures their reuse at each level in the hierarchy. This enables RAP to determine where to move data upon installation or eviction to maximize reuse. To accomplish this, each cache line is associated with a data set and consults that data set's policy upon eviction or installation. The RAP data placement mechanism is orthogonal to the replacement policy, and can be combined with any number of proposed eviction mechanisms. By itself, the RAP data placement mechanism reduces traffic in the cache hierarchy by 21 to 64%, depending on the level, without hurting performance. As a result of this traffic reduction, RAP reduces dynamic cache energy by 28% and total cache energy by 17%.

Place, publisher, year, edition, pages
Piscataway, NJ: IEEE, 2016
Series
Proceedings IEEE International Conference on Computer Design, ISSN 1063-6404
National Category
Computer Engineering
Identifiers
urn:nbn:se:uu:diva-305232 (URN)10.1109/ICCD.2016.7753269 (DOI)000391829200016 ()9781509051427 (ISBN)
Conference
ICCD 2016, October 2–5, Phoenix, AZ
Projects
UPMARC
Funder
Swedish Foundation for Strategic Research , FFL12-0051
Available from: 2016-11-24 Created: 2016-10-13 Last updated: 2018-01-14Bibliographically approved
Perais, A., Seznec, A., Michaud, P., Sembrant, A. & Hagersten, E. (2015). Cost-effective speculative scheduling in high performance processors. In: Proc. 42nd International Symposium on Computer Architecture: . Paper presented at ISCA 2015, June 13–17, Portland, OR (pp. 247-259). New York: ACM Press
Open this publication in new window or tab >>Cost-effective speculative scheduling in high performance processors
Show others...
2015 (English)In: Proc. 42nd International Symposium on Computer Architecture, New York: ACM Press, 2015, p. 247-259Conference paper, Published paper (Refereed)
Abstract [en]

To maximize performance, out-of-order execution processors sometimes issue instructions without having the guarantee that operands will be available in time; e.g. loads are typically assumed to hit in the L1 cache and dependent instructions are issued accordingly. This form of speculation - that we refer to as speculative scheduling - has been used for two decades in real processors, but has received little attention from the research community. In particular, as pipeline depth grows, and the distance between the Issue and the Execute stages increases, it becomes critical to issue instructions dependent on variable-latency instructions as soon as possible rather than wait for the actual cycle at which the result becomes available. Unfortunately, due to the uncertain nature of speculative scheduling, the scheduler may wrongly issue an instruction that will not have its source(s) available on the bypass network when it reaches the Execute stage. In that event, the instruction is canceled and replayed, potentially impairing performance and increasing energy consumption. In this work, we do not present a new replay mechanism. Rather, we focus on ways to reduce the number of replays that are agnostic of the replay scheme. First, we propose an easily implementable, low-cost solution to reduce the number of replays caused by L1 bank conflicts. Schedule shifting always assumes that, given a dual-load issue capacity, the second load issued in a given cycle will be delayed because of a bank conflict. Its dependents are thus always issued with the corresponding delay. Second, we also improve on existing L1 hit/miss prediction schemes by taking into account instruction criticality. That is, for some criterion of criticality and for loads whose hit/miss behavior is hard to predict, we show that it is more cost-effective to stall dependents if the load is not predicted critical.

Place, publisher, year, edition, pages
New York: ACM Press, 2015
National Category
Computer Systems
Identifiers
urn:nbn:se:uu:diva-272467 (URN)10.1145/2749469.2749470 (DOI)000380455700020 ()9781450334020 (ISBN)
Conference
ISCA 2015, June 13–17, Portland, OR
Projects
UPMARCUART
Available from: 2015-06-13 Created: 2016-01-14 Last updated: 2016-12-05Bibliographically approved
Sembrant, A., Carlson, T. E., Hagersten, E., Black-Schaffer, D., Perais, A., Seznec, A. & Michaud, P. (2015). Long Term Parking (LTP): Criticality-aware Resource Allocation in OOO Processors. In: Proc. 48th International Symposium on Microarchitecture: . Paper presented at MICRO 2015, December 5–9, Waikiki, HI.
Open this publication in new window or tab >>Long Term Parking (LTP): Criticality-aware Resource Allocation in OOO Processors
Show others...
2015 (English)In: Proc. 48th International Symposium on Microarchitecture, 2015Conference paper, Published paper (Refereed)
Abstract [en]

Modern processors employ large structures (IQ, LSQ, register file, etc.) to expose instruction-level parallelism (ILP) and memory-level parallelism (MLP). These resources are typically allocated to instructions in program order. This wastes resources by allocating resources to instructions that are not yet ready to be executed and by eagerly allocating resources to instructions that are not part of the application’s critical path.

This work explores the possibility of allocating pipeline resources only when needed to expose MLP, and thereby enabling a processor design with significantly smaller structures, without sacrificing performance. First we identify the classes of instructions that should not reserve resources in program order and evaluate the potential performance gains we could achieve by delaying their allocations. We then use this information to “park” such instructions in a simpler, and therefore more efficient, Long Term Parking (LTP) structure. The LTP stores instructions until they are ready to execute, without allocating pipeline resources, and thereby keeps the pipeline available for instructions that can generate further MLP.

LTP can accurately and rapidly identify which instructions to park, park them before they execute, wake them when needed to preserve performance, and do so using a simple queue instead of a complex IQ. We show that even a very simple queue-based LTP design allows us to significantly reduce IQ (64 →32) and register file (128→96) sizes while retaining MLP performance and improving energy efficiency.

National Category
Computer Engineering
Identifiers
urn:nbn:se:uu:diva-272468 (URN)
Conference
MICRO 2015, December 5–9, Waikiki, HI
Projects
UPMARCUART
Available from: 2016-01-14 Created: 2016-01-14 Last updated: 2018-01-10
Organisations

Search in DiVA

Show all publications