Open this publication in new window or tab >>2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]
This thesis proposes novel, efficient execution-paradigms for parallel heterogeneous architectures. The end of Dennard scaling is threatening the effectiveness of DVFS in future nodes; therefore, new execution paradigms are required to exploit the non-linear relationship between performance and energy efficiency of memory-bound application-regions. To attack this problem, we propose the decoupled access-execute (DAE) paradigm. DAE transforms regions of interest (at program-level) in two coarse-grain phases: the access-phase and the execute-phase, which we can independently DVFS. The access-phase is intended to prefetch the data in the cache, and is therefore expected to be predominantly memory-bound, while the execute-phase runs immediately after the access-phase (that has warmed-up the cache) and is therefore expected to be compute-bound.
DAE, achieves good energy savings (on average 25% lower EDP) without performance degradation, as opposed to other DVFS techniques. Furthermore, DAE increases the memory level parallelism (MLP) of memory-bound regions, which results in performance improvements of memory-bound applications. To automatically transform application-regions to DAE, we propose compiler techniques to automatically generate and incorporate the access-phase(s) in the application. Our work targets affine, non-affine, and even complex, general-purpose codes. Furthermore, we explore the benefits of software multi-versioning to optimize DAE in dynamic environments, and handle codes with statically unknown access-phase overheads. In general, applications automatically-transformed to DAE by our compiler, maintain (or even exceed in some cases) the good performance and energy efficiency of manually-optimized DAE codes.
Finally, to ease the programming environment of heterogeneous systems (with integrated GPUs), we propose a novel system-architecture that provides unified virtual memory with low overhead. The underlying insight behind our work is that existing data-parallel programming models are a good fit for relaxed memory consistency models (e.g., the heterogeneous race-free model). This allows us to simplify the coherency protocol between the CPU – GPU, as well as the GPU memory management unit. On average, we achieve 45% speedup and 45% lower EDP over the corresponding SC implementation.
Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2016. p. 54
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1405
Keywords
Decoupled Execution, Performance, Energy, DVFS, Compiler Optimizations, Heterogeneous Coherence
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:uu:diva-300831 (URN)978-91-554-9654-8 (ISBN)
Public defence
2016-09-30, ITC/1111, Lägerhyddsvägen 2, Uppsala, 13:00 (English)
Opponent
Supervisors
Projects
UPMARC
Funder
EU, FP7, Seventh Framework Programme, FP7-ICT-288653Swedish Research Council
2016-09-072016-08-152019-02-25