uu.seUppsala University Publications
Change search
Link to record
Permanent link

Direct link
BETA
Koukos, KonstantinosORCID iD iconorcid.org/0000-0002-9460-1290
Publications (10 of 10) Show all publications
Tran, K.-A., Jimborean, A., Carlson, T. E., Koukos, K., Själander, M. & Kaxiras, S. (2018). SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores. In: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation: . Paper presented at PLDI 2018 the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, June 18-22 2018, Philadelphia, USA (pp. 328-343). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores
Show others...
2018 (English)In: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, Association for Computing Machinery (ACM), 2018, p. 328-343Conference paper, Published paper (Refereed)
Abstract [en]

Increasing demands for energy efficiency constrain emerging hardware. These new hardware trends challenge the established assumptions in code generation and force us to rethink existing software optimization techniques. We propose a cross-layer redesign of the way compilers and the underlying microarchitecture are built and interact, to achieve both performance and high energy efficiency.

In this paper, we address one of the main performance bottlenecks — last-level cache misses — through a software-hardware co-design. Our approach is able to hide memory latency and attain increased memory and instruction level parallelism by orchestrating a non-speculative, execute-ahead paradigm in software (SWOOP). While out-of-order (OoO) architectures attempt to hide memory latency by dynamically reordering instructions, they do so through expensive, power-hungry, speculative mechanisms.We aim to shift this complexity into software, and we build upon compilation techniques inherited from VLIW, software pipelining, modulo scheduling, decoupled access-execution, and software prefetching. In contrast to previous approaches we do not rely on either software or hardware speculation that can be detrimental to efficiency. Our SWOOP compiler is enhanced with lightweight architectural support, thus being able to transform applications that include highly complex control-flow and indirect memory accesses.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2018
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-361359 (URN)10.1145/3192366.3192393 (DOI)000452469600023 ()978-1-4503-5698-5 (ISBN)
Conference
PLDI 2018 the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, June 18-22 2018, Philadelphia, USA
Projects
UPMARC
Funder
Swedish Research Council, 2016-05086
Available from: 2018-09-23 Created: 2018-09-23 Last updated: 2019-02-01Bibliographically approved
Tran, K.-A., Carlson, T. E., Koukos, K., Själander, M., Spiliopoulos, V., Kaxiras, S. & Jimborean, A. (2017). Clairvoyance: Look-ahead compile-time scheduling. In: Proc. 15th International Symposium on Code Generation and Optimization: . Paper presented at CGO 2017, February 4–8, Austin, TX (pp. 171-184). Piscataway, NJ: IEEE Press
Open this publication in new window or tab >>Clairvoyance: Look-ahead compile-time scheduling
Show others...
2017 (English)In: Proc. 15th International Symposium on Code Generation and Optimization, Piscataway, NJ: IEEE Press, 2017, p. 171-184Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
Piscataway, NJ: IEEE Press, 2017
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-316480 (URN)000402548700015 ()978-1-5090-4931-8 (ISBN)
Conference
CGO 2017, February 4–8, Austin, TX
Projects
UPMARC
Funder
Swedish Research Council, 2010-4741
Available from: 2017-02-04 Created: 2017-03-01 Last updated: 2018-04-26Bibliographically approved
Carlson, T. E., Tran, K.-A., Jimborean, A., Koukos, K., Själander, M. & Kaxiras, S. (2017). Transcending hardware limits with software out-of-order processing. IEEE Computer Architecture Letters, 16(2), 162-165
Open this publication in new window or tab >>Transcending hardware limits with software out-of-order processing
Show others...
2017 (English)In: IEEE Computer Architecture Letters, ISSN 1556-6056, Vol. 16, no 2, p. 162-165Article in journal (Refereed) Published
National Category
Computer Systems
Identifiers
urn:nbn:se:uu:diva-334012 (URN)10.1109/LCA.2017.2672559 (DOI)000418870500018 ()
Projects
UPMARC
Available from: 2017-02-22 Created: 2017-11-20 Last updated: 2018-04-26Bibliographically approved
Koukos, K., Ros, A., Hagersten, E. & Kaxiras, S. (2016). Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead. ACM Transactions on Architecture and Code Optimization (TACO), 13(1), Article ID 1.
Open this publication in new window or tab >>Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead
2016 (English)In: ACM Transactions on Architecture and Code Optimization (TACO), ISSN 1544-3566, E-ISSN 1544-3973, Vol. 13, no 1, article id 1Article in journal (Refereed) Published
Abstract [en]

This work proposes a novel scheme to facilitate heterogeneous systems with unified virtual memory. Research proposals implement coherence protocols for sequential consistency (SC) between central processing unit (CPU) cores and between devices. Such mechanisms introduce severe bottlenecks in the system; therefore, we adopt the heterogeneous-race-free (HRF) memory model. The use of HRF simplifies the coherency protocol and the graphics processing unit (GPU) memory management unit (MMU). Our protocol optimizes CPU and GPU demands separately, with the GPU part being simpler while the CPU is more elaborate and latency aware. We achieve an average 45% speedup and 45% energy-delay product reduction (20% energy) over the corresponding SC implementation.

Keywords
Multicore; heterogeneous coherence; GPU MMU design; virtual coherence protocol; directory-less protocol
National Category
Computer Systems
Identifiers
urn:nbn:se:uu:diva-295765 (URN)10.1145/2889488 (DOI)000373904600001 ()
Projects
UPMARC
Funder
EU, FP7, Seventh Framework Programme, FP7-ICT-288653EU, European Research Council, TIN2012-38341-C04-03
Available from: 2016-04-05 Created: 2016-06-09 Last updated: 2017-11-30Bibliographically approved
Koukos, K. (2016). Efficient Execution Paradigms for Parallel Heterogeneous Architectures. (Doctoral dissertation). Uppsala: Acta Universitatis Upsaliensis
Open this publication in new window or tab >>Efficient Execution Paradigms for Parallel Heterogeneous Architectures
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This thesis proposes novel, efficient execution-paradigms for parallel heterogeneous architectures. The end of Dennard scaling is threatening the effectiveness of DVFS in future nodes; therefore, new execution paradigms are required to exploit the non-linear relationship between performance and energy efficiency of memory-bound application-regions. To attack this problem, we propose the decoupled access-execute (DAE) paradigm. DAE transforms regions of interest (at program-level) in two coarse-grain phases: the access-phase and the execute-phase, which we can independently DVFS. The access-phase is intended to prefetch the data in the cache, and is therefore expected to be predominantly memory-bound, while the execute-phase runs immediately after the access-phase (that has warmed-up the cache) and is therefore expected to be compute-bound.

DAE, achieves good energy savings (on average 25% lower EDP) without performance degradation, as opposed to other DVFS techniques. Furthermore, DAE increases the memory level parallelism (MLP) of memory-bound regions, which results in performance improvements of memory-bound applications. To automatically transform application-regions to DAE, we propose compiler techniques to automatically generate and incorporate the access-phase(s) in the application. Our work targets affine, non-affine, and even complex, general-purpose codes. Furthermore, we explore the benefits of software multi-versioning to optimize DAE in dynamic environments, and handle codes with statically unknown access-phase overheads. In general, applications automatically-transformed to DAE by our compiler, maintain (or even exceed in some cases) the good performance and energy efficiency of manually-optimized DAE codes.

Finally, to ease the programming environment of heterogeneous systems (with integrated GPUs), we propose a novel system-architecture that provides unified virtual memory with low overhead. The underlying insight behind our work is that existing data-parallel programming models are a good fit for relaxed memory consistency models (e.g., the heterogeneous race-free model). This allows us to simplify the coherency protocol between the CPU – GPU, as well as the GPU memory management unit. On average, we achieve 45% speedup and 45% lower EDP over the corresponding SC implementation.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2016. p. 54
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1405
Keywords
Decoupled Execution, Performance, Energy, DVFS, Compiler Optimizations, Heterogeneous Coherence
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:uu:diva-300831 (URN)978-91-554-9654-8 (ISBN)
Public defence
2016-09-30, ITC/1111, Lägerhyddsvägen 2, Uppsala, 13:00 (English)
Opponent
Supervisors
Projects
UPMARC
Funder
EU, FP7, Seventh Framework Programme, FP7-ICT-288653Swedish Research Council
Available from: 2016-09-07 Created: 2016-08-15 Last updated: 2019-02-25
Koukos, K., Ekemark, P., Zacharopoulos, G., Spiliopoulos, V., Kaxiras, S. & Jimborean, A. (2016). Multiversioned decoupled access-execute: The key to energy-efficient compilation of general-purpose programs. In: Proc. 25th International Conference on Compiler Construction: . Paper presented at CC 2016, March 17–18, Barcelona, Spain (pp. 121-131). New York: ACM Press
Open this publication in new window or tab >>Multiversioned decoupled access-execute: The key to energy-efficient compilation of general-purpose programs
Show others...
2016 (English)In: Proc. 25th International Conference on Compiler Construction, New York: ACM Press, 2016, p. 121-131Conference paper, Published paper (Refereed)
Abstract [en]

Computer architecture design faces an era of great challenges in an attempt to simultaneously improve performance and energy efficiency. Previous hardware techniques for energy management become severely limited, and thus, compilers play an essential role in matching the software to the more restricted hardware capabilities. One promising approach is software decoupled access-execute (DAE), in which the compiler transforms the code into coarse-grain phases that are well-matched to the Dynamic Voltage and Frequency Scaling (DVFS) capabilities of the hardware. While this method is proved efficient for statically analyzable codes, general purpose applications pose significant challenges due to pointer aliasing, complex control flow and unknown runtime events. We propose a universal compile-time method to decouple general-purpose applications, using simple but efficient heuristics. Our solutions overcome the challenges of complex code and show that automatic decoupled execution significantly reduces the energy expenditure of irregular or memory-bound applications and even yields slight performance boosts. Overall, our technique achieves over 20% on average energy-delay-product (EDP) improvements (energy over 15% and performance over 5%) across 14 bench-marks from SPEC CPU 2006 and Parboil benchmark suites, with peak EDP improvements surpassing 70%.

Place, publisher, year, edition, pages
New York: ACM Press, 2016
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-283200 (URN)10.1145/2892208.2892209 (DOI)000389808800012 ()9781450342414 (ISBN)
Conference
CC 2016, March 17–18, Barcelona, Spain
Projects
UPMARC
Available from: 2016-03-17 Created: 2016-04-11 Last updated: 2018-12-03Bibliographically approved
Waern, J., Ekemark, P., Koukos, K., Kaxiras, S. & Jimborean, A. (2016). Profiling-Assisted Decoupled Access-Execute. In: Proc. 4th International Workshop on High Performance Energy Efficient Embedded Systems: . Paper presented at HIP3ES 2016, January 18, Prague, Czech Republic.
Open this publication in new window or tab >>Profiling-Assisted Decoupled Access-Execute
Show others...
2016 (English)In: Proc. 4th International Workshop on High Performance Energy Efficient Embedded Systems, 2016Conference paper, Published paper (Refereed)
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-286769 (URN)
Conference
HIP3ES 2016, January 18, Prague, Czech Republic
Projects
UPMARC
Available from: 2016-01-07 Created: 2016-04-21 Last updated: 2018-12-03Bibliographically approved
Jimborean, A., Koukos, K., Spiliopoulos, V., Black-Schaffer, D. & Kaxiras, S. (2014). Fix the code. Don't tweak the hardware: A new compiler approach to Voltage–Frequency scaling. In: Proc. 12th International Symposium on Code Generation and Optimization: . Paper presented at CGO 2014, February 15-19, Orlando, FL (pp. 262-272). New York: ACM Press
Open this publication in new window or tab >>Fix the code. Don't tweak the hardware: A new compiler approach to Voltage–Frequency scaling
Show others...
2014 (English)In: Proc. 12th International Symposium on Code Generation and Optimization, New York: ACM Press, 2014, p. 262-272Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
New York: ACM Press, 2014
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-212778 (URN)978-1-4503-2670-4 (ISBN)
Conference
CGO 2014, February 15-19, Orlando, FL
Projects
UPMARC
Available from: 2014-02-19 Created: 2013-12-13 Last updated: 2018-01-11Bibliographically approved
Koukos, K., Black-Schaffer, D., Spiliopoulos, V. & Kaxiras, S. (2013). Towards more efficient execution: a decoupled access-execute approach. In: Proc. 27th ACM International Conference on Supercomputing: . Paper presented at ICS 2013, June 10-14, Eugene, OR (pp. 253-262). New York: ACM Press
Open this publication in new window or tab >>Towards more efficient execution: a decoupled access-execute approach
2013 (English)In: Proc. 27th ACM International Conference on Supercomputing, New York: ACM Press, 2013, p. 253-262Conference paper, Published paper (Refereed)
Abstract [en]

The end of Dennard scaling is expected to shrink the range of DVFS in future nodes, limiting the energy savings of this technique. This paper evaluates how much we can increase the effectiveness of DVFS by using a software decoupled access-execute approach. Decoupling the data access from execution allows us to apply optimal voltage-frequency selection for each phase and therefore improve energy efficiency over standard coupled execution.

The underlying insight of our work is that by decoupling access and execute we can take advantage of the memory-bound nature of the access phase and the compute-bound nature of the execute phase to optimize power efficiency, while maintaining good performance. To demonstrate this we built a task based parallel execution infrastructure consisting of: (1) a runtime system to orchestrate the execution, (2) power models to predict optimal voltage-frequency selection at runtime, (3) a modeling infrastructure based on hardware measurements to simulate zero-latency, per-core DVFS, and (4) a hardware measurement infrastructure to verify our model's accuracy.

Based on real hardware measurements we project that the combination of decoupled access-execute and DVFS has the potential to improve EDP by 25% without hurting performance. On memory-bound applications we significantly improve performance due to increased MLP in the access phase and ILP in the execute phase. Furthermore we demonstrate that our method can achieve high performance both in presence or absence of a hardware prefetcher.

Place, publisher, year, edition, pages
New York: ACM Press, 2013
Keywords
Task-Based Execution, Decoupled Execution, Performance, Energy, DVFS
National Category
Computer Systems
Research subject
Computer Systems
Identifiers
urn:nbn:se:uu:diva-203239 (URN)10.1145/2464996.2465012 (DOI)978-1-4503-2130-3 (ISBN)
Conference
ICS 2013, June 10-14, Eugene, OR
Projects
LPGPU FP7-ICT-288653UPMARC
Funder
EU, FP7, Seventh Framework Programme, ICT-288653Swedish Research Council
Available from: 2013-07-06 Created: 2013-07-05 Last updated: 2016-09-02Bibliographically approved
Koukos, K., Black-Schaffer, D., Spiliopoulos, V. & Kaxiras, S. (2013). Towards Power Efficiency on Task-Based, Decoupled Access-Execute Models. In: PARMA 2013, 4th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures: . Paper presented at PARMA 2013, 4th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures, Berlin, Germany, January 23, 2013.
Open this publication in new window or tab >>Towards Power Efficiency on Task-Based, Decoupled Access-Execute Models
2013 (English)In: PARMA 2013, 4th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures, 2013Conference paper, Published paper (Refereed)
Abstract [en]

This work demonstrates the potential of hardware and software optimization to improve theeffectiveness of dynamic voltage and frequency scaling (DVFS). For software, we decouple data prefetch (access) and computation (execute) to enable optimal DVFS selectionfor each phase. For hardware, we use measurements from state-of-the-art multicore processors to accurately model the potential of per-core, zero-latency DVFS. We demonstrate that the combinationof decoupled access-execute and precise DVFS has the potential to decrease EDP by 25-30% without reducing performance.

The underlying insight in this work is that by decoupling access and execute we can take advantageof the memory-bound nature of the access phase and the compute-bound nature of the execute phase to optimize power efficiency. For the memory-bound access phase, where we prefetch data into the cachefrom main memory, we can run at a reduced frequency and voltage without hurting performance. Thereafter, the execute phase can run much faster, thanks to the prefetching of the access phase, and achieve higher performance. This decoupled program behavior allows us to achieve more effective use of DVFS than standard coupled executions which mix data access and compute.

To understand the potential of this approach, we measure application performance and power consumption on a modern multicore system across a range of frequencies and voltages. From this data we build a model that allows us to analyze the effects of per-core, zero-latency DVFS. The results of this work demonstrate the significant potential for finer-grain DVFS in combination with DVFS-optimized software.

Keywords
DVFS, energy, decoupled execution, performance, task-based execution
National Category
Computer Systems
Research subject
Computer Systems
Identifiers
urn:nbn:se:uu:diva-203249 (URN)
Conference
PARMA 2013, 4th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures, Berlin, Germany, January 23, 2013
Available from: 2013-07-06 Created: 2013-07-06 Last updated: 2013-07-09Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-9460-1290

Search in DiVA

Show all publications