uu.seUppsala University Publications
Change search
Link to record
Permanent link

Direct link
BETA
Spiliopoulos, Vasileios
Publications (10 of 14) Show all publications
Tran, K.-A., Carlson, T. E., Koukos, K., Själander, M., Spiliopoulos, V., Kaxiras, S. & Jimborean, A. (2018). Static instruction scheduling for high performance on limited hardware. IEEE Transactions on Computers, 67(4), 513-527
Open this publication in new window or tab >>Static instruction scheduling for high performance on limited hardware
Show others...
2018 (English)In: IEEE Transactions on Computers, ISSN 0018-9340, Vol. 67, no 4, p. 513-527Article in journal (Refereed) Published
Abstract [en]

Complex out-of-order (OoO) processors have been designed to overcome the restrictions of outstanding long-latency misses at the cost of increased energy consumption. Simple, limited OoO processors are a compromise in terms of energy consumption and performance, as they have fewer hardware resources to tolerate the penalties of long-latency loads. In worst case, these loads may stall the processor entirely. We present Clairvoyance, a compiler based technique that generates code able to hide memory latency and better utilize simple OoO processors. By clustering loads found across basic block boundaries, Clairvoyance overlaps the outstanding latencies to increases memory-level parallelism. We show that these simple OoO processors, equipped with the appropriate compiler support, can effectively hide long-latency loads and achieve performance improvements for memory-bound applications. To this end, Clairvoyance tackles (i) statically unknown dependencies, (ii) insufficient independent instructions, and (iii) register pressure. Clairvoyance achieves a geomean execution time improvement of 14 percent for memory-bound applications, on top of standard O3 optimizations, while maintaining compute-bound applications' high-performance.

National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-334011 (URN)10.1109/TC.2017.2769641 (DOI)000427420800005 ()
Projects
UPMARC
Funder
Swedish Research Council, 2016-05086
Available from: 2017-11-03 Created: 2017-11-20 Last updated: 2020-01-17Bibliographically approved
Tran, K.-A., Carlson, T. E., Koukos, K., Själander, M., Spiliopoulos, V., Kaxiras, S. & Jimborean, A. (2017). Clairvoyance: Look-ahead compile-time scheduling. In: Proc. 15th International Symposium on Code Generation and Optimization: . Paper presented at CGO 2017, February 4–8, Austin, TX (pp. 171-184). Piscataway, NJ: IEEE Press
Open this publication in new window or tab >>Clairvoyance: Look-ahead compile-time scheduling
Show others...
2017 (English)In: Proc. 15th International Symposium on Code Generation and Optimization, Piscataway, NJ: IEEE Press, 2017, p. 171-184Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
Piscataway, NJ: IEEE Press, 2017
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-316480 (URN)000402548700015 ()978-1-5090-4931-8 (ISBN)
Conference
CGO 2017, February 4–8, Austin, TX
Projects
UPMARC
Funder
Swedish Research Council, 2010-4741
Available from: 2017-02-04 Created: 2017-03-01 Last updated: 2020-01-17Bibliographically approved
Spiliopoulos, V., Sembrant, A., Keramidas, G., Hagersten, E. & Kaxiras, S. (2016). A unified DVFS-cache resizing framework.
Open this publication in new window or tab >>A unified DVFS-cache resizing framework
Show others...
2016 (English)Report (Other academic)
Series
Technical report / Department of Information Technology, Uppsala University, ISSN 1404-3203 ; 2016-014
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-300840 (URN)
Available from: 2016-08-15 Created: 2016-08-15 Last updated: 2018-01-10Bibliographically approved
Forsberg, B., Lampka, K. & Spiliopoulos, V. (2016). An Online Overclocking Scheme for Bursty Real-time Tasks and an Evaluation of its Thermal Impact. In: 14Th ACM/IEEE Symposium On Embedded Systems For Real-Time Multimedia (ESTIMEDIA 2016): . Paper presented at 14th ACM/IEEE Symposium on Embedded Systems for Real-Time Multimedia (ESTIMedia), OCT 06-07, 2016, Pittsburgh, PA (pp. 104-113).
Open this publication in new window or tab >>An Online Overclocking Scheme for Bursty Real-time Tasks and an Evaluation of its Thermal Impact
2016 (English)In: 14Th ACM/IEEE Symposium On Embedded Systems For Real-Time Multimedia (ESTIMEDIA 2016), 2016, p. 104-113Conference paper, Published paper (Refereed)
Abstract [en]

This paper proposes a scheme which drives a processor beyond its rated operation frequency, e.g., by exploiting Intel's boost technology, to digest the peak workload of the system in time. In the setting of deadline constrained workloads, this is far from trivial: the boost mode can only be used during short time spans, therefore it can only help to digest the peak workload, rather than serving the normal case. A lowered processor frequency, used outside the peak workload time, yields a backlog of not completed jobs. This backlog may result in deadline violations or buffer overflows, if the next burst of job arrivals appears too early. To overcome the above problem, we propose a peak workload aware speed assignment strategy, which only allows the system to build up computation backlog if the absence of high computation demands is assured. Contrasting the existing body of work, we take advantage of bursty arrival patterns of compute jobs, thereby progressing over the standard (non-bursty sporadic) job release model. Together with our scheme, we also present a tool chain and simulations of synthetic workloads for investigating the thermal effects of different speed assignment strategies.

National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:uu:diva-315016 (URN)10.1145/2993452.2993568 (DOI)000390612200004 ()9781450345439 (ISBN)
Conference
14th ACM/IEEE Symposium on Embedded Systems for Real-Time Multimedia (ESTIMedia), OCT 06-07, 2016, Pittsburgh, PA
Available from: 2017-02-08 Created: 2017-02-08 Last updated: 2018-01-13Bibliographically approved
Spiliopoulos, V. (2016). Improving Energy-Efficiency of Multicores using First-Order Modeling. (Doctoral dissertation). Uppsala: Acta Universitatis Upsaliensis
Open this publication in new window or tab >>Improving Energy-Efficiency of Multicores using First-Order Modeling
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In the recent decades, power consumption has evolved to one of the most critical resources in a computer system. In the form of electricity bill in data centers, battery life in mobile devices, or thermal constraints in desktops and laptops, power consumption imposes several limitations in today’s processors and improving power and energy efficiency is one of the most urgent research topics of Computer Architecture.

Dynamic Voltage and Frequency Scaling (DVFS) and Cache Resizing are among the most popular energy saving techniques. Previous work, however, has focused on developing heuristics and trial-and-error methods that yield acceptable savings, but fail to provide insight and understanding of how these techniques affect power and performance of a computer system. In contrast, this Thesis proposes the use of first-order modeling to improve the energy efficiency of computer systems. A first-order model needs to be (i) accurate enough to efficiently drive DVFS and Cache Resizing decisions, and (ii) simple enough to eliminate the overhead of collecting the required inputs to the model. We show that such models can be constructed and successfully applied in modern systems.

For DVFS, we propose to scale frequency down to exploit applications’ memory slack, i.e., periods that the processor spends waiting for data to be fetched from the main memory. In such cases, the processor frequency can be scaled down to save energy without inordinate performance penalty. Our DVFS models can detect slack and predict the impact of DVFS in both power and performance with great accuracy. Cache Resizing, on the other hand, relies on the fact that many applications do not benefit from the vast amount of cache that modern processors are equipped with. In such cases, the cache can be resized to save static energy consumption at limited performance cost. Since both techniques are related with the memory behavior of applications, we propose a unified model to manage the two techniques in tandem and maximize energy efficiency through synergistic DVFS and Cache Resizing.

Finally, our experience with DVFS in real systems motivated us to contribute to the integration of DVFS into the gem5 simulator. Unlike other simulators that ignore the role of OS in DVFS, we extend the gem5 simulator by developing the hardware and software components that allow existing Linux DVFS infrastructure to be seamlessly integrated in the simulator.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2016. p. 52
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1404
Keywords
Computer Architecture, DVFS, Cache Resizing, Interval modeling, Power modeling
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:uu:diva-300947 (URN)978-91-554-9652-4 (ISBN)
Public defence
2016-09-29, ITC/2446, Lägerhyddsvägen 2, Uppsala, 13:00 (English)
Opponent
Supervisors
Projects
UPMARC
Available from: 2016-09-06 Created: 2016-08-16 Last updated: 2019-02-25
Lampka, K., Forsberg, B. & Spiliopoulos, V. (2016). Keep it cool and in time: With runtime monitoring to thermal-aware execution speeds for deadline constrained systems. Journal of Parallel and Distributed Computing, 95, 79-91
Open this publication in new window or tab >>Keep it cool and in time: With runtime monitoring to thermal-aware execution speeds for deadline constrained systems
2016 (English)In: Journal of Parallel and Distributed Computing, ISSN 0743-7315, E-ISSN 1096-0848, Vol. 95, p. 79-91Article in journal (Refereed) Published
Abstract [en]

The Dynamic Power and Thermal Management (DPTM) system of Dynamic Voltage Frequency Scaling (DVFS) enabled processors compensates peak temperatures by slowing or even powering parts of the system down. While ensuring the integrity of computations, this comes with the drawback of losing performance. In the context of hard real-time systems, such unpredictable losses in performance are unacceptable, as they may lead to deadline misses which may yet compromise the integrity of the system. To safely execute hard real-time workloads on such systems, this article presents an online scheme for assigning speeds in such a way that (a) the system executes at low clock speed as often as possible, while (b) deadline violations are strictly ruled out. The proposed scheme is compared with an offline scheme which has complete knowledge about arrival times and execution demands of the workload. The benchmarking shows that for a workload which is always very close to the modelled maximum, our approach performs on-par with the offline scheme. In case of a workload which diverges from the modelled maximum more often, the speed assignments produced by our scheme become more pessimistic, as to ensure that all deadlines are met.

Keywords
Real-time computing; Multicore architectures; Dynamic Voltage Frequency Scaling; Dynamic power and temperature management; Run-time monitoring; Online real-time scheduling
National Category
Computer Engineering
Research subject
Computer Science with specialization in Embedded Systems
Identifiers
urn:nbn:se:uu:diva-283423 (URN)10.1016/j.jpdc.2016.03.002 (DOI)000378977800008 ()
Available from: 2016-03-18 Created: 2016-04-12 Last updated: 2018-01-10Bibliographically approved
Koukos, K., Ekemark, P., Zacharopoulos, G., Spiliopoulos, V., Kaxiras, S. & Jimborean, A. (2016). Multiversioned decoupled access-execute: The key to energy-efficient compilation of general-purpose programs. In: Proc. 25th International Conference on Compiler Construction: . Paper presented at CC 2016, March 17–18, Barcelona, Spain (pp. 121-131). New York: ACM Press
Open this publication in new window or tab >>Multiversioned decoupled access-execute: The key to energy-efficient compilation of general-purpose programs
Show others...
2016 (English)In: Proc. 25th International Conference on Compiler Construction, New York: ACM Press, 2016, p. 121-131Conference paper, Published paper (Refereed)
Abstract [en]

Computer architecture design faces an era of great challenges in an attempt to simultaneously improve performance and energy efficiency. Previous hardware techniques for energy management become severely limited, and thus, compilers play an essential role in matching the software to the more restricted hardware capabilities. One promising approach is software decoupled access-execute (DAE), in which the compiler transforms the code into coarse-grain phases that are well-matched to the Dynamic Voltage and Frequency Scaling (DVFS) capabilities of the hardware. While this method is proved efficient for statically analyzable codes, general purpose applications pose significant challenges due to pointer aliasing, complex control flow and unknown runtime events. We propose a universal compile-time method to decouple general-purpose applications, using simple but efficient heuristics. Our solutions overcome the challenges of complex code and show that automatic decoupled execution significantly reduces the energy expenditure of irregular or memory-bound applications and even yields slight performance boosts. Overall, our technique achieves over 20% on average energy-delay-product (EDP) improvements (energy over 15% and performance over 5%) across 14 bench-marks from SPEC CPU 2006 and Parboil benchmark suites, with peak EDP improvements surpassing 70%.

Place, publisher, year, edition, pages
New York: ACM Press, 2016
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-283200 (URN)10.1145/2892208.2892209 (DOI)000389808800012 ()9781450342414 (ISBN)
Conference
CC 2016, March 17–18, Barcelona, Spain
Projects
UPMARC
Available from: 2016-03-17 Created: 2016-04-11 Last updated: 2018-12-03Bibliographically approved
Jimborean, A., Koukos, K., Spiliopoulos, V., Black-Schaffer, D. & Kaxiras, S. (2014). Fix the code. Don't tweak the hardware: A new compiler approach to Voltage–Frequency scaling. In: Proc. 12th International Symposium on Code Generation and Optimization: . Paper presented at CGO 2014, February 15-19, Orlando, FL (pp. 262-272). New York: ACM Press
Open this publication in new window or tab >>Fix the code. Don't tweak the hardware: A new compiler approach to Voltage–Frequency scaling
Show others...
2014 (English)In: Proc. 12th International Symposium on Code Generation and Optimization, New York: ACM Press, 2014, p. 262-272Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
New York: ACM Press, 2014
National Category
Computer Sciences
Identifiers
urn:nbn:se:uu:diva-212778 (URN)978-1-4503-2670-4 (ISBN)
Conference
CGO 2014, February 15-19, Orlando, FL
Projects
UPMARC
Available from: 2014-02-19 Created: 2013-12-13 Last updated: 2018-01-11Bibliographically approved
Spiliopoulos, V., Bagdia, A., Hansson, A., Aldworth, P. & Kaxiras, S. (2013). Introducing DVFS-Management in a Full-System Simulator. In: Proc. 21st International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems: . Paper presented at MASCOTS 2013. IEEE Computer Society
Open this publication in new window or tab >>Introducing DVFS-Management in a Full-System Simulator
Show others...
2013 (English)In: Proc. 21st International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, IEEE Computer Society, 2013Conference paper, Published paper (Refereed)
Abstract [en]

Dynamic Voltage and Frequency Scaling (DVFS) is an essential part of controlling the power consumption of any computer system, ranging from mobile phones to servers. DVFS efficiency relies on hardware-software co-optimization, thus using existing hardware cannot reveal the full optimization potential beyond the current implementation’s characteristics. To explore the vast design space for DVFS efficiency, that straddles software and hardware, a simulation infrastructure must provide features that are not readily available today, for example: software controllable clock and voltage domains, support for the OS and the frequency scaling module of it, and an online power estimation methodology. As the main contribution,this work enables DVFS studies in a full-system simulator. We extend the gem5 simulator to support full-system DVFS modeling. By doing so, we enable energy-efficiency experiments to be performed in gem5 and we showcase such studies. Finally, we show that both existing and novel frequency governors for Linux and Android can be effortlessly integrated in the framework, and we evaluate the efficiency of different DVFS schemes.

Place, publisher, year, edition, pages
IEEE Computer Society, 2013
National Category
Computer Systems
Identifiers
urn:nbn:se:uu:diva-212809 (URN)
Conference
MASCOTS 2013
Projects
UPMARC
Available from: 2013-12-15 Created: 2013-12-15 Last updated: 2016-09-02
Koukos, K., Black-Schaffer, D., Spiliopoulos, V. & Kaxiras, S. (2013). Towards more efficient execution: a decoupled access-execute approach. In: Proc. 27th ACM International Conference on Supercomputing: . Paper presented at ICS 2013, June 10-14, Eugene, OR (pp. 253-262). New York: ACM Press
Open this publication in new window or tab >>Towards more efficient execution: a decoupled access-execute approach
2013 (English)In: Proc. 27th ACM International Conference on Supercomputing, New York: ACM Press, 2013, p. 253-262Conference paper, Published paper (Refereed)
Abstract [en]

The end of Dennard scaling is expected to shrink the range of DVFS in future nodes, limiting the energy savings of this technique. This paper evaluates how much we can increase the effectiveness of DVFS by using a software decoupled access-execute approach. Decoupling the data access from execution allows us to apply optimal voltage-frequency selection for each phase and therefore improve energy efficiency over standard coupled execution.

The underlying insight of our work is that by decoupling access and execute we can take advantage of the memory-bound nature of the access phase and the compute-bound nature of the execute phase to optimize power efficiency, while maintaining good performance. To demonstrate this we built a task based parallel execution infrastructure consisting of: (1) a runtime system to orchestrate the execution, (2) power models to predict optimal voltage-frequency selection at runtime, (3) a modeling infrastructure based on hardware measurements to simulate zero-latency, per-core DVFS, and (4) a hardware measurement infrastructure to verify our model's accuracy.

Based on real hardware measurements we project that the combination of decoupled access-execute and DVFS has the potential to improve EDP by 25% without hurting performance. On memory-bound applications we significantly improve performance due to increased MLP in the access phase and ILP in the execute phase. Furthermore we demonstrate that our method can achieve high performance both in presence or absence of a hardware prefetcher.

Place, publisher, year, edition, pages
New York: ACM Press, 2013
Keywords
Task-Based Execution, Decoupled Execution, Performance, Energy, DVFS
National Category
Computer Systems
Research subject
Computer Systems
Identifiers
urn:nbn:se:uu:diva-203239 (URN)10.1145/2464996.2465012 (DOI)978-1-4503-2130-3 (ISBN)
Conference
ICS 2013, June 10-14, Eugene, OR
Projects
LPGPU FP7-ICT-288653UPMARC
Funder
EU, FP7, Seventh Framework Programme, ICT-288653Swedish Research Council
Available from: 2013-07-06 Created: 2013-07-05 Last updated: 2016-09-02Bibliographically approved
Organisations

Search in DiVA

Show all publications