uu.seUppsala University Publications
Change search
Refine search result
1 - 49 of 49
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the 'Create feeds' function.
  • 1.
    Abdulla, Parosh Aziz
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Atig, Mohamed Faouzi
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Leonardsson, Carl
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Ros, Alberto
    Zhu, Yunyun
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Fencing programs with self-invalidation and self-downgrade2016In: Formal Techniques for Distributed Objects, Components, and Systems, Springer, 2016, 19-35 p.Conference paper (Refereed)
  • 2.
    Carlson, Trevor E.
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Heirman, Wim
    Intel, ExaSci Lab, Santa Clara, CA USA..
    Allam, Osman
    Univ Ghent, B-9000 Ghent, Belgium..
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Eeckhout, Lieven
    Univ Ghent, B-9000 Ghent, Belgium..
    The Load Slice Core Microarchitecture2015In: 2015 ACM/IEEE 42Nd Annual International Symposium On Computer Architecture (ISCA), 2015, 272-284 p.Conference paper (Refereed)
    Abstract [en]

    Driven by the motivation to expose instruction-level parallelism (ILP), microprocessor cores have evolved from simple, in-order pipelines into complex, superscalar out-of-order designs. By extracting ILP, these processors also enable parallel cache and memory operations as a useful side-effect. Today, however, the growing off-chip memory wall and complex cache hierarchies of many-core processors make cache and memory accesses ever more costly. This increases the importance of extracting memory hierarchy parallelism (MHP), while reducing the net impact of more general, yet complex and power-hungry ILP-extraction techniques. In addition, for multi-core processors operating in power- and energy-constrained environments, energy-efficiency has largely replaced single-thread performance as the primary concern. Based on this observation, we propose a core microarchitecture that is aimed squarely at generating parallel accesses to the memory hierarchy while maximizing energy efficiency. The Load Slice Core extends the efficient in-order, stall-on-use core with a second in-order pipeline that enables memory accesses and address-generating instructions to bypass stalled instructions in the main pipeline. Backward program slices containing address-generating instructions leading up to loads and stores are extracted automatically by the hardware, using a novel iterative algorithm that requires no software support or recompilation. On average, the Load Slice Core improves performance over a baseline in-order processor by 53% with overheads of only 15% in area and 22% in power, leading to an increase in energy efficiency (MIPS/Watt) over in-order and out-of-order designs by 43% and over 4.7x, respectively. In addition, for a power- and area-constrained many-core design, the Load Slice Core outperforms both in-order and out-of-order designs, achieving a 53% and 95% higher performance, respectively, thus providing an alternative direction for future many-core processors.

  • 3.
    Carlson, Trevor E.
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Tran, Kim-Anh
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Jimborean, Alexandra
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Koukos, Konstantinos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology.
    Själander, Magnus
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. Norwegian University of Science and Technology.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Transcending Hardware Limits with Software Out-of-order ProcessingIn: IEEE Computer Architecture LettersArticle in journal (Refereed)
    Abstract [en]

    Building high-performance, next-generation processors require novel techniques to allow improved performance given today’s power- and energy-efficiency requirements. Additionally, a widening gap between processor and memory performance makes it even more difficult to improve efficiency with conventional techniques. While out-of-order architectures attempt to hide this memory latency with dynamically reordered instructions, they lack the energy efficiency seen in in-order processors. Thus, our goal is to reorder the instruction stream to avoid stalls and improve utilization for energy efficiency and performance. To accomplish this goal, we propose an enhanced stall-on-use in-order core that improves energy efficiency (and therefore performance in these power-limited designes) through out-of-program-order execution. During long latency loads, the SWOOP (Software Out-of-Order Processing) core exposes additional memory- and instruction-level parallelism to perform useful, non-speculative work. The resulting instruction lookahead of the SWOOP core reaches beyond the conventional fixed-sized processor structures with the help of transparent hardware register contexts. Our results show that SWOOP demonstrates a 34% performance improvement on average compared with an in-order, stall-on-use core, with an energy reduction of 23%.

  • 4. Cebrian, Juan M.
    et al.
    Sanchez, Daniel
    Aragon, Juan L.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Efficient inter-core power and thermal balancing for multicore processors2013In: Computing, ISSN 0010-485X, E-ISSN 1436-5057, Vol. 95, no 7, 537-566 p.Article in journal (Refereed)
    Abstract [en]

    Nowadays the market is dominated by processor architectures that employ multiple cores per chip. These architectures have different behavior depending on the applications running on the processor (parallel, multiprogrammed, sequential), but all happen to meet what is called the power and temperature wall. For future technologies (less than 22 nm) and a fixed die size, it is still uncertain the percentage of processor that can be simultaneously powered on. Power saving and power budget mechanisms can be useful to precisely control the amount of power been dissipated by the processor. After an initial analysis we discover that legacy power saving techniques work properly for matching a power budget in thread-independent and multi-programmed workloads, but not in parallel workloads. When running parallel shared-memory applications sacrificing some performance in a single core (thread) in order to be more energy-efficient can unintentionally delay the rest of cores (threads) due to synchronization points (locks/barriers), having a negative impact on global performance. In order to solve this problem we propose power token balancing (PTB) aimed at accurately matching an external power constraint by balancing the power consumed among the different cores. Experimental results show that PTB matches more accurately a predefined power budget (50 % of the original peak power) than other mechanisms like DVFS. The total energy consumed over the budget is reduced to only 8 % for a 16-core CMP with only a 3 % energy increase (overhead). We also introduce a novel mechanism named "Nitro". Nitro will overclock the core that enters a critical section (delimited by locks) in order to free the lock as soon as possible. Experimental results have shown that Nitro is able to reduce the execution time of lock-intensive applications in more than 4 % by overclocking the frequency by 15 % in selected program phases over a period of time that represents a 22 % of the total execution time. We conclude the work with an analysis of the thermal effects of PTB in different CMP configurations using realistic power numbers and heatsink/fan configurations. Results show how PTB not only balances temperature between the different cores, reducing temperature gradient and increasing signal reliability, but also allows a reduction of 28-30 % of both average and peak temperatures for the studied benchmarks when a peak power budget of 50 % is exceeded.

  • 5. Cebrian, Juan M.
    et al.
    Sanchez, Daniel
    Aragon, Juan L.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Managing power constraints in a single-core scenario through power tokens2014In: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 68, no 1, 414-442 p.Article in journal (Refereed)
    Abstract [en]

    Current microprocessors face constant thermal and power-related problems during their everyday use, usually solved by applying a power budget to the processor/core. Dynamic voltage and frequency scaling (DVFS) has been an effective technique that allowed microprocessors to match a predefined power budget. However, the continuous increase of leakage power due to technology scaling along with low resolution of DVFS makes it less attractive as a technique to match a predefined power budget as technology goes to deep-submicron. In this paper, we propose the use of microarchitectural techniques to accurately match a power constraint while maximizing the energy-efficiency of the processor. We will predict the processor power dissipation at cycle level (power token throttling) or at a basic block level (basic block level mechanism), using the dissipated power translated into tokens to select between different power-saving microarchitectural techniques. We also introduce a two-level approach in which DVFS acts as a coarse-grain technique to lower the average power dissipation towards the power budget, while microarchitectural techniques focus on removing the numerous power spikes. Experimental results show that the use of power-saving microarchitectural techniques in conjunction with DVFS is up to six times more precise, in terms of total energy consumed over the power budget, than only using DVFS to match a predefined power budget.

  • 6. Cebrián, Juan M.
    et al.
    Aragón, Juan L.
    García, José M.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Leakage-efficient design of value predictors through state and non-state preserving techniques2011In: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 55, no 1, 28-50 p.Article in journal (Refereed)
  • 7. Cebrián, Juan M.
    et al.
    Aragón, Juan L.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Power Token Balancing: Adapting CMPs to power constraints for parallel multithreaded workloads2011In: Proc. 25th International Parallel and Distributed Processing Symposium, Piscataway, NJ: IEEE , 2011, 431-442 p.Conference paper (Refereed)
  • 8.
    Davari, Mahdad
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Scope-Aware Classification: Taking the hierarchical private/shared data classification to the next level2017Report (Other academic)
  • 9.
    Davari, Mahdad
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    The best of both works: A hybrid data-race-free cache coherence scheme2017Report (Other academic)
  • 10.
    Davari, Mahdad
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Ros, Alberto
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    An efficient, self-contained, on-chip directory: DIR1-SISD2015In: Proc. 24th International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society, 2015, 317-330 p.Conference paper (Refereed)
  • 11.
    Davari, Mahdad
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Ros, Alberto
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Effects of Granularity/Adaptivity on Private/Shared Classification for Coherence2015Conference paper (Other academic)
  • 12.
    Davari, Mahdad
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Ros, Alberto
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    The effects of granularity and adaptivity on private/shared classification for coherence2015In: ACM Transactions on Architecture and Code Optimization (TACO), ISSN 1544-3566, E-ISSN 1544-3973, Vol. 12, no 3, 26Article in journal (Refereed)
    Abstract [en]

    Classification of data into private and shared has proven to be a catalyst for techniques to reduce coherence cost, since private data can be taken out of coherence and resources can be concentrated on providing coherence for shared data. In this article, we examine how granularity-page-level versus cache-line level- and adaptivity-going from shared to private-affect the outcome of classification and its final impact on coherence. We create a classification technique, called Generational Classification, and a coherence protocol called Generational Coherence, which treats data as private or shared based on cache-line generations. We compare two coherence protocols based on self-invalidation/self-downgrade with respect to data classification. Our findings are enlightening: (i) Some programs benefit from finer granularity, some benefit further from adaptivity, but some do not benefit from either. (ii) Reducing the amount of shared data has no perceptible impact on coherence misses caused by self-invalidation of shared data, hence no impact on performance. (iii) In contrast, classifying more data as private has implications for protocols that employ write-through as a means of self-downgrade, resulting in network traffic reduction-up to 30%-by reducing write-through traffic.

  • 13.
    Davari, Mahdad
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Ros, Alberto
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    The Effects of Granularity and Adaptivity on Private/Shared Classification for Coherence2014Conference paper (Refereed)
  • 14.
    Davari, Mahdad
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Ros, Alberto
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    System and method for data classification and efficient virtual cache coherence without reverse translation2013Patent (Other (popular science, discussion, etc.))
    Abstract [en]

    An on-chip memory hierarchy organization for a multicore processing system is disclosed. The hierarchy supports virtual- addressed private caches and a physical-addressed shared cache. The hierarchy classifies cache line data as private or shared to support a one-directional request response protocol. The classification can be determined from the generational behavior of a cache line in the private caches. Cache lines having a single generation in a private cache are Private, and cache lines having overlapping generations in two or more private caches are Shared. The Private or Shared classification is performed dynamically at run-time in hardware using a single translation lookaside buffer at the interface between the private and shared caches. The coherence protocol uses the data classification in a dynamic write policy for both shared data race free data and private data, differentiating in when data is put back to the shared cache based on the classification.

  • 15.
    Jimborean, Alexandra
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computing Science.
    Koukos, Konstantinos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Spiliopoulos, Vasileios
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Fix the code. Don't tweak the hardware: A new compiler approach to Voltage–Frequency scaling2014In: Proc. 12th International Symposium on Code Generation and Optimization, New York: ACM Press, 2014, 262-272 p.Conference paper (Refereed)
  • 16.
    Jimborean, Alexandra
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Waern, Jonatan
    Ekemark, Per
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Ros, Alberto
    Automatic detection of extended data-race-free regions2017In: Proc. 15th International Symposium on Code Generation and Optimization, Piscataway, NJ: IEEE Press, 2017, 14-26 p.Conference paper (Refereed)
  • 17.
    Kaxiras, Stefanos
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Keramidas, Georgios
    SARC coherence: Scaling directory cache coherence in performance and power2010In: IEEE Micro, ISSN 0272-1732, Vol. 30, no 5, 54-65 p.Article in journal (Refereed)
  • 18. Kaxiras, Stefanos
    et al.
    Ros, A.
    Efficient, snoopless, System-on-Chip coherence2012In: SOC Conference (SOCC), 2012 IEEE International, 2012, 230-235 p.Conference paper (Refereed)
    Abstract [en]

    Coherence in a System-on-Chip (SoC) introduces complexity and overhead (snooping caches/directory, state bits, invalidations, etc.) in exchange for a clean and uniform shared memory model. As it is typical today, a SoC comprises a variety of cores with local caches, accelerators with local memories, and some form of shared last-level cache (LLC), all interconnected with shared buses. We propose a very simple coherence protocol, fit for this environment, that eliminates L1 snooping and its associated complexity and costs (power). In essence, we remove all coherence decisions from local caches by simply determining at the LLC whether data are private or shared. This makes a write-through policy a practical and effective alternative to maintain coherence. In the local caches, we dynamically select between writeback for private data, or write-through for shared data. Self-invalidation of the shared data on synchronization points eliminates the need to snoop, with just a data-race-free guarantee from software. Our evaluation shows that this simple protocol outperforms a traditional snooping protocol while at the same time significantly reducing L1, shared cache, and bus energy consumption.

  • 19.
    Kaxiras, Stefanos
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Ros, Alberto
    A New Perspective for Efficient Virtual-Cache Coherence2013In: Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013, 535-546 p.Conference paper (Refereed)
    Abstract [en]

    Coherent shared virtual memory (cSVM) is highly coveted for heterogeneous architectures as it will simplify program- ming across different cores and manycore accelerators. In this context, virtual L1 caches can be used to great advan- tage, e.g., saving energy consumption by eliminating address translation for hits. Unfortunately, multicore virtual-cache coherence is complex and costly because it requires reverse translation for any coherence request directed towards a vir- tual L1. The reason is the ambiguity of the virtual address due to the possibility of synonyms. In this paper, we take a radically different approach than all prior work which is focused on reverse translation. We examine the problem from the perspective of the coherence protocol. We show that if a coherence protocol adheres to certain conditions, it operates effortlessly with virtual caches, without requir- ing reverse translations even in the presence of synonyms. We show that these conditions hold in a new class of simple and efficient request-response protocols that use both self- invalidation and self-downgrade.This results in a new solu- tion for virtual-cache coherence, significantly less complex and more efficient than prior proposals. We study design choices for TLB placement under our proposal and compare them against those under a directory-MESI protocol. Our approach allows for choices that are particularly effective as for example combining all per-core TLBs in a single logical TLB in front of the last level cache. Significant area, energy, and performance benefits ensue as a result of simplifying the entire multicore memory organization. 

  • 20. Keramidas, Georgios
    et al.
    Petoumenos, Pavlos
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Where replacement algorithms fail: a thorough analysis2010In: Proc. 7th International Conference on Computing Frontiers, New York: ACM Press , 2010, 141-150 p.Conference paper (Refereed)
  • 21. Keramidas, Georgios
    et al.
    Spiliopoulos, Vasileios
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Interval-based models for run-time DVFS orchestration in superscalar processors2010In: Proc. 7th International Conference on Computing Frontiers, New York: ACM Press , 2010, 287-296 p.Conference paper (Refereed)
  • 22.
    Koukos, Konstantinos
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Spiliopoulos, Vasileios
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Towards more efficient execution: a decoupled access-execute approach2013In: Proc. 27th ACM International Conference on Supercomputing, New York: ACM Press, 2013, 253-262 p.Conference paper (Refereed)
    Abstract [en]

    The end of Dennard scaling is expected to shrink the range of DVFS in future nodes, limiting the energy savings of this technique. This paper evaluates how much we can increase the effectiveness of DVFS by using a software decoupled access-execute approach. Decoupling the data access from execution allows us to apply optimal voltage-frequency selection for each phase and therefore improve energy efficiency over standard coupled execution.

    The underlying insight of our work is that by decoupling access and execute we can take advantage of the memory-bound nature of the access phase and the compute-bound nature of the execute phase to optimize power efficiency, while maintaining good performance. To demonstrate this we built a task based parallel execution infrastructure consisting of: (1) a runtime system to orchestrate the execution, (2) power models to predict optimal voltage-frequency selection at runtime, (3) a modeling infrastructure based on hardware measurements to simulate zero-latency, per-core DVFS, and (4) a hardware measurement infrastructure to verify our model's accuracy.

    Based on real hardware measurements we project that the combination of decoupled access-execute and DVFS has the potential to improve EDP by 25% without hurting performance. On memory-bound applications we significantly improve performance due to increased MLP in the access phase and ILP in the execute phase. Furthermore we demonstrate that our method can achieve high performance both in presence or absence of a hardware prefetcher.

  • 23.
    Koukos, Konstantinos
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Spiliopoulos, Vasileios
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Towards Power Efficiency on Task-Based, Decoupled Access-Execute Models2013In: PARMA 2013, 4th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures, 2013Conference paper (Refereed)
    Abstract [en]

    This work demonstrates the potential of hardware and software optimization to improve theeffectiveness of dynamic voltage and frequency scaling (DVFS). For software, we decouple data prefetch (access) and computation (execute) to enable optimal DVFS selectionfor each phase. For hardware, we use measurements from state-of-the-art multicore processors to accurately model the potential of per-core, zero-latency DVFS. We demonstrate that the combinationof decoupled access-execute and precise DVFS has the potential to decrease EDP by 25-30% without reducing performance.

    The underlying insight in this work is that by decoupling access and execute we can take advantageof the memory-bound nature of the access phase and the compute-bound nature of the execute phase to optimize power efficiency. For the memory-bound access phase, where we prefetch data into the cachefrom main memory, we can run at a reduced frequency and voltage without hurting performance. Thereafter, the execute phase can run much faster, thanks to the prefetching of the access phase, and achieve higher performance. This decoupled program behavior allows us to achieve more effective use of DVFS than standard coupled executions which mix data access and compute.

    To understand the potential of this approach, we measure application performance and power consumption on a modern multicore system across a range of frequencies and voltages. From this data we build a model that allows us to analyze the effects of per-core, zero-latency DVFS. The results of this work demonstrate the significant potential for finer-grain DVFS in combination with DVFS-optimized software.

  • 24.
    Koukos, Konstantinos
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Ekemark, Per
    Zacharopoulos, Georgios
    Switzerland Univ Svizzera Italiana, Lugano, Switzerland.
    Spiliopoulos, Vasileios
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Jimborean, Alexandra
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Multiversioned decoupled access-execute: The key to energy-efficient compilation of general-purpose programs2016In: Proc. 25th International Conference on Compiler Construction, New York: ACM Press, 2016, 121-131 p.Conference paper (Refereed)
    Abstract [en]

    Computer architecture design faces an era of great challenges in an attempt to simultaneously improve performance and energy efficiency. Previous hardware techniques for energy management become severely limited, and thus, compilers play an essential role in matching the software to the more restricted hardware capabilities. One promising approach is software decoupled access-execute (DAE), in which the compiler transforms the code into coarse-grain phases that are well-matched to the Dynamic Voltage and Frequency Scaling (DVFS) capabilities of the hardware. While this method is proved efficient for statically analyzable codes, general purpose applications pose significant challenges due to pointer aliasing, complex control flow and unknown runtime events. We propose a universal compile-time method to decouple general-purpose applications, using simple but efficient heuristics. Our solutions overcome the challenges of complex code and show that automatic decoupled execution significantly reduces the energy expenditure of irregular or memory-bound applications and even yields slight performance boosts. Overall, our technique achieves over 20% on average energy-delay-product (EDP) improvements (energy over 15% and performance over 5%) across 14 bench-marks from SPEC CPU 2006 and Parboil benchmark suites, with peak EDP improvements surpassing 70%.

  • 25.
    Koukos, Konstantinos
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Ros, Alberto
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead2016In: ACM Transactions on Architecture and Code Optimization (TACO), ISSN 1544-3566, E-ISSN 1544-3973, Vol. 13, no 1, 1Article in journal (Refereed)
    Abstract [en]

    This work proposes a novel scheme to facilitate heterogeneous systems with unified virtual memory. Research proposals implement coherence protocols for sequential consistency (SC) between central processing unit (CPU) cores and between devices. Such mechanisms introduce severe bottlenecks in the system; therefore, we adopt the heterogeneous-race-free (HRF) memory model. The use of HRF simplifies the coherency protocol and the graphics processing unit (GPU) memory management unit (MMU). Our protocol optimizes CPU and GPU demands separately, with the GPU part being simpler while the CPU is more elaborate and latency aware. We achieve an average 45% speedup and 45% energy-delay product reduction (20% energy) over the corresponding SC implementation.

  • 26. Petoumenos, Pavlos
    et al.
    Psychou, Georgia
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Cebrián Gonzalez, Juan Manuel
    Aragón, Juan Luis
    MLP-aware instruction queue resizing: The key to power-efficient performance2010In: Architecture of Computing Systems – ARCS 2010, Berlin: Springer-Verlag , 2010, 113-125 p.Conference paper (Refereed)
  • 27. Ros, Alberto
    et al.
    Carlson, Trevor E.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Alipour, Mehdi
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Non-speculative load-load reordering in TSO2017In: Proc. 44th International Symposium on Computer Architecture, New York: ACM Press, 2017, 187-200 p.Conference paper (Refereed)
  • 28.
    Ros, Alberto
    et al.
    Univ Murcia, Dept Comp Engn, E-30001 Murcia, Spain.
    Davari, Mahdad
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies2015In: Proc. 21st International Symposium on High Performance Computer Architecture, IEEE Computer Society Digital Library, 2015, 186-197 p.Conference paper (Refereed)
    Abstract [en]

    Hierarchical clustered cache designs are becoming an appealing alternative for multicores. Grouping cores and their caches in clusters reduces network congestion by localizing traffic among several hierarchical levels, potentially enabling much higher scalability. While such architectures can be formed recursively by replicating a base design pattern, keeping the whole hierarchy coherent requires more effort and consideration. The reason is that, in hierarchical coherence, even basic operations must be recursive. As a consequence, intermediate-level caches behave both as directories and as leaf caches. This leads to an explosion of states, protocol-races, and protocol complexity. While there have been previous efforts to extend directory-based coherence to hierarchical designs their increased complexity and verification cost is a serious impediment to their adoption. We aim to address these concerns by encapsulating all hierarchical complexity in a simple function: that of determining when a data block is shared entirely within a cluster (sub-tree of the hierarchy) and is private from the outside. This allows us to eliminate complex recursive operations that span the hierarchy and instead employ simple coherence mechanisms such as self-invalidation and write-through-now restricted to operate within the cluster where a data block is shared. We examine two inclusivity options and discuss the relation of our approach to the recently proposed Hierarchical-Race-Free (HRF) memory models. Finally, comparisons to a hierarchical directory-based MOESI, VIPS-M, and TokenCMP protocols show that, despite its simplicity our approach results in competitive performance and decreased network traffic.

  • 29.
    Ros, Alberto
    et al.
    Univ Murcia, Dept Comp Engn, E-30001 Murcia, Spain..
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Callback: Efficient Synchronization without Invalidation with a Directory Just for Spin-Waiting2015In: 2015 ACM/IEEE 42Nd Annual International Symposium On Computer Architecture (ISCA), 2015, 427-438 p.Conference paper (Refereed)
    Abstract [en]

    Cache coherence protocols based on self-invalidation allow a simpler design compared to traditional invalidation-based protocols, by relying on data-race-free (DRF) semantics and applying self-invalidation on racy synchronization points exposed to the hardware. Their simplicity lies in the absence of invalidation traffic, which eliminates the need to track readers in a directory, and reduces the number of transient protocol states. With the addition of self-downgrade these protocols can become effectively directory-free. While this works well for race free data, unfortunately, lack of explicit invalidations compromises the effectiveness of any synchronization that relies on races. This includes any form of spin waiting, which is employed for signaling, locking, and barrier primitives. In this work we propose a new solution for spin-waiting in these protocols, the callback mechanism, that is simpler and more efficient than explicit invalidation. Callbacks are set by reads involved in spin waiting, and are satisfied by writes (that can even precede these reads). To implement callbacks we use a small (just a few entries) directory-cache structure that is intended to service only these "spin-waiting" races. This directory structure is self-contained and is not backed up in any way. Entries are created on demand and can be evicted without the need to preserve their information. Our evaluation shows a significant improvement both over explicit invalidation and over exponential back-off the state-of-the-art mechanism for self-invalidation protocols to avoid spinning in the shared cache.

  • 30. Ros, Alberto
    et al.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Complexity-effective multicore coherence2012In: Proc. 21st International Conference on Parallel Architectures and Compilation Techniques, New York: ACM Press, 2012, 241-251 p.Conference paper (Refereed)
  • 31.
    Ros, Alberto
    et al.
    Univ Murcia, Dept Comp Engn, E-30001 Murcia, Spain..
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Racer: TSO Consistency via Race Detection2016In: 2016 49Th Annual IEEE/ACM International Symposium On Microarchitecture (MICRO), 2016Conference paper (Refereed)
    Abstract [en]

    Several recent efforts aim to simplify coherence and its associate costs (e.g., directory size, complexity) in multicores. The bulk of these efforts rely on program data-race-free (DRF) semantics to eliminate explicit invalidations and use self-invalidation instead. While such protocols are simple, they require software cooperation. This is acceptable only for ( correct) software that abides by the SC-for-DRF semantics defined in many modern programming language standards (e.g., C++11, Java, the latest C standards) but many are unwilling to trust coherence that relies solely on program semantics for its correctness. To address this important issue, this work proposes Racer, an efficient self-invalidation/write-through approach that guarantees the memory consistency model of the most common family of processors (TSO-x86), and at the same time maintains the relaxed-ordering advantages of SC-for-DRF protocols. Lacking a directory and explicit invalidations, Racer achieves this by detecting read-after-write races and causing self-invalidation on the racing reader's cache. Racer also uses a coalescing store buffer (at the L1 level) that allows coalescing and reordering of stores but upon detecting a race, delays the racing read until all its stores appear in order to the read. Race detection is performed using an efficient signature-based mechanism at the level of the shared cache. Racer performs significantly better than a traditional non-scalable directory-based protocol that does not allow reordering at the protocol level (14.2% in time and 26.4% in energy), a directory protocol for TSO (1.9% in time and 15.5% in energy), and state-of-the-art SC-for-DRF protocol that relies on acquire-release annotations in the programs (6.7% in time and 9.5% in energy). Racer self-invalidates less than program-level annotations as it only enforces ordering on dynamically detected races and provides significant reductions in network traffic and memory-system energy consumption.

  • 32.
    Ros, Alberto
    et al.
    Univ Murcia, E-30001 Murcia, Spain..
    Leonardsson, Carl
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Sakalis, Christos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    POSTER: Efficient Self-Invalidation/Self-Downgrade for Critical Sections with Relaxed Semantics2016In: 2016 INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURE AND COMPILATION TECHNIQUES (PACT), 2016, 433-434 p.Conference paper (Refereed)
    Abstract [en]

    Cache coherence protocols based on self-invalidation allow simpler hardware implementation compared to traditional write-invalidation protocols, by relying on data-race-free semantics and applying self-invalidation and self-downgrade on synchronization points. This work examines how self invalidation and self-downgrade are performed in relation to atomicity and ordering and shows that they do not need to be applied conservatively, as so far implemented. Our key observation is that, often, critical sections which are not ordered in time, are intended to provide only atomicity but not thread synchronization.

  • 33. Sakalis, Christos
    et al.
    Leonardsson, Carl
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Ros, Alberto
    Univ Murcia, E-30001 Murcia, Spain.
    Splash-3: A properly synchronized benchmark suite for contemporary research2016In: Proc. International Symposium on Performance Analysis of Systems and Software: ISPASS 2016, IEEE Computer Society, 2016, 101-111 p.Conference paper (Refereed)
    Abstract [en]

    Benchmarks are indispensable in evaluating the performance implications of new research ideas. However, their usefulness is compromised if they do not work correctly on a system under evaluation or, in general, if they cannot be used consistently to compare different systems. A well-known benchmark suite of parallel applications is the Splash-2 suite. Since its creation in the context of the DASH project, Splash-2 benchmarks have been widely used in research. However, Splash-2 was released over two decades ago and does not adhere to the recent C memory consistency model. This leads to unexpected and often incorrect behavior when some Splash-2 benchmarks are used in conjunction with contemporary compilers and hardware (simulated or real). Most importantly, we discovered critical performance bugs that may question some of the reported benchmark results. In this work, we analyze the Splash-2 benchmarks and expose data races and related performance bugs. We rectify the problematic benchmarks and evaluate the resulting performance. Our work contributes to the community a new sanitized version of the Splash-2 benchmarks, called the Splash-3 benchmark suite.

  • 34.
    Sandberg, Andreas
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Nikoleris, Nikos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Carlson, Trevor E.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Full speed ahead: Detailed architectural simulation at near-native speed2015In: Proc. 18th International Symposium on Workload Characterization, IEEE Computer Society, 2015, 183-192 p.Conference paper (Refereed)
    Abstract [en]

    Cycle-level microarchitectural simulation is the defacto standard to estimate performance of next-generation platforms. Unfortunately, the level of detail needed for accurate simulation requires complex, and therefore slow, simulation models that run at speeds that are thousands of times slower than native execution. With the introduction of sampled simulation, it has become possible to simulate only the key, representative portions of a workload in a reasonable amount of time and reliably estimate its overall performance. These sampling methodologies provide the ability to identify regions for detailed execution, and through microarchitectural state checkpointing, one can quickly and easily determine the performance characteristics of a workload for a variety of microarchitectural changes. While this strategy of sampling simulations to generate checkpoints performs well for static applications, more complex scenarios involving hardware-software co-design (such as co-optimizing both a Java virtual machine and the microarchitecture it is running on) cause this methodology to break down, as new microarchitectural checkpoints are needed for each memory hierarchy configuration and software version. Solutions are therefore needed to enable fast and accurate simulation that also address the needs of hardware-software co-design and exploration. In this work we present a methodology to enhance checkpoint-based sampled simulation. Our solution integrates hardware virtualization to provide near-native speed, virtualized fast-forwarding to regions of interest, and parallel detailed simulation. However, as we cannot warm the simulated caches during virtualized fast-forwarding, we develop a novel approach to estimate the error introduced by limited cache warming, through the use of optimistic and pessimistic warming simulations. Using virtualized fast-forwarding (which operates at 90% of native speed on average), we demonstrate a parallel sampling simulator that can be used to accurately estimate the IPC of standard workloads with an average error of 2.2% while still reaching an execution rate of 2.0 GIPS (63% of native) on average. Additionally, we demonstrate that our parallelization strategy scales almost linearly and simulates one core at up to 93% of its native execution rate, 19,000x faster than detailed simulation, while using 8 cores.

  • 35.
    Själander, Magnus
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Borgström, Gustaf
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Improving Error-Resilience of Emerging Multi-Value TechnologiesManuscript (preprint) (Other academic)
    Abstract [en]

    There exist extensive ongoing research efforts on emerging technologies that have the potential to become an alternative to today’s CMOS technologies. A common feature among the investigated technologies is that of multi- value devices and the possibility of implementing quaternary logic and memory. However, multi-value devices tend to be more sensitive to interferences and, thus, have reduced error resilience. We present an architecture based on multi-value devices where we can trade energy efficiency against error resilience. Important data are encoded in a more robust binary format while error tolerant data is encoded in a quaternary format. We show for eight benchmarks an energy reduction of 32% and 36% for the register file and level-one data cache, respectively, and for the two integer benchmarks, an energy reduction for arithmetic operations of 13% and 23%. We also show that for a quaternary technology to be viable it need to have a raw bit error rate of one error in 100 million or better.

  • 36.
    Själander, Magnus
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Borgström, Gustaf
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Klymenko, Mykhailo V.
    Remacle, Françoise
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Techniques for modulating error resilience in emerging multi-value technologies2016In: Proc. 13th International Conference on Computing Frontiers, New York: ACM Press, 2016, 55-63 p.Conference paper (Refereed)
  • 37.
    Själander, Magnus
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Martonosi, Margaret
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Power-Efficient Computer Architectures: Recent Advances2014Book (Refereed)
  • 38.
    Själander, Magnus
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Shariati Nilsson, Nina
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    A tunable cache for approximate computing2014In: Proc. 10th International Symposium on Nanoscale Architectures, Piscataway, NJ: IEEE , 2014, 88-89 p.Conference paper (Refereed)
    Abstract [en]

    CMOS scaling is near its end but new emerging devices are being developed to replace CMOS. These devices have different features than CMOS, such as the possibility for multi-value logic, which present new opportunities when designing computer systems. In this work we investigate the use of multi-value devices to design a cache that can tune the amount of resources used to store application data. We leverage work on approximate computing to store data that are not application critical in a compact quaternary format while critical data is stored in a more error resilient binary format.

  • 39.
    Spiliopoulos, Vasileios
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Bagdia, Akash
    Hansson, Andreas
    Aldworth, Peter
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Introducing DVFS-Management in a Full-System Simulator2013In: Proc. 21st International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, IEEE Computer Society, 2013Conference paper (Refereed)
    Abstract [en]

    Dynamic Voltage and Frequency Scaling (DVFS) is an essential part of controlling the power consumption of any computer system, ranging from mobile phones to servers. DVFS efficiency relies on hardware-software co-optimization, thus using existing hardware cannot reveal the full optimization potential beyond the current implementation’s characteristics. To explore the vast design space for DVFS efficiency, that straddles software and hardware, a simulation infrastructure must provide features that are not readily available today, for example: software controllable clock and voltage domains, support for the OS and the frequency scaling module of it, and an online power estimation methodology. As the main contribution,this work enables DVFS studies in a full-system simulator. We extend the gem5 simulator to support full-system DVFS modeling. By doing so, we enable energy-efficiency experiments to be performed in gem5 and we showcase such studies. Finally, we show that both existing and novel frequency governors for Linux and Android can be effortlessly integrated in the framework, and we evaluate the efficiency of different DVFS schemes.

  • 40.
    Spiliopoulos, Vasileios
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Keramidas, Georgios
    Green governors: A framework for continuously adaptive DVFS2011In: Proc. International Green Computing Conference and Workshops: IGCC 2011, Piscataway, NJ: IEEE , 2011, 1-8 p.Conference paper (Refereed)
  • 41.
    Spiliopoulos, Vasileios
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Keramidas, Georgios
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Efstathiou, Konstantinos
    Power-performance adaptation in Intel core i72011In: Proc. 2nd Workshop on Computer Architecture and Operating System co-design, Cambridge, MA: Computer Science and Artificial Intelligence Laboratory, MIT , 2011, 10- p.Conference paper (Other academic)
  • 42.
    Spiliopoulos, Vasileios
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Sembrant, Andreas
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Power-Sleuth: A Tool for Investigating your Program's Power Behavior2012In: International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS'12), 2012, 241-250 p.Conference paper (Refereed)
    Abstract [en]

    Modern processors support aggressive power saving techniques to reduce energy consumption. However, traditional profiling techniques have mainly focused on performance, which does not accurately reflect the power behavior of applications. For example, the longest running function is not always the most energy-hungry function. Thus software developers cannot always take full advantage of these power-saving features.

    We present \powersleuth, a power/performance estimation tool which is able to provide a full description of an application's behavior for any frequency from a single profiling run. The tool combines three techniques: a power and a performance estimation model with a program phase detection technique to deliver accurate, per-phase, per-frequency analysis.

    Our evaluation (against real power measurements) shows that we can accurately predict power and performance across different frequencies with average errors of 3.5% and 3.9% respectively.

  • 43.
    Spiliopoulos, Vasileios
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Sembrant, Andreas
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Keramidas, Georgios
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    A unified DVFS-cache resizing framework2016Report (Other academic)
  • 44. Strikos, Nikolaos
    et al.
    Keramidas, Georgios
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Parallelizing multicore cache simulations on GPUs2010In: Proc. 3rd Swedish Workshop on Multi-Core Computing, Göteborg, Sweden: Chalmers University of Technology , 2010, 3-8 p.Conference paper (Other academic)
  • 45.
    Tran, Kim-Anh
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Carlson, Trevor E.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Koukos, Konstantinos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Själander, Magnus
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication. Norwegian University of Science and Technology.
    Spiliopoulos, Vasileios
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Jimborean, Alexandra
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Clairvoyance: Look-ahead compile-time scheduling2017In: Proc. 15th International Symposium on Code Generation and Optimization, Piscataway, NJ: IEEE Press, 2017, 171-184 p.Conference paper (Refereed)
  • 46.
    Tran, Kim-Anh
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Carlson, Trevor Erik
    National University of Singapore.
    Koukos, Konstantinos
    Royal Institute of Technology.
    Själander, Magnus
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. Norwegian University of Science and Technology.
    Spiliopoulos, Vasileios
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Jimborean, Alexandra
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Static Instruction Scheduling for High Performance on Limited HardwareIn: IEEE Transactions on ComputersArticle in journal (Refereed)
    Abstract [en]

    Complex out-of-order (OoO) processors have been designed to overcome the restrictions of outstanding long-latency misses at the cost of increased energy consumption. Simple, limited OoO processors are a compromise in terms of energy consumption and performance, as they have fewer hardware resources to tolerate the penalties of long-latency loads. In worst case, these loads may stall the processor entirely. We present Clairvoyance, a compiler based technique that generates code able to hide memory latency and better utilize simple OoO processors. By clustering loads found across basic block boundaries, Clairvoyance overlaps the outstanding latencies to increases memory-level parallelism. We show that these simple OoO processors, equipped with the appropriate compiler support, can effectively hide long-latency loads and achieve performance improvements for memory-bound applications. To this end, Clairvoyance tackles (i) statically unknown dependencies, (ii) insufficient independent instructions, and (iii) register pressure. Clairvoyance achieves a geomean execution time improvement of 14% for memory-bound applications, on top of standard O3 optimizations, while maintaining compute-bound applications' high-performance.

  • 47.
    Voigt, Thiemo
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Själander, Magnus
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Hermans, Frederik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Jimborean, Alexandra
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Gunningberg, Per
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Approximation: A New Paradigm also for Wireless Sensing2016Conference paper (Refereed)
  • 48. Waern, Jonatan
    et al.
    Ekemark, Per
    Koukos, Konstantinos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Jimborean, Alexandra
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Profiling-Assisted Decoupled Access-Execute2016In: Proc. 4th International Workshop on High Performance Energy Efficient Embedded Systems, 2016Conference paper (Refereed)
  • 49. Weber, Anton
    et al.
    Tran, Kim-Anh
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Jimborean, Alexandra
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Decoupled Access-Execute on ARM big.LITTLE2017In: Proc. 5th Workshop on High Performance Energy Efficient Embedded Systems, 2017Conference paper (Refereed)
1 - 49 of 49
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf