uu.seUppsala University Publications
Change search
Refine search result
23456 201 - 250 of 276
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 201.
    Mottola, Luca
    et al.
    Politecnico di Milano, Italy; SICS Swedish ICT.
    Whitehouse, Kamin
    University of Virginia, US.
    Fundamental Concepts of Reactive Control for Autonomous Drones2018In: Communications of the ACM, ISSN 0001-0782, E-ISSN 1557-7317, Vol. 61, no 10, p. 96-104Article in journal (Refereed)
    Abstract [en]

    Autonomous drones represent a new breed of mobile computing system. Compared to smartphones and connected cars that only opportunistically sense or communicate, drones allow motion control to become part of the application logic. The efficiency of their movements is largely dictated by the low-level control enabling their autonomous operation based on high-level inputs. Existing implementations of such low-level control operate in a time-triggered fashion. In contrast, we conceive a notion of reactive control that allows drones to execute the low-level control logic only upon recognizing the need to, based on the influence of the environment onto the drone operation. As a result, reactive control can dynamically adapt the control rate. This brings fundamental benefits, including more accurate motion control, extended lifetime, and better quality of service in end-user applications. Based on 260+ hours of real-world experiments using three aerial drones, three different control logic, and three hardware platforms, we demonstrate, for example, up to 41% improvements in motion accuracy and up to 22% improvements in flight time.

  • 202.
    Mustini, Jeton
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology.
    Development of a cloud service and a mobile client that visualizes business data stored in Microsoft Dynamics CRM2015Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    In this master thesis a prototype application is developed to help decision makers analyze data and present it so that decision makers can make business decisions more easily. The application consists of a client application, a cloud service, and a Microsoft Dynamics CRM system. The client application is developed as a Windows Store App, and the cloud service is developed as a web application using ASP.NET Web API. From the client users can connect to the cloud service by providing a set of user credentials. These credentials are then used against the users Microsoft Dynamics CRM server to retrieve business data. Data is modeled in a component on the cloud service to useful information defined by key performance indicators. The user's hierarchical organization structure is also replicated in the cloud service to enable users to drill-down forward and backward between organizational units and view their key performance indicators. These key performance indicators are finally returned to the client and presented on a dashboard using interactive charts

    Download full text (pdf)
    fulltext
  • 203.
    nakajima, masayuki
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Arts, Department of Game Design.
    Current Topics in Computer Graphics;: Report of SIGGRAPH20132013In: ITE Technical Report, ITE , 2013, p. 13-20Conference paper (Other (popular science, discussion, etc.))
    Abstract [en]

    CG ,Human Interface, Multimedia, Virtual Reality  Technology  are improved rapidly these days for many kind1s of  fields  in entertainment like movie ,TV and  game and Visualization in Engineering ,Science and Art etc..  I report current topics in 40  SIGGRAPH2013 conference  in Anahaim  Convention Center , Calfolnia.

  • 204.
    Nakajima, Masayuki
    Gotland University, School of Game Design, Technology and Learning Processes.
    Intelligent CG Making Technology and Intelligent Media2013In: ITE Transactions on Media Technology and Applications, ISSN 2186-7364, Vol. 1, no 1, p. 20-26Article in journal (Refereed)
    Abstract [en]

    In this invited research paper, I will describe the Intelligent CG Making Technology, (ICGMT) productionmethodology and Intelligent Media (IM). I will begin with an explanation of the key aspects of theICGMT and a definition of IM. Thereafter I will explain the three approaches of the ICGMT. These approachesare the reuse of animation data, the making animation from text, and the making animation from natural spokenlanguage. Finally, I will explain current approaches of the ICGMT under development by the Nakajima laboratory.

    Download full text (pdf)
    fulltext
  • 205.
    nakajima, masayuki
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Arts, Department of Game Design.
    Chang, Youngha
    Tokyo City University.
    Mukai, Nobuhiko
    Tokyo City University.
    Color Similarity Metric Based on Categorical Color Perception2013In: ITE journal, ISSN 1342-6893, Vol. 67, no 3, p. 116-119Article in journal (Refereed)
    Abstract [en]

    The calculation of color difference is one of the most basic techniques in image processing fields. For example, color clustering and edge detection are the first steps of most image processes and we compute them by using a color difference formula. Although the CIELAB color difference formula is a commonly used one, the results obtained with it are not in accordance with human feelings when the color difference becomes large. In this paper, we have performed psychophysical experiments on color similarity between colors that have large color differences. We have then analyzed the results and found that the similarity is strongly restricted by the basic color categories. In accordance with this result, we propose a new color similarity metric based on the CIEDE2000 color difference formula and categorical color perception.

  • 206.
    nakajima, masayuki
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Arts, Department of Game Design.
    Miyai, Ayumi
    Tokyo University.
    Yamaguchi, Yasushi
    Tokyo University.
    How to  Evaluate Learning Outcomes of Stereoscopic 3D Computer Graphics by Scene Rendering2013In: ITE Technical Report: Vol.37,No.45,ME2013-117, ITE , 2013, p. 21-24Conference paper (Other (popular science, discussion, etc.))
    Abstract [en]

     Use of stereoscopic 3DCG (S3DCG) is increasing in movies, games and animations. However, a method for objectively evaluating production capability has not been established. If possible to measure the production capability on basis of certain criteria, unified evaluation can be useful at school. In addition, it is useful to human resource development and adoption of enterprise. Therefore, the experiment conducted practical tests using 3DCG software to making a scene of S3DCG. The practical tests were carried out before and after subjects learning. As a result, it was able to measure improvement of subject's capability after learning, and the difference in capability between subjects. In this paper, we will report on the experimental method and results.

  • 207.
    nakajima, masayuki
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Arts, Department of Game Design.
    Ono, Smiaki
    Alexis, Andre
    Chang, Youngha
    Tokyo City University.
    Automatic Generation of LEGO from the Polygonal data2013In: IWAIT2013 Nagoya, 2013, p. 262-267Conference paper (Refereed)
    Abstract [en]

    In this work, we propose a method that converts a 3D

    polygonal model into a corresponding LEGO brick assembly. For

    this, we first convert the polygonal model into the voxel model,

    and then convert it to the brick representation. The difficulty lies

    in the connection between bricks should be guaranteed. To

    achieve this, we define replacement priority, and the conversion

    from voxel to brick representation is done according to this priority.

    We show some experimental results, which show that our

    method can keep the connection, and achieve a robust and optimized

    method for assembling LEGO building bricks.

  • 208.
    nakajima, masayuki
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Arts, Department of Game Design.
    Ono, Sumiaki
    Chang, Yang
    Tokyo City University.
    Andre, Alexis
    LEGO Builder: Automatic Generation of LEGO Assembly Manual from 3D Polygon Model2013In: ITE English Journal, ISSN 1342-6893, Vol. 1, no 4, p. 354-360Article in journal (Refereed)
    Abstract [en]

    The LEGO brick system is one of the most popular toys in the world. It can stimulate one’s creativity while

    being lots of fun. It is however very hard for the naive user to assemble complex models without instructions. In this work,

    we propose a method that converts 3D polygonal models into LEGO brick building instructions automatically. The most

    important part of the conversion is that the connectivity between the bricks should be assured. For this, we introduce a

    graph structure named ”legograph” that allows us to generate physically sound models that do not fall apart by managing

    the connections between the bricks. We show some experimental results and evaluation results. These show that the 3D

    brick models generated following the instructions generated by our method do not fall apart and that one can learn how to

    efficiently build 3D structures from our instructions.

  • 209.
    Ngo, Tuan-Phong
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Computer Systems.
    Model Checking of Software Systems under Weak Memory Models2019Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    When a program is compiled and run on a modern architecture, different optimizations may be applied to gain in efficiency. In particular, the access operations (e.g., read and write) to the shared memory may be performed in an out-of-order manner, i.e., in a different order than the order in which the operations have been issued by the program. The reordering of memory access operations leads to efficient use of instruction pipelines and thus an improvement in program execution times. However, the gain in this efficiency comes at a price. More precisely, programs running under modern architectures may exhibit unexpected behaviors by programmers. The out-of-order execution has led to the invention of new program semantics, called weak memory model (WMM). One crucial problem is to ensure the correctness of concurrent programs running under weak memory models.

    The thesis proposes three techniques for reasoning and analyzing concurrent programs running under WMMs. The first one is a sound and complete analysis technique for finite-state programs running under the TSO semantics (Paper II). This technique is based on a novel and equivalent semantics for TSO, called Dual TSO semantics, and on the use of well-structured transition framework. The second technique is an under-approximation technique that can be used to detect bugs under the POWER semantics (Paper III). This technique is based on bounding the number of contexts in an explored execution where, in each context, there is only one active process. The third technique is also an under-approximation technique based on systematic testing (a.k.a. stateless model checking). This approach has been used to develop an optimal and efficient systematic testing approach for concurrent programs running under the Release-Acquire semantics (Paper IV).

    The thesis also considers the problem of effectively finding a minimal set of fences that guarantees the correctness of a concurrent program running under WMMs (Paper I). A fence (a.k.a. barrier) is an operation that can be inserted in the program to prohibit certain reorderings between operations issued before and after the fence. Since fences are expensive, it is crucial to automatically find a minimal set of fences to ensure the program correctness. This thesis presents a method for automatic fence insertion in programs running under the TSO semantics that offers the best-known trade-off between the efficiency and optimality of the algorithm. The technique is based on a novel notion of correctness, called Persistence, that compares the behaviors of a program running under WMMs to that running under the SC semantics.

    List of papers
    1. The Best of Both Worlds: Trading efficiency and optimality in fence insertion for TSO
    Open this publication in new window or tab >>The Best of Both Worlds: Trading efficiency and optimality in fence insertion for TSO
    2015 (English)In: Programming Languages and Systems: ESOP 2015, Springer Berlin/Heidelberg, 2015, p. 308-332Conference paper, Published paper (Refereed)
    Abstract [en]

    We present a method for automatic fence insertion in concurrent programs running under weak memory models that provides the best known trade-off between efficiency and optimality. On the one hand, the method can efficiently handle complex aspects of program behaviors such as unbounded buffers and large numbers of processes. On the other hand, it is able to find small sets of fences needed for ensuring correctness of the program. To this end, we propose a novel notion of correctness, called persistence, that compares the behavior of the program under the weak memory semantics with that under the classical interleaving (SC) semantics. We instantiate our framework for the Total Store Ordering (TSO) memory model, and give an algorithm that reduces the fence insertion problem under TSO to the reachability problem for programs running under SC. Furthermore, we provide an abstraction scheme that substantially increases scalability to large numbers of processes. Based on our method, we have implemented a tool and run it successfully on a wide range benchmarks.

    Place, publisher, year, edition, pages
    Springer Berlin/Heidelberg, 2015
    Series
    Lecture Notes in Computer Science, ISSN 0302-9743 ; 9032
    Keywords
    weak memory, correctness, verification, TSO, concurrent program
    National Category
    Computer Sciences
    Research subject
    Computer Science
    Identifiers
    urn:nbn:se:uu:diva-253645 (URN)10.1007/978-3-662-46669-8_13 (DOI)000361751400013 ()978-3-662-46668-1 (ISBN)
    Conference
    24th European Symposium on Programming, ESOP 2015, April 11–18, London, UK
    Projects
    UPMARC
    Available from: 2015-05-29 Created: 2015-05-29 Last updated: 2018-11-21
    2. A load-buffer semantics for total store ordering
    Open this publication in new window or tab >>A load-buffer semantics for total store ordering
    2018 (English)In: Logical Methods in Computer Science, ISSN 1860-5974, E-ISSN 1860-5974, Vol. 14, no 1, article id 9Article in journal (Refereed) Published
    Abstract [en]

    We address the problem of verifying safety properties of concurrent programs running over the Total Store Order (TSO) memory model. Known decision procedures for this model are based on complex encodings of store buffers as lossy channels. These procedures assume that the number of processes is fixed. However, it is important in general to prove the correctness of a system/algorithm in a parametric way with an arbitrarily large number of processes. 

    In this paper, we introduce an alternative (yet equivalent) semantics to the classical one for the TSO semantics that is more amenable to efficient algorithmic verification and for the extension to parametric verification. For that, we adopt a dual view where load buffers are used instead of store buffers. The flow of information is now from the memory to load buffers. We show that this new semantics allows (1) to simplify drastically the safety analysis under TSO, (2) to obtain a spectacular gain in efficiency and scalability compared to existing procedures, and (3) to extend easily the decision procedure to the parametric case, which allows obtaining a new decidability result, and more importantly, a verification algorithm that is more general and more efficient in practice than the one for bounded instances.

    Keywords
    Verification, TSO, concurrent program, safety property, well-structured transition system
    National Category
    Computer Sciences
    Research subject
    Computer Science
    Identifiers
    urn:nbn:se:uu:diva-337278 (URN)000426512000008 ()
    Projects
    UPMARC
    Available from: 2018-01-23 Created: 2017-12-21 Last updated: 2018-11-21
    3. Context-bounded analysis for POWER
    Open this publication in new window or tab >>Context-bounded analysis for POWER
    2017 (English)In: Tools and Algorithms for the Construction and Analysis of Systems: Part II, Springer, 2017, p. 56-74Conference paper, Published paper (Refereed)
    Abstract [en]

    We propose an under-approximate reachability analysis algorithm for programs running under the POWER memory model, in the spirit of the work on context-bounded analysis initiated by Qadeer et al. in 2005 for detecting bugs in concurrent programs (supposed to be running under the classical SC model). To that end, we first introduce a new notion of context-bounding that is suitable for reasoning about computations under POWER, which generalizes the one defined by Atig et al. in 2011 for the TSO memory model. Then, we provide a polynomial size reduction of the context-bounded state reachability problem under POWER to the same problem under SC: Given an input concurrent program P, our method produces a concurrent program P' such that, for a fixed number of context switches, running P' under SC yields the same set of reachable states as running P under POWER. The generated program P' contains the same number of processes as P and operates on the same data domain. By leveraging the standard model checker CBMC, we have implemented a prototype tool and applied it on a set of benchmarks, showing the feasibility of our approach.

    Place, publisher, year, edition, pages
    Springer, 2017
    Series
    Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 10206
    Keywords
    POWER, weak memory model, under approximation, translation, concurrent program, testing
    National Category
    Computer Systems
    Research subject
    Computer Science
    Identifiers
    urn:nbn:se:uu:diva-314901 (URN)10.1007/978-3-662-54580-5_4 (DOI)000440733400004 ()978-3-662-54579-9 (ISBN)
    Conference
    23rd International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), 2017, April 22–29, Uppsala, Sweden
    Projects
    UPMARC
    Available from: 2017-03-31 Created: 2017-02-07 Last updated: 2018-11-21Bibliographically approved
    4. Optimal Stateless Model Checking under the Release-Acquire Semantics
    Open this publication in new window or tab >>Optimal Stateless Model Checking under the Release-Acquire Semantics
    2018 (English)In: SPLASH OOPSLA 2018, Boston, Nov 4-9, 2018, ACM Digital Library, 2018Conference paper, Published paper (Refereed)
    Abstract [en]

    We present a framework for efficient application of stateless model checking (SMC) to concurrent programs running under the Release-Acquire (RA) fragment of the C/C++11 memory model. Our approach is based on exploring the possible program orders, which define the order in which instructions of a thread are executed, and read-from relations, which define how reads obtain their values from writes. This is in contrast to previous approaches, which in addition explore the possible coherence orders, i.e., orderings between conflicting writes. Since unexpected test results such as program crashes or assertion violations depend only on the read-from relation, we avoid a potentially large source of redundancy. Our framework is based on a novel technique for determining whether a particular read-from relation is feasible under the RA semantics. We define an SMC algorithm which is provably optimal in the sense that it explores each program order and read-from relation exactly once. This optimality result is strictly stronger than previous analogous optimality results, which also take coherence order into account. We have implemented our framework in the tool Tracer. Experiments show that Tracer can be significantly faster than state-of-the-art tools that can handle the RA semantics.

    Place, publisher, year, edition, pages
    ACM Digital Library, 2018
    Keywords
    Software model checking, C/C++11, Release-Acquire, Concurrent program
    National Category
    Computer Systems
    Research subject
    Computer Science
    Identifiers
    urn:nbn:se:uu:diva-358241 (URN)
    Conference
    SPLASH OOPSLA 2018
    Projects
    UPMARC
    Available from: 2018-08-26 Created: 2018-08-26 Last updated: 2019-01-09Bibliographically approved
    Download full text (pdf)
    fulltext
    Download (jpg)
    presentationsbild
  • 210.
    Ngo, Tuan-Phong
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Abdulla, Parosh
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Jonsson, Bengt
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology.
    Atig, Mohamed Faouzi
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Optimal Stateless Model Checking under the Release-Acquire Semantics2018In: SPLASH OOPSLA 2018, Boston, Nov 4-9, 2018, ACM Digital Library, 2018Conference paper (Refereed)
    Abstract [en]

    We present a framework for efficient application of stateless model checking (SMC) to concurrent programs running under the Release-Acquire (RA) fragment of the C/C++11 memory model. Our approach is based on exploring the possible program orders, which define the order in which instructions of a thread are executed, and read-from relations, which define how reads obtain their values from writes. This is in contrast to previous approaches, which in addition explore the possible coherence orders, i.e., orderings between conflicting writes. Since unexpected test results such as program crashes or assertion violations depend only on the read-from relation, we avoid a potentially large source of redundancy. Our framework is based on a novel technique for determining whether a particular read-from relation is feasible under the RA semantics. We define an SMC algorithm which is provably optimal in the sense that it explores each program order and read-from relation exactly once. This optimality result is strictly stronger than previous analogous optimality results, which also take coherence order into account. We have implemented our framework in the tool Tracer. Experiments show that Tracer can be significantly faster than state-of-the-art tools that can handle the RA semantics.

  • 211.
    Nikoleris, Nikos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Computer Systems. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Efficient Memory Modeling During Simulation and Native Execution2019Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    Application performance on computer processors depends on a number of complex architectural and microarchitectural design decisions. Consequently, computer architects rely on performance modeling to improve future processors without building prototypes. This thesis focuses on performance modeling and proposes methods that quantify the impact of the memory system on application performance.

    Detailed architectural simulation, a common approach to performance modeling, can be five orders of magnitude slower than execution on the actual processor. At this rate, simulating realistic workloads requires years of CPU time. Prior research uses sampling to speed up simulation. Using sampled simulation, only a number of small but representative portions of the workload are evaluated in detail. To fully exploit the speed potential of sampled simulation, the simulation method has to efficiently reconstruct the architectural and microarchitectural state prior to the simulation samples. Practical approaches to sampled simulation use either functional simulation at the expense of performance or checkpoints at the expense of flexibility. This thesis proposes three approaches that use statistical cache modeling to efficiently address the problem of cache warm up and speed up sampled simulation, without compromising flexibility. The statistical cache model uses sparse memory reuse information obtained with native techniques to model the performance of the cache. The proposed sampled simulation framework evaluates workloads 150 times faster than approaches that use functional simulation to warm up the cache.

    Other approaches to performance modeling use analytical models based on data obtained from execution on native hardware. These native techniques allow for better understanding of the performance bottlenecks on existing hardware. Efficient resource utilization in modern multicore processors is necessary to exploit their peak performance. This thesis proposes native methods that characterize shared resource utilization in modern multicores. These methods quantify the impact of cache sharing and off-chip memory sharing on overall application performance. Additionally, they can quantify scalability bottlenecks for data-parallel, symmetric workloads.

    List of papers
    1. Extending statistical cache models to support detailed pipeline simulators
    Open this publication in new window or tab >>Extending statistical cache models to support detailed pipeline simulators
    2014 (English)In: 2014 IEEE International Symposium On Performance Analysis Of Systems And Software (Ispass), IEEE Computer Society, 2014, p. 86-95Conference paper, Published paper (Refereed)
    Abstract [en]

    Simulators are widely used in computer architecture research. While detailed cycle-accurate simulations provide useful insights, studies using modern workloads typically require days or weeks. Evaluating many design points, only exacerbates the simulation overhead. Recent works propose methods with good accuracy that reduce the simulated overhead either by sampling the execution (e.g., SMARTS and SimPoint) or by using fast analytical models of the simulated designs (e.g., Interval Simulation). While these techniques reduce significantly the simulation overhead, modeling processor components with large state, such as the last-level cache, requires costly simulation to warm them up. Statistical simulation methods, such as SMARTS, report that the warm-up overhead accounts for 99% of the simulation overhead, while only 1% of the time is spent simulating the target design. This paper proposes WarmSim, a method that eliminates the need to warm up the cache. WarmSim builds on top of a statistical cache modeling technique and extends it to model accurately not only the miss ratio but also the outcome of every cache request. WarmSim uses as input, an application's memory reuse information which is hardware independent. Therefore, different cache configurations can be simulated using the same input data. We demonstrate that this approach can be used to estimate the CPI of the SPEC CPU2006 benchmarks with an average error of 1.77%, reducing the overhead compared to a simulation with a 10M instruction warm-up by a factor of 50x.

    Place, publisher, year, edition, pages
    IEEE Computer Society, 2014
    Series
    IEEE International Symposium on Performance Analysis of Systems and Software-ISPASS
    National Category
    Computer Sciences
    Identifiers
    urn:nbn:se:uu:diva-224221 (URN)10.1109/ISPASS.2014.6844464 (DOI)000364102000010 ()978-1-4799-3604-5 (ISBN)
    Conference
    ISPASS 2014, March 23-25, Monterey, CA
    Projects
    UPMARC
    Available from: 2014-05-06 Created: 2014-05-06 Last updated: 2018-12-14Bibliographically approved
    2. CoolSim: Statistical Techniques to Replace Cache Warming with Efficient, Virtualized Profiling
    Open this publication in new window or tab >>CoolSim: Statistical Techniques to Replace Cache Warming with Efficient, Virtualized Profiling
    2016 (English)In: Proceedings Of 2016 International Conference On Embedded Computer Systems: Architectures, Modeling And Simulation (Samos) / [ed] Najjar, W Gerstlauer, A, IEEE , 2016, p. 106-115Conference paper, Published paper (Refereed)
    Abstract [en]

    Simulation is an important part of the evaluation of next-generation computing systems. Detailed, cycle-accurate simulation, however, can be very slow when evaluating realistic workloads on modern microarchitectures. Sampled simulation (e.g., SMARTS and SimPoint) improves simulation performance by an order of magnitude or more through the reduction of large workloads into a small but representative sample. Additionally, the execution state just prior to a simulation sample can be stored into checkpoints, allowing for fast restoration and evaluation. Unfortunately, changes in software, architecture or fundamental pieces of the microarchitecture (e.g., hardware-software co-design) require checkpoint regeneration. The end result for co-design degenerates to creating checkpoints for each modification, a task checkpointing was designed to eliminate. Therefore, a solution is needed that allows for fast and accurate simulation, without the need for checkpoints. Virtualized fast-forwarding (VFF), an alternative to using checkpoints, allows for execution at near-native speed between simulation points. Warming the micro-architectural state prior to each simulation point, however, requires functional simulation, a costly operation for large caches (e.g., 8 M B). Simulating future systems with caches of many MBs can require warming of billions of instructions, dominating simulation time. This paper proposes CoolSim, an efficient simulation framework that eliminates cache warming. CoolSim uses VFF to advance between simulation points collecting at the same time sparse memory reuse information (MRI). The MRI is collected more than an order of magnitude faster than functional simulation. At the simulation point, detailed simulation with a statistical cache model is used to evaluate the design. The previously acquired MRI is used to estimate whether each memory request hits in the cache. The MRI is an architecturally independent metric and a single profile can be used in simulations of any size cache. We describe a prototype implementation of CoolSim based on KVM and gem5 running 19 x faster than the state-of-the-art sampled simulation, while it estimates the CPI of the SPEC CPU2006 benchmarks with 3.62% error on average, across a wide range of cache sizes.

    Place, publisher, year, edition, pages
    IEEE, 2016
    National Category
    Computer Sciences
    Identifiers
    urn:nbn:se:uu:diva-322061 (URN)000399143000015 ()9781509030767 (ISBN)
    Conference
    International Conference on Embedded Computer Systems - Architectures, Modeling and Simulation (SAMOS), JUL 17-21, 2016, Samos, GREECE
    Funder
    Swedish Foundation for Strategic Research EU, FP7, Seventh Framework Programme, 610490
    Available from: 2017-05-16 Created: 2017-05-16 Last updated: 2018-12-14Bibliographically approved
    3. Delorean: Virtualized Directed Profiling for Cache Modeling in Sampled Simulation
    Open this publication in new window or tab >>Delorean: Virtualized Directed Profiling for Cache Modeling in Sampled Simulation
    2018 (English)Report (Other academic)
    Abstract [en]

    Current practice for accurate and efficient simulation (e.g., SMARTS and Simpoint) makes use of sampling to significantly reduce the time needed to evaluate new research ideas. By evaluating a small but representative portion of the original application, sampling can allow for both fast and accurate performance analysis. However, as cache sizes of modern architectures grow, simulation time is dominated by warming microarchitectural state and not by detailed simulation, reducing overall simulation efficiency. While checkpoints can significantly reduce cache warming, improving efficiency, they limit the flexibility of the system under evaluation, requiring new checkpoints for software updates (such as changes to the compiler and compiler flags) and many types of hardware modifications. An ideal solution would allow for accurate cache modeling for each simulation run without the need to generate rigid checkpointing data a priori.

    Enabling this new direction for fast and flexible simulation requires a combination of (1) a methodology that allows for hardware and software flexibility and (2) the ability to quickly and accurately model arbitrarily-sized caches. Current approaches that rely on checkpointing or statistical cache modeling require rigid, up-front state to be collected which needs to be amortized over a large number of simulation runs. These earlier methodologies are insufficient for our goals for improved flexibility. In contrast, our proposed methodology, Delorean, outlines a unique solution to this problem. The Delorean simulation methodology enables both flexibility and accuracy by quickly generating a targeted cache model for the next detailed region on the fly without the need for up-front simulation or modeling. More specifically, we propose a new, more accurate statistical cache modeling method that takes advantage of hardware virtualization to precisely determine the memory regions accessed and to minimize the time needed for data collection while maintaining accuracy.

    Delorean uses a multi-pass approach to understand the memory regions accessed by the next, upcoming detailed region. Our methodology collects the entire set of key memory accesses and, through fast virtualization techniques, progressively scans larger, earlier regions to learn more about these key accesses in an efficient way. Using these techniques, we demonstrate that Delorean allows for the fast evaluation of systems and their software though the generation of accurate cache models on the fly. Delorean outperforms previous proposals by an order of magnitude, with a simulation speed of 150 MIPS and a similar average CPI error (below 4%).

    Publisher
    p. 12
    Series
    Technical report / Department of Information Technology, Uppsala University, ISSN 1404-3203
    National Category
    Computer Systems
    Research subject
    Computer Science
    Identifiers
    urn:nbn:se:uu:diva-369320 (URN)
    Available from: 2018-12-12 Created: 2018-12-12 Last updated: 2019-01-08Bibliographically approved
    4. Cache Pirating: Measuring the Curse of the Shared Cache
    Open this publication in new window or tab >>Cache Pirating: Measuring the Curse of the Shared Cache
    2011 (English)In: Proc. 40th International Conference on Parallel Processing, IEEE Computer Society, 2011, p. 165-175Conference paper, Published paper (Refereed)
    Place, publisher, year, edition, pages
    IEEE Computer Society, 2011
    National Category
    Computer Engineering
    Identifiers
    urn:nbn:se:uu:diva-181254 (URN)10.1109/ICPP.2011.15 (DOI)978-1-4577-1336-1 (ISBN)
    Conference
    ICPP 2011
    Projects
    UPMARCCoDeR-MP
    Available from: 2011-10-17 Created: 2012-09-20 Last updated: 2018-12-14Bibliographically approved
    5. Bandwidth Bandit: Quantitative Characterization of Memory Contention
    Open this publication in new window or tab >>Bandwidth Bandit: Quantitative Characterization of Memory Contention
    2013 (English)In: Proc. 11th International Symposium on Code Generation and Optimization: CGO 2013, IEEE Computer Society, 2013, p. 99-108Conference paper, Published paper (Refereed)
    Abstract [en]

    On multicore processors, co-executing applications compete for shared resources, such as cache capacity and memory bandwidth. This leads to suboptimal resource allocation and can cause substantial performance loss, which makes it important to effectively manage these shared resources. This, however, requires insights into how the applications are impacted by such resource sharing. While there are several methods to analyze the performance impact of cache contention, less attention has been paid to general, quantitative methods for analyzing the impact of contention for memory bandwidth. To this end we introduce the Bandwidth Bandit, a general, quantitative, profiling method for analyzing the performance impact of contention for memory bandwidth on multicore machines. The profiling data captured by the Bandwidth Bandit is presented in a bandwidth graph. This graph accurately captures the measured application's performance as a function of its available memory bandwidth, and enables us to determine how much the application suffers when its available bandwidth is reduced. To demonstrate the value of this data, we present a case study in which we use the bandwidth graph to analyze the performance impact of memory contention when co-running multiple instances of single threaded application.

    Place, publisher, year, edition, pages
    IEEE Computer Society, 2013
    Keywords
    bandwidth, memory, caches
    National Category
    Computer Sciences
    Research subject
    Computer Science
    Identifiers
    urn:nbn:se:uu:diva-194101 (URN)10.1109/CGO.2013.6494987 (DOI)000318700200010 ()978-1-4673-5524-7 (ISBN)
    Conference
    CGO 2013, 23-27 February, Shenzhen, China
    Projects
    UPMARC
    Funder
    Swedish Research Council
    Available from: 2013-04-18 Created: 2013-02-08 Last updated: 2018-12-14Bibliographically approved
    6. A software based profiling method for obtaining speedup stacks on commodity multi-cores
    Open this publication in new window or tab >>A software based profiling method for obtaining speedup stacks on commodity multi-cores
    2014 (English)In: 2014 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS): ISPASS 2014, IEEE Computer Society, 2014, p. 148-157Conference paper, Published paper (Refereed)
    Abstract [en]

    A key goodness metric of multi-threaded programs is how their execution times scale when increasing the number of threads. However, there are several bottlenecks that can limit the scalability of a multi-threaded program, e.g., contention for shared cache capacity and off-chip memory bandwidth; and synchronization overheads. In order to improve the scalability of a multi-threaded program, it is vital to be able to quantify how the program is impacted by these scalability bottlenecks. We present a software profiling method for obtaining speedup stacks. A speedup stack reports how much each scalability bottleneck limits the scalability of a multi-threaded program. It thereby quantifies how much its scalability can be improved by eliminating a given bottleneck. A software developer can use this information to determine what optimizations are most likely to improve scalability, while a computer architect can use it to analyze the resource demands of emerging workloads. The proposed method profiles the program on real commodity multi-cores (i.e., no simulations required) using existing performance counters. Consequently, the obtained speedup stacks accurately account for all idiosyncrasies of the machine on which the program is profiled. While the main contribution of this paper is the profiling method to obtain speedup stacks, we present several examples of how speedup stacks can be used to analyze the resource requirements of multi-threaded programs. Furthermore, we discuss how their scalability can be improved by both software developers and computer architects.

    Place, publisher, year, edition, pages
    IEEE Computer Society, 2014
    Series
    IEEE International Symposium on Performance Analysis of Systems and Software-ISPASS
    National Category
    Computer Sciences
    Identifiers
    urn:nbn:se:uu:diva-224230 (URN)10.1109/ISPASS.2014.6844479 (DOI)000364102000025 ()978-1-4799-3604-5 (ISBN)
    Conference
    ISPASS 2014, March 23-25, Monterey, CA
    Projects
    UPMARC
    Available from: 2014-05-06 Created: 2014-05-06 Last updated: 2018-12-14Bibliographically approved
    Download full text (pdf)
    fulltext
    Download (jpg)
    presentationsbild
  • 212.
    Nikoleris, Nikos
    et al.
    Arm Res, Cambridge, England.
    Eeckhout, Lieven
    Univ Ghent, Ghent, Belgium.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Carlson, Trevor E.
    Natl Univ Singapore, Singapore, Singapore.
    Directed Statistical Warming through Time Traveling2019In: MICRO'52: The 52nd Annual IEEE/ACM International Symposium On Microarchitecture, 2019, p. 1037-1049Conference paper (Refereed)
    Abstract [en]

    Improving the speed of computer architecture evaluation is of paramount importance to shorten the time-to-market when developing new platforms. Sampling is a widely used methodology to speed up workload analysis and performance evaluation by extrapolating from a set of representative detailed regions. Installing an accurate cache state for each detailed region is critical to achieving high accuracy. Prior work requires either huge amounts of storage (checkpoint-based warming), an excessive number of memory accesses to warm up the cache (functional warming), or the collection of a large number of reuse distances (randomized statistical warming) to accurately predict cache warm-up effects. This work proposes DeLorean, a novel statistical warming and sampling methodology that builds upon two key contributions: directed statistical warming and time traveling. Instead of collecting a large number of randomly selected reuse distances as in randomized statistical warming, directed statistical warming collects a select number of key reuse distances, i.e., the most recent reuse distance for each unique memory location referenced in the detailed region. Time traveling leverages virtualized fast-forwarding to quickly 'look into the future' - to determine the key cachelines - and then 'go back in time' - to collect the reuse distances for those key cachelines at near-native hardware speed through virtualized directed profiling. Directed statistical warming reduces the number of warm-up references by 30x compared to randomized statistical warming. Time traveling translates this reduction into a 5.7x simulation speedup. In addition to improving simulation speed, DeLorean reduces the prediction error from around 9% to around 3% on average. We further demonstrate how to amortize warm-up cost across multiple parallel simulations in design space exploration studies. Implementing DeLorean in gem5 enables detailed cycle-accurate simulation at a speed of 126 MIPS.

  • 213.
    Nikoleris, Nikos
    et al.
    Arm Research, Cambridge UK.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Carlson, Trevor E.
    Department of Computer Science, National University of Singapore.
    Delorean: Virtualized Directed Profiling for Cache Modeling in Sampled Simulation2018Report (Other academic)
    Abstract [en]

    Current practice for accurate and efficient simulation (e.g., SMARTS and Simpoint) makes use of sampling to significantly reduce the time needed to evaluate new research ideas. By evaluating a small but representative portion of the original application, sampling can allow for both fast and accurate performance analysis. However, as cache sizes of modern architectures grow, simulation time is dominated by warming microarchitectural state and not by detailed simulation, reducing overall simulation efficiency. While checkpoints can significantly reduce cache warming, improving efficiency, they limit the flexibility of the system under evaluation, requiring new checkpoints for software updates (such as changes to the compiler and compiler flags) and many types of hardware modifications. An ideal solution would allow for accurate cache modeling for each simulation run without the need to generate rigid checkpointing data a priori.

    Enabling this new direction for fast and flexible simulation requires a combination of (1) a methodology that allows for hardware and software flexibility and (2) the ability to quickly and accurately model arbitrarily-sized caches. Current approaches that rely on checkpointing or statistical cache modeling require rigid, up-front state to be collected which needs to be amortized over a large number of simulation runs. These earlier methodologies are insufficient for our goals for improved flexibility. In contrast, our proposed methodology, Delorean, outlines a unique solution to this problem. The Delorean simulation methodology enables both flexibility and accuracy by quickly generating a targeted cache model for the next detailed region on the fly without the need for up-front simulation or modeling. More specifically, we propose a new, more accurate statistical cache modeling method that takes advantage of hardware virtualization to precisely determine the memory regions accessed and to minimize the time needed for data collection while maintaining accuracy.

    Delorean uses a multi-pass approach to understand the memory regions accessed by the next, upcoming detailed region. Our methodology collects the entire set of key memory accesses and, through fast virtualization techniques, progressively scans larger, earlier regions to learn more about these key accesses in an efficient way. Using these techniques, we demonstrate that Delorean allows for the fast evaluation of systems and their software though the generation of accurate cache models on the fly. Delorean outperforms previous proposals by an order of magnitude, with a simulation speed of 150 MIPS and a similar average CPI error (below 4%).

    Download full text (pdf)
    fulltext
  • 214. Noda, Claro
    et al.
    Prabh, Shashi
    Alves, Mario
    Voigt, Thiemo
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    On Packet Size and Error Correction Optimisations in Low-Power Wireless Networks2013In: IEEE International Conference on Sensing, Communication and Networking (IEEE SECON), 2013Conference paper (Refereed)
  • 215. Noda, Claro
    et al.
    Prabh, Shashi
    Boano, Carlo Alberto
    Voigt, Thiemo
    Alves, Mário
    Poster abstract: A channel quality metric for interference-aware wireless sensor networks2011In: IPSN, 2011, p. 167-168Conference paper (Refereed)
  • 216.
    Norgren, Magnus
    Uppsala University, Disciplinary Domain of Science and Technology, Technology, Department of Engineering Sciences.
    Wishbone compliant smart Pulse-Width Modulation (PWM) IP: Uppsala Universitet - ÅAC Mictrotec AB2012Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Download full text (pdf)
    fulltext
  • 217.
    Olofsson, Simon
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Systems and Control.
    Probabilistic Feature Learning Using Gaussian Process Auto-Encoders2016Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    The focus of this report is the problem of probabilistic dimensionality reduction and feature learning from high-dimensional data (images). Extracting features and being able to learn from high-dimensional sensory data is an important ability in a general-purpose intelligent system. Dimensionality reduction and feature learning have in the past primarily been done using (convolutional) neural networks or linear mappings, e.g. in principal component analysis. However, these methods do not yield any error bars in the features or predictions. In this report, theory and a model for how dimensionality reduction and feature learning can be done using Gaussian process auto-encoders (GP-AEs) are presented. By using GP-AEs, the variance in the feature space is computed, thus, yielding a measure of the uncertainty in the constructed model. This measure is useful in order to avoid making over-confident system predictions. Results show that GP-AEs are capable of dimensionality reduction and feature learning, but that they suffer from scalability issues and problems with weak gradient signal propagation. Results in reconstruction quality are not as good as those achieved by state-of-the-art methods, and it takes very long to train the model. The model has potential though, since it can scale to large inputs.

    Download full text (pdf)
    fulltext
  • 218.
    Oltner, Alexander Mac
    Gotland University, School of Game Design, Technology and Learning Processes.
    Att överföra en turordningsbaserad spelprototyp till realtid: ett projekt rörande Victorious Skies och dess utveckling2011Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    This project details the process of converting and transferring a turn-based paper prototype to a digital real-time format. The projects goals were to see how well the original feeling could be transferred to real-time and how the transition itself went. The project have been completed with the help of the programmer Mikael Gullberg. The practical part of the project was executed between the dates of 25/4 – 2/5. This project is a part of the larger project of Victorious Skies. During this project values have been converted and properties from the turn-based paper prototype have been formatted for use in real-time. This has been a very interesting and giving project that has challenged us and presented numerous choices about how we wanted to execute the conversions. The work was organized with the help of a priority list. I have been the one who have had sole responsibility regarding design choices while Mikael Gullberg have provided the necessary programming knowledge needed to convert the prototype to a digital format. The result of this project has been a real-time based digital prototype that is used as Victorious Skies first such prototype. Knowledge has been gathered regarding conversions of this kind by practical testing. The digital prototype is true to the original in such a way that a clear connection could be seen between the two. Though they differ in several key aspects due to the changes the real-time format brought with it.

    I have been able to arrive at several conclusions during this project. The most important conclusion I have taken is that there is no way that you can keep the exact original feeling as the real-time format just brings too many new factors into play. The formula used to transfer games from turn-based to real-time is not simple and require lots of thought. To be done right, attention must be given to the minute details which makes the process of converting both challenging and entertaining.

    Download full text (pdf)
    Alexander_Oltner_Att_overföra_en_turordningsbaserad_prototyp_till_realtid
  • 219.
    Orfanidis, Charalampos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Computer Systems. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Robustness in low power wide area networks2018Licentiate thesis, comprehensive summary (Other academic)
    Abstract [en]

    During the past few years we have witnessed an emergence of Wide Area Networks in the Internet of Things area. There are several new technologies like LoRa, Wi-SUN, Sigfox, that offer long range communication and low power for low-bitrate applications. These new technologies enable new application scenarios, such as smart cities, smart agriculture, and many more. However, when these networks co-exist in the same frequency band, they may cause problems to each other since they are heterogeneous and independent. Therefore it is very likely to have frame collisions between the different networks.

    In this thesis we first explore how tolerant these networks are to Cross Technology Interference (CTI). CTI can be described as the interference from heterogeneous wireless technologies that share the same frequency band and is able to affect the robustness and reliability of the network. In particular, we select two of them, LoRa and Wi-SUN and carry out a series of experiments with real hardware using several configurations. In this way, we quantify the tolerance of cross technology interference of each network against the other as well as which configuration settings are important.

    The next thing we explored is how well channel sensing mechanisms can detect the other network technologies and how they can be improved. For exploring these aspects, we used the default Clear Channel Assessment (CCA) mechanism of Wi-SUN against LoRa interference and we evaluated how accurate it is. We also improved this mechanism in order to have higher accuracy detection against LoRa interference.

    Finally, we propose an architecture for WSNs which will enable flexible reconfiguration of the nodes. The idea is based on Software Defined Network (SDN) principles and could help on our case by reconfiguring a node in order to mitigate the cross-technology interference from other networks.

    List of papers
    1. Investigating interference between LoRa and IEEE 802.15.4g networks
    Open this publication in new window or tab >>Investigating interference between LoRa and IEEE 802.15.4g networks
    2017 (English)In: Proc. 13th International Conference on Wireless and Mobile Computing, Networking and Communications, IEEE, 2017, p. 441-448Conference paper, Published paper (Refereed)
    Place, publisher, year, edition, pages
    IEEE, 2017
    National Category
    Communication Systems
    Identifiers
    urn:nbn:se:uu:diva-331851 (URN)10.1109/WiMOB.2017.8115772 (DOI)000419818000061 ()978-1-5386-3839-2 (ISBN)
    Conference
    WiMob 2017, October 9–11, Rome, Italy
    Available from: 2017-11-23 Created: 2017-10-18 Last updated: 2018-05-31Bibliographically approved
    2. Improving LoRa/IEEE 802.15.4g co-existence
    Open this publication in new window or tab >>Improving LoRa/IEEE 802.15.4g co-existence
    (English)Manuscript (preprint) (Other academic)
    National Category
    Communication Systems
    Identifiers
    urn:nbn:se:uu:diva-351504 (URN)
    Available from: 2018-05-28 Created: 2018-05-28 Last updated: 2018-05-31
    3. Using software-defined networking principles for wireless sensor networks
    Open this publication in new window or tab >>Using software-defined networking principles for wireless sensor networks
    2015 (English)In: Proc. 11th Swedish National Computer Networking Workshop, 2015Conference paper, Published paper (Refereed)
    National Category
    Computer Systems
    Identifiers
    urn:nbn:se:uu:diva-254172 (URN)
    Conference
    SNCNW 2015, May 28–29, Karlstad, Sweden
    Projects
    ProFuN
    Funder
    Swedish Foundation for Strategic Research , RIT08-0065
    Available from: 2015-06-05 Created: 2015-06-05 Last updated: 2018-05-31Bibliographically approved
    Download full text (pdf)
    fulltext
  • 220.
    Pan, Xiaoyue
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Jonsson, Bengt
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    A Modeling Framework for Reuse Distance-based Estimation of Cache Performance2015In: Performance Analysis of Systems and Software (ISPASS), 2015 IEEE International Symposium on, IEEE, 2015, p. 62-71Conference paper (Refereed)
    Abstract [en]

    We develop an analytical modeling framework for efficient prediction of cache miss ratios based on reuse distance distributions. The only input needed for our predictions is the reuse distance distribution of a program execution: previous work has shown that they can be obtained with very small overhead by sampling from native executions. This should be contrasted with previous approaches that base predictions on stack distance distributions, whose collection need significantly larger overhead or additional hardware support. The predictions are based on a uniform modeling framework which can be specialized for a variety of cache replacement policies, including Random, LRU, PLRU, and MRU (aka. bit-PLRU), and for arbitrary values of cache size and cache associativity. We evaluate our modeling framework with the SPEC CPU 2006 benchmark suite over a set of cache configurations with varying cache size, associativity and replacement policy. The introduced inaccuracies were generally below 1% for the model of the policy, and additionally around 2% when set-local reuse distances must be estimated from global reuse distance distributions. The inaccuracy introduced by sampling is significantly smaller.

  • 221.
    Pan, Xiaoyue
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Jonsson, Bengt
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Modeling cache coherence misses on multicores2014In: 2014 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS), IEEE, 2014, p. 96-105Conference paper (Refereed)
    Abstract [en]

    While maintaining the coherency of private caches, invalidation-based cache coherence protocols introduce cache coherence misses. We address the problem of predicting the number of cache coherence misses in the private cache of a parallel application when running on a multicore system with an invalidation-based cache coherence protocol. We propose three new performance models (uniform, phased and symmetric) for estimating the number of coherence misses from information about inter-core data sharing patterns and the individual core's data reuse patterns. The inputs to the uniform and phased models are the write frequency and reuse distance distribution of shared data from different cores. This input can be obtained either from profiling the target application on a single core or by analyzing the data access pattern statically, and does not need a detailed simulation of the pattern of interleaving accesses to shared data. The output of the models is an estimated number of coherence misses of the target application. The output can be combined with the number of other kinds of misses to estimate the total number of misses in each core's private cache. This output can also be used to guide program optimization to improve cache performance. We evaluate our models with a set of benchmarks from the PARSEC benchmark suite on real hardware.

  • 222.
    Paçacı, Görkem
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Social Sciences, Department of Informatics and Media, Information Systems.
    Hamfelt, Andreas
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Social Sciences, Department of Informatics and Media, Information Systems.
    A Visual System for Compositional Relational Programming2013Conference paper (Refereed)
    Abstract [en]

    Combilog is a compositional relational programming language that allows writing relational logic programs by functionally composing relational predicates. Higraphs, a diagram formalism is consulted to simplify some of the textual complexity of compositional relational programming to achieve a visual system that can represent these declarative meta-programs, with the final intention to design an intuitive and visually assisted complete development practice. As a proof of concept, an implementation of a two-way parser/visualizer is presented.

    Download full text (pdf)
    fulltext
  • 223.
    Perais, Arthur
    et al.
    IRISA INRIA, Rennes, France.
    Seznec, André
    IRISA INRIA, Rennes, France.
    Michaud, Pierre
    IRISA INRIA, Rennes, France.
    Sembrant, Andreas
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Cost-effective speculative scheduling in high performance processors2015In: Proc. 42nd International Symposium on Computer Architecture, New York: ACM Press, 2015, p. 247-259Conference paper (Refereed)
    Abstract [en]

    To maximize performance, out-of-order execution processors sometimes issue instructions without having the guarantee that operands will be available in time; e.g. loads are typically assumed to hit in the L1 cache and dependent instructions are issued accordingly. This form of speculation - that we refer to as speculative scheduling - has been used for two decades in real processors, but has received little attention from the research community. In particular, as pipeline depth grows, and the distance between the Issue and the Execute stages increases, it becomes critical to issue instructions dependent on variable-latency instructions as soon as possible rather than wait for the actual cycle at which the result becomes available. Unfortunately, due to the uncertain nature of speculative scheduling, the scheduler may wrongly issue an instruction that will not have its source(s) available on the bypass network when it reaches the Execute stage. In that event, the instruction is canceled and replayed, potentially impairing performance and increasing energy consumption. In this work, we do not present a new replay mechanism. Rather, we focus on ways to reduce the number of replays that are agnostic of the replay scheme. First, we propose an easily implementable, low-cost solution to reduce the number of replays caused by L1 bank conflicts. Schedule shifting always assumes that, given a dual-load issue capacity, the second load issued in a given cycle will be delayed because of a bank conflict. Its dependents are thus always issued with the corresponding delay. Second, we also improve on existing L1 hit/miss prediction schemes by taking into account instruction criticality. That is, for some criterion of criticality and for loads whose hit/miss behavior is hard to predict, we show that it is more cost-effective to stall dependents if the load is not predicted critical.

  • 224.
    Persson, Måns
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Technology, Department of Engineering Sciences, Solid State Electronics.
    Waern, Tom
    Uppsala University, Disciplinary Domain of Science and Technology, Technology, Department of Engineering Sciences, Solid State Electronics.
    Automatic adjustments of NC programs in machining centers2018Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    The goal of this master thesis was to automate the compensation of NC-programs. Automatic compensations can reduce errors and make the production more efficient. This is vital for increased precision and meeting the quality demands from the market.The project started with a study of how the feedback-loop between production and measurements was done at the time and also researching how the data could be sent between the different machines. This was done by researching solutions of similar problems and interviewing the machine operators. Simulations of how automation could be done with more in-depth measurements of the production machine were also made.The limitations was also evaluated. Research was done on errors and practical flaws which could be problematic for automation.The automation was implemented using Java to send the data between the measuring machine to the production machine. Furthermore a UI was created for the machine operators so that the information flow was under supervision at all times. The UI would suggest a compensation from a pre-programmed algorithm together with the measuring data, and the operator could then decide whether or not to diverge from the suggested compensation.

    Download full text (pdf)
    fulltext
  • 225.
    Persson, Tobias
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Technology, Department of Engineering Sciences, Solid State Electronics.
    Fredlund, Andreas
    Uppsala University, Disciplinary Domain of Science and Technology, Technology, Department of Engineering Sciences, Solid State Electronics.
    Motor control under strong vibrations2018Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
    Download full text (pdf)
    fulltext
  • 226.
    Popov, Mihail
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Jimborean, Alexandra
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computing Science. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Efficient thread/page/parallelism autotuning for NUMA systems2019In: International Conference on Supercomputing / [ed] ACM, New York, NY, USA: Association for Computing Machinery (ACM), 2019, , p. 12Conference paper (Refereed)
    Abstract [en]

    Current multi-socket systems have complex memory hierarchies with significant Non-Uniform Memory Access (NUMA) effects: memory performance depends on the location of the data and the thread. This complexity means that thread- and data-mappings have a significant impact on performance. However, it is hard to find efficient data mappings and thread configurations due to the complex interactions between applications and systems.

    In this paper we explore the combined search space of thread mappings, data mappings, number of NUMA nodes, and degreeof-parallelism, per application phase, and across multiple systems. We show that there are significant performance benefits from optimizing this wide range of parameters together. However, such an optimization presents two challenges: accurately modeling the performance impact of configurations across applications and systems, and exploring the vast space of configurations. To overcome the modeling challenge, we use native execution of small, representative codelets, which reproduce the system and application interactions. To make the search practical, we build a search space by combining a range of state of the art thread- and data-mapping policies.

    Combining these two approaches results in a tractable search space that can be quickly and accurately evaluated without sacrificing significant performance. This search finds non-intuitive configurations that perform significantly better than previous works. With this approach we are able to achieve an average speedup of 1.97× on a four node NUMA system

    Download full text (pdf)
    fulltext
  • 227.
    Qiu, Lanxin
    et al.
    Beijing Univ Technol, Beijing, Peoples R China.
    Huang, Zhuangqin
    Beijing Univ Technol, Beijing, Peoples R China.
    Wirström, Niklas
    SICS Swedish ICT, Kista, Sweden.
    Voigt, Thiemo
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication. SICS Swedish ICT, Kista, Sweden.
    3DinSAR: Object 3D Localization for Indoor RFID Applications2016In: IEEE RFID, 2016, p. 191-198Conference paper (Refereed)
    Abstract [en]

    More and more objects can be identified and sensed with RFID tags. Existing schemes for 2D indoor localization have achieved impressing accuracy. In this paper we propose an accurate 3D localization scheme for objects. Our scheme leverages spatial domain phase difference to estimate the height of objects which is inspired by the phase-based Interferometric Synthetic Aperture Radar (InSAR) height determination theory. We further leverage a density-based spatial clustering method to choose the most likely position and show that it improves the accuracy. Our localization method does not need any reference tags. Only one antenna is required to move in a known way in order to construct the synthetic arrays to implement the locating system. We present experimental results from an indoor office environment with EPC C1G2 passive tags and a COTS RFID reader. Our 3D experiments demonstrate a spatial median error of 0.24 m. This novel 3D localization scheme is a simple, yet promising, solution. We believe that it is especially applicable for both portable readers and transport vehicles.

  • 228.
    Rademacher, Frans
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Technology, Department of Engineering Sciences, Solid State Electronics.
    Larsson, Per
    Uppsala University, Disciplinary Domain of Science and Technology, Technology, Department of Engineering Sciences, Solid State Electronics.
    Lundberg, Oskar
    Uppsala University, Disciplinary Domain of Science and Technology, Technology, Department of Engineering Sciences, Solid State Electronics.
    Praktisk konstruktion av 8-bitarsdator2019Independent thesis Basic level (degree of Bachelor of Fine Arts), 10 credits / 15 HE creditsStudent thesis
    Abstract [sv]

    En 8-bitarsdator är i dagens samhälle gammal teknik. De kan knappast konkurrera med dagens moderna datorer som arbetar snabbare och med större tal. Genom att del för del ändå konstruera en 8-bitarsdator ges dock än idag stor insikt i hur datorer i allmänhet är konstruerade. Med bakgrundskunskap inom grundläggande digital elektronik kan enskilda moduler förstås, vilket sedan leder till en förståelse för datorn i stort. Detta projekt kretsade alltså kring att konstruera en 8-bitarsdator. Denna dator ska efter projektets slut kunna finnas kvar i syftet att användas i undervisning av digital elektronik. 8-bitarsdatorn innefattar flera moduler som var för sig kan både simuleras i mjukvara och konstrueras för sig. Därefter kunde alla moduler sättas samman. Datorn kan enkelt programmeras för att köra olika program, och kan med hjälp av så kallade flaggor hoppa i programkoden för att upprepa kod. Den resulterade datorn har vissa förbättringspotentialer, men fungerar väl enligt förväntningarna. Med strategiska val av färger på kablage och ett stort antal lysdioder blev datorn lättare att förstå och undersöka.

    Download full text (pdf)
    fulltext
  • 229.
    Romeo, Luca
    et al.
    Univ Politecn Marche, Dept Informat Engn DII, Via Brecce Blanche 12, I-60131 Ancona, Italy;Fdn Ist Italiano Tecnol Genova, Dept Cognit Mot & Neurosci & Computat Stat & Mach, Genoa, Italy.
    Loncarski, Jelena
    Uppsala University, Disciplinary Domain of Science and Technology, Technology, Department of Engineering Sciences, Electricity.
    Paolanti, Marina
    Univ Politecn Marche, Dept Informat Engn DII, Via Brecce Blanche 12, I-60131 Ancona, Italy.
    Bocchini, Gianluca
    Digital Prod Specialist Xelexia Srl, Pesaro, Italy.
    Mancini, Adriano
    Univ Politecn Marche, Dept Informat Engn DII, Via Brecce Blanche 12, I-60131 Ancona, Italy.
    Frontoni, Emanuele
    Univ Politecn Marche, Dept Informat Engn DII, Via Brecce Blanche 12, I-60131 Ancona, Italy.
    Machine learning-based design support system for the prediction of heterogeneous machine parameters in industry 4.02020In: Expert systems with applications, ISSN 0957-4174, E-ISSN 1873-6793, Vol. 140, article id 112869Article in journal (Refereed)
    Abstract [en]

    In the engineering practice, it frequently occurs that designers, final or intermediate users have to roughly estimate some basic performance or specification data on the basis of input data available at the moment, which can be time-consuming. There is the need for a tool that will fill the missing gap in the optimization problems in engineering design processes, by making use of the advances in the artificial intelligence field. This paper aims to fill this gap by introducing an innovative Design Support System (DesSS), originated from the Decision Support System, for the prediction and estimation of machine specification data such as machine geometry and machine design on the basis of heterogeneous input parameters. As the main core of the developed DesSS, we introduced different machine learning (ML) approaches based on Decision/Regression Tree, k-Nearest Neighbors, and Neighborhood Component Features Selection. Experimental results obtained on a real use case and using two different real datasets demonstrated the reliability and the effectiveness of the proposed approach. The innovative machine learning-based DesSS meant for supporting the designing choice, can bring various benefits such as the easier decision-making, conservation of the company's knowledge, savings in man-hours, higher computational speed and accuracy.

  • 230.
    Romeo, Luca
    et al.
    Univ Politecn Marche, Dept Informat Engn, Ancona, Italy;Fdn Ist Italiano Tecnol Genova, Cognit Mot & Neurosci & Computat Stat & Machine L, Genoa, Italy.
    Paolanti, Marina
    Univ Politecn Marche, Dept Informat Engn, Ancona, Italy.
    Bocchini, Gianluca
    Xelexia Srl, Pesaro, Italy.
    Loncarski, Jelena
    Uppsala University, Disciplinary Domain of Science and Technology, Technology, Department of Engineering Sciences, Electricity.
    Frontoni, Emanuele
    Univ Politecn Marche, Dept Informat Engn, Ancona, Italy.
    An Innovative Design Support System for Industry 4.0 Based on Machine Learning Approaches2018In: 2018 5TH INTERNATIONAL SYMPOSIUM ON ENVIRONMENT-FRIENDLY ENERGIES AND APPLICATIONS (EFEA) / [ed] Bruzzese, C Santini, E Digennaro, S, IEEE , 2018Conference paper (Refereed)
    Abstract [en]

    Electric machines together with power electronic converters are the major components in industrial and automotive applications. The frequent situation in the engineering practice is that designers, final or intermediate users have to roughly estimate some basic performance data or specification data or other metrics related to the specific task they have, on the basis of few data available at a particular instant of time or at the time of use. This paper addresses this problem in the Industry 4.0 scenario by introducing innovative Design support system (DesSS), originated from the Decision Support System (DSS), for the prediction and estimation of machine specification data such as machine geometry and machine design on the basis of other heterogeneous parameters (i.e. motor performance, field of application, geographic market, and range of cost). For the development of the DesSS different machine learning techniques were compared such as Decision/Regression Tree (DT/RT), Nearest Neighbors (NN), and Neighborhood Component Features Selection (NCFS). Experimental results obtained on the real use case demonstrated the appropriateness of the application of the machine learning approaches as the main core of the DesSS used for the estimation of the machine parameters. In particular, the results show high reliability in terms of accuracy and macro-F1 score of the 1-NN+NCFS and RT for solving respectively the classification and regression task. This approach can viably replace the model-based tools used for the parameters prediction, being it more accurate and with higher computational speed.

  • 231.
    Ros, Alberto
    et al.
    Univ Murcia, Dept Comp Engn, E-30001 Murcia, Spain.
    Davari, Mahdad
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies2015In: Proc. 21st International Symposium on High Performance Computer Architecture, IEEE Computer Society Digital Library, 2015, p. 186-197Conference paper (Refereed)
    Abstract [en]

    Hierarchical clustered cache designs are becoming an appealing alternative for multicores. Grouping cores and their caches in clusters reduces network congestion by localizing traffic among several hierarchical levels, potentially enabling much higher scalability. While such architectures can be formed recursively by replicating a base design pattern, keeping the whole hierarchy coherent requires more effort and consideration. The reason is that, in hierarchical coherence, even basic operations must be recursive. As a consequence, intermediate-level caches behave both as directories and as leaf caches. This leads to an explosion of states, protocol-races, and protocol complexity. While there have been previous efforts to extend directory-based coherence to hierarchical designs their increased complexity and verification cost is a serious impediment to their adoption. We aim to address these concerns by encapsulating all hierarchical complexity in a simple function: that of determining when a data block is shared entirely within a cluster (sub-tree of the hierarchy) and is private from the outside. This allows us to eliminate complex recursive operations that span the hierarchy and instead employ simple coherence mechanisms such as self-invalidation and write-through-now restricted to operate within the cluster where a data block is shared. We examine two inclusivity options and discuss the relation of our approach to the recently proposed Hierarchical-Race-Free (HRF) memory models. Finally, comparisons to a hierarchical directory-based MOESI, VIPS-M, and TokenCMP protocols show that, despite its simplicity our approach results in competitive performance and decreased network traffic.

  • 232. Ros, Alberto
    et al.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Non-Speculative Store Coalescing in Total Store Order2018In: Proc.45th International Symposium on Computer Architecture, IEEE, 2018, p. 221-234Conference paper (Refereed)
    Abstract [en]

    We present a non-speculative solution for a coalescing store buffer in total store order (TSO) consistency. Coalescing violates TSO with respect to both conflicting loads and conflicting stores, if partial state is exposed to the memory system. Proposed solutions for coalescing in TSO resort to speculation-and-rollback or centralized arbitration to guarantee atomicity for the set of stores whose order is affected by coalescing. These solutions can suffer from scalability, complexity, resource-conflict deadlock, and livelock problems. A non-speculative solution that writes out coalesced cachelines, one at a time, over a typical directory-based MESI coherence layer, has the potential to transcend these problems if it can guarantee absence of deadlock in a practical way. There are two major problems for a non-speculative coalescing store buffer: i) how to present to the memory system a group of coalesced writes as atomic, and ii) how to not deadlock while attempting to do so. For this, we introduce a new lexicographical order. Relying on this order, conflicting atomic groups of coalesced writes can be individually performed per cache block, without speculation, rollback, or replay, and without deadlock or livelock, yet appear atomic to conflicting parties and preserve TSO. One of our major contributions is to show that lexicographical orders based on a small part of the physical address (sub-address order) are deadlock-free throughout the system when taking into account resource-conflict deadlocks. Our approach exceeds the performance and energy benefits of two baseline TSO store buffers and matches the coalescing (and energy savings) of a release-consistency store buffer, at comparable cost.

    Download full text (pdf)
    fulltext
  • 233.
    Ros, Alberto
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    The Superfluous Load Queue2018In: 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), IEEE, 2018, p. 95-107Conference paper (Refereed)
    Abstract [en]

    In an out-of-order core, the load queue (LQ), the store queue (SQ), and the store buffer (SB) are responsible for ensuring: i) correct forwarding of stores to loads and ii) correct ordering among loads (with respect to external stores). The first requirement safeguards the sequential semantics of program execution and applies to both serial and parallel code; the second requirement safeguards the semantics of coherence and consistency (e.g., TSO). In particular, loads search the SQ/SB for the latest value that may have been produced by a store, and stores and invalidations search the LQ to find speculative loads in case they violate uniprocessor or multiprocessor ordering. To meet timing constraints the LQ and SQ/SB system is composed of CAM structures that are frequently searched. This results in high complexity, cost, and significant difficulty to scale, but is the current state of the art. Prior research demonstrated the feasibility of a non-associative LQ by replaying loads at commit. There is a steep cost however: a significant increase in L1 accesses and contention for L1 ports. This is because prior work assumes Sequential Consistency and completely ignores the existence of a SB in the system. In contrast, we intentionally delay stores in the SB to achieve a total management of stores and loads in a core, while still supporting TSO. Our main result is that we eliminate the LQ without burdening the L1 with extra accesses. Store forwarding is achieved by delaying our own stores until speculatively issued loads are validated on commit, entirely in-core; TSO load -> load ordering is preserved by delaying remote external stores in their SB until our own speculative reordered loads commit. While the latter is inspired by recent work on non-speculative load reordering, our contribution here is to show that this can be accomplished without having a load queue. Eliminating the LQ results in both energy savings and performance improvement from the elimination of LQ-induced stalls.

    Download full text (pdf)
    fulltext
  • 234. Rostampour, Vahab
    et al.
    Ferrari, Riccardo
    Teixeira, André M.H.
    Uppsala University, Disciplinary Domain of Science and Technology, Technology, Department of Engineering Sciences, Signals and Systems Group.
    Keviczky, Tamás
    Differentially-Private Distributed Fault Diagnosis for Large-Scale Nonlinear Uncertain Systems2018In: IFAC-PapersOnLine, ISSN 2405-8963, Vol. 51, no 24, p. 975-982Article in journal (Refereed)
    Abstract [en]

    Distributed fault diagnosis has been proposed as an effective technique for monitoring large scale, nonlinear and uncertain systems. It is based on the decomposition of the large scale system into a number of interconnected subsystems, each one monitored by a dedicated Local Fault Detector (LFD). Neighboring LFDs, in order to successfully account for subsystems interconnection, are thus required to communicate with each other some of the measurements from their subsystems. Anyway, such communication may expose private information of a given subsystem, such as its local input. To avoid this problem, we propose here to use differential privacy to pre-process data before transmission.

  • 235.
    Sakalis, Christos
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Jimborean, Alexandra
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Själander, Magnus
    Norwegian University of Science and Technology.
    Evaluating the Potential Applications of Quaternary Logic for Approximate Computing2019In: ACM Journal on Emerging Technologies in Computing Systems (JETC), ISSN 1550-4832, Vol. 16, no 1, article id 5Article in journal (Refereed)
    Abstract [en]

    There exist extensive ongoing research efforts on emerging atomic-scale technologies that have the potential to become an alternative to today’s complementary metal--oxide--semiconductor technologies. A common feature among the investigated technologies is that of multi-level devices, particularly the possibility of implementing quaternary logic gates and memory cells. However, for such multi-level devices to be used reliably, an increase in energy dissipation and operation time is required. Building on the principle of approximate computing, we present a set of combinational logic circuits and memory based on multi-level logic gates in which we can trade reliability against energy efficiency. Keeping the energy and timing constraints constant, important data are encoded in a more robust binary format while error-tolerant data are encoded in a quaternary format. We analyze the behavior of the logic circuits when exposed to transient errors caused as a side effect of this encoding. We also evaluate the potential benefit of the logic circuits and memory by embedding them in a conventional computer system on which we execute jpeg, sobel, and blackscholes approximately. We demonstrate that blackscholes is not suitable for such a system and explain why. However, we also achieve dynamic energy reductions of 10% and 13% for jpeg and sobel, respectively, and improve execution time by 38% for sobel, while maintaining adequate output quality.

  • 236.
    Sakalis, Christos
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Computer Systems. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Ros, Alberto
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Jimborean, Alexandra
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computing Science. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Själander, Magnus
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Understanding Selective Delay as a Method for Efficient Secure Speculative ExecutionIn: Article in journal (Refereed)
    Abstract [en]

    Since the introduction of Meltdown and Spectre, the academic and industry research communities have been tirelessly working on speculative side-channel attacks and on how to shield computer systems from them. To ensure that a system is protected not only from all the currently known attacks but also from future, yet to be discovered, attacks, the solutions developed need to be general in nature, covering a wide array of system components, while at the same time keeping the performance, energy, area, and implementation complexity costs at a minimum. One such solution is our own delay-on-miss, which efficiently protects the memory hierarchy by i) selectively delaying speculative load instructions and ii) utilizing value prediction as an invisible form of speculation. In this work we dive deeper into delay-on-miss, offering insights into why and how it affects the performance of the system. We also reevaluate value prediction as an invisible form of speculation. Specifically, we focus on the implications that delaying memory loads has in the memory level parallelism of the system and how this affects the value predictor and the overall performance of the system. We present new, updated results but more importantly, we also offer deeper insight into why delay-on-miss works so well and what this means for the future of secure speculative execution.

  • 237.
    Salles, Arleen
    et al.
    Uppsala University, Disciplinary Domain of Medicine and Pharmacy, Faculty of Medicine, Department of Public Health and Caring Sciences, Centre for Research Ethics and Bioethics. Centro de Investigaciones Filosoficas (CIF),F), Buenos Aires, Argentina.
    Evers, Kathinka
    Uppsala University, Disciplinary Domain of Medicine and Pharmacy, Faculty of Medicine, Department of Public Health and Caring Sciences, Centre for Research Ethics and Bioethics.
    Farisco, Michele
    Uppsala University, Disciplinary Domain of Medicine and Pharmacy, Faculty of Medicine, Department of Public Health and Caring Sciences, Centre for Research Ethics and Bioethics. Biogem, Biology and Molecular Genetics Institute, Ariano Irpino, Camporeale, Italy.
    Anthropomorphism in AI2020In: AJOB Neuroscience, ISSN 2150-7740, E-ISSN 2150-7759, Vol. 11, no 2, p. 88-95Article in journal (Refereed)
    Abstract [en]

    AI research is growing rapidly raising various ethical issues related to safety, risks, and other effects widely discussed in the literature. We believe that in order to adequately address those issues and engage in a productive normative discussion it is necessary to examine key concepts and categories. One such category is anthropomorphism. It is a well-known fact that AI’s functionalities and innovations are often anthropomorphized (i.e., described and conceived as characterized by human traits). The general public’s anthropomorphic attitudes and some of their ethical consequences (particularly in the context of social robots and their interaction with humans) have been widely discussed in the literature. However, how anthropomorphism permeates AI research itself (i.e., in the very language of computer scientists, designers, and programmers), and what the epistemological and ethical consequences of this might be have received less attention. In this paper we explore this issue. We first set the methodological/theoretical stage, making a distinction between a normative and a conceptual approach to the issues. Next, after a brief analysis of anthropomorphism and its manifestations in the public, we explore its presence within AI research with a particular focus on brain-inspired AI. Finally, on the basis of our analysis, we identify some potential epistemological and ethical consequences of the use of anthropomorphic language and discourse within the AI research community, thus reinforcing the need of complementing the practical with a conceptual analysis.

    Download full text (pdf)
    fulltext
  • 238.
    Sanchez, Carlos
    et al.
    Florida State Univ, Tallahassee, FL 32306 USA.
    Gavin, Peter
    Florida State Univ, Tallahassee, FL 32306 USA.
    Moreau, Daniel
    Chalmers, Gothenburg, Sweden.
    Själander, Magnus
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication. NTNU, Trondheim, Norway.
    Whalley, David
    Florida State Univ, Tallahassee, FL 32306 USA.
    Larsson-Edefors, Per
    Chalmers, Gothenburg, Sweden.
    McKee, Sally A.
    Chalmers, Gothenburg, Sweden.
    Redesigning a tagless access buffer to require minimal ISA changes2016In: Proc. 19th International Conference on Compilers, Architectures and Synthesis for Embedded Systems, 2016, article id 19Conference paper (Refereed)
  • 239.
    Sandberg, Andreas
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Efficient techniques for predicting cache sharing and throughput2012In: Proc. 21st International Conference on Parallel Architectures and Compilation Techniques, New York: ACM Press, 2012, p. 305-314Conference paper (Refereed)
    Abstract [en]

    This work addresses the modeling of shared cache contention in multicore systems and its impact on throughput and bandwidth. We develop two simple and fast cache sharing models for accurately predicting shared cache allocations for random and LRU caches.

    To accomplish this we use low-overhead input data that captures the behavior of applications running on real hardware as a function of their shared cache allocation. This data enables us to determine how much and how aggressively data is reused by an application depending on how much shared cache it receives. From this we can model how applications compete for cache space, their aggregate performance (throughput)¸ and bandwidth.

    We evaluate our models for two- and four-application workloads in simulation and on modern hardware. On a four-core machine, we demonstrate an average relative fetch ratio error of 6.7% for groups of four applications. We are able to predict workload bandwidth with an average relative error of less than 5.2% and throughput with an average error of less than 1.8%. The model can predict cache size with an average error of 1.3% compared to simulation.

    Download full text (pdf)
    pact2012_sharing.pdf
  • 240.
    Sandberg, Andreas
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Sembrant, Andreas
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Modeling performance variation due to cache sharing2013In: Proc. 19th IEEE International Symposium on High Performance Computer Architecture, IEEE Computer Society, 2013, p. 155-166Conference paper (Refereed)
    Abstract [en]

    Shared cache contention can cause significant variability in the performance of co-running applications from run to run. This variability arises from different overlappings of the applications' phases, which can be the result of offsets in application start times or other delays in the system. Understanding this variability is important for generating an accurate view of the expected impact of cache contention. However, variability effects are typically ignored due to the high overhead of modeling or simulating the many executions needed to expose them.

    This paper introduces a method for efficiently investigating the performance variability due to cache contention. Our method relies on input data captured from native execution of applications running in isolation and a fast, phase-aware, cache sharing performance model. This allows us to assess the performance interactions and bandwidth demands of co-running applications by quickly evaluating hundreds of overlappings.

    We evaluate our method on a contemporary multicore machine and show that performance and bandwidth demands can vary significantly across runs of the same set of co-running applications. We show that our method can predict application slowdown with an average relative error of 0.41% (maximum 1.8%) as well as bandwidth consumption. Using our method, we can estimate an application pair's performance variation 213x faster, on average, than native execution.

    Download full text (pdf)
    fulltext
  • 241.
    Seipel, Stefan
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Visual Information and Interaction. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computerized Image Analysis and Human-Computer Interaction.
    Lingfors, David
    Uppsala University, Disciplinary Domain of Science and Technology, Technology, Department of Engineering Sciences, Solid State Physics.
    Widén, Joakim
    Uppsala University, Disciplinary Domain of Science and Technology, Technology, Department of Engineering Sciences, Solid State Physics.
    Dual-domain visual exploration of urban solar potential2013In: Proc. Eurographics Workshop on Urban Data Modelling and Visualisation, 2013Conference paper (Other academic)
  • 242.
    Sembrant, Andreas
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Computer Systems. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Hiding and Reducing Memory Latency: Energy-Efficient Pipeline and Memory System Techniques2016Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    Memory accesses in modern processors are both far slower and vastly more energy-expensive than the actual computations. To improve performance, processors spend a significant amount of energy and resources trying to hide and reduce the memory latency. To hide the latency, processors use out-order-order execution to overlap memory accesses with independent work and aggressive speculative instruction scheduling to execute dependent instructions back-to-back. To reduce the latency, processors use several levels of caching that keep frequently used data closer to the processor. However, these optimizations are not for free. Out-of-order execution requires expensive processor resources, and speculative scheduling must re-execute instructions on incorrect speculations, and multi-level caching requires extra energy and latency to search the cache hierarchy. This thesis investigates several energy-efficient techniques for: 1) hiding the latency in the processor pipeline, and 2) reducing the latency in the memory hierarchy.

    Much of the inefficiencies of hiding latency in the processor come from two sources. First, processors need several large and expensive structures to do out-of-order execution (instructions queue, register file, etc.). These resources are typically allocated in program order, effectively giving all instructions equal priority. To reduce the size of these expensive resources without hurting performance, we propose Long Term Parking (LTP). LTP parks non-critical instructions before they allocate resources, thereby making room for critical memory accessing instructions to continue and expose more memory-level parallelism. This enables us to save energy by shrinking the resources sizes without hurting performance. Second, when a load's data returns, the load's dependent instructions need to be scheduled and executed. To execute the dependent instructions back-to-back, the processor will speculatively schedule instructions before the processor knows if the input data will be available at execution time. To save energy, we investigate different scheduling techniques that reduce the number of re-executions due to misspeculation.

    The inefficiencies of traditional memory hierarchies come from the need to do level-by-level searches to locate data. The search starts at the L1 cache, then proceeds level by level until the data is found, or determined not to be in any cache, at which point the processor has to fetch the data from main memory. This wastes time and energy for every level that is searched. To reduce the latency, we propose tracking the location of the data directly in a separate metadata hierarchy. This allows us to directly access the data without needing to search. The processor simply queries the metadata hierarchy for the location information about where the data is stored. Separating metadata into its own hierarchy brings a wide range of additional benefits, including flexibility in how we place data storages in the hierarchy, the ability to intelligently store data in the hierarchy, direct access to remote cores, and many other data-oriented optimizations that can leverage our precise knowledge of where data are located.

    List of papers
    1. Long Term Parking (LTP): Criticality-aware Resource Allocation in OOO Processors
    Open this publication in new window or tab >>Long Term Parking (LTP): Criticality-aware Resource Allocation in OOO Processors
    Show others...
    2015 (English)In: Proc. 48th International Symposium on Microarchitecture, 2015Conference paper, Published paper (Refereed)
    Abstract [en]

    Modern processors employ large structures (IQ, LSQ, register file, etc.) to expose instruction-level parallelism (ILP) and memory-level parallelism (MLP). These resources are typically allocated to instructions in program order. This wastes resources by allocating resources to instructions that are not yet ready to be executed and by eagerly allocating resources to instructions that are not part of the application’s critical path.

    This work explores the possibility of allocating pipeline resources only when needed to expose MLP, and thereby enabling a processor design with significantly smaller structures, without sacrificing performance. First we identify the classes of instructions that should not reserve resources in program order and evaluate the potential performance gains we could achieve by delaying their allocations. We then use this information to “park” such instructions in a simpler, and therefore more efficient, Long Term Parking (LTP) structure. The LTP stores instructions until they are ready to execute, without allocating pipeline resources, and thereby keeps the pipeline available for instructions that can generate further MLP.

    LTP can accurately and rapidly identify which instructions to park, park them before they execute, wake them when needed to preserve performance, and do so using a simple queue instead of a complex IQ. We show that even a very simple queue-based LTP design allows us to significantly reduce IQ (64 →32) and register file (128→96) sizes while retaining MLP performance and improving energy efficiency.

    National Category
    Computer Engineering
    Identifiers
    urn:nbn:se:uu:diva-272468 (URN)
    Conference
    MICRO 2015, December 5–9, Waikiki, HI
    Projects
    UPMARCUART
    Available from: 2016-01-14 Created: 2016-01-14 Last updated: 2018-01-10
    2. Cost-effective speculative scheduling in high performance processors
    Open this publication in new window or tab >>Cost-effective speculative scheduling in high performance processors
    Show others...
    2015 (English)In: Proc. 42nd International Symposium on Computer Architecture, New York: ACM Press, 2015, p. 247-259Conference paper, Published paper (Refereed)
    Abstract [en]

    To maximize performance, out-of-order execution processors sometimes issue instructions without having the guarantee that operands will be available in time; e.g. loads are typically assumed to hit in the L1 cache and dependent instructions are issued accordingly. This form of speculation - that we refer to as speculative scheduling - has been used for two decades in real processors, but has received little attention from the research community. In particular, as pipeline depth grows, and the distance between the Issue and the Execute stages increases, it becomes critical to issue instructions dependent on variable-latency instructions as soon as possible rather than wait for the actual cycle at which the result becomes available. Unfortunately, due to the uncertain nature of speculative scheduling, the scheduler may wrongly issue an instruction that will not have its source(s) available on the bypass network when it reaches the Execute stage. In that event, the instruction is canceled and replayed, potentially impairing performance and increasing energy consumption. In this work, we do not present a new replay mechanism. Rather, we focus on ways to reduce the number of replays that are agnostic of the replay scheme. First, we propose an easily implementable, low-cost solution to reduce the number of replays caused by L1 bank conflicts. Schedule shifting always assumes that, given a dual-load issue capacity, the second load issued in a given cycle will be delayed because of a bank conflict. Its dependents are thus always issued with the corresponding delay. Second, we also improve on existing L1 hit/miss prediction schemes by taking into account instruction criticality. That is, for some criterion of criticality and for loads whose hit/miss behavior is hard to predict, we show that it is more cost-effective to stall dependents if the load is not predicted critical.

    Place, publisher, year, edition, pages
    New York: ACM Press, 2015
    National Category
    Computer Systems
    Identifiers
    urn:nbn:se:uu:diva-272467 (URN)10.1145/2749469.2749470 (DOI)000380455700020 ()9781450334020 (ISBN)
    Conference
    ISCA 2015, June 13–17, Portland, OR
    Projects
    UPMARCUART
    Available from: 2015-06-13 Created: 2016-01-14 Last updated: 2016-12-05Bibliographically approved
    3. TLC: A tag-less cache for reducing dynamic first level cache energy
    Open this publication in new window or tab >>TLC: A tag-less cache for reducing dynamic first level cache energy
    2013 (English)In: Proceedings of the 46th International Symposium on Microarchitecture, New York: ACM Press, 2013, p. 49-61Conference paper, Published paper (Refereed)
    Abstract [en]

    First level caches are performance-critical and are therefore optimized for speed. To do so, modern processors reduce the miss ratio by using set-associative caches and optimize latency by reading all ways in parallel with the TLB and tag lookup. However, this wastes energy since only data from one way is actually used.

    To reduce energy, phased-caches and way-prediction techniques have been proposed wherein only data of the matching/predicted way is read. These optimizations increase latency and complexity, making them less attractive for first level caches.

    Instead of adding new functionality on top of a traditional cache, we propose a new cache design that adds way index information to the TLB. This allow us to: 1) eliminate ex-tra data array reads (by reading the right way directly), 2) avoid tag comparisons (by eliminating the tag array), 3) later out misses (by checking the TLB), and 4) amortize the TLB lookup energy (by integrating it with the way information). In addition, the new cache can directly replace existing caches without any modication to the processor core or software.

    This new Tag-Less Cache (TLC) reduces the dynamic energy for a 32 kB, 8-way cache by 60% compared to a VIPT cache without aecting performance.

    Place, publisher, year, edition, pages
    New York: ACM Press, 2013
    National Category
    Computer Engineering Computer Systems
    Identifiers
    urn:nbn:se:uu:diva-213236 (URN)10.1145/2540708.2540714 (DOI)978-1-4503-2638-4 (ISBN)
    Conference
    MICRO-46; December 7-11, 2013; Davis, CA, USA
    Projects
    UPMARCCoDeR-MP
    Available from: 2013-12-07 Created: 2013-12-19 Last updated: 2018-01-11Bibliographically approved
    4. The Direct-to-Data (D2D) Cache: Navigating the cache hierarchy with a single lookup
    Open this publication in new window or tab >>The Direct-to-Data (D2D) Cache: Navigating the cache hierarchy with a single lookup
    2014 (English)In: Proc. 41st International Symposium on Computer Architecture, Piscataway, NJ: IEEE Press, 2014, p. 133-144Conference paper, Published paper (Refereed)
    Abstract [en]

    Modern processors optimize for cache energy and performance by employing multiple levels of caching that address bandwidth, low-latency and high-capacity. A request typically traverses the cache hierarchy, level by level, until the data is found, thereby wasting time and energy in each level. In this paper, we present the Direct-to-Data (D2D) cache that locates data across the entire cache hierarchy with a single lookup.

    To navigate the cache hierarchy, D2D extends the TLB with per cache-line location information that indicates in which cache and way the cache line is located. This allows the D2D cache to: 1) skip levels in the hierarchy (by accessing the right cache level directly), 2) eliminate extra data array reads (by reading the right way directly), 3) avoid tag comparisons (by eliminating the tag arrays), and 4) go directly to DRAM on cache misses (by checking the TLB). This reduces the L2 latency by 40% and saves 5-17% of the total cache hierarchy energy.

    D2D´s lower L2 latency directly improves L2 sensitive applications´ performance by 5-14%. More significantly, we can take advantage of the L2 latency reduction to optimize other parts of the microarchitecture. For example, we can reduce the ROB size for the L2 bound applications by 25%, or we can reduce the L1 cache size, delivering an overall 21% energy savings across all benchmarks, without hurting performance.

    Place, publisher, year, edition, pages
    Piscataway, NJ: IEEE Press, 2014
    National Category
    Computer Engineering Computer Sciences
    Identifiers
    urn:nbn:se:uu:diva-235362 (URN)10.1145/2678373.2665694 (DOI)000343652800012 ()978-1-4799-4394-4 (ISBN)
    Conference
    ISCA 2014, June 14–18, Minneapolis, MN
    Projects
    UPMARCCoDeR-MP
    Available from: 2014-06-14 Created: 2014-10-31 Last updated: 2018-01-11Bibliographically approved
    5. A split cache hierarchy for enabling data-oriented optimizations
    Open this publication in new window or tab >>A split cache hierarchy for enabling data-oriented optimizations
    2017 (English)In: Proc. 23rd International Symposium on High Performance Computer Architecture, IEEE Computer Society, 2017, p. 133-144Conference paper, Published paper (Refereed)
    Place, publisher, year, edition, pages
    IEEE Computer Society, 2017
    National Category
    Computer Engineering
    Identifiers
    urn:nbn:se:uu:diva-306368 (URN)10.1109/HPCA.2017.25 (DOI)000403330300012 ()978-1-5090-4985-1 (ISBN)
    Conference
    HPCA 2017, February 4–8, Austin, TX
    Projects
    UPMARC
    Available from: 2017-05-08 Created: 2016-10-27 Last updated: 2019-03-08
    Download full text (pdf)
    fulltext
    Download (jpg)
    preview image
  • 243.
    Sembrant, Andreas
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Phase Behavior in Serial and Parallel Applications2012In: International Symposium on Workload Characterization (IISWC'12), IEEE Computer Society, 2012Conference paper (Refereed)
  • 244.
    Sembrant, Andreas
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Phase Guided Profiling for Fast Cache Modeling2012In: International Symposium on Code Generation and Optimization (CGO'12), ACM Press, 2012, p. 175-185Conference paper (Refereed)
    Abstract [en]

    Statistical cache models are powerful tools for understanding application behavior as a function of cache allocation. However, previous techniques have modeled only the average application behavior, which hides the effect of program variations over time. Without detailed time-based information, transient behavior, such as exceeding bandwidth or cache capacity, may be missed. Yet these events, while short, often play a disproportionate role and are critical to understanding program behavior.

    In this work we extend earlier techniques to incorporate program phase information when collecting runtime profiling data. This allows us to model an application's cache miss ratio as a function of its cache allocation over time. To reduce overhead and improve accuracy we use online phase detection and phase-guided profiling. The phase-guided profiling reduces overhead by more intelligently selecting portions of the application to sample, while accuracy is improved by combining samples from different instances of the same phase.

    The result is a new technique that accurately models the time-varying behavior of an application's miss ratio as a function of its cache allocation on modern hardware. By leveraging phase-guided profiling, this work both improves on the accuracy of previous techniques and reduces the overhead.

  • 245.
    Sembrant, Andreas
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Carlson, Trevor E.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    POSTER: Putting the G back into GPU/CPU Systems Research2017In: 2017 26TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT), 2017, p. 130-131Conference paper (Refereed)
    Abstract [en]

    Modern SoCs contain several CPU cores and many GPU cores to execute both general purpose and highly-parallel graphics workloads. In many SoCs, more area is dedicated to graphics than to general purpose compute. Despite this, the micro-architecture research community primarily focuses on GPGPU and CPU-only research, and not on graphics (the primary workload for many SoCs). The main reason for this is the lack of efficient tools and simulators for modern graphics applications. This work focuses on the GPU's memory traffic generated by graphics. We describe a new graphics tracing framework and use it to both study graphics applications' memory behavior as well as how CPUs and GPUs affect system performance. Our results show that graphics applications exhibit a wide range of memory behavior between applications and across time, and slows down co-running SPEC applications by 59% on average.

  • 246.
    Sembrant, Andreas
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Eklöv, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Efficient software-based online phase classification2011In: International Symposium on Workload Characterization (IISWC'11), IEEE Computer Society, 2011, p. 104-115Conference paper (Refereed)
    Abstract [en]

    Many programs exhibit execution phases with time-varying behavior. Phase detection has been used extensively to find short and representative simulation points, used to quickly get representative simulation results for long-running applications. Several proposals for hardware-assisted phase detection have also been proposed to guide various forms of optimizations and hardware configurations. This paper explores the feasibility of low overhead phase detection at runtime based entirely on existing features found in modern processors. If successful, such a technology would be useful for cache management, frequency adjustments, runtime scheduling and profiling techniques. The paper evaluates several existing and new alternatives for efficient runtime data collection and online phase detection. ScarPhase (Sample-based Classification and Analysis for Runtime Phases), a new online phase detection library, is presented. It makes extensive usage of the new hardware counter features, introduces a new phase classification heuristic and suggests a way to dynamically adjust the sample rate. ScarPhase exhibits runtime overhead below 2%.

  • 247.
    Sembrant, Andreas
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Hagersten, Erik
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Black-Schaffer, David
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    TLC: A tag-less cache for reducing dynamic first level cache energy2013In: Proceedings of the 46th International Symposium on Microarchitecture, New York: ACM Press, 2013, p. 49-61Conference paper (Refereed)
    Abstract [en]

    First level caches are performance-critical and are therefore optimized for speed. To do so, modern processors reduce the miss ratio by using set-associative caches and optimize latency by reading all ways in parallel with the TLB and tag lookup. However, this wastes energy since only data from one way is actually used.

    To reduce energy, phased-caches and way-prediction techniques have been proposed wherein only data of the matching/predicted way is read. These optimizations increase latency and complexity, making them less attractive for first level caches.

    Instead of adding new functionality on top of a traditional cache, we propose a new cache design that adds way index information to the TLB. This allow us to: 1) eliminate ex-tra data array reads (by reading the right way directly), 2) avoid tag comparisons (by eliminating the tag array), 3) later out misses (by checking the TLB), and 4) amortize the TLB lookup energy (by integrating it with the way information). In addition, the new cache can directly replace existing caches without any modication to the processor core or software.

    This new Tag-Less Cache (TLC) reduces the dynamic energy for a 32 kB, 8-way cache by 60% compared to a VIPT cache without aecting performance.

  • 248.
    Själander, Magnus
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Borgström, Gustaf
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Improving Error-Resilience of Emerging Multi-Value TechnologiesManuscript (preprint) (Other academic)
    Abstract [en]

    There exist extensive ongoing research efforts on emerging technologies that have the potential to become an alternative to today’s CMOS technologies. A common feature among the investigated technologies is that of multi- value devices and the possibility of implementing quaternary logic and memory. However, multi-value devices tend to be more sensitive to interferences and, thus, have reduced error resilience. We present an architecture based on multi-value devices where we can trade energy efficiency against error resilience. Important data are encoded in a more robust binary format while error tolerant data is encoded in a quaternary format. We show for eight benchmarks an energy reduction of 32% and 36% for the register file and level-one data cache, respectively, and for the two integer benchmarks, an energy reduction for arithmetic operations of 13% and 23%. We also show that for a quaternary technology to be viable it need to have a raw bit error rate of one error in 100 million or better.

    Download full text (pdf)
    fulltext
  • 249.
    Själander, Magnus
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Borgström, Gustaf
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Klymenko, Mykhailo V.
    Remacle, Françoise
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication.
    Techniques for modulating error resilience in emerging multi-value technologies2016In: Proc. 13th International Conference on Computing Frontiers, New York: ACM Press, 2016, p. 55-63Conference paper (Refereed)
    Download full text (pdf)
    fulltext
  • 250.
    Själander, Magnus
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Martonosi, Margaret
    Kaxiras, Stefanos
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems.
    Power-Efficient Computer Architectures: Recent Advances2014Book (Refereed)
23456 201 - 250 of 276
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf