Logo: to the web site of Uppsala University

uu.sePublications from Uppsala University
Change search
Link to record
Permanent link

Direct link
Wallin, Dan
Publications (7 of 7) Show all publications
Johansson, H., Wallin, D. & Holmgren, S. (2006). Analyzing advanced PDE solvers through simulation. In: Applied Parallel Computing: State of the Art in Scientific Computing (pp. 893-900). Berlin: Springer-Verlag
Open this publication in new window or tab >>Analyzing advanced PDE solvers through simulation
2006 (English)In: Applied Parallel Computing: State of the Art in Scientific Computing, Berlin: Springer-Verlag , 2006, p. 893-900Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
Berlin: Springer-Verlag, 2006
Series
Lecture Notes in Computer Science ; 3732
National Category
Computer Sciences Computational Mathematics
Identifiers
urn:nbn:se:uu:diva-80673 (URN)10.1007/11558958_108 (DOI)000237003200108 ()
Available from: 2008-03-07 Created: 2008-03-07 Last updated: 2018-01-13Bibliographically approved
Wallin, D., Löf, H., Hagersten, E. & Holmgren, S. (2006). Multigrid and Gauss-Seidel smoothers revisited: Parallelization on chip multiprocessors. In: Proc. 20th ACM International Conference on Supercomputing (pp. 145-155). New York: ACM Press
Open this publication in new window or tab >>Multigrid and Gauss-Seidel smoothers revisited: Parallelization on chip multiprocessors
2006 (English)In: Proc. 20th ACM International Conference on Supercomputing, New York: ACM Press , 2006, p. 145-155Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
New York: ACM Press, 2006
National Category
Computer Sciences Computational Mathematics
Identifiers
urn:nbn:se:uu:diva-19810 (URN)10.1145/1183401.1183423 (DOI)1-59593-282-8 (ISBN)
Available from: 2008-02-08 Created: 2008-02-08 Last updated: 2018-01-12Bibliographically approved
Wallin, D. & Hagersten, E. (2004). Bundling: Reducing the Overhead of Multiprocessor Prefetchers. In: 18th International Parallel and Distributed Processing Symposium: (IPDPS 2004).
Open this publication in new window or tab >>Bundling: Reducing the Overhead of Multiprocessor Prefetchers
2004 (English)In: 18th International Parallel and Distributed Processing Symposium: (IPDPS 2004), 2004Conference paper, Published paper (Refereed)
Abstract [en]

Prefetching has proven to be a useful technique for reducing cache misses in multiprocessors at the cost of increased coherence traffic. This is especially troublesome for snoop-based systems, where the available coherence bandwidth often is the scalability bottleneck. The bundling technique presented in this paper reduces the overhead caused by prefetching in two ways: piggybacking prefetches with normal requests, and requiring only one device to perform the snoop lookup for each prefetch transaction. This can reduce both the address bandwidth and the number of snoop lookups compared with a nonprefetching system. We describe bundling implementations for two important transaction types: reads and upgrades. While bundling could reduce the overhead of most existing prefetch schemes, the evaluation of bundling performed in this paper has been limited to two of them: sequential prefetching and Dahlgren´s adaptive sequential prefetching. Both schemes have their snoop bandwidth halved for all commercial and scientific benchmarks in the study. The combined effect of bundling applied to these prefetch schemes lowers the cache miss rate, the address bandwidth and the snoop bandwidth, compared with a system with no prefetching, for all applications. Bundling, will not reduce the data bandwidth introduced by a prefetch scheme. However, we argue that the data bandwidth is more easily scaled than the snoop bandwidth for snoop-based coherence systems.

Available as PDF (693 kB)

Identifiers
urn:nbn:se:uu:diva-72530 (URN)
Available from: 2005-05-25 Created: 2005-05-25
Wallin, D., Johansson, H. & Holmgren, S. (2004). Cache memory behavior of advanced PDE solvers. In: Parallel Computing: Software Technology, Algorithms, Architectures and Applications (pp. 475-482). Amsterdam, The Netherlands: Elsevier
Open this publication in new window or tab >>Cache memory behavior of advanced PDE solvers
2004 (English)In: Parallel Computing: Software Technology, Algorithms, Architectures and Applications, Amsterdam, The Netherlands: Elsevier , 2004, p. 475-482Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
Amsterdam, The Netherlands: Elsevier, 2004
Series
Advances in Parallel Computing ; 13
National Category
Computer Sciences Computational Mathematics
Identifiers
urn:nbn:se:uu:diva-67857 (URN)0-444-51689-1 (ISBN)
Available from: 2006-05-17 Created: 2006-05-17 Last updated: 2018-01-10Bibliographically approved
Wallin, D. (2003). Exploiting data locality in adaptive architectures. (Licentiate dissertation). Uppsala University
Open this publication in new window or tab >>Exploiting data locality in adaptive architectures
2003 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

The speed of processors increases much faster than the memory access time. This makes memory accesses expensive. To meet this problem, cache hierarchies are introduced to serve the processor with data. However, the effectiveness of caches depends on the amount of locality in the application's memory access pattern. The behavior of various programs differs greatly in terms of cache miss characteristics, access patterns and communication intensity. Therefore a computer built for many different computational tasks potentially benefits from dynamically adapting to the varying needs of the applications.

This thesis shows that a cc-NUMA multiprocessor with data migration and replication optimizations efficiently exploits the temporal locality of algorithms. The performance of the self-optimizing system is similar to a system with a perfect initial thread and data placement.

Data locality optimizations are not for free. Large cache line coherence protocols improve spatial locality but yield increases in false sharing misses for many applications. Prefetching techniques that reduce the cache misses often lead to increased address and data traffic. Several techniques introduced in this thesis efficiently avoid these drawbacks. The bundling technique reduces the coherence traffic in multiprocessor prefetchers. This is especially important in snoop-based systems where the coherence bandwidth is a scarce resource. Bundled prefetchers manage to reduce both the cache miss rate and the coherence traffic compared with non-prefetching protocols. The most efficient bundled prefetching protocol studied, lowers the cache misses by 27 percent and the address snoops by 24 percent relative to a non-prefetching protocol on average for all examined applications. Another proposed technique, capacity prefetching, avoids false sharing misses by distinguishing between cache lines involved in communication from non-communicating cache lines at run-time.

Place, publisher, year, edition, pages
Uppsala University, 2003
Series
Information technology licentiate theses: Licentiate theses from the Department of Information Technology, ISSN 1404-5117 ; 2003-010
National Category
Computer Engineering
Research subject
Computer Systems
Identifiers
urn:nbn:se:uu:diva-86160 (URN)
Supervisors
Available from: 2003-09-26 Created: 2006-12-27 Last updated: 2018-01-13Bibliographically approved
Holmgren, S., Nordén, M., Rantakokko, J. & Wallin, D. (2002). Performance of PDE solvers on a self-optimizing NUMA architecture. Parallel Algorithms and Applications, 17, 285-299
Open this publication in new window or tab >>Performance of PDE solvers on a self-optimizing NUMA architecture
2002 (English)In: Parallel Algorithms and Applications, ISSN 1063-7192, E-ISSN 1029-032X, Vol. 17, p. 285-299Article in journal (Refereed) Published
National Category
Computer Sciences Computational Mathematics
Identifiers
urn:nbn:se:uu:diva-66909 (URN)10.1080/01495730208941445 (DOI)
Available from: 2006-05-22 Created: 2006-05-22 Last updated: 2018-01-10Bibliographically approved
Holmgren, S. & Wallin, D. (2001). Performance of high-accuracy PDE solvers on a self-optimizing NUMA architecture. In: Euro-Par 2001: Parallel Processing (pp. 602-610). Berlin: Springer-Verlag
Open this publication in new window or tab >>Performance of high-accuracy PDE solvers on a self-optimizing NUMA architecture
2001 (English)In: Euro-Par 2001: Parallel Processing, Berlin: Springer-Verlag , 2001, p. 602-610Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
Berlin: Springer-Verlag, 2001
Series
Lecture Notes in Computer Science ; 2150
National Category
Computer Sciences Computational Mathematics
Identifiers
urn:nbn:se:uu:diva-40590 (URN)10.1007/3-540-44681-8_86 (DOI)
Available from: 2008-03-13 Created: 2008-03-13 Last updated: 2018-01-11Bibliographically approved
Organisations

Search in DiVA

Show all publications