Manycore and Accelerator-based High-performance Scientific Computing


(Wen-mei Hwu, University of Illinois Urbana-Champaign, IL)

January 24 – January 25, 2011

There is limited seating for tutorial therefore please register early.


The tutorial is designed for participants with basic CUDA programming knowledge. The tutorial will cover the following with hands-on exercises on GPU clusters,

Computational thinking

In this section, we discuss the particular limitations and constraints of many-core hardware, and which computational patterns are desirable or undesirable, from the architectural point of view. Among the main obstacles to performance, we discuss: conflicts in critical resources leading to serialization, load imbalance, and memory bandwidth bottlenecks.

With these constraints in sight, we aim to discover how computational thinking enables one to transform the structure of a problem by: identifying inherently serial parts, finding the parts which are amenable to parallel execution, and the methods and compromises for transforming one to the other. We observe that there are only a modest number of fundamental algorithm strategies used in successful GPU programs.

Parallelism transformations for performance

Often domain problems have inherent parallelism that needs to be recognized.  The most efficient implementation that exploits the problem's parallelism may be non-intuitive.  For example, two alternative thread arrangements that appear in electrostatics calculations have, respectively, scatter and gather memory access behavior.  The first is more intuitive, but the second is much more efficient on the GPU architecture.

Avoidance of resource oversubscription

The GPU architecture is characterized by memory access bandwidth that, although fast, is often limiting in comparison to compute throughput.  Thus, achieving performance critically depends on finding ways to reduce and regularize global memory access. Three important algorithmic strategies for conserving bandwidth are "register/memory tiling", “layout transformation” and "thread coarsening". These come at a cost of increased on-chip memory usage, which is also a limited resource.  We will discuss a variety of examples from PDE solvers, linear algebra, and convolution.

Dealing with data, efficiently

Regular computation and data access is the ideal combination for GPUs.  Domain problems present data in a variety of ways to the programmer: often data is sparse, sometimes it is non-uniform, or it can be dynamic.  Specific techniques such as data binning, compaction, and queuing are effective for GPU architectures. We will present examples from medical imaging and graph problems to illustrate these techniques.


Room 254 Sutardja Dai Hall

Monday January 24, 2011

Hands-on Manual

9:00 – 10:15 Lecture 1 – Introduction and computational thinking for manycore computing
10:15 – 10:45 Break
10:45 – 12:00 Lecture 2 – Parallelism Scalability Transformations
12:00 – 1:30 Lunch break and discussions
1:30 –  2:45 Lecture 3 - Blocking/Tiling for Locality
2:45 – 3:00  Break
3:00 – 4:15 Lecture 4 – Thread Coarsening and More on Tiling/Blocking
4:15 – 4:30 Break
4:30 – 5:30 Guest Lecture - James Demmel , Communication-avoiding Algorithms, GPU Implementation
5:30 – 6:30 Hands-on Lab 1 – Introduction and 7-point Stencil Code Optimization Open Lab in the evening for students to finish their assignments
Tuesday January 25, 2011
9:00 – 10:15 Lecture 5 – Advanced Data Optimizations
10:15 – 10:45 Break
10:45 – 12:00 Lecture 6 – Input Binning
12:00 – 1:30 Lunch break and discussions
1:30 – 2:45 Lecture 7 – Input Compaction
2:45 – 3:00 Break
3:00 – 4:15 Lecture 8 – Privatization and Further Studies
4:15 – 4:30 Break
4:30 – 5:30 Guest Lecture – David Kirk, Fermi and Important Architecture Trends
5:30 – 6:30 Hands-on Lab 2 – Continued Optimization of 7-point Stencil Code Open Lab in the evening for students to finish their assignments