### Berkeley Winter School

### Advanced Algorithmic Techniques for GPUs

# Lecture 5: Advanced Data **Optimizations**

W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2010

### **Objective**

- Apply tiling, thread coarsening, and data layout transformations to one kernel
- Understand the practical use of these techniques







Accessed by T1

Accessed by T2

Accessed by T3





 $\frac{1}{2}$ 

Accessed by T4

• Only four elements of A and four elements of B is needed to calculate one step for a 16-element tile of C







- The C tile does not need to be square
- This is a 4X2 tile
	- 4 elements of A and 2 elements of B are needed for each step







• Step 2…







- At each step
	- For 4X2 only 6 elements need to be loaded for all 8 threads to make progress
	- For 4X4, 8 elements for all 16 threads







```
But, how about the kernel we saw.
\overline{C} D. D. Halpertuid H. \overline{C}16. Pd[Row*Width+Col] = Pvalue;
  __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
    1. __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
2. ___shared____float Nds[TILE_WIDTH][TILE_WIDTH];
3. int bx = blockIdx.x; int by = blockIdx.y;
4. int tx = \text{threadIdx}.xi int ty = \text{threadIdx}.yiIdentify the row and column of the Pd element to work on
5. int Row = by \text{Each thread loads}6. int Col = bx 1 element of A and
7. float Pvalue
  Loop over the \frac{1}{2} CICINCIIL UI D \frac{1}{2} to compute the Pd element
8. for (int m = 0 \sqrt{ width/TILE WIDT
// Coolaborative \sqrt{\mathcal{A}}ing of Md and Nd ti
9. Mds[tx][\sqrt{\frac{1}{2}}] = Md[Row*Width + m*TI]
10. Nds[tx][ty] = Nd[(m*TILE_WIDTH +
11. syncthreads();
12. for (int k = 0; k < TILE_WIDTH; ++k13. Pvalue += Mds[tx][k] * Nds[k][\pm \sqrt{7};
14. __synchthreads();
15.}
                                          Each thread calculates 
                                          TILE_WIDTH steps of a 
                                          C element
                 1 element of B
```
}

### In the kernel of the previous slide

- T^2 elements of A and T^2 element of B are loaded to calculate T steps for T^2 elements of C
- According to our analysis, we can use much smaller amount of shared memory by
	- Loading T element of A and T element of B to claculate 1 step for T^2 elements of C
	- Or loading TA elements of A and TB elements of B to calculate 1 step for TA\*TB elements of C (rectangular matrix)
	- So, why didn't we do so?

## Synchronization Overhead

- We need to call synchthreads() in the inner loop of each thread. In each iteration
	- only a subset of threads load A and B elements (divergence)
	- Call \_\_synchthreads()
	- All threads calculate one step of the inner product
	- Call \_\_synchtrheads()
	- Go to the next iteration
- Even though \_synchtrheads() is a very efficient function, such intensive use is still going to hurt

### A somewhat different approach Optimization 1: thread coarsening

- Have each thread to calculate a horizontal subset of C elements
- Data loaded in A can be reused through registers



### Optimization 2: Shared memory tiling

• Multiple threads collaborate to load TB B elements into shared memory



©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2010

B C T1-T4 cooperatively load 4 values from B, b1~b4 into shared memory so T1-T4 can all use them Intermediate results computed by T1; stored in registers  $\overline{C}$  $\overline{C}$  $\overline{C}$ 

### In one iteration, each thread

- loads one A element into register, accesses TB B elements from shared memory
	- Calculates one step for 1\*TB C elements
	- $-$  TB  $\sim$ 16 in practice





B

T1-T4 cooperatively load 4 values from B, b1~b4 into shared memory so T1-T4 can all use them

Intermediate results computed by T1; stored in registers

### In one iteration, each block

- Loads TA A elements into registers, loads TB B elements into shared memory
	- TA is number of threads in thread block (64 or more in practice)
	- TB is number of threads folded into one thread in thread coarsening (16 or more in practice)
- However, loading of B will involve only a subset of threads (divergence)

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2010

### A more balanced approach, in each iteration

- All threads in a block collaborate to load a TBxK tile of B into shared memory
	- $-$  K is set so that TA = TB\*K
	- Every thread loads one B element, no divergence
- Each thread loads K A elements into registers
- Each thread calculates K steps for TB C elements

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2010

## **Summary**

- Each block has TA threads
- Each thread coarsened by TB times
- Each thread loads
	- One B element
	- K =TA/TB A elements
	- To calculate K steps of TB C elements









## For a toy example

- Each block has 8 threads
- Each thread coarsened by 4 times
- Each thread loads
	- One B element
	- $-$  8/4=2 A elements
	- To calculate 2 steps of 4 C elements



#### TB

C

C

 $\overline{C}$ 

 $\overline{C}$ 

C

 $\overline{C}$ 

 $\overline{C}$ 

 $\overline{C}$ 

B

## For GTX280 (Volkov & Demmel )

- Each block has 64 threads
- Each thread coarsened by 16 times
- Each thread loads
	- One B element
	- $-64/16=4$  A elements
	- To calculate 4 steps of 16 C elements



#### TB

C

C

 $\overline{C}$ 

 $\overline{C}$ 

C

 $\overline{C}$ 

 $\overline{C}$ 

 $\overline{C}$ 

B

## A Comparative Analysis

- Tiled MM introduced earlier:
	- Each thread block computes 32x32=1024 results
	- Use 9 KB on-chip memory (register + shared memory)
- Register tiled version of sgemm:
	- Each thread block computes 64x16=1024 results
	- Use only 4 ¼ KB on-chip memory
		- Similar degree of reuse;  $\sim$  2X more efficient than tiled MM



Berkeley, January 24-25, 2010

# Data Layout – For C (row major)

B

 $\overline{C}$ 

TB

C

C

 $\overline{C}$ 

C

 $\overline{C}$ 

 $\overline{C}$ 

 $\overline{C}$ 

- Loading B into shared memory is easily coalesced with the 16X4 tile
- Loading A into registers in not coalesced

– Transpose A for coalescing



©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2010

## Data Layout for FORTRAN

- Column major layout
- A accesses are coalesced
- B needs to be transposed
- C may need to be transposed







© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

## A very small (8x2 bit) DRAM Bank



### DRAM core arrays are slow.

- Reading from a cell in the core array is a very slow process
	- DDR: Core speed  $=$  1/2 interface speed
	- $–$  DDR2/GDDR3: Core speed  $=$  ¼ interface speed
	- $–$  DDR3/GDDR4: Core speed  $=$   $\frac{1}{8}$  interface speed
	- … likely to be worse in the future



### DRAM Bursting.

- For DDR{2,3} SDRAM cores clocked at 1/N speed of the interface:
	- $-$  Load (N  $\times$  interface width) of DRAM bits from the same row at once to an internal buffer, then transfer in N steps at interface speed
	- DDR2/GDDR3: buffer width  $= 4X$  interface width



### DRAM Bursting



### DRAM Bursting







©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana,

## DRAM Bursting for the 8x2 Bank



## First-order Look at the GPU off-chip memory subsystem

• nVidia GTX280 GPU:

 $-$  Peak global memory bandwidth  $= 141.7GB/s$ 

- Global memory (GDDR3) interface @ 1.1GHz
	- $-$  (Core speed @ 276Mhz)
	- For a typical 64-bit interface, we can sustain only about 17.6 GB/s (Recall DDR - 2 transfers per clock)
	- We need a lot more bandwith (141.7 GB/s) thus 8 memory channels

## Multiple Memory Channels

- Divide the memory address space into N parts
	- N is number of memory channels
	- Assign each portion to a channel



Illinois, August 2-5, 2010

### Memory Controller Organization of a Many-Core Processor

- GTX280: 30 Stream Multiprocessors (SM) connected to 8-channel DRAM controllers through interconnect
	- DRAM controllers are interleaved
	- Within DRAM controllers (channels), DRAM banks are interleaved for incoming memory requests
	- We approximate its DRAM channel/bank interleaving scheme through micro-benchmarking

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

## Back to the Big Picture

- Each global memory access is made to a memory location with an address
	- Some bits will determine the memory channel used
	- Some bits will determine the DRAM bank used
	- Some bits will determine the position within a burst



• When adjacent threads in a warp access words in a burst, the accesses are coalesced.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

## **ANY MORE QUESTIONS?**

©Wen-mei W. Hwu and David Kirk/NVIDIA, Berkeley, January 24-25, 2010