Introduction

平行處理 (parallel processing/computing) 是一個歷史悠久的題目。即使在 CPU 時代，Intel 努力榨出單核 CPU 的效能 by driving high operating frequency. 製程技術落後的 AMD 另闢蹊徑，使用多核提升 CPU 效能。

不過此時多核 CPU 也只有 2/4/8 核，因為總效能無法比例成長。

GPU 是第一個極大化利用平行計算讓 computation power 持續倍數成長，即使 transistor shrink 已經無法 meet Moore's law, i.e. double the transistor every 18 months. GPU 的核數動輒上百或是上千，GPU 是 post Moore's law 的一個救星。

天下沒有白吃的午餐，高度平行計算也限制 GPU 的應用範圍。好在幾個非常熱門的領域剛好契合，包含 computer graphic and game, computer vision, block chain proof of work, and machine learning/deep learning AI.

本文 focus on GPU for computer graphic.

GPU for 3D “G”raphic and “G”ame

幾個重點

3D object 由無數三角形頂點組成，稱為 vertex. Vertex shader is programmable 作各種 3D operation: translation, scale, rotation, 利用 matrix multiplication (or GEMM, General Matrix Multiplication) 控制 3D 形狀和視角。
Rasterization : pixel?
Fragment shader (or pixel shader): pixel 為單位，program to present 2D display 的 color and texture, 最後決定每一個 pixel 的 color.

為什麼 GPU 可以高度平行計算？

Vertex shader 的 vertexes and Fragment shader 的 pixel 可以平行計算，

Vertex shader and fragment shader 可以用 unified shader 統一 shader computation using SC (Shader Cores).

There is another key difference between CPU and GPU
CPU is de factor by instruction set, differentiation.
GPU is standardized by software API of the following.
GPU API

OpenCL
GLES3.0 ~ DirectX => unified shaders
Vulkan: overhead is reduced significantly; multi-core CPUs support

Therefore, even GPU has proprietary ISA, but as long as they support the above software API, they are hardware independent.

From CPU to GPU – Road to Parallel Computing and SIMD/SIMT

高度平行計算要靠 hardware 和 software 達成，缺一不可。先看 GPU hardware architecture 如何從 CPU architecture 演化而來。[@warburtonIntroGPU2017]
[@warburtonIntroductionGraphics2017].

CPU hardware architecture and software thread

一個 modern CPU architecture 如下：橘色部分是 control flow (instruction fetch, execute, branch, etc.), 綠色是 data flow, 黃色是 instruction memory, 灰色是 data memory. 一個程式從 instruction memory 循序讀取解碼。根據指示讀寫 data memory 進出 ALU 完成運算。

這樣的 hardware 稱為一個 “core”, core 裡面執行的 software 很形象的稱為一個 "thread". Hardware core 和 software thread 並非總是一一對應。例如 multi-thread or hyper-thread software 技術就是在一個 core 上用分時或同時執行兩個或是多個 threads. 一般稱為 instruction-level parallelism.

另一個方式就是 multi-core, 複製一個 core 變成 many cores. 例如 10th Gen Intel Core i9 包含 10 cores, 20 threads (2 threads in one core), clock speeds up to 5.3GHz. 這是 SOTA (state of the art) CPU at 2020.

GPU hardware architecture

Idea 1: GPU hardware architecture 的第一步是把單核減肥。Remove out-of-order, brach prediction, and pre-fetech logics. 基本就是利用 computer graphic 都是 data processing in stream, 不會有 out-of-order or branch operation.

Idea 2: Replicate core and take advantage of GPU rendering instruction streams are typically very similar.

因此可以用 SIMD "single instruction multiple data" model: share the cost of the instruction steam across many ALUs. 也就是一個 core 可以同時執行多個 threads, 一般為了要和 CPU 的 thread 區別，

Massively parallelism by cloning cores.

看一個完整的例子：Nvidia Maxwell architecture GPU.
可以看到有 16 個 Maxwell cores (下圖中), 每一個 core 包含 4 個 SIMD clusters (下圖右). 每一個 cluster 包含 32 個 ALUs. 同時可以處理 16×4=64 streams.

Data streams at ~56 GFLOP/s and peak 4.6 TFLOP/s (single precision).

Difference between CPU and GPU

Each CPU core executes scalar OR vector operations. Each GPU core ONLY executes vector instructions.
- CPU: SIMD parallelism through vector execution units.
- GPU: SIMD parallel execution all operations.
GPU core has a lot more local registers than CPU core for computation intensive operations. 原因是
- GPU core 包含大量平行計算 ALUs, 需要大量 local registers for parallel processing.
- Each GPU stream 有很深的 computation pipeline, data including input/intermediate/output 最好都在 local registers read/write, 避免從 buffer memory read/write 影響效能。
- 例如 Intel Skylake core: 180 integer registers and 168 floating point registers. Nvidia Maxwell core: 16,000 floating point registers.
GPU cores are engineered to switch quickly between threads to recover stalls. 原因也是因為 GPU 有很深的 pipeline, 必須減少 pipeline switch or stall penalty.
GPU summary of multiple cores and each core:
- Has one or more wide SIMD vector units.
- Wide SIMD vector units execute one instruction stream.
- Has a pool of shared memory
- Shares a register files among all the ALUs.
- Fast switches thread blocks to hid memory latency.

GPU Programming With CUDA Threading

上述 GPU hardware architecture 為高度平行計算的架構。因此 software programming 不意外也必須有特別方式才能最大利用 GPU 的強大計算力。

Nvidia 2007 提出的 CUDA (Compute Unified Device Architecture) programming model, 藉著 Nvidia 在 PC graphic card/game, ML/DL 的制霸力大放異彩。
Microsoft 更早 (2003) 提出的 DirectX 也是 PC game (with or without external graphic card) 和 Xbox 的行業標準。
Linux/android/mobile phone 的行業標準包含: OpenCL (2009), OpenGL for ES/GLES (2003), Vulkan (2016)。

我們以 CUDA 為例: CUDA 因為是 Intel CPU on mother board 和 Nvidia GPU on graphic card 的協作。區分為 HOST (CPU) 和 DEVICE (GPU) 兩部分：

DEVICE and HOST are asynchronous. Operations are queued on the DEVICE
DEVICE has its own DDR memory (e.g. 8GB for 1028Ti?).

Programmer explicitly moves data between HOST and DEVICE.

cudaMalloc: C-style malloc on the DDR of the graphic card.
cudaMemcpy: copy data from HOST to DEVICE array.
Queue kernel (CPU) task on DEVICE.
cudaMemcpy: copy data from DEVICE to HOST array.

Portable API

OpenMP (Multi-Processor) and OpenACC (focusing on parallel computing) 都不是 GPU centric. CUDA and OpenCL, 以及 GLES and Vulkan 才是 GPU focus API.
可以看出 OpenCL 的重要性。甚至在 Vision DSP 也開始支持 OpenCL.

Why OpenCL?

OpenCL is a programming framework for heterogeneous compute resources.

Comparison between serial/CUDA/OpenCL models

Link to High Level Programming Language

CUDA 和 OpenCL 是好東西，但一般工程師需要經過一段時間訓練才能掌握平行計算 vector C/C++ 的技巧。是否可以直接用高階語言如 Python/Julia, 或是一般 C/C++ 轉換成 CUDA or OpenCL? 答案是肯定的。

例如在 ML/DL/AI 的應用，高階 framework 包含 PyTorch, Tensorflow 架在 Python 上，可以直接調用 CUDA!
Pytorch 是 explicitly 調用 CUDA. Tensorflow 是 pip install tensorflow (CPU) or pip install tensorflow-gpu (CUDA), program 中 implicitly 調用 CUDA.

# pytorch
a = torch.tensor([1., 2.]).cuda()
# tensorflow
a = tf.Tensor([2. 3. 4.], shape=(3,), dtype=float32)

最近流行的是 IR (Intermediate Representation), 先把高階語言 compile/parse 稱為 IR, 再轉為不同的低階語言 CUDA, OpenCL, etc. OCCA 是一個例子，Google 的 MLIR 是另一個例子。

GPU Benchmark

GPU 效能包含 micro-benchmark 和 frame-base benchmark.

Micro benchmark 主要 focus on GPU 的 peak compute performance (GFLOP/s) and memory transfer performance (latency and bandwidth) using variable operation intensity. 廣為人知的 roofline model 就是 micro-benchmark 的代表作之一。屋頂代表 peak performance limitation; 屋簷的斜率代表 memory bandwidth limitation. [@konstantinidisPracticalPerformance2015] and [@konstantinidisEkondisMixbench2020]

Frame-base benchmark 顧名思義是 GPU renders 完整的 frame 的 performance. Frame-base benchmark 比較接近 game player 的體驗，例如 GFXBench 3.0 Manhattan

Thermal Bound and Memory Bound

一個有名的 quote, "Arithmetic is cheap, bandwidth is money, latency is physics.".

對於 GPU 來說，arithmetic 對應 ALU 的 peak performance, 可以藉著增加 core 的數目而成長 and leverage (discounted) Moore's law. 但是 thermal 目前是大問題，而且是 physics limitation. 除非 voltage scaling 或是散熱技術有突破進步。
Bandwidth 的確是用錢可以解決的問題，主要是（DDR 的速度）x ( DDR Port Number)。 Nvidia can be viewed as a company selling expensive GDDR memory. 不過 data transfer over DDR 也是非常耗能的 operations. 最後也會是 thermal bound.
Latency 對於 CPU 非常 sensitive, 大約在 ~ns range. 對於 GPU 就和應用相關，一般以 frame (e.g. 4M pixel) 為單位，例如 camera (100s'ms) 和 video (16ms) 一般還好。但在 gaming/AR/car application 會要求小於 10ms latency. Latency 由 clock speed (i.e. transistor speed) 和 physical trace path 決定，也就是 limited by physics.
因此對於 GPU 是由 thermal bound (physics) or memory bound (money).

Reference

Konstantinidis, Elias. (2015) 2020. Ekondis/Mixbench.
https://github.com/ekondis/mixbench.

Konstantinidis, Elias, and Yiannis Cotronis. 2015. “A Practical
Performance Model for Compute and Memory Bound GPU Kernels.” In 2015
23rd Euromicro International Conference on Parallel, Distributed, and
Network-Based Processing, 651–58.
https://doi.org/10.1109/PDP.2015.51.

Warburton, Tim. 2017a. “An Introduction to Graphics Processing Unit
Architecture and Programming Models.” Numerical Methods, August, 246.
https://extremecomputingtraining.anl.gov//files/2017/08/ATPESC_2017_Track-2_4_8-3_830am_Warburton-GPU.pdf.

———, dir. 2017b. An Intro to GPU Architecture and Programming Models I
Tim Warburton, Virginia Tech.
https://www.youtube.com/watch?v=lGmPy8xpT4E.

GPU Basic

Introduction

GPU for 3D “G”raphic and “G”ame

From CPU to GPU – Road to Parallel Computing and SIMD/SIMT