Edge AI – Bitwidth Reduction – ALU

Edge AI Challenges:

Reduce the storage and runtime memory (MB)
Reduce the memory bandwidth (GB/sec)
Increase the throughput (frame/sec)
Reduce the power/energy (pJ/mW)
Reduce the latency

如果 pruning for sparsity 是戰略空軍俯視攻擊目標。 Bitwidth reduction and quantization 就是戰術陸軍 ground battle to attack the target. 從最早的 floating point (FP32) 32-bit MAC 縮減到 16-bit floating or fixed point MAC, 再縮減到目前主流的 8-bit fixed point (or its variant) MAC.

這個趨勢仍然持續，很多研究聚焦在更小 bit-width, 例如 4-bit 甚至是 ternary (3-level) or binary (1-bit) [@wanTBNConvolutional2018].

目前 (2020) 小於 8-bit neural network 大多限制在特定應用，而不是平台應用。使用 low bit-width 的好處顯而易見：stored and runtime memory, memory bandwidth (4X reduction for 32b vs 8b), computation energy/power 都大幅減少 (15X-30X reduction for 32b FP vs. 8b fixed point).

很多觀念容易混淆；例如 bitwidth reduction, quantization, compression. 本文主要討論 bitwidth reduction 加上基本的 quantization.

Quantization 一般指 continuous analog signal/value is constrained to a set of discrete values. 有幾點需要注意：
1. A set of discrete values can be linear or nonlinear, uniform or non-uniform. Linear and uniform 最常見, e.g. binary encoding with $2^b$ values where b is bitwidth. Linear but non-uniform is like Han-Song's weight sharing. Nonlinear or non-uniform 例如 [@polinoModelCompression2018].
2. 因為 group or constrained to discrete values, 所以會有 quantization noise or error.
Bitwidth includes floating-point and fixed-point representations. Fixed-point's bitwidth 是指 "binary encoding" 這些 quantized discrete values. Based on information theory, 最 compact bitwidth encoding scheme 是基於這些 discrete values 的 distribution (e.g. Huffman encoding). 不過這會造成後續的 MAC computation 非常複雜。因此通常使用 binary encoding ( $\log_2 M$ ) 最普遍。另外有 low-bit floating point representation. 因此 bitwidth reduction 並不完全等同 quantization 如下圖。
Compression 是更普遍的觀念。 Quantization 是達成 compression 的一種方式；sparsity 也是達成 compression 的一種方式。

Bitwidth Reduction for Training and Interference

Bit-width Reduction Motivation and Trade-off

Bit-width reduction 的可行性和 drive 可能也來自大腦研究。大腦可視為類比的運算單元，不具有高精度的數位浮點運算，卻能完成令人驚訝的工作, both learning and inference. 這種類比運算近似 low bitwidth ( $\le$ 8-bit) fixed point operation. 不過 neural network 畢竟和人腦不同，一般 training 使用 32b FP GPU, inference 也要複製 32b FP operation 才會得到預期的 accuracy (golden).

Floating Point (FP) vs. Fixed Point

IEEE 訂出 floating-point representation 標準。 Floating-point 包含 mantissa (負責精度) 和 exponent (負責 dynamic range, DR), 加上 sign bit.

真正的數字是: $(-1)^s (1+m) \times 2^{(e-bias)}$ , where m: mantissa, e: exponent, s: sign. Bias 決定小數點的位置，一般會控制 e=0 真正數字在 +/-1 之間。
IEEE 754 standard 定義 floating point 如下：

Half Precision (16 bit): 1 sign bit, 5 bit exponent, and 10 bit mantissa
Single Precision (32 bit): 1 sign bit, 8 bit exponent, and 23 bit mantissa
Double Precision (64 bit): 1 sign bit, 11 bit exponent, and 52 bit mantissa
Quadruple Precision (128 bit): 1 sign bit, 15 bit exponent, and 112 bit mantissa

Floating Point 最大的優點：decouple 精度和 dynamic range. 缺點也很明顯，就是需要很多 bits (mantissa + exponent, 16-bit/half precision 起跳), 以及相對大的 implementation area and power.

Fixed Point 相反，最大的優點是 bit 數只有更小 (16b -> 12b -> 8b -> 6b -> 4b -> 2b -> 1b), 因此有前述的 memory/area/power 的優點。但是需要克服的問題：(1)精度; (2)dynamic range.

Training vs. Inference

根據 training or inference, 以及 floating point (FP) or fixed point, 可以分為四類如下表。

	Training	Inference
Floating point	GPU/TPU	CPU/GPU/TPU
Fixed point	DSP/NPU	(Edge AI) ASIC

Training 需要 forward and backward paths operation; inference 只有 forward path. 因此 training 需要高的精度和 dynamic range, 但是 inference 可以放寬兩者。

(Row 1) Floating point for training and inference 一般用為 golden. 大多 server or data center 使用 32-bit for training；使用 16-bit 甚至 12-bit floating point for inference. Floating point training/inference accuracy or top1/top5 error 一般就是 golden 或是近似 golden.
(Row 2) Fixed point for training and inference 基本是在 trade-off（1）bitwidth 的精度和 dynamic range; 以及 (2) top1/top5 error or accuracy.

為什麼會考慮 fixed point for training? 主要還是 power consideration. 在一些應用 (e.g. edge AI) 使用 on-line training with power constraint, fixed point for training 成為必要選擇。

Focus on fixed point interference.

本文 focus on fixed point for inference, 這是 edge AI 的主流。
Fixed point 要解決的問題 (1）bitwidth 的精度和 dynamic range; 以及 (2) top1/top5 error or accuracy.

下圖顯示一個簡單的 7-layer (including 1-FC) neural network of CIFAR-10 weights and activation (Omap and Imap for next layer) distribution.

幾個特點：

每一層的 weight and activation distribution dynamic range 差異非常大。
Weight distribution 是零點對稱。activation distribution 非零點非對稱。

明智的方法不是用一個 fixed point representation, 而是 layer-base scaling to adjust the dynamic range, 把範圍拉到一致 (similar to batch norm, can it merge?), 再做 quantization. 針對非對稱 activation, 可以引入不為零的 zero_point, 可以增加 1bit resolution. 可以 summarize 如下 [@tensorflowTensorFlowLite]：

real_value = (int8_value – zero_point) x scale

Dynamic range: 大多用 layer based adaptive scaling for both weights and activations, 就是上式的 scale.
精度的觀念是：quantize the golden (FP32) with minimal quantization noise, minimal quantization noise 隱含的意義是最少的 overall accuracy drop. 簡單但 sub-optimal 是 symmetric uniform quantization (上式 zero_point = 0). 適用 weights. 進一步是 asymmetric uniform quantization (上式 zero_point $\ne$ 0). 適用 activations.
Minimize runtime cost of asymmetric quantization: A is a $m \times n$ of asymmetric quantized activations; B is a $n \times p$ matrix of quantized weights. Considering $A \bullet B$ : $a_{j} \cdot b_{k}=\sum_{i=0}^{n} a_{j}^{(i)} b_{k}^{(i)} =\sum_{i=0}^{n}\left(q_{a}^{(i)}-z_{a}\right)\left(q_{b}^{(i)}-z_{b}\right) \\ =\sum_{i=0}^{n} q_{a}^{(i)} q_{b}^{(i)}-\sum_{i=0}^{n} q_{a}^{(i)} z_{b}-\sum_{i=0}^{n} q_{b}^{(i)} z_{a}+\sum_{i=0}^{n} z_{a} z_{b}$
第三和第四項可以 pre-compute as a constant. 第二項為 0 因為 force $z_b = 0$ . 因此只有第一項需要計算。和一般 symmetric quantization multiplication 一樣。
最佳的 quantization 方式是基於 (information theory) density distribution 的 nonlinear quantization (如下圖 red dot). 中間的洞 weight 接近 zero 的 density 為 0 利用 pruning 技巧 [@hanDeepCompression2016]。

Nonlinear quantization 的 MAC runtime 計算成本很高，可能要用 table look-up. 比較簡單還是 asymmetric uniform quantization with pruning and clipping 如下圖 [@jungLearningQuantize2019]。

How to Do Bitwidth Reduction

我的分類如下：

Open loop: direct quantization/truncation of weight, Imap, and Omap of golden model (FP32). 非常簡單，但只能用於 higher bit width (e.g. $\ge$ 12bit) without accuracy drop (e.g. $\le$ 1%).
(1) + adaptive layer scaling: 基本和 1. 很像，但是每一層都有 scaling factor. 如何決定 scaling factor？一般會先用 image subset pre-determine the scaling factor. Adaptive layer scaling 解決 dynamic range problem. 可以用於 8-10 bit width. 這稱為 post-training quantization, 是目前最普遍的方式
Close loop: (2) + retrain. Retrain 是把 (2) 的 weight/Imap/Omap 視為 initial values, 重新 training, fine tune weights. 可以用於 6-8 bit width. 這稱為 quantization-aware training
(3) + advance quantization: 例如 asymmetrical quantization (Google), non-linear quantization (Korea PCK?), weight-sharing quantization (Han Song), pruning+quantization [@jungLearningQuantize2019]. 這些 advance quantization 基本都需要 retrain. 可以用於 4-8 bit width [@polinoModelCompression2018].
更少的 bitwidth ( < 4bit) 一般用於特定應用。也需要特別的技巧。

以下是 tensorflow 的分類。

前兩項對應 1; 第三項對應 2 (post-training integer quantization with unlabelled sample; 第四項對應 3 (quantization-aware training with labelled data).

[@nagelDataFreeQuantization2019] 的分類：

Level 1 No data and no backpropagation required. Method
works for any model. As simple as an API call that
only looks at the model definition and weights.
Level 2 Requires data but no backpropagation. Works for
any model. The data is used e.g. to re-calibrate batch
normalization statistics [27] or to compute layer-wise
loss functions to improve quantization performance.
However, no fine-tuning pipeline is required.
Level 3 Requires data and backpropagation. Works for any
model. Models can be quantized but need fine-tuning
to reach acceptable performance. Often requires hyperparameter tuning for optimal performance. These
methods require a full training pipeline (e.g. [16, 35]).
Level 4 Requires data and backpropagation. Only works
for specific models. In this case, the network architecture needs non-trivial reworking, and/or the architecture needs to be trained from scratch with quantization in mind (e.g. [4, 31, 21]). Takes significant extra
training-time and hyperparameter tuning to work.

Bit-width Reduction Techniques

Q&A

Q: Bitwidth reduction 和其他技巧如 pruning for sparsity 能共用? Yes. 不過有可能在 bitwidth reduction + retrain 之後，sparsity 比例下降 (refernce?)

Reference

Han, Song, Huizi Mao, and William J. Dally. 2016. “Deep Compression:
Compressing Deep Neural Networks with Pruning, Trained Quantization and
Huffman Coding,” February. http://arxiv.org/abs/1510.00149.

Jung, Sangil, Changyong Son, Seohyung Lee, Jinwoo Son, Jae-Joon Han,
Youngjun Kwak, Sung Ju Hwang, and Changkyu Choi. 2019. “Learning to
Quantize Deep Networks by Optimizing Quantization Intervals with Task
Loss.” In 2019 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 4345–54. Long Beach, CA, USA: IEEE.
https://doi.org/10.1109/CVPR.2019.00448.

Wan, Diwen, Fumin Shen, Li Liu, Fan Zhu, Jie Qin, Ling Shao, and Heng
Tao Shen. 2018. “TBN: Convolutional Neural Network with Ternary Inputs
and Binary Weights.” In Computer Vision 2018, edited by Vittorio
Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss,
11206:322–39. Cham: Springer International Publishing.
https://doi.org/10.1007/978-3-030-01216-8_20.

Edge AI – Bitwidth Reduction

Bitwidth Reduction for Training and Interference