Math ML – Multiclass Classification: Softmax, SVM, IVM

Multi-class classification (MCC) 是一個陳年題目,標準作法就是 softmax classification. Softmax MCC 在 CNN 分類網絡擔任 feature classification 最後收官角色。在 attention based 的 transformer, BERT 更是扮演主角。

另外在一些特殊的應用如人臉識別,不同的 softmax loss function 近年也有很多的進展。
值得更進一步探討背後的基本數學原理以及變形。最好有一個 unified theory vs. applications correspondence!

  • Linear separable dataset, 特別是 binary dataset, SVM 的優勢 (i) maximum margin; (ii) efficiency computation for both training (convex optimization) and inference; (iii) intuitively explainable (supporting vectors). Logistic regression or softmax 完全沒有優勢,反而需要由 regulation 避免 overfit.
  • Simple nonlinear separable dataset. kernel SVM has the advantage as above.
  • Most linear/nonlinear separable but with some mixture data along the boundary. SVM or kernel SVM has the advantage
  • 如果一定比例 data mixture 而且呈現一定 trend, 特別是 multi-class scenarios, probabilistic approach, logistic regression or softmax, 更有優勢。這在深度學習,transformer 可以看出。

本文結構:先 logistic regression for binary classification, link minimize cross-entropy loss to maximize log likelihood. 再來用 softmax function 重新推導 logistic regression and generalize to multi-class classification.

Softmax con: Too many parameters vs. convolution layer (WxH~1000×1000=1M parameters for CNN) and (? x? for transformer)

Type Softmax SVM IVM
Form Exp(w.x)/sum(Exp())
Probability Yes No Yes
Rationale Max. likelihood/similarity Max. margin ?
Loss function Cross-entropy loss (above) Hinge loss ?
Application CNN, transformer classification ?
Multiclass C inference C(C-1)/2 binary C probability ?
Output, Y 0 or 1 (binary) +1 or -1 0 to 1??
w,b training w1-wc supporting vectors import vectors
Binary inference exp comp. linear compu. ? exp comp?
Storage NxM support vector x N x M ? exp comp?
Binary inference boundary check boundary check ?import vectors
Multiclass C inference C(C-1)/2 binary C probability ?
Differentiable Yes No Yes

Lots of new softmax!!
https://towardsdatascience.com/additive-margin-softmax-loss-am-softmax-912e11ce1c6b#:~:text=In%20short%2C%20Softmax%20Loss%20is,negative%20logarithm%20of%20the%20probabilities.

Loss function of LR and Softmax
[@houLogisticSoftmax2015]

Logistic Regression for Binary Classification

下圖是 binary classifier for N data points of D dimension with linear decision boundary (下圖 D=2\, ).

Inference: Logistic Regression by Decision Boundary (Not Generalizable to Multi-Classes!)

基本想法是用 sigmoid function, h(x) , to linearly classify two possible outcomes y_i \in\{0, 1\} given x_i, i=[1:N]

$latex y_i=\left\{\begin{array}{ll}
1 & 1 > h(x_i) > 0.5 \\
0 & 0 < h(x_i) < 0.5
\end{array}\right. $

where h(x) = \frac{1}{1+e^{-(w\cdot x +b)}} and 1-h(x) = \frac{e^{-(w\cdot x +b)}}{1+e^{-(w\cdot x +b)}} = \frac{1}{1+e^{+(w\cdot x +b)}}
Decision boundary, h(x)=0.5 , 對應 linear function w\cdot x +b = 0 , 就像 SVM 的 decision boundary.

Inference: Logistic Regression by Maximal Probability (Generalizable to Multi-Classes!)

只是把原來的想法稍微修改一下: y_i \in\{0, 1\} :
$latex y_i=\left\{\begin{array}{ll}
1 & h_1(x_i) > h_0(x_i) \\
0 & h_0(x_i) > h_1(x_i)
\end{array}\right. $

where h_1(x) = h(x) = \frac{1}{1+e^{-(w\cdot x +b)}} and h_0(x) = 1-h(x) = \frac{e^{-(w\cdot x +b)}}{1+e^{-(w\cdot x +b)}} = \frac{1}{1+e^{+(w\cdot x +b)}}
此處只有最大或然率的觀念,沒有 decision boundary 的觀念。
在 binary classification 兩者等價,但背後的邏輯完全不同!下面會深入討論。

在 multi-class classification, 最大或然率分類簡單明瞭。
以 3-class 分類為例,對應的 output (e.g. one hot vector),
$latex y_i=\left\{\begin{array}{ll}
[1, 0, 0] & h_1(x_i) > h_2(x_i), h_3(x_i) \\
[0, 1, 0] & h_2(x_i) > h_1(x_i), h_3(x_i) \\
[0, 0, 1] & h_3(x_i) > h_1(x_i), h_2(x_i)
\end{array}\right. $
where h_1(x)+h_2(x)+h_3(x)=1 and 0 \le h_i(x) \le 0

Training/Learning: 如何得到 w and b

所有 machine learning 的 training 都是先 formulate 成一個 loss function of minimum value, i.e. \min \mathcal{L}(w,b) , 藉此 optimization problem 解 w, b or other parameters. 只有少數簡單 optimization problem 有 close form (e.g. linear regression). 大多數 optimization 需要用數值算法求解,也就是 training or learning.

Maximum Likelihood for Probabilistic Interpretation ML

Minimum loss 和 maximum likelihood method 非常相似!
我們先複習一下 maximum likelihood, 再連結 minimum loss.

Let L(\theta) be a likelihood function, 基本是 probability function, P(x\mid\theta) , 求 \max L(\theta) = \max P(x\mid\theta) , and \theta^* = \text{argmax} L(\theta)

簡單的例子:n 個 coins with k heads and n-k tails. L(\theta) = P( k\#H, (n-k)\#T\mid \theta) = \frac{n!}{k! (n-k)!} {\theta}^k (1-\theta)^{n-k}
\max L(\theta) \Rightarrow \frac{d L(\theta^*)}{d \theta} = 0 \Rightarrow \theta^* = \text{argmax} L(\theta)

k \theta^{k-1} (1-\theta)^{n-k} - (n-k)\theta^k (1-\theta)^{n-k-1} = 0
\to k (1-\theta) = (n-k) \theta \to \theta^* = \frac{k}{n}

Logistic Regression's Maximum Likelihood Interpretation [@roelantsSoftmaxClassification2019]

對於具有 probabilistic interpretation 的 ML problem, 是否可以用 maximum likelihood approach 求解 parameter? Yes, of course!

以 logistic regression 為例,假設有一個 underlying probability distribution based on parameter. Let \theta = (w, b) and L(\theta \mid x, y) is the likelihood function, 可以視為 joint probability distribution of P(x, y \mid \theta) = P(y \mid x, \theta) P(x\mid\theta) . 基本上 P(x\mid\theta) \theta 無關,因此 L(\theta \mid x, y) 可以簡化為 P(y \mid x, \theta) , i.e. L(\theta \mid x, y) = P(y \mid x, \theta)
P(y=1 \mid x, \theta) = \frac{1}{1+e^{-\theta\cdot x}}
P(y=0 \mid x, \theta) = \frac{1}{1+e^{+\theta\cdot x}}

因為 y=0 and y=1 are independent, for a given data point (x_i, y_i)
L(\theta) = P(y=y_i \mid x_i, \theta) = \prod_c P(y_c \mid x, \theta) = \left(\frac{1}{1+e^{-\theta\cdot x_i}}\right)^{y_i} \left(\frac{1}{1+e^{+\theta\cdot x_i}}\right)^{1-y_i} where y_i \in [0, 1]

實務上不會只用單點決定 parameter. 假設 n data points 彼此互相獨立。The likelihood function,
L(\theta) = \prod_{i=1}^n P(y=y_i \mid x_i, \theta) = \prod_{i=1}^n \prod_c P(y_c \mid x, \theta) = \prod_{i=1}^n \left(\frac{1}{1+e^{-\theta\cdot x_i}}\right)^{y_i} \left(\frac{1}{1+e^{+\theta\cdot x_i}}\right)^{1-y_i}

一般的 Maximum Likelihood: \max L(\theta) , and \theta^* = \text{argmax} L(\theta)
但更常用 Minimum -log likelihood, \min \{-\log L(\theta)\} .
因為 log function 是 monotonic function. 此處 \log = \log_e = \ln , 可以視為 cancel exponential term.
\theta^* = \text{argmin} \{-\log L(\theta)\} = \text{argmax} L(\theta)

-\log L(\theta) = - \sum_{i=1}^n \sum_c \log P(y_c \mid x, \theta) \\ = \sum_{i=1}^n \{ y_i \log(1+e^{-\theta\cdot x_i}) + (1-y_i) \log(1+e^{+\theta\cdot x_i}) \} \\ = \sum_{i=1}^n \{ -y_i \log h(x_i) - (1-y_i)\log(1-h(x_i)) \}

where h(x) is the sigmoid function

Logistic Regression Cross-Entropy Loss

一般 Logistic Regression (LR) or softmax 的 cross-entropy loss function [@brownleeGentleIntroduction2019],
L = -y \log(h(x)) - (1-y) \log(1-h(x))

比較之下,其實 cross-entropy loss 就是 -log(likelihood). Minimize cross-entropy loss = Maximize log likelihood = Maximize likelihood!
對於熟悉 entropy 觀念的人,可能會覺得 entropy or cross-entropy 更直觀或基本。對我而言,把 cross-entropy loss 連結到 negative log likelihood 容易理解。

上式可以推廣到 loss function and multi-class classification.

Generalize to Multi-class Classification Using Softmax [@houLogisticSoftmax2015]

如何把 binary classification 的 logistic regression 推廣到 multi-class classification? 我們用 softmax 重新推導 binary classification (等價 logistic regression) 以及定義其 loss function. 下一步再推廣到 multi-class classification.

Softmax Function

什麼是 softmax? 我們先看什麼是 hard maximum. Hardmax 是 indictor vector of the original vector's maximum position, 對應 maximum 的 value 為 1, 其餘為 0.
舉個例子 A = [2, 4, 6, 1, 7], hardmax(A) = [0, 0, 0, 0, 1].

Hardmax 有幾個缺點:

  • 如果有兩個或是多個 maximum, 結果不唯一。例如 A = [2, 4, 6, 6], hardmax(A) = [0, 0, 0, 1] or [0, 0, 1, 0].
  • 大部分的資訊都流失,從 hardmax(A) 無法回溯 (back inject) 原始的資訊,除了 maximum 的 position.
  • Not differentiable, 無法 back propagate.

因此 Softmax 就應運而生:

  • Step 1 是把 A 所有 value 都 raise to exponents. Step 2 再 normalize to sum to 1.
    \sigma(\mathbf{z})_{\mathrm{j}}=\frac{\mathrm{e}^{\mathrm{z}_{\mathrm{j}}}}{\sum_{\mathrm{k}=1}^{\mathrm{K}} \mathrm{e}^{\mathrm{z}_{\mathrm{k}}}} \text { for } \mathrm{j}=1, \cdots, \mathrm{K}

  • A = [2, 4, 6, 1, 7], softmax(A) = [0.005, 0.035, 0.258, 0.002, 0.7]. 最大的 "7" 佔 70%, “6” 佔 26%, 其他值加起來不到 5%. 基本上所有的資訊都被 compressed 到 0 到 1 之間值,可以視為機率,同時 differentiable.

  • Softmax 的 probability 特性剛好 match maximum log likelihood = minimize cross-entropy loss.

Binary Classification

假設如下的 binary labelled dataset (y_i= 0 or 1 given x_i ). 我們可以假設這些 data 的平均值為 0 而不失普遍性。下一步就是要找出對應的 w_0 and w_1 . 想法很簡單,就是 w_0 要儘量和 y=0's data (下圖的 "o") 有最大的“相似性”。w_1 要儘量和 y=1's data (下圖的 "x") 有最大的“相似性”。

如何有最大的相似性?最簡單的相似性就是用 inner product, \langle w_0, x\rangle = {w_0 \cdot x} or \langle w_1, x\rangle = {w_1 \cdot x} . 一個直覺想法是
\max_{w_0, w_1}\{\sum_{\{i: y_i = 0\}}\langle w_0, x_{i}\rangle + \sum_{\{i: y_i = 1\}}\langle w_1, x_{i}\rangle \}
= \max_{w_0, w_1}\{\sum_i (1-y_i)\langle w_0, x_{i}\rangle + \sum_i y_i\langle w_1, x_{i}\rangle \} where y_i \in \{0, 1\}
= \max_{w_0, w_1} \sum_i [(1-y_i)  (w_0 \cdot x_{i}) + y_i  (w_1 \cdot x_{i}) ] where y_i \in \{0, 1\}

上式的問題是 inner product 可以是正或負,正負會互相抵銷。可以改良用 exp 變成都是正值,再 normalize the sum to 1, 避免爆掉。這正是 softmax function 所有的值介於 0 to 1, 用來取代原來的相似性 inner product!
\max_{w_0, w_1} \sum_i [(1-y_i)\frac{e^{ w_0\cdot x_{i}}}{ e^{ w_0\cdot x_{i}} + e^{ w_1\cdot x_{i}}} + y_i \frac{e^{ w_1\cdot x_{i}}}{ e^{ w_0\cdot x_{i}} + e^{ w_1\cdot x_{i}}} ]
= \max_{w_0, w_1} \sum_i [(1-y_i) P(y_i=0 \mid x_i, w_0) + y_i P(y_i=1 \mid x_i, w_1) ]

進一步改良,因為 softmax output 可以視為 probability. 可以用 log probability, i.e. -log P (natural log) 轉為 entropy. 因為乘以 (-1), maximum similarity 轉為 minimum (cross) entropy loss. 為什麼要用 entropy 取代 probability? 一方面是 entropy 更有“物理”意義, e.g. minimum cross-entropy loss, maximum entropy principle, maximum energy principle. 另一方面 cross-entropy loss punish small probability, -\log P \to +\infty when P \to 0 (見上上圖).

\min_{w_0, w_1} \sum_i [-(1-y_i)\log \frac{e^{ w_0\cdot x_{i}}}{ e^{ w_0\cdot x_{i}} + e^{ w_1\cdot x_{i}}} - y_i \log \frac{e^{ w_1\cdot x_{i}}}{ e^{ w_0\cdot x_{i}} + e^{ w_1\cdot x_{i}}} ]

在 binary classification 的 special case, w_0 and w_1 可以化簡為 w=w_1-w_0 .
\min_{w} \sum_i [-(1-y_i)\log \frac{1}{ 1 + e^{ +w\cdot x_{i}}} - y_i \log \frac{1}{1 + e^{ -w\cdot x_{i}}} ]

Sanity check: w = w_1 - w_0 w_1 和相似性很高,也就是和 \{x_i : y_i = 1\} 相似度很高,w \cdot x_i = +C \gg 0 , 因此上式第一項為 0 (y_i=1 ). 只要看第二項:-y_i \log \frac{1}{1 + e^{ -w\cdot x_{i}}} = \log(1+e^{-C}) \sim e^{-C} \sim 0
相反 w \{x_i : y_i = 0\} 負相似度很高,w \cdot x_i = -C \ll 0 , 因此上式第二項為 0,只要看第一項: -(1-y_i)\log \frac{1}{ 1 + e^{ +w\cdot x_{i}}} = \log(1+e^{-C})\sim e^{-C} \sim 0

In summary, maximum similarity \Rightarrow minimum cross-entropy loss.

上式的 \min_{w} \sum_i [-(1-y_i)\log \frac{1}{ 1 + e^{ +w\cdot x_{i}}} - y_i \log \frac{1}{1 + e^{ -w\cdot x_{i}}} ]
= \min_{w} \sum_i [-(1-y_i)\log (1-h(x_i)) - y_i \log h(x_i) ] where h(x) 就是 sigmoid function without bias, 因為我們一開始假設 data 平均值為 0. 我們可以加上 bias, i.e. w\cdot x \to w\cdot x + b 代入上式得到 logistic regression 的 loss function.
\min_{w} \sum_i [-(1-y_i)\log \frac{1}{ 1 + e^{ +w\cdot x_{i}+b}} - y_i \log \frac{1}{1 + e^{ -w\cdot x_{i}-b}} ]

另一個重點是 regularization, 上式的 \|w\| 愈大,+C 愈大,cross-entropy loss 愈小。

以下圖 1D logistic regression 為例,w 對應 transition region 的斜率:假設 y=0 (下圖 'o') 和 y=1 (下圖 'x') 是 totally separable, 紅線 (w=2) 和藍線 (w=1) 都可以很好的 fit data, 但是 w=2 的 cross-entropy loss 比較小。如果沒有 regularization, w 愈大 (e.g. \infty ),cross-entropy loss 愈小,顯然這不是希望的結果。一般會在 loss function 加上 L1 or L2 regularization term 限制 w 的大小。完整的 loss function 如下。
\min_{w} \sum_i \left[-(1-y_i)\log \frac{1}{ 1 + e^{ +w\cdot x_{i}+b}} - y_i \log \frac{1}{1 + e^{ -w\cdot x_{i}-b}} \right] + \lambda \| w\|^2

或者 cross-entropy loss 先平均 w.r.t. N data points, 再加上 regularization:
\min_{w} \frac{1}{N}\sum_i \left[-(1-y_i)\log \frac{1}{ 1 + e^{ +w\cdot x_{i}+b}} - y_i \log \frac{1}{1 + e^{ -w\cdot x_{i}-b}} \right] + \lambda \| w\|^2

一般 data y=0 和 y=1 非 totally separable, cross-entropy loss 會限制 w 的 optimal range.

Multi-Class Classification

從 binary classification 推廣到 multi-class classification 很直接。下式是 3-class loss function:

\sum_i \left[-y^0_i\log \frac{e^{ w_0\cdot x_{i}}}{ e^{ w_0\cdot x_{i}} + e^{ w_1\cdot x_{i}} + e^{ w_2\cdot x_{i}}} - y^1_i \log \frac{e^{ w_1\cdot x_{i}}}{ e^{ w_0\cdot x_{i}} + e^{ w_1\cdot x_{i}}  + e^{ w_2\cdot x_{i}}} - y^2_i \log \frac{e^{ w_2\cdot x_{i}}}{ e^{ w_0\cdot x_{i}} + e^{ w_1\cdot x_{i}} + e^{ w_2\cdot x_{i}}}\right]

where [y^0_i, y^1_i, y^2_{i}] is the label of x_i , 是一個 one hot vector, i.e. [1, 0, 0] or [0, 1, 0] or [0, 0, 1].

再來推廣到 C classes using the notation of [@roelantsSoftmaxClassification2019].
\sum_i \sum_{c=1}^C \left[-y^c_i\log \frac{e^{ w_c\cdot x_{i}}}{ e^{ w_0\cdot x_{i}} + e^{ w_1\cdot x_{i}} + ... + e^{ w_C\cdot x_{i}}} \right]

因為 y^c_i 是 one-hot with respect to C. 上式可以把 y_i 直接 embedded 到 softmax formula. [@rashadAdditiveMargin2020]
\sum_i -\log \left( \frac{e^{ w_{y_i}\cdot x_{i}}}{ \sum_j e^{ w_j\cdot x_{i}} } \right)

上式平均 w.r.t N data points, 再加上 regularization, softmax multi-class classification formulation 如下式。常用於深度學習的 feature classification, transformer 的 attention network.
\min_{w_0, w_1, ..., w_C} \frac{1}{N} \sum_i -\log \left( \frac{e^{ w_{y_i}\cdot x_{i}}}{ \sum_j e^{ w_j\cdot x_{i}} } \right) + \lambda {\sum_i \|w_i\|^2}

Likelihood to Cross-Entropy Function

幾個重點:

  • Softmax classification 的邏輯是 maximum similarity between w_i and \{x \in y_i\} (等價 maximum (log) likelihood = 等價 minimum entropy loss)! 再加上 regularization. Maximum similarity 的邏輯很容易推廣到 multi-class classification!
  • SVM 是找 maximum margin decision boundary (等價 regularization) 再加上 minimize hinge loss. SVM 似乎比較適合 binary classification where the decision boundary is easier to defined and understood. 如果是 multi-class classification, 就要變成多個 binary classification.

  • let g_0(x) = e^{w_0 \cdot x} and g_1(x) = e^{w_1 \cdot x}

  • Normalize the sum: h_0(x) = \frac{g_0(x)}{g_0(x)+g_1(x)} and h_1(x) = \frac{g_1(x)}{g_0(x)+g_1(x)} . h_0 and h_1 w_0 \cdot x and w_1 \cdot x 的 softmax.

  • h_0(x) = \frac{e^{w_0 \cdot x}}{e^{w_0 \cdot x}+ e^{w_1 \cdot x}} = \frac{1}{1+ e^{(w_1 - w_0 )\cdot x}} = \frac{1}{1+ e^{-w \cdot x}} where w = w_0 - w_1

  • 當然, h_1(x) = \frac{e^{w_1 \cdot x}}{e^{w_0 \cdot x}+ e^{w_1 \cdot x}} = \frac{1}{1+ e^{(w_0-w_1)\cdot x}} = \frac{1}{1+ e^{+w \cdot x}} = 1 - h_0(x)

  • 加上一個 bias, w\cdot t \to w \cdot x + b  不會改變以上結論,就得出 logistic regression. 但在 softmax 還可以有這個 bias 嗎?

  • If w_1\cdot x > w_0\cdot x \Rightarrow g_1(x) > g_0(x) \Rightarrow h_1(x) > h_0(x) \Rightarrow y_1 此處用到 exp and log 都是 monotonic functions. 但如何定義 loss function? y_1 = 1 = y and y_0 = 0 = 1 - y

  • 什麼是 loss function? 就是 given x , 如果你選 y_1 , 而 P(y_1 | x) = 1 代表選對了,loss = 0; 如果 P(y_1 | x) = 0.01,代表選錯了, loss 等於 big positive number. 神奇的 function 就是用 \log P(y_1 \mid x) , 。 : make wrong decision -(y = 1) x log( h_0(x))

  • Use probability P(y=0 \mid x)

  • And the loss function is log P().
    以上 Logistic Regression (LR) 的 loss function:
    L = -y \log(h(x)) - (1-y) \log(1-h(x))

Log Base: \log_e(\ln) or \log_2 or \log_{10} ?

Logistic regression and softmax loss function, use natural log = ln 因為 exp().
Classification cross-entropy loss function, use log_2 因為可以用 bits.
Continuous loss function, use log 單位是 neper?
基本上沒有 log_10 in ML?

Cross-entropy can be calculated using the probabilities of the events from P and Q, as follows:

H(P, Q) = – sum x in X P(x) * log(Q(x))
Where P(x) is the probability of the event x in P, Q(x) is the probability of event x in Q and log is the base-2 logarithm, meaning that the results are in bits. If the base-e or natural logarithm is used instead, the result will have the units called nats.

!!!! softmax 的精神是找出 w vector 和 data 的相似性,而不是 decision boundary. 所以可以推廣到 multi-class classification!!!

!!! SVM 是找出 decision boundary. 對於 binary OK, but multi-class classification 就很 akward.

設如下的 binary labelled dataset (y_i= 0 or 1 given x_i ). 我們可以假設這些 data 的平均值為 0 而不失普遍性。下一步就是要找出對應的 w_0 and w_1 . 想法很簡單,就是 w_0 要儘量和 y=0's data (下圖的 "o") 有最大的“相似性”。w_1 要儘量和 y=1's data (下圖的 "x") 有最大的“相似性”。

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Design a site like this with WordPress.com
Get started