Math ML – Multiclass Classification: Softmax, SVM, IVM – ALU

Multi-class classification (MCC) 是一個陳年題目，標準作法就是 softmax classification. Softmax MCC 在 CNN 分類網絡擔任 feature classification 最後收官角色。在 attention based 的 transformer, BERT 更是扮演主角。

另外在一些特殊的應用如人臉識別，不同的 softmax loss function 近年也有很多的進展。
值得更進一步探討背後的基本數學原理以及變形。最好有一個 unified theory vs. applications correspondence!

Linear separable dataset, 特別是 binary dataset, SVM 的優勢 (i) maximum margin; (ii) efficiency computation for both training (convex optimization) and inference; (iii) intuitively explainable (supporting vectors). Logistic regression or softmax 完全沒有優勢，反而需要由 regulation 避免 overfit.
Simple nonlinear separable dataset. kernel SVM has the advantage as above.
Most linear/nonlinear separable but with some mixture data along the boundary. SVM or kernel SVM has the advantage
如果一定比例 data mixture 而且呈現一定 trend, 特別是 multi-class scenarios, probabilistic approach, logistic regression or softmax, 更有優勢。這在深度學習，transformer 可以看出。

本文結構：先 logistic regression for binary classification, link minimize cross-entropy loss to maximize log likelihood. 再來用 softmax function 重新推導 logistic regression and generalize to multi-class classification.

Softmax con: Too many parameters vs. convolution layer (WxH~1000×1000=1M parameters for CNN) and (? x? for transformer)

Type	Softmax	SVM	IVM
Form	Exp(w.x)/sum(Exp())
Probability	Yes	No	Yes
Rationale	Max. likelihood/similarity	Max. margin	?
Loss function	Cross-entropy loss (above)	Hinge loss	?
Application	CNN, transformer	classification	?
Multiclass C inference	C(C-1)/2 binary	C probability	?
Output, Y	0 or 1 (binary)	+1 or -1	0 to 1??
w,b training	w1-wc	supporting vectors	import vectors
Binary inference	exp comp.	linear compu.	? exp comp?
Storage	NxM	support vector x N x M	? exp comp?
Binary inference	boundary check	boundary check	?import vectors
Multiclass C inference	C(C-1)/2 binary	C probability	?
Differentiable	Yes	No	Yes

Lots of new softmax!!
https://towardsdatascience.com/additive-margin-softmax-loss-am-softmax-912e11ce1c6b#:~:text=In%20short%2C%20Softmax%20Loss%20is,negative%20logarithm%20of%20the%20probabilities.

Loss function of LR and Softmax
[@houLogisticSoftmax2015]

Logistic Regression for Binary Classification

下圖是 binary classifier for N data points of D dimension with linear decision boundary (下圖 $D=2\,$ ).

Inference: Logistic Regression by Decision Boundary (Not Generalizable to Multi-Classes!)

基本想法是用 sigmoid function, $h(x)$ , to linearly classify two possible outcomes $y_i \in\{0, 1\}$ given $x_i, i=[1:N]$

$latex y_i=\left\{\begin{array}{ll}
1 & 1 > h(x_i) > 0.5 \\
0 & 0 < h(x_i) < 0.5
\end{array}\right. $

where $h(x) = \frac{1}{1+e^{-(w\cdot x +b)}}$ and $1-h(x) = \frac{e^{-(w\cdot x +b)}}{1+e^{-(w\cdot x +b)}} = \frac{1}{1+e^{+(w\cdot x +b)}}$
Decision boundary, $h(x)=0.5$ , 對應 linear function $w\cdot x +b = 0$ , 就像 SVM 的 decision boundary.

Inference: Logistic Regression by Maximal Probability (Generalizable to Multi-Classes!)

只是把原來的想法稍微修改一下： $y_i \in\{0, 1\}$ :
$latex y_i=\left\{\begin{array}{ll}
1 & h_1(x_i) > h_0(x_i) \\
0 & h_0(x_i) > h_1(x_i)
\end{array}\right. $

where $h_1(x) = h(x) = \frac{1}{1+e^{-(w\cdot x +b)}}$ and $h_0(x) = 1-h(x) = \frac{e^{-(w\cdot x +b)}}{1+e^{-(w\cdot x +b)}} = \frac{1}{1+e^{+(w\cdot x +b)}}$
此處只有最大或然率的觀念，沒有 decision boundary 的觀念。
在 binary classification 兩者等價，但背後的邏輯完全不同！下面會深入討論。

在 multi-class classification, 最大或然率分類簡單明瞭。
以 3-class 分類為例，對應的 output (e.g. one hot vector),
$latex y_i=\left\{\begin{array}{ll}
[1, 0, 0] & h_1(x_i) > h_2(x_i), h_3(x_i) \\
[0, 1, 0] & h_2(x_i) > h_1(x_i), h_3(x_i) \\
[0, 0, 1] & h_3(x_i) > h_1(x_i), h_2(x_i)
\end{array}\right. $
where $h_1(x)+h_2(x)+h_3(x)=1$ and $0 \le h_i(x) \le 0$

Training/Learning: 如何得到 w and b

所有 machine learning 的 training 都是先 formulate 成一個 loss function of minimum value, i.e. $\min \mathcal{L}(w,b)$ , 藉此 optimization problem 解 $w, b$ or other parameters. 只有少數簡單 optimization problem 有 close form (e.g. linear regression). 大多數 optimization 需要用數值算法求解，也就是 training or learning.

Maximum Likelihood for Probabilistic Interpretation ML

Minimum loss 和 maximum likelihood method 非常相似！
我們先複習一下 maximum likelihood, 再連結 minimum loss.

Let $L(\theta)$ be a likelihood function, 基本是 probability function, $P(x\mid\theta)$ , 求 $\max L(\theta) = \max P(x\mid\theta)$ , and $\theta^* = \text{argmax} L(\theta)$

簡單的例子：n 個 coins with k heads and n-k tails. $L(\theta) = P( k\#H, (n-k)\#T\mid \theta) = \frac{n!}{k! (n-k)!} {\theta}^k (1-\theta)^{n-k}$
$\max L(\theta) \Rightarrow \frac{d L(\theta^*)}{d \theta} = 0 \Rightarrow \theta^* = \text{argmax} L(\theta)$

$k \theta^{k-1} (1-\theta)^{n-k} - (n-k)\theta^k (1-\theta)^{n-k-1} = 0$
$\to k (1-\theta) = (n-k) \theta \to \theta^* = \frac{k}{n}$

Logistic Regression's Maximum Likelihood Interpretation [@roelantsSoftmaxClassification2019]

對於具有 probabilistic interpretation 的 ML problem, 是否可以用 maximum likelihood approach 求解 parameter? Yes, of course!

以 logistic regression 為例，假設有一個 underlying probability distribution based on parameter. Let $\theta = (w, b)$ and $L(\theta \mid x, y)$ is the likelihood function, 可以視為 joint probability distribution of $P(x, y \mid \theta) = P(y \mid x, \theta) P(x\mid\theta)$ . 基本上 $P(x\mid\theta)$ 和 $\theta$ 無關，因此 $L(\theta \mid x, y)$ 可以簡化為 $P(y \mid x, \theta)$ , i.e. $L(\theta \mid x, y) = P(y \mid x, \theta)$
$P(y=1 \mid x, \theta) = \frac{1}{1+e^{-\theta\cdot x}}$
$P(y=0 \mid x, \theta) = \frac{1}{1+e^{+\theta\cdot x}}$

因為 $y=0$ and $y=1$ are independent, for a given data point $(x_i, y_i)$
$L(\theta) = P(y=y_i \mid x_i, \theta) = \prod_c P(y_c \mid x, \theta) = \left(\frac{1}{1+e^{-\theta\cdot x_i}}\right)^{y_i} \left(\frac{1}{1+e^{+\theta\cdot x_i}}\right)^{1-y_i}$ where $y_i \in [0, 1]$

實務上不會只用單點決定 parameter. 假設 n data points 彼此互相獨立。The likelihood function,
$L(\theta) = \prod_{i=1}^n P(y=y_i \mid x_i, \theta) = \prod_{i=1}^n \prod_c P(y_c \mid x, \theta) = \prod_{i=1}^n \left(\frac{1}{1+e^{-\theta\cdot x_i}}\right)^{y_i} \left(\frac{1}{1+e^{+\theta\cdot x_i}}\right)^{1-y_i}$

一般的 Maximum Likelihood: $\max L(\theta)$ , and $\theta^* = \text{argmax} L(\theta)$
但更常用 Minimum -log likelihood, $\min \{-\log L(\theta)\}$ .
因為 log function 是 monotonic function. 此處 $\log = \log_e = \ln$ , 可以視為 cancel exponential term.
$\theta^* = \text{argmin} \{-\log L(\theta)\} = \text{argmax} L(\theta)$

$-\log L(\theta) = - \sum_{i=1}^n \sum_c \log P(y_c \mid x, \theta) \\ = \sum_{i=1}^n \{ y_i \log(1+e^{-\theta\cdot x_i}) + (1-y_i) \log(1+e^{+\theta\cdot x_i}) \} \\ = \sum_{i=1}^n \{ -y_i \log h(x_i) - (1-y_i)\log(1-h(x_i)) \}$

where $h(x)$ is the sigmoid function

Logistic Regression Cross-Entropy Loss

一般 Logistic Regression (LR) or softmax 的 cross-entropy loss function [@brownleeGentleIntroduction2019],
$L = -y \log(h(x)) - (1-y) \log(1-h(x))$

比較之下，其實 cross-entropy loss 就是 -log(likelihood). Minimize cross-entropy loss = Maximize log likelihood = Maximize likelihood!
對於熟悉 entropy 觀念的人，可能會覺得 entropy or cross-entropy 更直觀或基本。對我而言，把 cross-entropy loss 連結到 negative log likelihood 容易理解。

上式可以推廣到 loss function and multi-class classification.

Generalize to Multi-class Classification Using Softmax [@houLogisticSoftmax2015]

如何把 binary classification 的 logistic regression 推廣到 multi-class classification? 我們用 softmax 重新推導 binary classification (等價 logistic regression) 以及定義其 loss function. 下一步再推廣到 multi-class classification.

Softmax Function

什麼是 softmax? 我們先看什麼是 hard maximum. Hardmax 是 indictor vector of the original vector's maximum position, 對應 maximum 的 value 為 1, 其餘為 0.
舉個例子 A = [2, 4, 6, 1, 7], hardmax(A) = [0, 0, 0, 0, 1].

Hardmax 有幾個缺點：

如果有兩個或是多個 maximum, 結果不唯一。例如 A = [2, 4, 6, 6], hardmax(A) = [0, 0, 0, 1] or [0, 0, 1, 0].
大部分的資訊都流失，從 hardmax(A) 無法回溯 (back inject) 原始的資訊，除了 maximum 的 position.
Not differentiable, 無法 back propagate.

因此 Softmax 就應運而生:

Step 1 是把 A 所有 value 都 raise to exponents. Step 2 再 normalize to sum to 1.
$\sigma(\mathbf{z})_{\mathrm{j}}=\frac{\mathrm{e}^{\mathrm{z}_{\mathrm{j}}}}{\sum_{\mathrm{k}=1}^{\mathrm{K}} \mathrm{e}^{\mathrm{z}_{\mathrm{k}}}} \text { for } \mathrm{j}=1, \cdots, \mathrm{K}$
A = [2, 4, 6, 1, 7], softmax(A) = [0.005, 0.035, 0.258, 0.002, 0.7]. 最大的 "7" 佔 70%, “6” 佔 26%, 其他值加起來不到 5%. 基本上所有的資訊都被 compressed 到 0 到 1 之間值，可以視為機率，同時 differentiable.
Softmax 的 probability 特性剛好 match maximum log likelihood = minimize cross-entropy loss.

Binary Classification

假設如下的 binary labelled dataset ( $y_i=$ 0 or 1 given $x_i$ ). 我們可以假設這些 data 的平均值為 0 而不失普遍性。下一步就是要找出對應的 $w_0$ and $w_1$ . 想法很簡單，就是 $w_0$ 要儘量和 y=0's data (下圖的 "o") 有最大的“相似性”。 $w_1$ 要儘量和 y=1's data (下圖的 "x") 有最大的“相似性”。

如何有最大的相似性？最簡單的相似性就是用 inner product, $\langle w_0, x\rangle = {w_0 \cdot x}$ or $\langle w_1, x\rangle = {w_1 \cdot x}$ . 一個直覺想法是
$\max_{w_0, w_1}\{\sum_{\{i: y_i = 0\}}\langle w_0, x_{i}\rangle + \sum_{\{i: y_i = 1\}}\langle w_1, x_{i}\rangle \}$
$= \max_{w_0, w_1}\{\sum_i (1-y_i)\langle w_0, x_{i}\rangle + \sum_i y_i\langle w_1, x_{i}\rangle \}$ where $y_i \in \{0, 1\}$
$= \max_{w_0, w_1} \sum_i [(1-y_i) (w_0 \cdot x_{i}) + y_i (w_1 \cdot x_{i}) ]$ where $y_i \in \{0, 1\}$

上式的問題是 inner product 可以是正或負，正負會互相抵銷。可以改良用 exp 變成都是正值，再 normalize the sum to 1, 避免爆掉。這正是 softmax function 所有的值介於 0 to 1, 用來取代原來的相似性 inner product！
$\max_{w_0, w_1} \sum_i [(1-y_i)\frac{e^{ w_0\cdot x_{i}}}{ e^{ w_0\cdot x_{i}} + e^{ w_1\cdot x_{i}}} + y_i \frac{e^{ w_1\cdot x_{i}}}{ e^{ w_0\cdot x_{i}} + e^{ w_1\cdot x_{i}}} ]$
$= \max_{w_0, w_1} \sum_i [(1-y_i) P(y_i=0 \mid x_i, w_0) + y_i P(y_i=1 \mid x_i, w_1) ]$

進一步改良，因為 softmax output 可以視為 probability. 可以用 log probability, i.e. -log P (natural log) 轉為 entropy. 因為乘以 (-1), maximum similarity 轉為 minimum (cross) entropy loss. 為什麼要用 entropy 取代 probability? 一方面是 entropy 更有“物理”意義, e.g. minimum cross-entropy loss, maximum entropy principle, maximum energy principle. 另一方面 cross-entropy loss punish small probability, $-\log P \to +\infty$ when $P \to 0$ (見上上圖).

$\min_{w_0, w_1} \sum_i [-(1-y_i)\log \frac{e^{ w_0\cdot x_{i}}}{ e^{ w_0\cdot x_{i}} + e^{ w_1\cdot x_{i}}} - y_i \log \frac{e^{ w_1\cdot x_{i}}}{ e^{ w_0\cdot x_{i}} + e^{ w_1\cdot x_{i}}} ]$

在 binary classification 的 special case, $w_0$ and $w_1$ 可以化簡為 $w=w_1-w_0$ .
$\min_{w} \sum_i [-(1-y_i)\log \frac{1}{ 1 + e^{ +w\cdot x_{i}}} - y_i \log \frac{1}{1 + e^{ -w\cdot x_{i}}} ]$

Sanity check: $w = w_1 - w_0$ 和 $w_1$ 和相似性很高，也就是和 $\{x_i : y_i = 1\}$ 相似度很高， $w \cdot x_i = +C \gg 0$ , 因此上式第一項為 0 ( $y_i=1$ ). 只要看第二項： $-y_i \log \frac{1}{1 + e^{ -w\cdot x_{i}}} = \log(1+e^{-C}) \sim e^{-C} \sim 0$
相反 $w$ 和 $\{x_i : y_i = 0\}$ 負相似度很高， $w \cdot x_i = -C \ll 0$ , 因此上式第二項為 0，只要看第一項： $-(1-y_i)\log \frac{1}{ 1 + e^{ +w\cdot x_{i}}} = \log(1+e^{-C})\sim e^{-C} \sim 0$

In summary, maximum similarity $\Rightarrow$ minimum cross-entropy loss.

上式的 $\min_{w} \sum_i [-(1-y_i)\log \frac{1}{ 1 + e^{ +w\cdot x_{i}}} - y_i \log \frac{1}{1 + e^{ -w\cdot x_{i}}} ]$
$= \min_{w} \sum_i [-(1-y_i)\log (1-h(x_i)) - y_i \log h(x_i) ]$ where $h(x)$ 就是 sigmoid function without bias, 因為我們一開始假設 data 平均值為 0. 我們可以加上 bias, i.e. $w\cdot x \to w\cdot x + b$ 代入上式得到 logistic regression 的 loss function.
$\min_{w} \sum_i [-(1-y_i)\log \frac{1}{ 1 + e^{ +w\cdot x_{i}+b}} - y_i \log \frac{1}{1 + e^{ -w\cdot x_{i}-b}} ]$

另一個重點是 regularization, 上式的 $\|w\|$ 愈大， $+C$ 愈大，cross-entropy loss 愈小。

以下圖 1D logistic regression 為例，w 對應 transition region 的斜率：假設 y=0 (下圖 'o') 和 y=1 (下圖 'x') 是 totally separable, 紅線 (w=2) 和藍線 (w=1) 都可以很好的 fit data, 但是 w=2 的 cross-entropy loss 比較小。如果沒有 regularization, w 愈大 (e.g. $\infty$ )，cross-entropy loss 愈小，顯然這不是希望的結果。一般會在 loss function 加上 L1 or L2 regularization term 限制 w 的大小。完整的 loss function 如下。
$\min_{w} \sum_i \left[-(1-y_i)\log \frac{1}{ 1 + e^{ +w\cdot x_{i}+b}} - y_i \log \frac{1}{1 + e^{ -w\cdot x_{i}-b}} \right] + \lambda \| w\|^2$

或者 cross-entropy loss 先平均 w.r.t. N data points, 再加上 regularization:
$\min_{w} \frac{1}{N}\sum_i \left[-(1-y_i)\log \frac{1}{ 1 + e^{ +w\cdot x_{i}+b}} - y_i \log \frac{1}{1 + e^{ -w\cdot x_{i}-b}} \right] + \lambda \| w\|^2$

一般 data y=0 和 y=1 非 totally separable, cross-entropy loss 會限制 w 的 optimal range.

Multi-Class Classification

從 binary classification 推廣到 multi-class classification 很直接。下式是 3-class loss function:

$\sum_i \left[-y^0_i\log \frac{e^{ w_0\cdot x_{i}}}{ e^{ w_0\cdot x_{i}} + e^{ w_1\cdot x_{i}} + e^{ w_2\cdot x_{i}}} - y^1_i \log \frac{e^{ w_1\cdot x_{i}}}{ e^{ w_0\cdot x_{i}} + e^{ w_1\cdot x_{i}} + e^{ w_2\cdot x_{i}}} - y^2_i \log \frac{e^{ w_2\cdot x_{i}}}{ e^{ w_0\cdot x_{i}} + e^{ w_1\cdot x_{i}} + e^{ w_2\cdot x_{i}}}\right]$

where $[y^0_i, y^1_i, y^2_{i}]$ is the label of $x_i$ , 是一個 one hot vector, i.e. [1, 0, 0] or [0, 1, 0] or [0, 0, 1].

再來推廣到 C classes using the notation of [@roelantsSoftmaxClassification2019].
$\sum_i \sum_{c=1}^C \left[-y^c_i\log \frac{e^{ w_c\cdot x_{i}}}{ e^{ w_0\cdot x_{i}} + e^{ w_1\cdot x_{i}} + ... + e^{ w_C\cdot x_{i}}} \right]$

因為 $y^c_i$ 是 one-hot with respect to C. 上式可以把 $y_i$ 直接 embedded 到 softmax formula. [@rashadAdditiveMargin2020]
$\sum_i -\log \left( \frac{e^{ w_{y_i}\cdot x_{i}}}{ \sum_j e^{ w_j\cdot x_{i}} } \right)$

上式平均 w.r.t N data points, 再加上 regularization, softmax multi-class classification formulation 如下式。常用於深度學習的 feature classification, transformer 的 attention network.
$\min_{w_0, w_1, ..., w_C} \frac{1}{N} \sum_i -\log \left( \frac{e^{ w_{y_i}\cdot x_{i}}}{ \sum_j e^{ w_j\cdot x_{i}} } \right) + \lambda {\sum_i \|w_i\|^2}$

Likelihood to Cross-Entropy Function

幾個重點：

Softmax classification 的邏輯是 maximum similarity between $w_i$ and $\{x \in y_i\}$ (等價 maximum (log) likelihood = 等價 minimum entropy loss)! 再加上 regularization. Maximum similarity 的邏輯很容易推廣到 multi-class classification!
SVM 是找 maximum margin decision boundary (等價 regularization) 再加上 minimize hinge loss. SVM 似乎比較適合 binary classification where the decision boundary is easier to defined and understood. 如果是 multi-class classification, 就要變成多個 binary classification.
let $g_0(x) = e^{w_0 \cdot x}$ and $g_1(x) = e^{w_1 \cdot x}$
Normalize the sum: $h_0(x) = \frac{g_0(x)}{g_0(x)+g_1(x)}$ and $h_1(x) = \frac{g_1(x)}{g_0(x)+g_1(x)}$ . $h_0$ and $h_1$ 是 $w_0 \cdot x$ and $w_1 \cdot x$ 的 softmax.
$h_0(x) = \frac{e^{w_0 \cdot x}}{e^{w_0 \cdot x}+ e^{w_1 \cdot x}} = \frac{1}{1+ e^{(w_1 - w_0 )\cdot x}} = \frac{1}{1+ e^{-w \cdot x}}$ where $w = w_0 - w_1$
當然, $h_1(x) = \frac{e^{w_1 \cdot x}}{e^{w_0 \cdot x}+ e^{w_1 \cdot x}} = \frac{1}{1+ e^{(w_0-w_1)\cdot x}} = \frac{1}{1+ e^{+w \cdot x}} = 1 - h_0(x)$
加上一個 bias, $w\cdot t \to w \cdot x + b$ 不會改變以上結論，就得出 logistic regression. 但在 softmax 還可以有這個 bias 嗎?
If $w_1\cdot x > w_0\cdot x \Rightarrow g_1(x) > g_0(x) \Rightarrow h_1(x) > h_0(x) \Rightarrow y_1$ 此處用到 exp and log 都是 monotonic functions. 但如何定義 loss function? $y_1 = 1 = y$ and $y_0 = 0 = 1 - y$
什麼是 loss function? 就是 given $x$ , 如果你選 $y_1$ , 而 P(y_1 | x) = 1 代表選對了，loss = 0; 如果 P(y_1 | x) = 0.01，代表選錯了, loss 等於 big positive number. 神奇的 function 就是用 $\log P(y_1 \mid x)$ , 。 : make wrong decision -(y = 1) x log( h_0(x))
Use probability $P(y=0 \mid x)$
And the loss function is log P().
以上 Logistic Regression (LR) 的 loss function:
$L = -y \log(h(x)) - (1-y) \log(1-h(x))$

Log Base: $\log_e(\ln)$ or $\log_2$ or $\log_{10}$ ?

Logistic regression and softmax loss function, use natural log = ln 因為 exp().
Classification cross-entropy loss function, use log_2 因為可以用 bits.
Continuous loss function, use log 單位是 neper?
基本上沒有 log_10 in ML?

Cross-entropy can be calculated using the probabilities of the events from P and Q, as follows:

H(P, Q) = – sum x in X P(x) * log(Q(x))
Where P(x) is the probability of the event x in P, Q(x) is the probability of event x in Q and log is the base-2 logarithm, meaning that the results are in bits. If the base-e or natural logarithm is used instead, the result will have the units called nats.

!!!! softmax 的精神是找出 w vector 和 data 的相似性，而不是 decision boundary. 所以可以推廣到 multi-class classification!!!

!!! SVM 是找出 decision boundary. 對於 binary OK, but multi-class classification 就很 akward.

設如下的 binary labelled dataset ( $y_i=$ 0 or 1 given $x_i$ ). 我們可以假設這些 data 的平均值為 0 而不失普遍性。下一步就是要找出對應的 $w_0$ and $w_1$ . 想法很簡單，就是 $w_0$ 要儘量和 y=0's data (下圖的 "o") 有最大的“相似性”。 $w_1$ 要儘量和 y=1's data (下圖的 "x") 有最大的“相似性”。

Math ML – Multiclass Classification: Softmax, SVM, IVM

Logistic Regression for Binary Classification

Inference: Logistic Regression by Decision Boundary (Not Generalizable to Multi-Classes!)

Inference: Logistic Regression by Maximal Probability (Generalizable to Multi-Classes!)

Training/Learning: 如何得到 w and b

Maximum Likelihood for Probabilistic Interpretation ML

Logistic Regression's Maximum Likelihood Interpretation [@roelantsSoftmaxClassification2019]

Generalize to Multi-class Classification Using Softmax [@houLogisticSoftmax2015]

Softmax Function

Likelihood to Cross-Entropy Function

Log Base: $\log_e(\ln)$ or $\log_2$ or $\log_{10}$ ?

Leave a comment

Cancel reply

Logistic Regression for Binary Classification

Inference: Logistic Regression by Decision Boundary (Not Generalizable to Multi-Classes!)

Inference: Logistic Regression by Maximal Probability (Generalizable to Multi-Classes!)

Training/Learning: 如何得到 w and b

Maximum Likelihood for Probabilistic Interpretation ML

Logistic Regression's Maximum Likelihood Interpretation [@roelantsSoftmaxClassification2019]

Generalize to Multi-class Classification Using Softmax [@houLogisticSoftmax2015]

Softmax Function

Likelihood to Cross-Entropy Function

Log Base: or or ?

分享此文：

Related

Leave a comment

Cancel reply

Log Base: $\log_e(\ln)$ or $\log_2$ or $\log_{10}$ ?