Featured

My First Blog Post

Be yourself; Everyone else is already taken.
— Oscar Wilde.

This is the first post on my new blog. I’m just getting this new blog going, so stay tuned for more. Subscribe below to get notified when I post new updates.

Move Math Article to Github.io

我寫 blog 目的是把自己讀過或做過的工程，物理，資訊，語言做一些紀錄，間或做一些整理便於之後快速查詢。最近 (2021) 看到蘇劍林似乎很早就在經營科學空間 https://spaces.ac.cn 做的非常好。其中一些文章對我很有幫助。

我也想把之前或之後的紀錄和心得能夠更好的呈現，主要還是為了自己的整理和查詢。當然如果能有助於他人更好。因此找到一個適合呈現 math equation, figure, 並且可以長時間存在的 blog 平台就很重要。

Blogger (Google) 和 WordPress (Open) 對於 math equation 很多問題。 Medium 我沒試過，但我不喜歡付費才能閱讀。最近 (2020) 試了一下 Jekyll (markdown to html) + Next theme + Github.io (Blog platform, now Microsoft) 似乎是不錯的組合。產生的 math equation 最接近我想要的呈現。因此我會把一些文章逐步搬到 Github.io: https://allenlu2009.github.io 特此為誌。

Math AI: Maximum Likelihood Estimation (MLE) Evolve To EM Algorithm For Incomplete/Hidden Data

Math AI: Maximum Likelihood Estimation (MLE) Evolve To EM Algorithm For Incomplete/Hidden Data To Variational Bayesian Inference

Major Reference

[@poczosCllusteringEM2015]
[@matasExpectationMaximization2018] good reference
[@choyExpectationMaximization2017]
[@tzikasVariationalApproximation2008] excellent introductory paper

Maximum Likelihood Estimation 和應用

Maximum likelihood estimation (MLE) 最大概似估計是一種估計模型參數的方法。適用時機在於手邊有模型，但是模型參數有無限多種，透過真實觀察到的樣本資訊，想辦法導出最有可能產生這些樣本結果的模型參數，也就是挑選使其概似性(Likelihood)最高的一組模型參數，這系列找參數的過程稱為最大概似估計法。

Bernoulli distribution：投擲硬幣正面的機率 $\theta$ , 反面的機率 $1-\theta$ . 連續投擲的正面/反面的次數分別是 H/T. Likelihood function 為

$f(\theta, H, T)=\theta^{H}(1-\theta)^{T}$

MLE 在無限個 $\theta$ 中，找到一個使概似性最大的 $\theta$ , i.e. $\widehat{\theta}_{\mathrm{MLE}} =\arg \max _{\theta} {\theta^{H}(1-\theta)^{T}}$

只要 likelihood function 一次微分，可以得到

$\widehat{\theta}_{M L E}=\frac{H}{T+H}$

就是平均值，推導出來的模型參數符合直覺。

Normal distribution：假設 mean unknown, variance known, 我們可以用 maximum log-likelihood function

$\underset{\mu}{\operatorname{argmax}} f\left(x_{1}, \ldots, x_{n}\right) \Rightarrow \underset{\mu}{\operatorname{argmax}} \log f\left(x_{1}, \ldots, x_{n}\right)$

$latex \begin{aligned}
&\frac{\mathrm{d}}{d \mu}\left(\sum_{i=1}^{n}-\frac{\left(x_{i}-\mu\right)^{2}}{2 \sigma^{2}}\right)=\sum_{\mathrm{i}=1}^{\mathrm{n}} \frac{\left(x_{i}-\hat{\mu}\right)}{\sigma^{2}}=\sum_{i=1}^{n} x_{i}-n \hat{\mathrm{u}}=0 \\
&\hat{\mu}=\overline{\mathrm{X}}=\frac{\sum_{i=1}^{n} x_{i}}{n}
\end{aligned} $

微分的結果告訴我們，樣本的平均值，其實就是母體平均值 $\mu$ 最好的估計！又是一個相當符合直覺的答案，似乎 MLE 只是用來驗證直覺的工具。

這是一個錯覺，常見的 distribution (e.g. Bernoulli, normal distribution) 都是 exponential families. 可以證明 maximum log-likelihood functions of exponential families 都是 concave function, 沒有 local minimum. 非常容易用數值方法找到最佳解，而且大多有 analytical solution.

但只要 distribution function 更複雜一點，例如兩個 normal distribution weighted sum to 1, MLE 就非常難解。這是 EM algorithm 為什麼非常有用。

Maximum Likelihood Estimation (MLE) vs. EM

EM seems to be a special case for MLE?
- Yes! If the problem can be formulated as MLE of incomplete/hidden data. Then EM algorithm 的 E-step is guessing incomplete/hidden data; M-step 就對應 MLE with some modification (見本文後段)。
Sometimes MLE is very hard to solve because of too many parameters to estimation (mean, variance, initial distribution in GMM; or initial distribution, transition probability, emission probability in HMM); or missing data or missing label!!
- Too many parameters can be seen as incomplete/hidden data. see the following GMM example. It can be regards as missing label.
In the above case, EM is a specialized MLE with iterative approach to solve the MLE.
- Yes. Specifically, the M step is MLE.
EM ~~ iterative MLE (maybe local minimum)
EM is more general than MLE because it can apply to parameters and missing data?
- The same thing, unknown parameters can be regarded as missing information/data. In the M step, we guess the missing data. In the E step, we use MLE to compute it, and so on.

Toy Example [@matasExpectationMaximization2018]

前提摘要

一個簡單例子觀察 temperature and amount of snow (溫度和雪量, both are binary input) 的 joint probability depending on two "scalar factors" $a$ and $b$ as $p(t, s | a, b)$

	$s_0$	$s_1$
$t_0$	$a$	$5a$
$t_1$	$3b$	$b$

注意 $a$ and $b$ are parameters, 不是 conditional probability.
另外因為機率和為 1 做為一個 constraint: $6a + 4b = 1$

問題一: MLE

一個 ski-center 觀察 $N$ 天的溫度和雪量得到以下的統計， $N_{ij} \in \mathbf{I}$ , 如何估計 $a$ and $b$ ?

	$s_0$	$s_1$
$t_0$	$N_{00}$	$N_{01}$
$t_1$	$N_{10}$	$N_{11}$

Likelihood function (就是 joint pdf of $N$ repeat experiments)

$P(\mathcal{T} \mid a, b)= C a^{N_{00}}(5 a)^{N_{01}}(3 b)^{N_{10}}(b)^{N_{11}}$

where $C = (\Sigma N_{ij})! / \Pi (N_{ij}!)$ 是 MLE 無關的常數

問題改成 maximum log-likelihood with constraint and $C' = \ln C$

$L(a, b, \lambda) = C' + N_{00} \ln a+N_{01} \ln 5 a+N_{10} \ln 3 b+N_{11} \ln b+\lambda(6 a+4 b-1)$

$latex \begin{gathered}
\frac{\partial L}{\partial a}=N_{00} \frac{1}{a}+N_{01} \frac{1}{a}+6 \lambda=0 \\
\frac{\partial L}{\partial b}=N_{10} \frac{1}{b}+N_{11} \frac{1}{b}+4 \lambda=0 \\
\frac{\partial L}{\partial \lambda}=6 a+4 b – 1 = 0
\end{gathered} $

上述方程式的解為
$a=\frac{N_{00}+N_{01}}{6 N} \quad b=\frac{N_{10}+N_{11}}{4 N} \quad \lambda = -(N_{00}+N_{01}+N_{10}+N_{11})=-N$

結果很直觀。其實就是利用大數法則： $a\cdot N \sim N_{00}; 5a\cdot N\sim N_{01}; 3b\cdot N\sim N_{10}; b\cdot N\sim N_{11}$
再來大數法則 (a+5a)N~N00+N01; (3b+b)N~N10+N11 => a = .. ; b = …

問題二 incomplete/hidden Data

假設我們無法觀察到完整的"溫度和雪量“；而是“溫度或雪量”，有時“溫度”，有時“雪量”，但不是同時。對應的不是 joint pdf, 而是 marginal pdf 如下：

觀察如下：

The Lagrangian (log-likelihood with constraint)

$L(a, b, \lambda)=T_{0} \ln 6 a+T_{1} \ln 4 b+S_{0} \ln (a+3 b)+S_{1} \ln (5 a+b)+\lambda(6 a+4 b-1)$

此時的方程式比起之前複雜的多，不一定有 close-form solution:

$latex \begin{gathered}
\frac{\partial L}{\partial a}=\frac{T_{0}}{a}+\frac{S_{0}}{a+3 b}+\frac{5 S_{1}}{5 a+b}+6 \lambda=0 \\
\frac{\partial L}{\partial b}=\frac{T_{1}}{b}+\frac{3 S_{0}}{a+3 b}+\frac{S_{1}}{5 a+b}+4 \lambda=0 \\
6 a+4 b=1
\end{gathered} $

如果用大數法則：

$6a \cdot(T_0+T_1) \sim T0; \, 4b\cdot(T_0+T_1) \sim T_1$
$(a+3b) \cdot (S_0+S_1)\sim S_0; \, (5a+b)\cdot(S_0+S_1) \sim S_1$
注意不論 1. or 2. 都滿足 $6a+4b = 1$ constraint, 可以用來估計 $a$ and $b$ .
問題是我們要用那一組 $(a, b)$ ? 單獨用一組都會損失一些 information, 應該要 combine 1 and 2 的 information, how?

思路一 平均 (a, b) from 1 and 2. 但這不是好的策略，因為平均 (a,b) 不一定滿足 constraint. 在這個 case 因為 linear constraint, 所以平均 (a,b) 仍然滿足 constraint. 但對於更複雜 constraint, 平均並非好的方法。

更重要的是平均並無法代表 maximum likelihood in the above equation. 我們的目標是 maximum likelihood, 平均 (a, b) 完全無法保證會得到更好的 likelihood value!

或者把 (a,b) from 1 or 2 代入上述 likelihood function 取大值。顯然這也不是最好的策略。因為一半的資訊被捨棄了。

思路二 比較好的方法是想辦法用迭代法解微分後的 Lagrange multiplier 聯立方程式。 (a, b) from 1. or 2. 只作為 initial solution, 想辦法從聯立方程式找出 iterative formula. 這似乎是對的方向，問題是 Lagrange multiplier optimization 是解聯立(level 1)微分方程式。不一定有 close form as in this example. 同時也無法保證收斂。另外如何找出 iterative formula 似乎是 case-by-case, 沒有一致的方式。
=> iterative solution is one of the key, but NOT on Lagrange multiplier (level 1)

思路三 既然是 missing data, 我們是否可以假設 $(a, b) \to$ fill missing data $\to$ update $(a, b) \to$ update missing data $\cdots$ 具體做法
$N_{00} = T_0 \cdot \frac{1}{6} + S_0 \cdot \frac{a}{a+3b}$
$N_{01} = T_0 \cdot \frac{5}{6} + S_1 \cdot \frac{5a}{5a+b}$
$N_{10} = T_1 \cdot \frac{3}{4} + S_0 \cdot \frac{3b}{a+3b}$
$N_{11} = T_1 \cdot \frac{1}{4} + S_1 \cdot \frac{b}{5a+b}$

有了 $N_{00},N_{01},N_{10},N_{11}$ 可以重新估計 $(a, b)$ using joint pdf

$a'=\frac{N_{00}+N_{01}}{6 N} \quad b'=\frac{N_{10}+N_{11}}{4 N}$

Q: 如何證明這個方法是最佳或是對應 complete data MLE or incomplete/hidden data MLE? 甚至會收斂？

EM algorithm 邏輯

前提摘要

GMM 特例：Estimate Means of Two Gaussian Distributions (known variance and ratio; unknown means)

We measure lengths of vehicles. The observation space is two-dimensional, with $x$ capturing vehicle type (binary) and $y$ capturing length (Gaussian).

$p(x, y)$ $x\in$ {car, truck}, $y \in \mathbb{R}$

$p(\text {car}, y)=\pi_{\mathrm{c}} \mathcal{N}\left(y \mid \mu_{\mathrm{c}}, \sigma_{\mathrm{c}}=1\right)=\kappa_{\mathrm{c}} \exp \left\{-\frac{1}{2}\left(y-\mu_{\mathrm{c}}\right)^{2}\right\},\left(\kappa_{\mathrm{c}}=\frac{\pi_{\mathrm{c}}}{\sqrt{2 \pi}}\right)$

$p(\text {truck,} y)=\pi_{\mathrm{t}} \mathcal{N}\left(y \mid \mu_{\mathrm{t}}, \sigma_{\mathrm{t}}=2\right)=\kappa_{\mathrm{t}} \exp \left\{-\frac{1}{8}\left(y-\mu_{\mathrm{t}}\right)^{2}\right\},\left(\kappa_{\mathrm{t}}=\frac{\pi_{\mathrm{t}}}{\sqrt{8 \pi}}\right)$

where $\pi_c + \pi_t = 1$

問題三 Complete Data (Easy case)

$T=\{\underbrace{\left(\operatorname{car}, y_{1}^{(c)}\right),\left(\operatorname{car}, y_{2}^{(c)}\right), \ldots,\left(\operatorname{car}, y_{C}^{(c)}\right)}_{C \text { car observations }}, \underbrace{\left(\text {truck}, y_{1}^{(\mathrm{t})}\right),\left(\text {truck}, y_{2}^{(\mathrm{t})}\right), \ldots,\left(\text {truck}, y_{T}^{(\mathrm{t})}\right)}_{T \text { truck observations }}\}$

Log-likelihood

$\ell(\mathcal{T})=\sum_{i=1}^{N} \ln p\left(x_{i}, y_{i} \mid \mu_{\mathrm{c}}, \mu_{\mathrm{t}}\right)=C \ln \kappa_{\mathrm{c}}-\frac{1}{2} \sum_{i=1}^{C}\left(y_{i}^{(c)}-\mu_{\mathrm{c}}\right)^{2}+T \ln \kappa_{\mathrm{t}}-\frac{1}{8} \sum_{i=1}^{T}\left(y_{i}^{(\mathrm{t})}-\mu_{\mathrm{t}}\right)^{2}$

很容易用 MLE 估計 $\mu_1, \mu_2$

$\frac{\partial \ell(\mathcal{T})}{\partial \mu_{\mathrm{c}}}=\sum_{i=1}^{C}\left(y_{i}^{(\mathrm{c})}-\mu_{\mathrm{c}}\right)=0 \quad \Rightarrow \quad \mu_{\mathrm{c}}=\frac{1}{C} \sum_{i=1}^{C} y_{i}^{(c)}$

$\frac{\partial \ell(\mathcal{T})}{\partial \mu_{\mathrm{t}}}=\frac{1}{4} \sum_{i=1}^{T}\left(y_{i}^{(\mathrm{t})}-\mu_{\mathrm{t}}\right)=0 \quad \Rightarrow \quad \mu_{\mathrm{t}}=\frac{1}{T} \sum_{i=1}^{T} y_{i}^{(\mathrm{t})}$

直觀上很容易理解。如果 observations 已經分組，求 mean 只要做 sample 的平均即可。

以這個例子，ratio $\pi_c, \pi_t$ 不論已知或未知，都不影響結果。

問題四 incomplete/hidden Data

$latex \mathcal{T}=\{\left(\operatorname{car}, y_{1}^{(c)}\right), \ldots,\left(\operatorname{car}, y_{C}^{(c)}\right),\left(\text {truck}, y_{1}^{(\mathrm{t})}\right), \ldots,\left(\text {truck}, y_{T}^{(\mathrm{t})}\right), \underbrace{\left(\bullet, y_{1}^{\bullet}\right), \ldots,\left(\bullet, y_{M}^{\bullet}\right)}_{\begin{array}{l}
\text { data with uknown } \\
\text { vehicle type }
\end{array}}\} $

$p\left(y^{\bullet}\right)=p\left(\text {car}, y^{\bullet}\right)+p\left(\text {truck}, y^{\bullet}\right)$

Log-likelihood

$latex \ell(\mathcal{T})=\sum_{i=1}^{N} \ln p\left(x_{i}, y_{i} \mid \mu_{c}, \mu_{\mathrm{t}}\right)=\overbrace{C \ln \kappa_{\mathrm{c}}-\frac{1}{2} \sum_{i=1}^{C}\left(y_{i}^{(c)}-\mu_{\mathrm{c}}\right)^{2}+T \ln \kappa_{\mathrm{t}}-\frac{1}{8} \sum_{i=1}^{T}\left(y_{i}^{(\mathrm{t})}-\mu_{\mathrm{t}}\right)^{2}}^{\text {same term as before }} \\
+\sum_{i=1}^{M} \ln \left(\kappa_{\mathrm{c}} \exp \left\{-\frac{1}{2}\left(y_{i}^{\bullet}-\mu_{\mathrm{c}}\right)^{2}\right\}+\kappa_{\mathrm{t}} \exp \left\{-\frac{1}{8}\left(y_{i}^{\bullet}-\mu_{\mathrm{t}}\right)^{2}\right\}\right) $

不用微分也知道非常難解 MLE. 我們必須用另外的方法，就是 EM 算法。
不過我們還是微分一下，得到更多的 insights.

$latex \begin{aligned}
0=\frac{\partial \ell(\mathcal{T})}{\partial \mu_{\mathrm{c}}} &=\sum_{i=1}^{C}\left(y_{\mathrm{c}}^{(\mathrm{c})}-\mu_{\mathrm{c}}\right) \\
&+ \sum_{i=1}^{M} \overbrace{\frac{\kappa_{\mathrm{c}} \exp \left\{-\frac{1}{2}\left(y_{i}^{\bullet}-\mu_{\mathrm{c}}\right)^{2}\right\}}{\kappa_{\mathrm{c}} \exp \left\{-\frac{1}{2}\left(y_{i}^{\bullet}-\mu_{\mathrm{c}}\right)^{2}\right\}+\kappa_{\mathrm{t}} \exp \left\{-\frac{1}{8}\left(y_{i}^{\bullet}-\mu_{\mathrm{t}}\right)^{2}\right\}}}^{p\left(\operatorname{car} \mid y_{i}^{\bullet}, \mu_{\mathrm{c}}, \mu_{\mathrm{t}}\right)}\left(y_{i}^{\bullet}-\mu_{\mathrm{c}}\right)
\end{aligned} $

$0=4 \frac{\partial \ell(\mathcal{T})}{\partial \mu_{\mathrm{t}}}=\sum_{i=1}^{T}\left(y_{i}^{(\mathrm{t})}-\mu_{\mathrm{t}}\right)+\sum_{i=1}^{M} p\left(\text {truck} \mid y_{i}^{\bullet}, \mu_{\mathrm{c}}, \mu_{\mathrm{t}}\right)\left(y_{i}^{\bullet}-\mu_{\mathrm{t}}\right)$

上兩式非常有物理意義。基本是 easy case 的延伸：已知分類的平均值，加上未知分類的機率平均值。一個簡單的方法是只取前面已知的部分平均，不過這不是最佳，因為丟失部分的資訊。

Missing Values, EM Approach

重新 summarize optimality conditions

$\sum_{i=1}^{C}\left(y_{i}^{(c)}-\mu_{c}\right)+\sum_{i=1}^{M} p\left(\operatorname{car} \mid y_{i}^{\bullet}, \mu_{c}, \mu_{\mathrm{t}}\right)\left(y_{i}^{\bullet}-\mu_{\mathrm{c}}\right)=0$

$\sum_{i=1}^{T}\left(y_{i}^{(\mathrm{t})}-\mu_{\mathrm{t}}\right)+\sum_{i=1}^{M} p\left(\text {truck } \mid y_{i}^{\bullet}, \mu_{\mathrm{c}}, \mu_{\mathrm{t}}\right)\left(y_{i}^{\bullet}-\mu_{\mathrm{t}}\right)=0$

如果 $p(\text {truck} \mid y_{i}^{\bullet}, \mu_c, \mu_t)$ 和 $p(\text {car} \mid y_{i}^{\bullet}, \mu_c, \mu_t)$ 已知，上式非常容易解 $\mu_c$ and $\mu_t$ 。實際這是一個雞生蛋、蛋生雞的問題，因為這兩個機率又和 $\mu_c$ and $\mu_t$ 相關。

EM algorithm 剛好用來打破這個迴圈。

Let $z_i \,(i=1, 2, \cdots, M), z_i \in \text{\{car, truck\}}$ denote the missing data. Define $q\left(z_{i}\right)=p\left(z_{i} \mid y_{i}^{\bullet}, \mu_{\mathrm{c}}, \mu_{\mathrm{t}}\right)$
上述 optimality equations 可以得到

$\mu_{\mathrm{c}}=\frac{\sum_{i=1}^{C} y_{i}^{(\mathrm{c})}+\sum_{i=1}^{M} q\left(z_{i}=\mathrm{car}\right) y_{i}^{\bullet}}{C+\sum_{i=1}^{M} q\left(z_{i}=\mathrm{car}\right)}$

$\mu_{\mathrm{t}}=\frac{\sum_{i=1}^{T} y_{i}^{(\mathrm{t})}+\sum_{i=1}^{M} q\left(z_{i}=\text { truck }\right) y_{i}^{\bullet}}{T+\sum_{i=1}^{M} q\left(z_{i}=\text { truck }\right)}$

EM Algorithm 可以用以下四步驟表示

Initialize $\mu_c$ , $\mu_t$
Compute $q\left(z_{i}\right)=p\left(z_{i} \mid y_{i}^{\bullet}, \mu_{\mathrm{c}}, \mu_{\mathrm{t}}\right)$ for $i = 1, 2, \cdots, M$
Recompute $\mu_c$ , $\mu_t$ according to the above equations.
If termination condition is met, finish. Otherwise, goto 2.

上述步驟 2 稱為 Expectation (E) Step, 步驟 3 稱為 Maximization (M) Step. 統稱為 EM algorithm.

Q. Why Step 2 稱為 Expectation? not clear. Maximization 比較容易理解，因為 optimality condition 就是 maximization (微分為 0).

In summary, EM algorithm 的一個關鍵點是：讓 incomplete/hidden data 變成 complete (Expectation?). 有了完整的 data, 就容易用 MLE 找到 maximal likelihood estimation ( $\mu_c$ and $\mu_t$ in this case).

Clustering: Soft Assignment Vs. Hard Assignment (K-means)

EM Algorithm Derivation

EM algorithm 如果只是 heuristic algorithm, 可能有用度減半。這裡討論數學上的 justification. 先定義 terminologies

$\mathbf{o}$ : all observed values
$\mathbf{z}$ : all unobserved values (i.e. hidden variable)
$\mathbf{\theta}$ : model parameters to be estimated

目標：Find $\theta^*$ using the Maximum Likelihood approach

$\boldsymbol{\theta}^{*}=\underset{\boldsymbol{\theta}}{\operatorname{argmax}} \ell(\boldsymbol{\theta})=\underset{\boldsymbol{\theta}}{\operatorname{argmax}} \ln p(\mathbf{o} \mid \boldsymbol{\theta})$

思路：假設解下列完整 data 很容易解 (例如上面的 exmple xx and xx)

$\underset{\boldsymbol{\theta}}{\operatorname{argmax}} \ln p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta})$

我們的想法是把上上式改寫成上式，再加以優化
$latex \begin{aligned}
\ln p(\mathbf{o} \mid \boldsymbol{\theta}) &=\ln \sum_{\mathbf{z}} p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta}) \\
&=\ln \sum_{\mathbf{z}} q(\mathbf{z}) \frac{p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta})}{q(\mathbf{z})}
\end{aligned} $

這裡引入 $q(\mathbf{z})$ distribution with $\sum_{\mathbf{z}} q(\mathbf{z})=1$ . 上式利用 Jensen's inequality 可以導出
$\geq \sum_{\mathbf{z}} q(\mathbf{z}) \ln \frac{p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta})}{q(\mathbf{z})}$

Log-Likelihood with Hidden Variable Lower Bound

我們定義 $\ln p(\mathbf{o} \mid \boldsymbol{\theta})$ 的 lower bound $\mathcal{L}(q, \boldsymbol{\theta})$ , for any distribution $q$ .

$\mathcal{L}(q, \boldsymbol{\theta})=\sum_{\mathbf{z}} q(\mathbf{z}) \ln \frac{p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta})}{q(\mathbf{z})}$

這已經非常接近思路！

Maximizing $\mathcal{L}(q, \boldsymbol{\theta})$ by choosing $q(z)$ 就可以 push the log likelihood $\ln p(\mathbf{o} \mid \boldsymbol{\theta})$ upwards.

反過來我們可以計算和 lower bound 之間的 gap.

$latex \begin{aligned}
\ln p(\mathbf{o} \mid \boldsymbol{\theta})-\mathcal{L}(q, \boldsymbol{\theta}) &=\ln p(\mathbf{o} \mid \boldsymbol{\theta})-\sum_{\mathbf{z}} q(\mathbf{z}) \ln \frac{p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta})}{q(\mathbf{z})} \\
&=\ln p(\mathbf{o} \mid \boldsymbol{\theta})-\sum_{\mathbf{z}} q(\mathbf{z})\{\ln \underbrace{p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta})}_{p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta}) p(\mathbf{o} \mid \boldsymbol{\theta})}-\ln q(\mathbf{z})\} \\
&=\ln p(\mathbf{o} \mid \boldsymbol{\theta})-\sum_{\mathbf{z}} q(\mathbf{z})\{\ln p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta})+\ln p(\mathbf{o} \mid \boldsymbol{\theta})-\ln q(\mathbf{z})\} \\
&=\ln p(\mathbf{o} \mid \boldsymbol{\theta})-\underbrace{\sum_{\mathbf{z}} q(\mathbf{z})}_{1} \ln p(\mathbf{o} \mid \boldsymbol{\theta})-\sum_{\mathbf{z}} q(\mathbf{z})\{\ln p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta})-\ln q(\mathbf{z})\} \\
&=-\sum_{\mathbf{z}} q(\mathbf{z}) \ln \frac{p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta})}{q(\mathbf{z})} \\
&= D_{\mathrm{KL}}(q(\mathbf{z}) \| p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta}) ) \ge 0
\end{aligned} $

這個 gap 就是 KL divergence between $q(\mathbf{z})$ and $p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta})$ , 永遠大於 0.
如果能找到 $q(\mathbf{z}) = p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta})$ ，就可以讓 gap 變成 0.

EM Algorithm Push the Lower Bound Upwards

Log likelihood function 可以分為兩個部分： lower bound + gap
$\ln p(\mathbf{o} \mid \boldsymbol{\theta})=\mathcal{L}(q, \boldsymbol{\theta})+D_{\mathrm{KL}}(q \| p)$

從 Jensen's inequality 得到 $\mathcal{L}(q, \boldsymbol{\theta})$ 是 lower bound. 同樣從 KL divergence $\ge$ 0 再度驗證。

如果 $q(\mathbf{z}) = p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta})$ , the bound is tight.

Example 1: Hidden variable $z$ does NOT provide any information of $o$

假設 $\mathbf{o}$ 和 $\mathbf{z}$ 完全無關， $p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta}) = p(\mathbf{z} \mid \boldsymbol{\theta})$ . We can make $q(\mathbf{z}) = p(\mathbf{z} \mid \boldsymbol{\theta})$
such that $D_{\mathrm{KL}}(q(\mathbf{z}) \| p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta})) = 0$ , 也就是 gap = 0.

Lower bound 就變成原來的 log-likelihood function.
$latex \begin{aligned}
\mathcal{L}(q, \boldsymbol{\theta}) &= \sum_{\mathbf{z}} q(\mathbf{z}) \ln \frac{p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta})}{q(\mathbf{z})}\\
&= \sum_{\mathbf{z}} q(\mathbf{z}) \ln \frac{p(\mathbf{o} \mid \boldsymbol{\theta}) p(\mathbf{z} \mid \boldsymbol{\theta})}{q(\mathbf{z})} \\
&= \sum_{\mathbf{z}} q(\mathbf{z}) \ln p(\mathbf{o} \mid \boldsymbol{\theta})\\
&= \ln p(\mathbf{o} \mid \boldsymbol{\theta})
\end{aligned} $

Example 2: $q(\mathbf{z}) = p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta})$

$latex \begin{aligned}
\mathcal{L}(q, \boldsymbol{\theta}) &= \sum_{\mathbf{z}} q(\mathbf{z}) \ln \frac{p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta})}{q(\mathbf{z})}\\
&= \sum_{\mathbf{z}} q(\mathbf{z}) \ln \frac{p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta}) p(\mathbf{o} \mid \boldsymbol{\theta})}{q(\mathbf{z})} \\
&= \sum_{\mathbf{z}} q(\mathbf{z}) \ln p(\mathbf{o} \mid \boldsymbol{\theta})\\
&= \ln p(\mathbf{o} \mid \boldsymbol{\theta})
\end{aligned} $

What is the mutual information of $o$ and $z$ in this case?

EM 具體步驟

Lower bound $\mathcal{L}(q, \boldsymbol{\theta})$

$\mathcal{L}(q, \boldsymbol{\theta})=\sum_{\mathbf{z}} q(\mathbf{z}) \ln \frac{p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta})}{q(\mathbf{z})}$

Lower bound 其實可以視為 MLE of complete data，對應 EM algorithm 的 M step.
Gap 可以視為從 observables 推論出 unobservables, i.e. incomplete/hidden data, 對應 EM algorithm 的 E step.

Lower bound 包含兩個部分： (1) log-likelihood of complete data (both $o$ and $z$ ) MLE for $\theta$ estimation; (2) $q$ distribution.

這兩個部分剛好對應 EM algorithm 的 E-step and M-step.

Initialize $\boldsymbol{\theta}=\boldsymbol{\theta}^{(0)}$
E-step (Expectation): $q^{(t+1)}=\underset{q}{\operatorname{argmax}} \mathcal{L}\left(q, \boldsymbol{\theta}^{(t)}\right)$
M-step (Maximization): $\boldsymbol{\theta}^{(t+1)}=\underset{\boldsymbol{\theta}}{\operatorname{argmax}} \mathcal{L}\left(q^{(t+1)}, \boldsymbol{\theta}\right)$

M-step: $q^{(t+1)}$ is fixed

我們先看 M-step, 因為這和 MLE estimate $\theta$ 非常相似。

$\theta^{(t+1)}=\underset{\theta}{\operatorname{argmax}} \mathcal{L}\left(q^{(t+1)}, \theta\right)$

$latex \begin{aligned}
\mathcal{L}\left(q^{(t+1)}, \boldsymbol{\theta}\right) &=\sum_{\mathbf{z}} q^{(t+1)}(\mathbf{z}) \ln \frac{p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta})}{q^{(t+1)}(\mathbf{z})} \\
&=\sum_{\mathbf{z}} q^{(t+1)}(\mathbf{z}) \ln p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta})-\underbrace{\sum_{\mathbf{z}} q^{(t+1)}(\mathbf{z}) \ln q^{(t+1)}(\mathbf{z})}_{\text {const. }}
\end{aligned} $

$\boldsymbol{\theta}^{(t+1)}=\underset{\boldsymbol{\theta}}{\operatorname{argmax}} \sum_{\mathbf{z}} q^{(t+1)}(\mathbf{z}) \ln p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta}^{(t)})$

注意 M-Step 和完整 data 的 MLE 思路如下非常接近，只加了對 $q(z)$ 的 weighted sum.
$\underset{\boldsymbol{\theta}}{\operatorname{argmax}} \ln p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta})$

上式微分等於 0 就可以解 $\theta^{t+1}$ 。上面問題四以及問題二就是很好的例子。

另一個常見的寫法 $\boldsymbol{\theta}^{(t+1)}=\underset{\boldsymbol{\theta}}{\operatorname{argmax}} E_{q(z)} \ln p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta}^{(t)})$

注意 M-Step 是 maximize lower bound, 並不等於 maximize 不完整 data 的 MLE，因為還差了一個 gap function (i.e. KL divergence), which is also $\theta$ dependent.

E-step: $\boldsymbol{\theta}^{(t)}$ is fixed

$q^{(t+1)}=\underset{q}{\operatorname{argmax}} \mathcal{L}\left(q, \boldsymbol{\theta}^{(t)}\right)$

$\mathcal{L}\left(q, \boldsymbol{\theta}^{(t)}\right)=\underbrace{\ln p\left(\mathbf{o} \mid \boldsymbol{\theta}^{(t)}\right)}_{\text {const. }}-D_{\mathrm{KL}}(q \| p)$

以上 KL divergence 大於等於 0，所以 maximize lower bound 就要讓要選擇 $q(z)$ 儘量縮小 gap (i.e. KL divergence) 到 0. 前面提到等於 0 的條件就是
$q^{(t+1)}(\mathbf{z}) = p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta}^{(t)})$

同樣 E-Step 深具物理意義，就是猜 incomplete/hidden data distribution based on 已知的 observables 和 iterative $\theta$ .

例如問題四 E-Step 就是計算 $q\left(z_{i}\right)=p\left(z_{i} \mid y_{i}^{\bullet}, \mu_{\mathrm{c}}, \mu_{\mathrm{t}}\right)$ for $i = 1, 2, \cdots, M$

Conditional Vs. Joint Distribution

我們可以把 conditional distribution 改成 joint distribution 如下。兩者都可以用來解 E-Step.

$p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta}^{(t)}) = p(\mathbf{z}, \mathbf{o} \mid \boldsymbol{\theta}^{(t)}) / p(\mathbf{o} \mid \boldsymbol{\theta}^{(t)})$

結合 E-Step and M-Step 的另一種寫法 (這是 EM 的精髓，假設可以有一個 close from Q function, e.g. GMM)

In summary: M-Step maximize the lower bound; E-Step close the gap
E-Step
$q^{(t+1)}(\mathbf{z}) = p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta}^{(t)})$

M-Step
$\boldsymbol{\theta}^{(t+1)}=\underset{\boldsymbol{\theta}}{\operatorname{argmax}} E_{q(z)} \ln p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta}^{(t)})$

Define new E-Step (plug E-Step to M-Step)
$latex \begin{aligned}
Q(\theta^{t+1} | \theta^{t}) &= \sum_{\mathbf{z}} p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta}^{(t)}) \ln p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta}^{t+1}) \\
&= \int d \mathbf{z} \, p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta}^{(t)}) \ln p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta}^{t+1})
\end{aligned} $

new M Step
$\boldsymbol{\theta}^{(t+1)}=\underset{\boldsymbol{\theta}}{\operatorname{argmax}} Q(\theta^{t+1} | \theta^{t})$

Free Energy Interpretation [@poczosCllusteringEM2015]

搞 machine learning 很多是物理學家 (e.g. Max Welling), 習慣用物理觀念套用於 machine learning. 常見的例子是 training 的 momentum method. 另一個是 energy/entropy loss function. 此處我們看的是類似 energy loss function.

我們從 gap 開始

$\ln p(\mathbf{o} \mid \boldsymbol{\theta})-\mathcal{L}(q, \boldsymbol{\theta}) = D_{\mathrm{KL}}(q(\mathbf{z}) \| p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta}) ) \ge 0$

$latex \begin{aligned}
\ln p(\mathbf{o} \mid \boldsymbol{\theta}) &= \mathcal{L}(q, \boldsymbol{\theta}) + D_{\mathrm{KL}}(q(\mathbf{z}) \| p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta}) ) \\
&= \sum_{\mathbf{z}} q(\mathbf{z}) \ln \frac{p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta})}{q(\mathbf{z})} + D_{\mathrm{KL}}(q(\mathbf{z}) \| p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta}) ) \\
&= \sum_{\mathbf{z}} q(\mathbf{z}) \ln p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta}) + \sum_{\mathbf{z}} -q(\mathbf{z}) \ln {q(\mathbf{z})}+ D_{\mathrm{KL}}(q(\mathbf{z}) \| p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta}) ) \\
&= E_{q(z)} \ln p(\mathbf{o}, \mathbf{z} \mid \boldsymbol{\theta}) + H(q) + D_{\mathrm{KL}}(q(\mathbf{z}) \| p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta}) ) \\
\end{aligned} $

where H(q) is the entropy of q, 第一項是負的，第二項和第三項是正的。
我們用一個例子來驗證
q = {0 or 1} with 50% chance, =>
H(q) = 1 (bit) or ln (?) > 0
Eq(z) ln p(o, z) = -(0.5 (o-u1)² + 0.5 (o-u2)² ) / sqrt(2pi) < 0

此處我們 switch to [@poczosCllusteringEM2015] notation.

Observed data: $D = \{x_1, \cdots, x_n\}$
Unobserved/hidden variable: $z = \{z_1, \cdots, z_n\}$
Parameter: $\theta = [\mu_1, \cdots, \mu_K, \pi_1, \cdots, \pi_K, \Sigma_1, \cdots, \Sigma_K]$
Goal: $\boldsymbol{\theta}^{*}=\underset{\boldsymbol{\theta}}{\operatorname{argmax}} \ln p(D \mid \theta)$

重寫上式：

$latex \begin{aligned}
\ln p(D \mid \boldsymbol{\theta}^t) &= \sum_{\mathbf{z}} q(\mathbf{z}) \ln p(D, \mathbf{z} \mid \boldsymbol{\theta}^t) + \sum_{\mathbf{z}} -q(\mathbf{z}) \ln {q(\mathbf{z})}+ D_{\mathrm{KL}}(q(\mathbf{z}) \| p(\mathbf{z} \mid D, \boldsymbol{\theta}^t) ) \\
&= E_{q(z)} \ln p(D, \mathbf{z} \mid \boldsymbol{\theta}) + H(q) + D_{\mathrm{KL}}(q(\mathbf{z}) \| p(\mathbf{z} \mid D, \boldsymbol{\theta}) ) \\
&= F_{\theta^t} (q(\cdot), D) + D_{\mathrm{KL}}(q(\mathbf{z}) \| p(\mathbf{z} \mid D, \boldsymbol{\theta}) )
\end{aligned} $

$F_{\theta^t} (q(\cdot), D)$ 稱為 free energy, 包含 joint distribution expectation 和 self-entropy. 或是稱為 ELBO (Evidence lower bound)

如果 $p(z\mid x; \theta)$ is analytically available (e.g. GMM, this is just a softmax!). The ELBO becomes a Q(theta, theta^old) function + H(q)

The EM algorithm can be summzied as argmax Q!!

E-step: $p(z\mid x)$
M-step; argmax …

We can prove

log likelihood is always increasing!

Use multiple, randomized initialization in practice.

Variational Expectation Maximization

EM algorithm 一個問題是對於複雜的問題沒有 close from p(z|x), then toast!

Appendix

問題二的 Conditional Vs. Joint Distribution 解法

我們之前的 E-Step 是猜 joint distribution, $p(t, s | a, b)$ .

	$s_0$	$s_1$
$t_0$	a	5a
$t_1$	3b	b

如果用上述的 conditional distribution 可以細膩的看每一個 data.

對於所有 $(\bullet, s_0)$
$latex q(t \mid s_0, a, b)=\left\{\begin{array}{l}
q\left(t_{0}\right)=p\left(t_{0} \mid s_{0}, a, b\right)=\frac{a}{a+3 b} \\
q\left(t_{1}\right)=p\left(t_{1} \mid s_{0}, a, b\right)=\frac{3 b}{a+3 b}
\end{array}\right. $
對於所有 $(\bullet, s_1)$
$latex q(t \mid s_1, a, b)=\left\{\begin{array}{l}
q\left(t_{0}\right)=p\left(t_{0} \mid s_{1}, a, b\right)=\frac{5a}{5 a+ b} \\
q\left(t_{1}\right)=p\left(t_{1} \mid s_{1}, a, b\right)=\frac{b}{5 a+ b}
\end{array}\right. $
對於所有 $(t_0, \bullet)$
$latex q(s \mid t_0, a, b)=\left\{\begin{array}{l}
q\left(s_{0}\right)=p\left(s_{0} \mid t_{0}, a, b\right)=\frac{1}{6} \\
q\left(s_{1}\right)=p\left(s_{1} \mid t_{0}, a, b\right)=\frac{5}{6}
\end{array}\right. $
對於所有 $(t_1, \bullet)$
$latex q(s \mid t_1, a, b)=\left\{\begin{array}{l}
q\left(s_{0}\right)=p\left(s_{0} \mid t_{1}, a, b\right)=\frac{3}{4} \\
q\left(s_{1}\right)=p\left(s_{1} \mid t_{1}, a, b\right)=\frac{1}{4}
\end{array}\right. $

再來是問題二的 M-Step

最後再把所有 dataset 的 weighted sum $(t_i, s_j)$ 統計出來，例如
$S_0$ 個 $(\bullet, s_0) \to \frac{a}{a+3b}S_0$ 個 $(t_0, s_0)$ 和 $\frac{3b}{a+3b}S_0$ 個 $(t_1, s_0)$
$S_1$ 個 $(\bullet, s_1) \to \frac{5a}{5a+b}S_1$ 個 $(t_0, s_1)$ 和 $\frac{b}{5a+b}S_1$ 個 $(t_1, s_1)$
$T_0$ 個 $(t_0, \bullet) \to \frac{1}{6}T_0$ 個 $(t_0, s_0)$ 和 $\frac{5}{6}T_0$ 個 $(t_0, s_1)$
$T_1$ 個 $(t_1, \bullet) \to \frac{3}{4}T_1$ 個 $(t_1, s_0)$ 和 $\frac{1}{4}T_1$ 個 $(t_1, s_1)$

$(t_0, s_0)$ 個數 $\to N_{00} = \frac{1}{6}T_0+\frac{a}{a+3b}S_0$
$(t_0, s_1)$ 個數 $\to N_{01} = \frac{5}{6}T_0+\frac{5a}{5a+b}S_1$
$(t_1, s_0)$ 個數 $\to N_{10} = \frac{3}{4}T_1+\frac{3b}{a+3b}S_0$
$(t_1, s_1)$ 個數 $\to N_{11} = \frac{1}{4}T_1+\frac{b}{5a+b}S_1$

因此可以使用完整 data 的 MLE estimation:
$a'=\frac{N_{00}+N_{01}}{6 N} \quad b'=\frac{N_{10}+N_{11}}{4 N}$

To Do Next

Go through GMM example.

下一步 go through HMM model or simplest z -> o graph model.

Math AI: Unsupervised Learning – Similarity/Clustering: K-means and GMM

Major Reference

[@poczosCllusteringEM2015]

Unsupervised Learning

Unsupervised learning (UL) 簡單說就是 learning without label. 常見的 UL 包含降維 (e.g. PCA) 或是 clustering (e.g. K-means). UL 比起 supervised learning (SL) 更難以捉摸。

一個常引述的原因是 UL 的結果比較主觀，因為沒有 label 也就沒有標準答案。但這並不意味沒有 loss function. 當然我們需要 loss function 才能 optimization. 但是

Loss function 可能包含 hyper parameter, 這些 hyper parameter 只能 try-and-error 得到。例如 PCA dimension reduction 到底要 reduce 多少 dimension, 或是 K-means 的 K categories 到底 K 多少比較適合？這些 hyper parameter 有些主觀。
Loss function 本身可能就很主觀 because of scaling, nonlinearity, etc.
如果有 downstream task (e.g. 分類), loss function 會由 downstream task 定義，UL 沒有自己的 loss function 而是 embedded 在 downstream task. 也很難分離判斷不同 UL 的效果。

Similarity

本文聚焦在 clustering, 目標是

High intra-class similarity
Low inter-class similarity
兩種基本常見算法：K-means and GMM (Gaussian Mixture Model).

一個問題是如何定義 "similarity"？常見幾種做法：

Cosine/angular similarity: $sim(A, B) = \cos(\theta) = \frac{\vec{A}\cdot \vec{B}}{\|A\| \|B\|}.\quad$ Cosine similarity 愈大，代表 $\cos\theta$ 愈大， $\theta$ 愈接近 0，代表角度愈小，相似度愈好。
Distance similarity: $\|\vec{A} - \vec{B}\|$ or $\|\vec{A} - \vec{B}\|^2 = \|A\|^2 + \|B\|^2 - 2 \|A\| \|B\| \cos(\theta)$ 。注意 distance similarity 和 cosine similarity 相反(差了一個負號)，distance similarity 愈小，代表愈相似。
Exponential similarity: 在很多情況，similarity 需要比 cosine, L1, or L2 函數更 nonlinear 的函數。例如在 downstream task 分類問題，不同類別（如車子和行人）可能天差地遠。因此可以把原來的 cosine, L1, or L2 函數結合 exponential 函數產生:
- Exponential cosine/angular similarity: $e^{\vec{A}\cdot \vec{B}}$ 或是 $e^{\vec{A}\cdot \vec{B}/|A||B|}.$ 這和 softmax function 非常相像。特別適合結合 (deep) neural network 用於分類問題。
- Exponential distance similarity: $e^{\|\vec{A} - \vec{B}\|}$ or $e^{\|\vec{A} - \vec{B}\|^2}.$ 這似乎特別適合 (shallow) machine learning 的 kernel function (e.g. kernel-SVM, kernel-PCA), 同樣大多用於分類問題。
- Exponential similarity 還有一個非常大的好處：就是永遠是正值，同時積分或求和都會收斂，很自然可以轉換為機率，成為 deterministic and probabilistic interpretation 的橋樑，e.g. softmax or Gaussian mixed model; as opposed to cosine function (可能有負值), L1/L2 (積分不收斂)。

直覺上 distance similarity 似乎比 angular similarity 更精確。實務上還是要看 downstream task. 例如在 face recognition task, angular similarity (with margin) 是主流的做法。

本文討論的 clustering 聚焦在 distance similarity (K-means) and exponential distance similarity (GMM).

K-means

K-means 問題的陳述非常簡單: Given a set of observations ( $x_1, x_2, \cdots ,x_n$ ), where $x_i \in \mathbb{R}^d$ . 把 $n$ observations 分 $K$ 群 $K \le n$ 讓以下的 cost function 最小化, 或是更準確的說 intra-class distance similarity 最大化。

$\underset{\mathbf{S}}{\arg \min } \sum_{i=1}^{K} \sum_{\mathbf{x}_{j} \in S_{i}}\left\|\mathbf{x}_{j}-\boldsymbol{\mu}_{i}\right\|^{2}$

where $\mu_i$ is the mean of points in set $S_i$ .

上式的重點 K-means 是 minimize similarity distance, 就是 High intra-class similarity, 不管 Low inter-class similarity.

另外一派注重 low inter-class similarity (or maximum margin similarity), i.e. angular similarity with margin 此處不表，可以參考別文。

K-means 算法

K-means cost function（如下式）沒有 close form solution. 因為 $\mu_i = f(x_j) \,\text{with}\, x_j \in S_i$ 很容易。但 $x_j \in S_i$ 又會 depends on $\mu_i$ . 這是一個雞生蛋，蛋生雞問題。最好的解法就是 iteration: 先固定 $\mu_i$ , assign $x_j \in S_i$ . 再來固定 $x_j$ , re-estimate $\mu_i$ .

$\underset{\mathbf{S}}{\arg \min } \sum_{i=1}^{K} \sum_{\mathbf{x}_{j} \in S_{i}}\left\|\mathbf{x}_{j}-\boldsymbol{\mu}_{i}\right\|^{2}$

後面會提到，這是 EM algorithm 的特例。對於每一個 $x_j$ 有可能屬於 $S_i$ group 的機率是 $p_i(x_j) = p_{ij} = 0 \,\text{or}\,1$ 同時對 $i$ 的機率和為 1. 這稱為 hard assignment, 因為 $x_j$ 只能被 assign 到唯一的 $S_i$ .

K-means example, K=3

下圖是一個 $K=3$ 的例子。把 $n$ 個 observations 分為三群。
300

Step 1: initialization. 先隨便選擇 3 個 $\mu_1, \mu_2, \mu_3$ .

Step 2: (hard) assign each $x_j$ to $K_1, K_2, K_3$ . 其實就是在 minimize cost function by fixing $\mu_i$ but changing $x_j \in S_i$ .
具體做法有兩種: (1)計算 $x_j$ 到 $\mu_i$ 的距離，取最近的 $\mu_i$ assign $S_i$ ; (2) 找到 decision boundary，boundary 之內歸為 $S_i$ ，如下圖。

Step 3: Re-estimate the cluster centers $\mu_1, \mu_2, \mu_3$ . 其實也是在 minimize the cost function with fixed $x_i$ but changing $\mu_i$

Repeat Step 2 (re-assignment) and step 3 (re-estimate) to minimize the cost function continuously. **Stop when the re-assignment do NOT change anymore.

K-means algorithm summary

Input

Data + Desired number of clusters, $K$

Initialize

the $K$ cluster centers (randomly if necessary)

Iterate

Decide the class memberships of the n objects by assigning them to the nearest cluster centers
Re-estimate the $K$ cluster centers (aka the centroid or mean), by assuming the memberships found above are correct.

Termination

If none of the $n$ objects changed membership in the last iteration, exit.
Otherwise go to 1.

注意以上的算法並沒有保證一定會找到 global minima! 如果 initial guess 不理想，很容易陷入 local minima. 因此需要 randomize initial guess many times and find a better (local) minima.

Potential Function Interpretation and EM Algorithm

我們可以用物理的觀念來解釋，定義 potential function $F(\mu, C)$

$\min _{\mu, C} F(\mu, C)=\min _{\mu, C} \sum_{j=1}^{n}\left\|\mu_{C(j)}-x_{j}\right\|^{2}$

where $\mu=\left(\mu_{1}, \ldots, \mu_{K}\right)$ and $C=(C(1), \ldots, C(n))$
and $C(j) \leftarrow \arg \min _{i}\left\|\mu_{i}-x_{j}\right\|^{2}$ is the classification function.

K-means algorithm (coordinate descent on F)
(1) Expectation step (re-assignment): fix $\mu$ , Optimize $C$
(2) Maximization step (re-center): fix $C$ , Optimize $\mu$

Gaussian Mixture Model (GMM)

GMM 可以視為 K-means 的進化版。 K-means 要求每個 data point $x_j$ 只能屬於唯一的 group/cluster $S_i$ . 對於聚成團清楚分開的 data set, K-means 是很好的 clustering algorithm. 但對於重疊性高的 data set, 例如下圖位於 $\mu_1$ 和 $\mu_3$ 或是 $\mu_2$ 和 $\mu_3$ 重疊區域的 data, 硬要指定屬於某一個 group 有可能造成 downstream task (e.g. 分類、迴歸) 的誤判。從 information theory 角度就是 information loss.

Hard Assignment Vs Soft Assignment

K-means: hard assignment: 每一個 $x_j$ where $j\in 1,\cdots,n$ 僅能屬於一個 cluster
$\theta_{i} \in\left\{\theta_{1}, \ldots, \theta_{K}\right\}$

GMM: soft assignment: 每一個 $x_j$ where $j\in 1,\cdots,n$ 屬於任何 cluster with probability
$\left(\pi_{1}, \ldots, \pi_{K}\right), \pi_{i} \geq 0, \sum_{i=1}^{K} \pi_{i}=1$

~~如果限制 $\pi_i = 0 \,\,\text{or}\,\, 1$ , Soft assignment 就變成 hard assignment, GMM 就變成 K-means.~~
$\pi_i$ 代表 $i^{th}$ group 的機率分布或比例，和 $x_j$ 屬於 $i_th$ group 的機率似乎不同？我們用以下 hidden model 解釋。

Probability with Exponential Distance Similarity

如何決定 probability? 此時 exponential similarity 就派上用場。GMM 使用 Gaussian distribution, 對應 exponential (L2) distance similarity.

Hidden Variable Model

Gaussian mixture models (GMMs) 可以用以下的 graph model 代表。 data point 都是一個 $(x_i, z_i)$ , $z_i$ is unobservable, the joint distribution $p(x, z) = p(x|z)p(z)$ is a directed model, where $p(z=k) = \pi_k$

$p(\mathbf{x} \mid \mathbf{z}=k)=\mathcal{N}\left(\mathbf{x} ; \mu_{k}, \Sigma_{k}\right)$

我們會 given 一堆 $x_i$ and K categories.

$p(x)=\sum_{i=1}^{K} p(x \mid z=i) P(z=i)$
observed data = summation of mixture component (Gaussian) * mixture proportion ( $\pi_i$ )

$\pi_{i}=P(z=i)$ => 注意 $\pi_i$ 不是個別 $x_j$ 屬於 $i^{th}$ group 的機率，i.e. $P(z=i | x_j )$ . 後面會看到 $P(z=i | x_j )$ 在 EM 扮演重要角色。

GMM

先考慮一個假的 GMM: assume $\theta=\left[\mu_{1}, \ldots, \mu_{K}, \Sigma_{1}, \ldots, \Sigma_{K}, \pi_{1}, \ldots, \pi_{K}\right]$ 為已知。

考慮 clustering based on posteriors (i over j):
$\log \frac{P(z=i \mid x)}{P(z=j \mid x)}$
$=\log \frac{p(x \mid z=i) P(z=i) / p(x)}{p(x \mid z=j) P(z=j) / p(x)}$
$=\log \frac{p(x \mid z=i) \pi_{i}}{p(x \mid z=j) \pi_{j}}=\log \frac{\pi_{i} \frac{1}{\sqrt{\left|2 \pi \Sigma_{i}\right|}} \exp \left[-\frac{1}{2}\left(x-\mu_{i}\right)^{T} \Sigma_{i}^{-1}\left(x-\mu_{i}\right)\right]}{\pi_{j} \frac{1}{\sqrt{\left|2 \pi \Sigma_{j}\right|}} \exp \left[-\frac{1}{2}\left(x-\mu_{j}\right)^{T} \Sigma_{j}^{-1}\left(x-\mu_{j}\right)\right]}$
$=x^{T} W x+w^{T} x+c$

這代表 ”Quadratic Decision Boundary", 就是拋物線 boundary 如下圖。

如果所有 $\Sigma_i = \sigma^2$ ，拋物線就變成直線。變成和 K-means boundary 一樣。如果再加上 hard assignment, 就真的變成和 K-mean 一樣。注意 $\pi_i$ 只影響 offset, 不影響是拋物線或是直線。

GMM 並沒有真的 boundary, 也不是 hard assignment, 此處只是 for illustration purpose.

GMM MLE Estimation

現實世界 $\theta=\left[\mu_{1}, \ldots, \mu_{K}, \Sigma_{1}, \ldots, \Sigma_{K}, \pi_{1}, \ldots, \pi_{K}\right]$ 都是未知。

GMM 問題其實是一個 maximum likelihood problem to estimate $\theta$ .

$latex \begin{aligned}
\arg \max _{\theta} \prod_{j=1}^{n} P\left(x_{j} \mid \theta\right) &=\arg \max _{\theta} \prod_{j=1}^{n} \sum_{i=1}^{K} P\left(z=i, x_{j} \mid \theta\right) \\
&=\arg \max _{\theta} \prod_{j=1}^{n} \sum_{i=1}^{K} P\left(z=i \mid \theta\right) p\left(x_{j}\mid z=i , \theta\right) \\
&=\arg \max _{\theta} \prod_{j=1}^{n} \sum_{i=1}^{K} \pi_{i} \frac{1}{\sqrt{\left|2 \pi \Sigma_{i}\right|}} \exp \left[-\frac{1}{2}\left(x_{j}-\mu_{i}\right)^{T} \Sigma_{i}^{-1}\left(x_{j}-\mu_{i}\right)\right]
\end{aligned} $

上述方程式非常複雜，有幾種可能的解法

所有變數的偏微分為零。解所有變數 -> Nonlinear, non-analytically solvable.
Use gradient descent. -> doable, but often slow.
Use expectation maximization (EM) algorithm.

EM Algorithm

EM 其實和原來 K-means 的 iterative solution 有異曲同工之妙。

K-means: (1) fix $\mu_i$ , change $x_j$ assignment to minimize potential function; then (2) fix $x_j$ assignment, move $\mu_i$ to minimize potential function.

In EM, 我們基本上 reuse 以上的 recipe 稍加修改，再取一個 fancy 名字：E-optimizaiton and M-optimization for EM algorithm, 就成為 10 大最有影響力的演算法。

修改的方法：(1) fix $\mu_i$ and $\Sigma_i$ , re-assign $x_j$ probability, i.e. $\pi_i$ to maximize likelihood; (2) fix $x_j$ probability $\pi_i$ , move $\mu_i$ and $\Sigma_i$ to maximize likelihood.

Step (1) 稱為 E-step (Expectation); (2) 稱為 M-step (Maximization).

特例一：所有 Gaussian Distribution $\sigma^2$ 相同且已知。另外 $\pi_i$ 為已知！

假設 unknowns are: $\theta = [\mu_1, \mu_2, \cdots, \mu_K]$ , 我們看一下 posterior likelihood
$latex \begin{aligned}
p\left(x_{1}, \ldots, x_{n} \mid \mu_{1}, \ldots \mu_{K}\right)&=\prod_{j=1}^{n} p\left(x_{j} \mid \mu_{1}, \ldots, \mu_{K}\right) \quad \text { Independent data }\\
&=\prod_{j=1}^{n} \sum_{i=1}^{K} p\left(x_{j}, z=i \mid \mu_{1}, \ldots, \mu_{K}\right) \quad \text { Marginalize over class }\\
&=\prod_{j=1}^{n} \sum_{i=1}^{K} p\left(x_{j} \mid z=i, \mu_{1}, \ldots, \mu_{K}\right) p\left(z=i\right) \\
&\propto \prod_{j=1}^{n} \sum_{i=1}^{K} \exp \left(-\frac{1}{2 \sigma^{2}}\left\|x_{j}-\mu_{i}\right\|^{2}\right) \pi_{i} \Rightarrow \text { learn } \mu_{1}, \mu_{2}, \ldots \mu_{\mathrm{K}}
\end{aligned} $

比較一下 K-means 的 object function
$\underset{\mathbf{S}}{\arg \min } \sum_{i=1}^{K} \sum_{\mathbf{x}_{j} \in S_{i}}\left\|\mathbf{x}_{j}-\boldsymbol{\mu}_{i}\right\|^{2}$

如果把上上式 GMM 的 soft assignment 限制成 hard assignment, 再取 log on the likelihood function and re-grouping. GMM 和 K-means 似乎完全等價！(to be proved)

接下來我們示範用 iterative method (EM algorithm) 解問題。

Step 1: initialization, 可以隨機選擇 $\theta^0 = [\mu^0_1, \mu^0_2, \cdots, \mu^0_K]$ ( $\sigma^2$ 為已知). => same as K-means

通常隨機的結果不好，我們可以先用 K-Means to get a good initialization.

Step 2: E-Step (soft) assign each $x_j$ to $\mu_i$

其實就是在 minimize cost function by fixing $\mu_i$ but changing $x_j$ weight, 其實就是 $P(z=i \mid x_j, \theta^{t-1})$

做法很直觀，因為所有 Gaussian distribution variance 都一樣，距離就是唯一的決定因素。基本就是每一個 $x_j$ 計算和 $\mu_i$ 的距離。代入 Gaussian distribution, 最後 normalized to 1.

$latex \begin{aligned}
P\left(z=i \mid x_{j}, \theta^{t-1}\right) &=P\left(z=i \mid x_{j}, \mu_{1}^{t-1}, \ldots, \mu_{K}^{t-1}\right) \\
& \propto P\left(x_{j} \mid z=i, \mu_{1}^{t-1}, \ldots, \mu_{K}^{t-1}\right) P\left(z=i\right) \\
& \propto \exp \left(-\frac{1}{2 \sigma^{2}}\left\|x_{j}-\mu_{i}^{t-1}\right\|^{2}\right) \pi_{i} \\
=& \frac{\exp \left(-\frac{1}{2 \sigma^{2}}\left\|x_{j}-\mu_{i}^{t-1}\right\|^{2}\right) \pi_{i}}{\sum_{i=1}^{K} \exp \left(-\frac{1}{2 \sigma^{2}}\left\|x_{j}-\mu_{i}^{t-1}\right\|^{2}\right) \pi_{i}}
\end{aligned} $

Equivalent to assigning clusters to each data point in K-means in a soft way

Step 3: M-Step re-estimate the cluster centers/means, i.e. update $\mu^{t-1}_i \to \mu^{t}_i$

其實也是在 minimize the cost function with fixed $x_j$ but changing $\mu_i$ . 直觀的想法就是每一點 $x_j$ 乘上 weight 最後做 average. 問題是 weight 是什麼？

直觀說就是機率 $p(z_i | x_j)$ 愈大，weight 應該愈高。
$\mu_{i}^{t}=\sum_{j=1}^{n} w_{j} x_{j} \text { where } w_{j}=\frac{P\left(z_{j}=i \mid x_{j}, \theta^{t-1}\right)}{\sum_{l=1}^{n} P\left(z_{l}=i \mid x_{l}, \theta^{t-1}\right)}$

檢查一下，如果是 hard assignment, $p(z_i | x_j)$ = 1 or 0. 上式就代表屬於 $i^{th}$ group/clustering 的平均值。這和 K-means 的 re-estimate 一致。

實務上要計算 $p(z_i | x_j)$ 還要引入一個輔助函數 $Q$
$latex
Q\left(\theta, \theta^{t}\right)=E_{Z}\left[\log \left(p\left(X, Z \mid \theta^{t}\right)\right)\right]
$
Maximize Q function 可以證明同時也 maximize likelihood function.
$latex
\theta^{t+1}=\arg \max _{\theta}\left[Q\left(\theta, \theta^{t}\right)\right]
$

此處不涉入細節，可以參考 [@alliliShortTtutorial2010].

General GMM: 除了 $\mu_i$ 需要迭代， $\Sigma_i$ and $\pi_i$ 也需要迭代

E-step

$R_{i, j}^{t-1}=P\left(z_{j}=i \mid x_{j}, \theta^{t-1}\right)=\frac{\mathcal{N}\left(x_{j} \mid \mu_{i}^{t-1}, \Sigma_{i}^{t-1}\right) \pi_{i}^{t-1}}{\sum_{i=1}^{K} \mathcal{N}\left(x_{j} \mid \mu_{i}^{t-1}, \Sigma_{i}^{t-1}\right) \pi_{i}^{t-1}}$

M-step
$\mu_{i}^{t}=\sum_{j=1}^{n} w_{j} x_{j} \quad \text { where } w_{j}=\frac{R_{i, j}^{t-1}}{\sum_{j=1}^{n} R_{i, j}^{t-1}}$

$\Sigma_{i}^{t}=\sum_{j=1}^{n} w_{j}\left(x_{j}-\mu_{i}^{t}\right)^{T}\left(x_{j}-\mu_{i}^{t}\right)$

$\pi_{i}^{t}=\frac{1}{n} \sum_{j=1}^{n} R_{i, j}^{t-1}$

Spherical EM, same variance GMM

這是 special case.

Hidden Markov Model (HMM) Using EM Algorithm

Not finding the path, but find the initial distribution $\pi$ and $A, B$ matrix.

Math AI – Variational Autoencoder (VAE) 變分自編碼器

機器學習的起手式是 MNIST 的手寫數字辨識率，這是 supervised learning. 使用的 Neural network 是 discriminative model, 可以是 MLP (multi-layer perceptron) 或是 CNN.

另一個機器學習的分支是 MNIST 影像 dimension reduction/compression, reconstruction, denoise, 這是 unsupervised learning, 也稱為 self-supervised learning. 使用的 neural network (decoder) 是 generative model, 可以是 autoencoder (AE) 以及變形 (sparse AE, convolution AE, variational AE, etc.) 或是更 fancy 的 GAN, 如下圖。
AE 以及變形在 training 都用 encoder-decoder 架構，self-supervised training.

本文主要聚焦在 VAE. VAE 其實和 AE 有很大的差別。AE 可以視為 (nonlinear) dimension reduction 如下圖，本質上很 PCA 非常類似。Hinton 發表在 Science 的 autoencoder paper 可說是開山之作[@hintonReducingDimensionality2006]. AE paper 引入的 encoder-decoder 架構，至今如 transformer 持續使用中。Encoder-decoder 的重要性在機器學習無所不在。大概只有 GAN 的 discriminator-generator 才創造另一種新的架構，普遍性還是比不上 encoder-decoder。

下圖是 MNIST 經過 AE 降維後再影射到 2D 空間的(樣本)分佈。這個分佈可以幫助判斷降維 feature extraction 的好壞。

原則上不同數字（0-9）對應的分佈愈集中代表 intra-class 的 variation 愈小，降維的 feature extraction 有抓到同一類別 “dominating feature". Inter-class 分佈愈分開，降維的 feature extraction 有抓到不同類別的 "differentiated feature"。

這是分類網絡最大的挑戰，maximize inter-class distribution distance and minimize intra-class distribution variation. AE 降維網絡 (encoder) 提供了這個工具。之後的各種變形 (CNN AE, sparse AE, etc.) 都是 AE 延伸。

AE 一個大問題：我們無法控制這個樣本分佈。為什麼這是一個大問題？因為無法控制樣本分佈就無法用於 generative model! 試試用 AE generate 一個數字或是人臉。基本無從下手。

Why VAE?

簡單來說就是把 AE 無法控制的樣本分佈 $z$ (下圖中)，變成可以控制的樣本分佈 $q_{\phi}(\mathbf{z} \mid \mathbf{x}) \sim N(0, I)$ (下下圖中左). 注意 VAE 想要控制的樣本分佈是指整體的樣本分佈，不是個別類別的樣本分佈。實務上 VAE 常用於 unsupervised learning, 只有整體的樣本分佈，沒有個別的樣本分佈。

VAE 雖然繼承了 AE encoder-decoder 架構，以及 unsupervised/self-supervised training procedure, 但和 AE 卻有本質的差異。簡單來說
* 最基本/重要：VAE 是 probabilistic encoder/decoder, AE 則是 deterministic encoder/decoder. VAE decoder 是 true generative model, 可以用在 image generation 重要的功能。
* Inference: AE 一般使用 encoder 降維，或是 encoder-decoder denoise image. VAE 一般使用 decoder for image generation.
* z 是 Gaussian distribution with mean/variance.

How VAE work?

VAE training 部分是最難理解的部分。因為會加入 “random sample”, $\epsilon,$ from Gaussian distribution. 這和 AE “deterministic” 降維再升維的觀念似乎完全不同。

1. "Robust" AE

我們可以直觀理解 by setting $\epsilon = 0.$ . 此時 $z = \mu$ 就變成 deterministic 降維，和一般的 AE 相同。因此 $\epsilon$ sampled from Gaussian distribution 可以視為 additive noise, 目的是加強這個 AE 的 robustness. 從這個角度，Gaussian distribution 只是配角，也就是 ”noise".

2. Probability Distribution Conversion (分布變換)

另一個角度，VAE (encoder) 可以想成分布變換。接下來詳細說明。在這個 picture 中 $z \sim N(\mu, \sigma)$ 經過 decoder map to $x'$ distribution close to input $x$ distribution. 從這個角度，Gaussian distribution 反而是主角。

目前主流的觀點是 (2)，因為應用廣泛。問題是如何在 training 時達成這種分布變換？

用下圖表示我們想要達成的目的。

訓練出一個生成器 (generator, i.e. decoder) 可以把 normal distribution 的樣本 $Z_i$ (sample) 變換成 $\hat{X}_i$ , sample distribution 和原來的 input $X_i$ sample distribution 相等。
通常我們會拿VAE跟GAN比較，的確，它們兩個的目標基本是一致的——希望構建一個從隱變量 $Z$ 生成目標數據 $X$ 的模型，但是實現上有所不同。更準確地講，它們是假設了 $Z$ 服從某些常見的分布（比如正態分布或均勻分布），然後希望訓練一個模型 $X=g(Z)$ ，這個模型能夠將原來的機率分布映射到訓練集的機率分布，也就是說，它們的目的都是進行分布之間的變換。
$Z$ 的維度遠小於 $\hat{X}$ or $X$ ，因此 generator (or decoder) 同時執行升維和分布變換。

如何 Train Generator?

重點是如何定義 loss function.

假設 $Z$ 服從標準的正態分布，那麼我就可以從中採樣得到若干個 $Z_1,Z_2,...,Z_n$ ，然後對它做變換得到 $\hat{X}_1=g(Z_1),\hat{X}_2=g(Z_2),...,\hat{X}_n=g(Z_n).$ 我們怎麼判斷這個通過 $g$ 構造出來的數據集，它的分布跟我們目標的數據集分布是不是一樣？有讀者說不是有KL散度嗎？當然不行，因為KL散度是根據兩個機率分布的表達式來算它們的相似度的，然而目前我們並不知道它們的機率分布的表達式，我們只有一批從構造的分布採樣而來的數據 { $\hat{X_1},\hat{X_2},...,\hat{X_n}$ }，還有一批從真實的分布採樣而來的數據 { $X_1,X_2,...,X_n$ }（也就是我們希望生成的訓練集）。我們只有樣本本身，沒有分布表達式，當然也就沒有方法算KL散度。

雖然遇到困難，但還是要想辦法解決的。GAN的思路很直接粗獷：既然沒有合適的度量(loss function?)，那我乾脆把這個度量也用神經網絡訓練出來吧。就這樣，WGAN就誕生了，詳細過程請參考《互懟的藝術：從零直達WGAN-GP》。而VAE則使用了一個精緻迂迴的技巧。[@suVariationalAutoencoder2018]

VAE Encoder-Decoder Training 傳統理解和誤解

首先我們有一批數據樣本 { $X_1,X_2,...,X_n$ }，其整體用 $X$ 來描述，我們本想根據 { $X_1,X_2,...,X_n$ } 得到X的分布 $p(X)$ ，如果能得到的話，那我直接根據 $p(X)$ 來採樣，就可以得到所有可能的 $X$ 了（包括{{ $X_1,X_2,...,X_n$ } 以外的），這是一個終極理想的生成模型了。當然，這個理想很難實現，於是我們將分布改一改

$p(X)=\sum_{Z} p(X \mid Z) p(Z)$

此時 $p(X|Z)$ 就描述了一個由 $Z$ 來生成 $X$ 的模型，而我們假設 $Z$ 服從標準正態分布，也就是 $p(Z)=N(0,I)$ 。如果這個理想能實現，那麼我們就可以先從標準正態分布中採樣一個 $Z$ ，然後根據 $Z$ 來算一個 $X$ ，也是一個很棒的生成模型。接下來就是結合自編碼器來實現重構，保證有效信息沒有丟失，再加上一系列的推導，最後把模型實現。框架的示意圖如下：

問題在哪裡？在 training!
如果如上圖的話，我們其實完全不清楚：究竟經過重新採樣出來的 $Z_k$ ，是不是還對應著原來的 $X_k$ ，所以我們如果直接最小化 loss function as $D(\hat{X}_k,X_k)^2$ （這裡 $D$ 代表某種距離函數）是很不科學的，而事實上你看代碼也會發現根本不是這樣實現的。也就是說，很多教程說了一大通頭頭是道的話，然後寫代碼時卻不是按照所寫的文字來寫，可是他們也不覺得這樣會有矛盾～

VAE 正確理解？

其實，在整個VAE模型中，我們並沒有去使用 $p(Z)$ （隱變量空間的分布）是正態分布的假設，我們用的是假設 $p(Z|X)$ （後驗分布）是正態分布！

具體來說，給定一個真實樣本 $X_k$ ，我們假設存在一個專屬於 $X_k$ 的分布 $p(Z|X_k)$ （學名叫後驗分布），並進一步假設這個分布是（獨立的、多元的）正態分布。為什麼要強調「專屬」呢？因為我們後面要訓練一個生成器 $X=g(Z)$ ，希望能夠把從分布 $p(Z|X_k)$ 採樣出來的一個 $Z_k$ 還原為 $X_k$ 。如果假設 $p(Z)$ 是正態分布，然後從 $p(Z)$ 中採樣一個 $Z$ ，那麼我們怎麼知道這個 $Z$ 對應於哪個真實的 $X$ 呢？現在 $p(Z|Xk)$ 專屬於 $X_k$ ，我們有理由說從這個分布採樣出來的 $Z$ 應該要還原到 $X_k$ 中去。如下圖：

上圖和上上圖差別在哪裡？就是均值 (mean) 和方差 (variance）在 after training finished, i.e. inference 是一個固定值（和 $X_i$ 無關，上上圖）或是隨 input $X_i$ 而改變 (和 $X_i$ 有關，上圖)。可以參考 Appendix A 的 code snippet.

事實上，在論文《Auto-Encoding Variational Bayes》的應用部分，也特別強調了這一點。注意下式中的 $\mu^{(i)}$ 和 ${\sigma}^{2(i)}$ 都有 superscript $i$ .

In this case, we can let the variational approximate posterior be a multivariate Gaussian with a diagonal covariance structure:
$\log q_{\phi}\left(z \mid \boldsymbol{x}^{(i)}\right)=\log N\left(z ; \boldsymbol{\mu}^{(i)}, \boldsymbol{\sigma}^{2(i)} \boldsymbol{I}\right)$

論文中的上式是實現整個模型的關鍵，不知道為什麼很多教程在介紹VAE時都沒有把它凸顯出來。儘管論文也提到 $p(Z)$ 是標準正態分布，然而那其實並不是本質重要的。

順便說明為什麼 $p(Z)$ 是標準正態分布 (normal distribution, Gaussian distribution). 如下式 $p(Z)$ 是一堆正態分布 $p(Z \mid X_i)$ , 各有不同的 $\mu^{(i)}$ 和 ${\sigma}^{2(i)}$ )，乘上 $X_i$ 出現的機率 (一個常數)， $p(X_i)$ 的 weighted sum.

$p(Z)=\sum_{X_i} p(Z \mid X_i) p(X_i)$

我們知道一堆正態分布的 weighted sum 還是正態分布，所以 $p(Z)$ 仍然是正態分布。其實這就是上上圖的 picture. 並非完全不正確，但是各 $X$ $\mu^{(i)}$ 和 ${\sigma}^{2(i)}$ 的 information 被平均掉了。上圖則是保持各自 $\mu^{(i)}$ 和 ${\sigma}^{2(i)}$ ，

VAE Details

回到本文，這時候每一個 $X_k$ 都配上了一個專屬的正態分布，才方便後面的生成器做還原，如上圖。但這樣有多少個 $X$ 就有多少個正態分布了。我們知道正態分布有兩組參數：均值 $\mu$ 和方差 $\sigma^2$ （多元的話，它們都是向量)，如下圖。

那我怎麼找出專屬於 $X_k$ 的正態分布 $p(Z|X_k)$ 的均值和方差呢？好像並沒有什麼直接的思路。那好吧，那我就用神經網絡來擬合出來吧！這就是神經網絡時代的哲學：難算的我們都用神經網絡來擬合，在WGAN那裡我們已經體驗過一次了，現在再次體驗到了。

VAE Encoder

於是我們構建兩個神經網絡 $\mu_k =f_1(X_k), \log \sigma^2_k = f_2(X_k)$ 來算它們了，相當于 AE encoder! AE/VAE encoder 主要的作用是(1)降維；(2)產生 input $x$ dependent latent variables.

比較 AE encoder 和 VAE encoder

兩者都會降維。
AE encoder 是 deterministic; VAE encoder 觀念上是 probabilistic，實際 implementation 是 deterministic 和 AE encoder 一樣。只有在 encoder-decoder training 和 decoder generator 時才有 Gaussian random variable.
AE encoder 只產生一個 input $X_k$ dependent latent variable $Z_k$ ; VAE 產生兩個 input $X_k$ dependent latent variables， $\mu_k$ and $\log \sigma^2_k$ .

我們選擇擬合 $\log \sigma^2_k$ 而不是直接擬合 $\sigma^2_k$ ，是因為 $\sigma^2_k$ 總是非負的，需要加激活函數處理，而擬合 $\log \sigma^2_k$ 不需要加激活函數，因為它可正可負。到這裡，我能知道專屬於 $X_k$ 的均值和方差了，也就知道它的正態分布長什麼樣了，然後從這個專屬分布中採樣一個 $Z_k$ 出來，然後經過一個生成器得到 $\hat{X}_k=g(Z_k)$ ，現在我們可以放心地最小化 $D(\hat{X}_k,X_k)^2$ ，因為 $Z_k$ 是從專屬 $X_k$ 的分布中採樣出來的，這個生成器應該要把開始的 $X_k$ 還原回來，如上圖。

VAE Training

讓我們來思考一下，根據上圖的訓練過程，最終會得到什麼結果。

首先，我們希望重構 $X$ ，也就是最小化 $D(\hat{X}_k,X_k)^2$ ，但是這個重構過程受到噪聲的影響，因為 $Z_k = \mu_k + \sigma_k \cdot \epsilon$ ( $\epsilon$ 是 Gaussian random variable) 是通過重新採樣過，不是直接由encoder算出來的。顯然噪聲會增加重構的難度，不過好在這個噪聲強度（也就是方差）通過一個神經網絡算出來的，所以最終模型為了重構得更好，肯定會想盡辦法讓方差為0。而方差為0的話，也就沒有隨機性了，所以不管怎麼採樣其實都只是得到確定的結果（也就是均值），只擬合一個當然比擬合多個要容易，而均值是通過另外一個神經網絡算出來的。

說白了，模型會慢慢退化成普通的AutoEncoder，噪聲不再起作用。

這樣不就白費力氣了嗎？說好的生成模型呢？

別急別急，其實VAE還讓所有的 $p(Z|X)$ 都向標準正態分布看齊，這樣就防止了噪聲為零，同時保證了模型具有生成能力。怎麼理解「保證了生成能力」呢？如果所有的 $p(Z|X)$ 都很接近標準正態分布 $N(0,I)$ , 那麼根據定義
$p(Z)=\sum_{X} p(Z \mid X) p(X)=\sum_{X} \mathcal{N}(0, I) p(X)=\mathcal{N}(0, I) \sum_{X} p(X)=\mathcal{N}(0, I)$
這樣我們就能達到我們的先驗假設： $p(Z)$ 是標準正態分布。然後我們就可以放心地從 $N(0,I)$ 中採樣來生成圖像了。

Really? 這代表 $\mu_k \sim 0$ and $\sigma_k \sim 1$ ? => 這應該只是一個 constraint, 避免 $\sigma_k \to 0$ 以及 $\mu_k$ go out of bound.

How? By define the loss function!

"Variational" Latent Loss

那怎麼讓所有的 $p(Z|X)$ 都向 $N(0,I)$ 看齊呢？如果沒有外部知識的話，其實最直接的方法應該是在重構誤差(reconstruction loss)的基礎上中加入額外的 loss：
$\mathcal{L}_{\mu}=\left\|f_{1}\left(X_{k}\right)\right\|^{2} \quad\text{and}\quad \mathcal{L}_{\sigma^{2}}=\left\|f_{2}\left(X_{k}\right)\right\|^{2}$

因為它們分別代表了均值 $\mu_k$ 和方差的對數 $\log \sigma^2_k$ ，達到 $N(0,I)$ 就是希望二者盡量接近於0了。不過，這又會面臨著這兩個損失的比例要怎麼選取的問題，選取得不好，生成的圖像會比較模糊。所以，原論文直接算了一般（各分量獨立的）正態分布與標準正態分布的KL散度 $KL(N(\mu,\sigma^2)\|N(0,I))$ 作為這個額外的loss，計算結果為 (Appendix B)

$\mathcal{L}_{\mu, \sigma^{2}}=\frac{1}{2} \sum_{i=1}^{d}\left(\mu_{(i)}^{2}+\sigma_{(i)}^{2}-\log \sigma_{(i)}^{2}-1\right)$

這裡的 $d$ 是隱變量 $Z$ 的維度，而 $\mu_{(i)}$ 和 $\sigma^2_{(i)}$ 分別代表一般正態分布的均值向量和方差向量的第 $i$ 個分量。直接用這個式子做補充loss，就不用考慮均值損失和方差損失的相對比例問題了。

這個 loss 也稱為 latent loss 代表隱變數 ( $\mu_i$ and $\sigma^2_i$ ) 和 $N(0, I)$ 差距。
VAE 的 "variation, 變分" 在數學上是一個函數積分（或是其他 operations）的極值，變數不是一個值，而是一個函數。KL divergence 就是以函數為變數的積分 (Appendix B)，因此 latent loss minimization 就相當 variation, 沒有其他神奇之處。

顯然，latent loss 也可以分兩部分理解：
$latex \begin{array}{l}
\mathcal{L}_{\mu, \sigma^{2}}=\mathcal{L}_{\mu}+\mathcal{L}_{\sigma^{2}} \\
\mathcal{L}_{\mu}=\frac{1}{2} \sum_{i=1}^{d} \mu_{(i)}^{2}=\frac{1}{2}\left\|f_{1}(X)\right\|^{2} \\
\mathcal{L}_{\sigma^{2}}=\frac{1}{2} \sum_{i=1}^{d}\left(\sigma_{(i)}^{2}-\log \sigma_{(i)}^{2}-1\right)
\end{array} $

$\mathcal{L}_{\mu}$ 是 parabolic dependency on $\mu_i.$ The minimum occurs when $\mu_i=0.$ $\mathcal{L}_{\mu}$ 的作用是 drive $\mu_i\to 0$ .

$\mathcal{L}_{\sigma^{2}}$ 對於 $\sigma^2_i$ 的關係如下圖。The minimum occurs when $\sigma^2_i = 1.$ $\mathcal{L}_{\sigma^{2}}$ 的作用是 drive $\sigma^2_i \to 1.$

注意任一 $\sigma^2_i \to 0$ 會導致 $\mathcal{L}_{\sigma^{2}}, \mathcal{L}_{\mu, \sigma^{2}} \to \infty$ . 因此 latent loss, $\mathcal{L}_{\mu, \sigma^{2}},$ 會阻止 VAE 從 probabilistic 變成 deterministic ( $\sigma^2_i \to 0$ ).

除了 latent loss, XAE (AE, VAE, SAE, CAE) 的基本 loss term 是重構誤差(reconstruction loss)。所以 VAE training 完整的 loss 包含 latent loss 加上 reconstruction loss, 見 Appendix A code.

Re-parmeterization Trick (for Gradient Back Propagation)

最後是實現模型的一個技巧，reparameterization trick。其實很簡單，就是我們要從 $p(Z|X_k)$ 中採樣一個 $Z_k$ 出來，儘管我們知道了 $p(Z|X_k)$ 是正態分布，但是均值方差都是靠模型算出來的，我們要靠這個過程反過來優化均值方差的模型，但是「採樣」這個操作是不可微分的，而採樣的結果是可微分的。我們利用
$\begin{aligned} & \frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left(-\frac{(z-\mu)^{2}}{2 \sigma^{2}}\right) d z \\=& \frac{1}{\sqrt{2 \pi}} \exp \left[-\frac{1}{2}\left(\frac{z-\mu}{\sigma}\right)^{2}\right] d\left(\frac{z-\mu}{\sigma}\right) \end{aligned}$

這說明 $(z-\mu)/\sigma=\varepsilon$ 是服從均值為0、方差為1的標準正態分布的，要同時把 $dz$ 考慮進去，是因為乘上 $dz$ 才算是概率，去掉dz是概率密度而不是概率。

這時候我們得到：從 $N(\mu,\sigma^2)$ 中採樣一個 $Z$ ，相當於從 $N(0,I)$ 中採樣一個 $\varepsilon$ ，然後讓 $Z=\mu+\varepsilon \times\sigma$ 。如下圖：

於是，我們將從 $N(\mu,\sigma^2)$ 採樣變成了從 $N(0,I)$ 中採樣，然後通過參數變換得到從 $N(\mu,\sigma^2)$ 中採樣的結果。這樣一來，「採樣」這個操作就不用參與梯度下降了，改為採樣的結果參與，使得整個模型可訓練了。

Training Forward and Backward Paths

Forward path

Encoder: $\mu_k =f_1(X_k), \log \sigma^2_k = f_2(X_k)$ 兩者都是 $d$ dimension vector/tensor.
Sample an $\varepsilon$ from $N(0, I)$ , compute $Z_k = \mu_k+\varepsilon \times \sigma_k$ . $\varepsilon$ 的 dimension?
使用 $Z_k$ in decoder to reconstruct $X'_k$

Compute loss term 包含 reconstruction loss and latent loss

$\mathcal{L} = |X_k - X'_k|^2 + \mathcal{L}_{\mu, \sigma^2}$

Backward path (自動微分)

Back propagation 先 update decoder weights.
Back propagation update 到 $Z_k$ , given the fixed sample $\varepsilon$ and based on $Z_k = \mu_k+\varepsilon\times\sigma_k$ , 可以 update $\mu_k$ and $\sigma_k$ .
繼續 back propagation update encoder.

VAE 的本質

以下 from [@suVariationalAutoencoder2018], 很好總結 VAE 的本質。

VAE的本質是什麼？VAE雖然也稱是AE（AutoEncoder）的一種，但它的做法（或者說它對網絡的詮釋）是別具一格的。在VAE中，它的Encoder有兩個，一個用來計算均值，一個用來計算方差，這已經讓人意外了：Encoder不是用來Encode的，是用來算均值和方差的，這真是大新聞了，還有均值和方差不都是統計量嗎，怎麼是用神經網絡來算的？

事實上，VAE從讓普通人望而生畏的變分和貝葉斯理論出發，最後落地到一個具體的模型中，雖然走了比較長的一段路，但最終的模型其實是很接地氣的：它本質上就是在我們常規的自編碼器的基礎上，對encoder的結果（在VAE中對應著計算均值的網絡）加上了「高斯噪聲」，使得結果decoder能夠對噪聲有 robustness；而那個額外的KL loss（目的是讓均值為0，方差為1），事實上就是相當於對encoder的一個正則項，希望encoder出來的東西均有零均值。

那另外一個encoder（對應著計算方差的網絡）的作用呢？它是用來動態調節噪聲的強度的。直覺上來想，當decoder還沒有訓練好時（重構誤差遠大於KL loss），就會適當降低噪聲（KL loss增加），使得擬合起來容易一些（重構誤差開始下降）；反之，如果decoder訓練得還不錯時（重構誤差小於KL loss），這時候噪聲就會增加（KL loss減少），使得擬合更加困難了（重構誤差又開始增加），這時候decoder就要想辦法提高它的生成能力了。

說白了，重構的過程是希望沒噪聲的，而KL loss則希望有高斯噪聲的，兩者是對立的。所以，VAE跟GAN一樣，內部其實是包含了一個對抗的過程，只不過它們兩者是混合起來，共同進化的。從這個角度看，VAE的思想似乎還高明一些，因為在GAN中，造假者在進化時，鑒別者是安然不動的，反之亦然。當然，這只是一個側面，不能說明VAE就比GAN好。GAN真正高明的地方是：它連度量都直接訓練出來了，而且這個度量往往比我們人工想的要好（然而GAN本身也有各種問題）。

Appendix A: VAE Coding Example [@durrVAETutorial2016]

https://ithelp.ithome.com.tw/articles/10188453
Variational Autoencoder 直接從 Github 上找實作來試試看，並用 MNIST 的資料來重現。

Encoder

在 encoder 一開始的地方和 Autoencoder 一樣，循序的降維．這裡定義了維度從 784 維到 500 維到 200(501?) 維再到 2 維．

n_z = 2 #Dimension of the latent space
# Input
x = tf.placeholder("float", shape=[None, 28*28]) #Batchsize x Number of Pixels
y_ = tf.placeholder("float", shape=[None, 10])   #Batchsize x 10 (one hot encoded)

# First hidden layer
W_fc1 = weights([784, 500])
b_fc1 = bias([500])
h_1   = tf.nn.softplus(tf.matmul(x, W_fc1) + b_fc1)

# Second hidden layer 
W_fc2 = weights([500, 501]) #501 and not 500 to spot errors
b_fc2 = bias([501])
h_2   = tf.nn.softplus(tf.matmul(h_1, W_fc2) + b_fc2)

接下來就是有趣的地方了，再進入 2 維前它把前面經過低維權重輸出的向量複製兩份，來建立 Gaussian，而且一份做成 mean 一份做成 stddev …
再來把它轉成二維的 code layer

# Parameters for the Gaussian
z_mean = tf.add(tf.matmul(h_2, weights([501, n_z])), bias([n_z]))
z_log_sigma_sq = tf.add(tf.matmul(h_2, weights([501, n_z])), bias([n_z]))

Decoder

Samples from a Gaussian using the given mean and the std. The sampling is done by adding a random number ensuring that backpropagation works fine.

batch_size = 64
eps = tf.random_normal((batch_size, n_z), 0, 1, dtype=tf.float32) # Adding a random number
z = tf.add(z_mean, tf.mul(tf.sqrt(tf.exp(z_log_sigma_sq)), eps))  # The sampled z

W_fc1_g = weights([n_z, 500])
b_fc1_g = bias([500])
h_1_g   = tf.nn.softplus(tf.matmul(z, W_fc1_g) + b_fc1_g)

W_fc2_g = weights([500, 501])
b_fc2_g = bias([501])
h_2_g   = tf.nn.softplus(tf.matmul(h_1_g, W_fc2_g) + b_fc2_g)

x_reconstr_mean = tf.nn.sigmoid(tf.add(tf.matmul(h_2_g,  weights([501, 784])), bias([784])))

Loss function

reconstr_loss = -tf.reduce_sum(x * tf.log(1e-10 + x_reconstr_mean) + (1-x) * tf.log(1e-10 + 1 - x_reconstr_mean), 1)
latent_loss = -0.5 * tf.reduce_sum(1 + z_log_sigma_sq - tf.square(z_mean) - tf.exp(z_log_sigma_sq), 1)
cost = tf.reduce_mean(reconstr_loss + latent_loss)   # average over batch

# Use ADAM optimizer
optimizer =  tf.train.AdamOptimizer(learning_rate=0.001).minimize(cost)

Appendix B: KL Divergence

由於我們考慮的是各分量獨立的多元正態分布，因此只需要推導一元正態分布的情形即可，根據定義我們可以寫出
$latex \begin{aligned}
& K L\left(N\left(\mu, \sigma^{2}\right) \| N(0,1)\right) \\
=& \int \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^\frac{-(x-\mu)^{2}}{2 \sigma^{2}}\left(\log \frac{e^{-(x-\mu)^{2} / 2 \sigma^{2}} / \sqrt{2 \pi \sigma^{2}}}{e^{-x^{2} / 2} / \sqrt{2 \pi}}\right) d x\\
=& \int \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^\frac{-(x-\mu)^{2}}{2 \sigma^{2}} \log \left\{\frac{1}{\sqrt{\sigma^{2}}} \exp \left\{\frac{1}{2}\left[x^{2}-(x-\mu)^{2} / \sigma^{2}\right]\right\}\right\} d x \\
=& \frac{1}{2} \int \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^\frac{-(x-\mu)^{2}}{2 \sigma^{2}} \left[ -\log \sigma^{2}+x^{2}-(x-\mu)^{2} / \sigma^{2}\right] d x
\end{aligned} $

整個結果分為三項積分，第一項實際上就是 $-\log \sigma^2$ 乘以概率密度的積分（也就是1），所以結果是 $-\log \sigma^2$ ；第二項實際是正態分布的二階矩，熟悉正態分布的朋友應該都清楚正態分布的二階矩為 $\mu^2+\sigma^2$ ；而根據定義，第三項實際上就是「-方差除以方差=-1」。所以總結果就是
$K L\left(N\left(\mu, \sigma^{2}\right) \| N(0,1)\right)=\frac{1}{2}\left(-\log \sigma^{2}+\mu^{2}+\sigma^{2}-1\right)$

Reference

Durr, Oliver. 2016. “VAE Tutorial.” GitHub. November 25, 2016. https://github.com/oduerr/dl_tutorial.

Hinton, G. E., and R. R. Salakhutdinov. 2006. “Reducing the Dimensionality of Data with Neural Networks.” Science 313 (5786):504–7. https://doi.org/10.1126/science.1127647.

Su, Jianlin. 2018. “Variational Autoencoder 1：原来是这么一回事 – 科学空间|Scientific Spaces.” March 18, 2018.
https://kexue.fm/archives/5253.

原理 → 定律 → 現象：最小作用量用原理

物理學最早從現象的觀察，歸納出定律(law) – 例如光學 Snell's law，牛頓力學定律，萬有引力定律，電磁學的庫倫定律、安培定律、法拉第定律、Lawrence 定律、最終整合成 Maxwell's equation 的電磁學定律。再由這些定律找出更基本，更統一且具有美感的原理。

原理 → 定律 → 現象

一般認為物理學的兩大基本原理是：最小作用原理和對稱性原理。 可以從這兩大基本原理推論出定律。再由定律解釋各種物理現象，如下圖。（相對性原理是對稱性原理的一種？）

具體的操作（至少邏輯上）

Step 1: 找到或是猜到這個系統的 Lagrangian, L. "作用 (action)"一般是 Lagrangian 的時間積分。以古典力學為例
$\mathcal{S}\left[\mathbf{q}, t_{1}, t_{2}\right]=\int_{t_{1}}^{t_{2}} L(\mathbf{q}(t), \dot{\mathbf{q}}(t), t) d t$
where $L = T - V$ , $T$ 是動能， $V$ 是位能。

Step 2A: 使用最小作用原理，用變分法 (calculus of variation) 於 action, 就是用 Euler-Lagrange equation, 得到“自然定律一”，一般是運動方程式。例如應用於古典力學可以得到牛頓力學定律 [@sufeifanDerivationClassic2019]。應用於電磁學可以得到 EM Wave equation. 顯然最小作用原理和 scaling/coupling constant 無關（最小值乘常數或加常數還是最小值），因此需要確認常數和定律的 consistency.

Step 2B: 使用對稱性原理，用 Noether Theorem 找到對應的守恆量，”自然定律二“。例如 time symmetry 對應能量守恆定律，space symmetry 對應動量守恆定律，rotation symmetry 對應角動量守恆定律。電磁場的 gauge symmetry 對應電荷守恆定律。更高級的場論（field theory) 的對稱性對應完整的理論，例如 U(1) 對稱性對應電磁場理論。U(1) x U(1) 對應電弱場理論。

Step 3: 自然定律還要加上 (time) initial condition 和 (space) boundary condition 才能得到或解釋自然現象。例如自由落體和行星運動都遵守牛頓萬有引力定律，但因為 initial condition 而有不同的運動。同樣微波爐和手機都遵守 Maxwell equations, 但因為 boundary condition 而有不同的作用。

最小作用原理幾何化

最小作用量用原理貫穿物理學的發展。一個問題是最小作用原理 (least action principle, LAP or PLA) 比較抽象。一種粗略的說法：作用的抽象觀念（或單位）是 (energy)x(time), 代表大自然不偏愛能量或時間，而是兩者的乘積。但為什麼古典力學 action 積分中的 Lagrangian = (動能-位能)？這並不直觀。再加上作用是對時間的積分。人是視覺的動物，對於空間幾何的掌握高於時間運動。最好能把最小作用原理轉換為幾何原理詮釋，類似最短距離路徑或是最短時間路徑，比較能抓住其中物理意義。

順便一提，希臘的芝諾 (Zino) 和他的老師否定運動存在，所以才提出芝諾悖論：「阿奇里斯追不上烏龜」和「飛矢不動」等等，間接暗示人對時間和運動的主觀性。這也導致後來馬赫、愛因斯坦對於“時空”開創性的看法。

我之前花了一些時間把一些定律和幾何原理連結。其實這也是相對論的基本思想。時空相對性原理 (i.e. Lorentz transform) 就是"時空坐標系“的微分幾何的不變量 (ds). 廣義相對論方程式就是彎曲時空最短路徑原理（geodesic, 測地線)。同樣幾何光學可以轉換成最短時間原理。但只能片段串在一起，沒有貫穿所有理論。

注意，最短時間原理雖然包含時間，但這裡的時間只是 derived parameter, 距離除以速度（或是 momentum, p, 假設質量=1), i.e. $t = l/v \sim q/p$ ，而不是 explicit variable (自變數)。
最短時間原理是找 $\min \int (1/v)dl = \min \int (1/p) dq$ 的極值，並且 $p(q)$ or $v(x)$ 都是位置函數，和時間無關，所以還是一個清晰（且直觀）的幾何原理。

古典力學的最小作用原理或是動力學定律反而比較困難找到幾何原理，因為在 (x(t), t) ”慣性座標系” t 是無所不在的自變數。必須轉成 (p(q), q) “phase space”, t 被隱藏在 p 內，才會被踢出幾何原理。用微分幾何的術語，才能定義微分幾何不變量 ds 和測地線 (geodesic).

Maupertuis (其實應該是 Euler) 原理可以視為幾何化最小作用原理。在歷史上是較早的最小作用原理。不過有一些限制（固定初速和能量守恆，位能只和位置相關），而且不含時間不方便用於解運動方程式。但比較直觀，容易理解，後面會詳細介紹。[@wikiMaupertuisPrinciple2020] and [@fowlerMaupertuisPrinciple2009]

古典力學的最小作用原理 recap

假設考慮粒子在某個平面中從某個初始點和時間 $(x_1, y_1, t_1)$ 到某個最終點和時間 $(x_2, y_2, t_2)$ 的運動。假設其勢能是位置 $V(x, y)$ 的函數。例如，想像一下將高爾夫球在相當顛簸的果嶺上推進洞，但還要求球從被擊中到掉入洞中需要一定的時間，例如兩秒鐘，也就是 $t_2-t_1=7$ 。

一般的最小作用原理可以得出 $(x(t), y(t))$ 參數化的路徑，也就是：
$\mathcal{S}\left[\mathbf{q}, t_{1}, t_{2}\right]=\int_{t_{1}}^{t_{2}} L(\mathbf{q}(t), \dot{\mathbf{q}}(t), t) d t \quad\text{and}\quad \delta{\mathcal{S}} = 0$
限制 time and space boundary condition $(x_1, y_1, t_1)$ and $(x_2, y_2, t_2)$ . 但是不限制（初始）速度，也不要求能量守恆。雖然最後還是能量守恆。
$\Rightarrow$ 解 Euler-Lagrange equation $\Rightarrow \mathbf{q}(t) \Rightarrow \dot{\mathbf{q}}(t)$ , 自然會得到初始速度和終端速度。

例一：1D 自由運動

我們用最簡單的 1D 直線自由運動為例，說明什麼是最小作用原理。
自由運動代表只有動能 $T$ ，位能 $V=0$ 。這是一個 trivial case，就是等速運動。不過推導過程可以得到最小作用原理的 insight。

最小作用原理：固定初始點和時間 $(x_1, t_1)$ 以及最終點和時間 $(x_2, t_2)$ 的運動，也就是固定下圖一的左下點和右上點。在所有可能的路徑函數（不必是遞增或是單調，可以上下自由移動），計算對應的 action, $S$ , 找到 $S$ 的最小值。

注意最小作用原理並沒有假設能量守恆。如果用上能量守恆，這個例子就不用算了。能量守恆是由對稱性原理（T-invariant）得到。

$S = \int_{t_1}^{t_2} L dt = \int_{t_1}^{t_2} (T-V) dt = \int_{t_1}^{t_2} T dt \\\\ \sim \Sigma_{1}^{7} T_i \Delta t \sim (v_1^2 + v_2^2 + ... + v_7^2)$

此處假設 $\Delta t = 1$ and $M=2$ . 我們的目的是在滿足 boundary condition $(x_1, t_1, x_2, t_2)$ 的條件下，找 $S$ 的最小值。對應下圖三的面積積分。

1D 直線運動的 boundary condition 非常簡單，就是
$\int_{t_1}^{t_2} v dt = \int_{t_1}^{t_2} \dot{x} dt = \Sigma_{1}^{7} v_i \Delta t = (v_1 + v_2 + ... + v_7) = (x_2 - x_1)$

對應下圖二的面積積分是一個定值。

使用 Cauchy-Schwarz 不等式
$( v_1^2 + v_2^2 + ... + v_7^2 ) (1^2 + 1^2 + ... + 1^2) \ge (v_1 + v_2 + ... + v_7)^2 = (x_2 - x_1)^2$

等號成立時 $v_1 = v_2 = ... = v_7 = (x_2 - x_1)/7$ ，也就是等速直線運動。

例二：1D 一階位能場運動

加上均勻位能條件： $V(x) = -g x$
同樣 boundary condition $(x_1, t_1, x_2, t_2)$ 的條件，就是
$\int_{t_1}^{t_2} v dt = \int_{t_1}^{t_2} \dot{x} dt = \Sigma_{1}^{7} v_i \Delta t = (v_1 + v_2 + ... + v_7) = (x_2 - x_1)$

因此 action S
$S = \int_{t_1}^{t_2} L dt = \int_{t_1}^{t_2} (T-V) dt = \int_{t_1}^{t_2} T dt - \int_{t_1}^{t_2} V dt \\\\ \sim \Sigma_{1}^{7} T_i \Delta t + \Sigma_{1}^{7} g x_i \Delta t \\\\ \sim ( v_1^2 + v_2^2 + ... + v_7^2 ) + g(x_1 + x_2 + ... + x_7)$

$x_2 = x_1 + v_1 \Delta t = x_1 + v_1$
$x_3 = x_2 + v_2 \Delta t = x_1 + v_1 + v_2$
$x_4 = x_3 + v_3 \Delta t = x_1 + v_1 + v_2 + v_3$
…
$x_7 = x_6 + v_6 \Delta t = x_1 + v_1 + v_2 + v_3 + ... + v_6$

所以 action $S$ 可以進一步寫成
$S \sim ( v_1^2 + v_2^2 + ... + v_7^2 ) + g(x_1 + x_2 + ... + x_7) \\\\ = ( v_1^2 + v_2^2 + ... + v_7^2 )+ g (7 x_1 + 6 v_1 + 5 v_2 + ... + v_6) \\\\ = (v_1+3g)^2 + (v_2+2.5g)^2 + (v_3+2g)^2 + ... + (v_6+0.5g)^2 + (v_7)^2 + c$

同樣用 Cauchy-Schwarz 不等式
$( (v_1+3g)^2 + (v_2+2.5g)^2 + ... + v_7^2 ) (1^2 + 1^2 + ... + 1^2) \ge (v_1 + v_2 + ... + v_7 + c')^2 \\\\ = (x_2 - x_1 + c')^2$

等號成立時 $v_1+3g = v_2+2.5g = ... = v_7$ .
$v_2 = v_1 + 0.5g$
$v_3 = v_1 + 1g$
$v_4 = v_1 + 1.5g$
$v_5 = v_1 + 2g$
$v_6 = v_1 + 2.5g$
$v_7 = v_1 + 3g$
如預期，這是等加速度運動！

雖然我們用 Cauchy-Schwarz 不等式找出自由運動和一階位能場的解，但並沒有什麼物理意義。
二階以及更複雜的位能函數很難用 Cauchi-Schwarz 不等式或是基本的代數求解，需要用到變分法求解。不過我們後面可以用幾何化的最小作用原理，結合(微分)幾何提供一些 insight.

如何消除時間？幾何化“最小作用原理”

接下來我們尋找另一種最小作用原理。這次不固定時間，但是固定了空間的起點和終點，但允許最終時間可能有所變化。也就是 ( $q_1,t_1)$ to $(q_2, \text{any}\, t)$ , 以及給定的起始能量 $E = T + V$ where $E$ 是總能量，包含動能和位能。

實際上我們也假設整個 path 能量守恆。因此沒有時間的 explicit dependency. 一般可以用 time-independent Hamiltonian function $H(x, y, p_x, p_y) = E$ 表示。
因為 $E = p^2/2M + V$ , 而且 $V$ 是已知，因此 $p(x, y) = 2 M \sqrt{E-V}$ 可以視為一個 independent variable. 使用 $(\vec{p}, \vec{q})$ 取代 $(x, y, t)$ , where $\vec{p} = [p_x, p_y]$ and $\vec{q} = [x, y]$ , and $p_x = p_x (\vec{q}) = p_x (x, y)$ , and $p_y = p_y (\vec{q}) = p_y (x, y)$ . $t$ 是隱變數。

這意味我們必須限制在具有固定能量 $E$ 的所有路徑中，改變路徑（variation）to minimize action, 但不限制時間或路徑。回到高爾夫球的例子。在顛簸的果嶺中，我們給球一個固定的初始速率 $v_0$ ，但可以有不同的初始方向讓它進洞。

如果果嶺是平面，所有初速 $v_0$ 也會是定速 $v_0$ 的所有路徑。此時 stationary path 可以是“最短距離路徑”或是“最短時間路徑”，都對應是直線。如果果嶺只有高低兩種位能，對應兩種不同速度：高位能對應低速，低位能對應高速。這時 stationary path 不會是“最短路徑路徑”，比較類似幾何光學的折射路徑。但實際上和幾何光學的 Fermat 原理 (最短時間路徑) 還是不同，後面會說明。

Q. 如何定義新的作用，有比較明確的物理或幾何意義？ Abbreviated action, no time dependency given initial speed. 和原來的 action 不同，更有幾何意義。
Q. 如何定義新的最小作用原理？或者是否還有最小作用原理？ Yes, 就是 Maupertuis 原理.

Abbreviated Action

我們先定義 abbreviated action 如下。假設固定初速以及能量守恆，下式明顯只和路徑有關，和 t 無關，是一種幾何表示。
$\mathcal{S}_{0}=\int_{q_0}^{q_1} \mathbf{p} \cdot d \mathbf{q}=\int_{q_0}^{q_1} \Sigma_i p_{i} d q_{i}$

先檢查 $\mathcal{S}_{0}$ 單位： $p \cdot dq = mv \cdot dq$ => kg m² / sec.
這和一般 action: energy * time = mv² sec = kg (m/sec)² sec = kg m² /sec 一致。

再來 $\mathcal{S}_{0}$ 的積分內，如果 p (or v) 是常數，變成 $dq$ 的積分，也就是距離長度。 $\mathcal{S}_{0}$ 的“最小作用”等價最短距離路徑。

如果是兩個速度(對應兩個位能)，積分包含兩區。 $\mathcal{S}_{0}$ 的“最小作用”不會是直線，而是兩段折線，類似幾何光學的折射。

Abbreviated Action Vs. Action

先定義總能量 function (i.e. Hamiltonian): $H(x, y, p_x, p_y) = E$ 守恆
$E=\sum_{i=1}^{n} \dot{q}_{i} \frac{\partial L}{\partial \dot{q}_{i}}-L \\\\ L = \sum_{i=1}^{n} \dot{q}_{i} \frac{\partial L}{\partial \dot{q}_{i}} - E$

把 Lagrangian 代入 action:
$\mathcal{S} |_{q_0, t_0}^{q_1, t_1} = \int_{t_0}^{t_1} L dt = \int_{t_0}^{t_1} \sum_{i=1}^{n} \dot{q}_{i} \frac{\partial L}{\partial \dot{q}_{i}} dt - E t|_{t_0}^{t_1} \\\\ = \int_{q_0}^{q_1} \sum_{i=1}^{n} \frac{\partial L}{\partial \dot{q}_{i}} d q_i - E t|_{t_0}^{t_1}$

假設 $L= T - V = \frac{1}{2} M \sum_{i=1}^{n} \dot{q}_{i}^{2} - V(q_1, ... q_n)$ . 此處假設 $V$ 位能函數和速度無關¹，上式可以化簡為

$\mathcal{S} |_{q_0, t_0}^{q_1, t_1} = \int_{q_0}^{q_1} \sum_{i=1}^{n} M \dot{q}_i d q_i - E t|_{t_0}^{t_1} \\\\ = \int_{q_0}^{q_1} \sum_{i=1}^{n} p_i d q_i - E t|_{t_0}^{t_1} \\\\ = \mathcal{S}_{0}|_{q_0}^{q_1} - E t|_{t_0}^{t_1}$

也就是說，一般的 action $\mathcal{S}$ 是 Lagrangian 對時間積分 (from $t_0$ to $t_1$ )，顯然和時間相關( $t = t_1 - t_0$ )，甚至和時間有比例關係。但我們可以把 $\mathcal{S}$ 拆解成時間無關，但和路徑相關的 abbreviated action $\mathcal{S}_0$ 減去時間相關的 $E t.\,$ Action 和 abbreviated action 單位都是 (能量)*(時間).

這個分解和波動方程式 (e.g. Maxwell wave function or Schrodinger wave) 的 separation of variables 把 time and spatial 分開有異曲同工之處。不過電磁波動方程式的 separation of variables 是乘法 $\exp(kx-\omega t) = \exp(kx) \exp(-\omega t)$ ，此處則是加法。
如果看量子力學波動方程式更明顯： $\exp\{(i / \hbar)(p x-E t)\}$ 這其實是古典力學到量子力學的連結關鍵。[@fowlerHamiltonPrinciple2010] 另外也是 Feynman's path integral 的基礎
$A=\sum_{\text {paths }} \exp (i S / \hbar)$

我們用一個簡單的例子確認上上式的正確性：Free particle (V=0) in the space. 假設能量守恆, i.e. $L = T - V = T = T + V = E$ 是一個常數 for the entire motion. 以及直線是最小作用的結果路徑。
$\mathcal{S} |_{q_0, t_0}^{q_1, t_1} = \int_{t_0}^{t_1} L dt = \int_{t_0}^{t_1} E dt = E (t_1 - t_0) = \frac{1}{2} M v^2 (t_1 - t_0) = \frac{M}{2} \frac{ (q_1 - q_0)^2}{t_1 - t_0}$

$\mathcal{S}_{0} |_{q_0}^{q_1} = \int_{q_0}^{q_1} p \cdot dq = Mv(q_1 - q_0) = M \frac{ (q_1 - q_0)^2}{t_1 - t_0}$

$E t|_{t_0}^{t_1} = E (t_1 - t_0) = \frac{M}{2} \frac{ (q_1 - q_0)^2}{t_1 - t_0}$

可以簡單驗證 (sanity check): $\mathcal{S} |_{q_0, t_0}^{q_1, t_1} = \mathcal{S}_{0}|_{q_0}^{q_1} - E t|_{t_0}^{t_1}$

Q: 為什麼上式是 $-E t$ 不是 $+E t$ ？
A: 同樣用 free particle 為例。Fixed $q_0, q_1, q_1-q_0=Q$ , 把 $t_0-t_1 =T$ 增加 $dT$ , 最小作用原理會要從 $v = Q / (T+dT)$ 變成 $v \approx Q/T (1-dT/T)$ decrease by $(1-dT/T)$ , 因此會有 negative sign?
其實 $S_0$ 雖然外顯時間無關。但這是假設 p 不變，如果 p 改變， $S_0$ 還是會跟著改變。 (NOT good answer, to be modified!)

Maupertuis 原理

[@fowlerMaupertuisPrinciple2009] 推導如何從最小作用原理得到 Maupertuis 原理, 不過要再推敲一下。

從最小作用原理出發：在固定 $q_0, q_1, t_0, t_1 \Rightarrow \delta \mathcal{S} |_{q_0, t_0}^{q_1, t_1} = 0$
以及 $\delta q_i = 0$ (why?) leaving

$\delta \mathcal{S} |_{q_0}^{q_1} = \sum_{i} p_{i} \delta q_{i}-H \delta t=-E \delta t$

再用 action 和 abbreviated action 的關係式
$\mathcal{S} |_{q_0, t_0}^{q_1, t_1} = \mathcal{S}_{0}|_{q_0}^{q_1} - E t|_{t_0}^{t_1} \\\\ \Rightarrow \mathcal{S}_{0}|_{q_0}^{q_1} = \mathcal{S} |_{q_0, t_0}^{q_1, t_1} + E t|_{t_0}^{t_1} \\\\ \Rightarrow \mathcal{S}_{0}|_{q_0}^{q_1} = \mathcal{S} |_{q_0, t_0}^{q_1, t} + E t|_{t_0}^{t}$

在不固定 $t$ , 但固定 $E$ , i.e. 能量守恆包含固定初始能量(或初速)的情況下，代入 $\delta \mathcal{S} |_{q_0}^{q_1} = -E \delta t$ .

可以得到 Maupertuis 原理.
$\delta \mathcal{S}_0 |_{q_0}^{q_1} = 0$
或是
$\delta \mathcal{S}_{0}=\delta \int_{q_0}^{q_1} \mathbf{p} \cdot d \mathbf{q} = 0$

Maupertuis 原理: 最短距離，最短時間，最小“作用”？

Maupertuis 原理去除時間的相關性(i.e. p is p(q), not p(q,t) assuming initial velocity is fixed)，變成空間原理, 也就是幾何的原理。是否有直觀物理意義？ 3D space 的 free particle (V=0) 最小XX原理一定是直線，無法區分。我們用次簡單的 two potential regions ( $V_L$ and $V_H$ ) on 2D 平面。 $q_0$ 和 $q_1$ 代表起點和終點，位於 low and high potential regions respectively，如下圖。來探討 Maupertuis 原理的物理意義。

Level 1 幾何詮釋：Phase space (p, q) 的曲線

由於能量守恆，high potential 對應 low speed, and vice versa.
橘線是最短距離，當然不是 Maupertuis 原理的路線。只是做為 baseline.

藍線是最短時間，對應的是 $\min \{ \frac{L_0}{v_H} + \frac{L_1}{v_L} \}$ , where $L_0$ and $L_1$ 是藍線在 low potential region 和 high potential region 的長度。因為在 low potential region 的速度大於 high potential region, i.e. $v_H > v_L$ . 藍線直覺上在橘線的左邊，所以粒子可以高速運動距離比較長，雖然整體的距離比較長，整體時間反而比較短。最短時間原理是 Fermat 原理, 幾何光學的基礎，其路徑積分如下。明顯和 Maupertuis 原理不同。
$\mathcal{T}=\int_{q_0}^{q_1} \frac{ds}{v}$

紫線則是最小“作用”，此處的“作用”不是 Lagrange 或是 Hamilton action, 而是 abbreviated action,
$\mathcal{S}_{0}=\int \mathbf{p} \cdot d \mathbf{q}=\int \Sigma p_{i} d q_{i} = M \int_{q_0}^{q_1} v ds.$
這和最短時間原理的邏輯剛好相反。在高速運動的距離比較短，減少整體“作用”的積分。我們可以檢查紫線是否符合古典力學。

應該反過來，用微積分找出商數公式 (appendix)。得到結論是水平方向速度是定值。
我們直接用牛頓力學無外力時的動量守恆說明。牛頓力學的"力"可以視為 potential gradient. 下圖可以看出“力”只存在垂直方向，水平方向沒有 potential gradient, 因此沒有“力”。也因此水平方向的動量守恆。
$v_H \sin(\theta_i) = v_L \sin(\theta_r) \Rightarrow \theta_i < \theta_r \Rightarrow\text{Least Action}$

這和幾何光學的 Snell's Law，或是最短時間原理，數學上剛好相反。
$v_L \sin(\theta_i) = v_H \sin(\theta_r) \Rightarrow \theta_i > \theta_r \Rightarrow\text{Least Time, Snell's Law}$

我們可以整理結論。高位能 → 低速 → 長距離；低位能 → 高速 → 短距離

重新定義 refractive index*

一個簡單的做法是重新定義幾何光學的 refractive index, n. 就可以 reuse 幾何光學的結果。 From Snell's law and the least "action", we can define refractive index n, either n(v) or n(1/v). There is nothing new, but we can reuse geometric optics results to motion.

例一：1D 一階位能場 (p, q)

注意現在沒有 t
1D 一階位能場 ( $V(q) = -g q$ ) 對應自由落體。
能量守恆要求: $\frac{p^2}{2M} + V(q) = \frac{p^2}{2M} - g q = C$
Let $M=1/2$ and $g=1 \Rightarrow p^2 - q = C$

例二：1D 二階位能場 (p, q)

2D 二階位能場 ( $V(q) = k q^2$ ) 對應自由落體。
能量守恆要求: $\frac{p^2}{2M} + V(q) = \frac{p^2}{2M} + k q^2 = C$
Let $M=1/2$ and $k=1 \Rightarrow p^2 + q^2 = C$

1D 的軌跡都是直線運動,(p,q)形成二維平面。給定位能函數和初速，只要用能量守恆就可以得到每一點的動能和動量 p, 用不到 Maupertuis 原理。
2D 的軌跡, (p,q) 形成四維空間。

再來是 2D 平面，如果是 uniform x-direction linear field => 拋物線。如果是重力場 (環形場), 則是圓形或是橢圓形。=> how to derive it from Maupertuis 原理？

Use 2 particles 4-dimension example? or generalized coordinates
$1/2 M_1 p_1^2 + 1/2 M_2 p_2^2 + k M_1 M_2 f(|q_1 - q_2|)$

Level 2: 微分幾何化，向愛因斯坦致敬(Jocobi form) [@fowlerMaupertuisPrinciple2010]

如果空間不存在位能場 ( $V=0$ ), 在此無外力的自由運動軌跡是直線。Maupertuis 原理加上能量守恆 (T=constant) 等價於平直歐氏空間兩點之間的最短距離，如預期答案是直線。我們換一個說法: 無位能場歐氏空間的 geodesic 是直線。基本上這是牛頓第一定律。

當空間存在位能場，古典的看法是位能場的梯度代表外力，這個外力會改變運動的軌跡，不再是直線。而是最小作用原理或是 Maupertuis 原理決定的曲線，如前所述。

另一種不同的看法，來自愛因斯坦的廣義相對論，質量（或能量，質能互換）造成空間彎曲，運動物體的軌跡其實還是“彎曲空間”的最短距離, geodesic. 如果加上時空座標，就變成一個存粹的（黎曼）幾何問題。此處我們不討論廣義相對論，只是借用彎曲空間的觀念致敬愛因斯坦。

位能場和初始能量決定空間的彎曲程度。
- 位能場的影響很直觀：位能梯度代表力，力會改變運動軌跡。此處我們說改變空間曲率。位能梯度越大，對應空間曲率越大。
- 初始能量也會影響運動軌跡。能量越大，軌跡越直線，空間曲率越小。
微分幾何就是處理彎曲空間的工具。不再把空間視為平直的歐氏空間，先定義 metric $g_{uv}$ ；再定義 Christopher symbol, $\Lambda$ , metric 的一階導數；再定義 curvature, 基本是 metric 的二階導數。
Maupertuis 原理，可以視為在這個彎曲空間的最短路徑，即 geodesic. 這是廣義相對論在低能量下的近似解。廣義相對論在時空座標系本身就是彎曲時空的 geodesic. 反而在低能量時要把時間剝離出來，轉成 Maupertuis 原理，對應彎曲空間的 geodesic.

我們用 Landau 方法消除 $dt$ .

$T=\frac{1}{2} \sum a_{i k}(q) \dot{q}_{i} \dot{q}_{k}$
$p_{i} = \sum a_{i k}(q) \dot{q}_{k}$
$\mathcal{S}_{0} = \int \sum_{i} p_{i} d q_{i}=\int \sum_{i} a_{i k} d q_{i} d q_{k} / d t$

$T=E-V(q)=\frac{1}{2} \sum a_{i k}(q) \dot{q}_{i} \dot{q}_{k} = \frac{\sum a_{i k}(q) d q_{i} d q_{k}}{2(d t)^{2}}$

以下這步是關鍵，也是從 $(p, q)$ space 和 $(q, \dot{q}, t)$ space 的轉換公式。
$d t=\sqrt{\sum \frac{a_{i k} d q_{i} d q_{k}}{2(E-V)}}$
$\mathcal{S}_{0} =\int \sum_{i} p_{i} d q_{i}=\int \sum_{i} a_{i k} d q_{i} d q_{k} / d t=\int \sqrt{\left[2(E-V) \sum a_{i k} d q_{i} d q_{k}\right]} = \int ds$

也就是說： $ds = \sum_{i} p_{i} d q_{i} = \mathbf{p}\cdot \mathbf{dq},\,$ where ds 是微分幾何中的 differential arc length. 微分幾何第一式 (first fundamental form):
$latex \begin{aligned}
d s^{2} = 2(E-V) \sum_{j=1}^{n} \sum_{k=1}^{n} a_{j k} d q_{j} d q_{k} &=\sum_{j=1}^{n} \sum_{k=1}^{n} g_{j k} d q_{j} d q_{k}
\end{aligned} $

由此得到 metric (度規), $g_{jk} = 2 (E-V(q)) a_{jk}.$ 也就是位能函數和初始能量帶入微分幾何。空間曲率和 metric 二階微分直接相關。Maupertuis 原理，這個幾何“最小作用”原理就有更清楚的物理意義：彎曲空間兩點的最短距離，也就是 geodesic (測地線)。

完整的 Geodesic equation 如下。 $\Gamma_{j k}^{i}$ 是 Christopher symbol, 由 metric 一階導數構成。

$\frac{d^{2}}{d s^{2}} q^{i}+\Gamma_{j k}^{i} \frac{d}{d s} q^{j} \frac{d}{d s} q^{k}=0$

Lengendre transform between Maupertuis and Least Action Principle

例一：2D 卡氏座標系, $V(x, y)$

的 arc 如下，定義了 metric $g_{xy} = \sqrt{E-V(x,y)}$
$d s^2 = (E-V(x,y)) (dx^2 + dy^2)$
E 是一個常數，E-V 就是動能。
How to derive a parabola from above?

例二：Keplerian System ( $V = -\frac{GM}{r}$ ) [@biesiadaEpicyclicOrbital2003]

$d s^{2}=2\left(E+\frac{GM}{r}\right)\left(d r^{2}+r^{2} d \theta^{2}\right) = f(r)^2 \left(d r^{2}+r^{2} d \theta^{2}\right)$ where $f(r) = \sqrt{2(E-V(r)}$

比較一下，相對論的 (3D) Schwarzschild metric 如下。
$d s^{2}=\left(1-\frac{2 G M}{c^{2} r}\right) c^{2} d t^{2}-\frac{1}{\left(1-\frac{2 G M}{c^{2} r}\right)} d r^{2}-r^{2}\left(d \theta^{2}+\sin ^{2} \theta d \varphi^{2}\right)$
在極限 ( $v \sim f(r) \le c$ ) 接近 2D 非相對論的軌跡。

完整的 geodesic equation:
$\frac{d^{2} x^{\mu}}{d s^{2}}+\Gamma_{\alpha \beta}^{\mu} \frac{d x^{\alpha}}{d s} \frac{d x^{\beta}}{d s}=0$

In this case the geodesic deviation equation (for an orthogonal Jacobi field ξ of geodesic deviation $g(ξ, u) = 0$ ) reads:
$\frac{d^{2}}{d s^{2}} \xi^{i}+K_{G} \xi^{i}=0$ where $K_G$ is Gaussian curvature of respective Maupertuis-Jacobi manifold, and
$K_{G}=-\frac{1}{f(r)^{2} r} \frac{d}{d r}\left(\frac{1}{f(r)} \frac{d(r f(r))}{d r}\right)=-\frac{E G M}{4(r E+G M)^{3}}$
因為 $T = E - V(r) = E + GM/r > 0$ , 上式的分母大於 0. 因此分子 total energy $E$ 決定 $K_G$ 為正或負。 $E > 0 \rightarrow K_G < 0$ 對應雙曲線，例如一次性彗星。相反 $E < 0 \rightarrow K_G > 0$ 對應圓形或橢圓 bound motion. How to solve the geodesic equation?

$latex \forall \quad E>-\frac{GM}{r} \quad\left\{\begin{array}{llll}
E<0 & \Rightarrow & K_{G}>0 & ; & \text { ellipse } \\
E=0 & \Rightarrow & K_{G}=0 & ; & \text { parabola } \\
E>0 & \Rightarrow & K_{G}<0 \quad ; & & \text { hyperbola }
\end{array}\right. $

考慮圓形軌道半徑 $R$ and $E = \frac{V(R)}{2} = -\frac{GM}{2R} \Rightarrow K_G = \frac{1}{GMR}$ . 高斯曲率和半徑成反比符合預期。

科學空間：處處皆幾何 (力學幾何化)

偶然看到 [@suUnderstandRiemann2016] 也有一樣的想法，以及類似的文章！摘錄做為參考。另外科學空間有很多文章值得一看 (ML, NLP, etc.)。和我的興趣非常相近。

以下是重點：

黎曼度量提供了一個變分原理 → What？
那反過來，一個變分原理，能不能提供一個黎曼度量呢？眾所周知，不少學科的基礎原理都可以歸結為一個極值原理，而有了極值原理就不難導出變分原理（泛函極值），如物理中就有最小作用量原理、最小勢能原理，概率論中有最大熵原理，等等 → OK
如果有一個將變分原理導出黎曼度量的方法，那麼就可以用幾何的方式來描述它。幸運的是，對於二次型的變分原理，是可以做到的 → How?

我們看具體的 2D 例子。BTW, 必須是 2D 或以上。 1D 的幾何只有直線。這裡用 general 2D 系統（假設 M=1），並且位能只是位置函數。

$S=\int\left\{\frac{1}{2}\left[\left(\frac{d x}{d t}\right)^{2}+\left(\frac{d y}{d t}\right)^{2}\right]-U(x, y)\right\} d t$

這個變分帶入 Euler-Lagrange, 可以得到牛頓運動方程式
$\frac{d^{2} x}{d t^{2}}=-\frac{\partial U}{\partial x}, \quad \frac{d^{2} y}{d t^{2}}=-\frac{\partial U}{\partial y}$

由於是保守系統，那麼滿足能量守恆
$\frac{1}{2}\left[\left(\frac{d x}{d t}\right)^{2}+\left(\frac{d y}{d t}\right)^{2}\right]+U=E$

代入第一式
$S=\int\left[\left(\frac{d x}{d t}\right)^{2}+\left(\frac{d y}{d t}\right)^{2}-E\right] d t$

再來他的推導顯然不分 $S$ and $S_0$ , 結果如下：
$d s^{2}=(E-U)\left(d x^{2}+d y^{2}\right)$

也就是 Riemannian geometry metric is (E-V(x,y))

結論：
以上結果可以一般化，即如下作用量的保守系統
$S=\int\left[\frac{1}{2} g_{\mu \nu} \frac{d x^{\mu}}{d t} \frac{d x^{\nu}}{d t}-U(x)\right] d t$

的能量為E的運動曲線的形狀（相軌線），等價於黎曼度量
$d s^{2}=[E-U(\boldsymbol{x})] g_{\mu \nu} d x^{\mu} d x^{\nu}$
下的測地線，其推導過程是類似的。這樣我們就實現了將力學的問題幾何化，或者說將二次型的變分問題幾何化了。可能讓人意外的是，以上結果在1837年就由雅可比完成了。

上面結果告訴我們，廣義相對論不再是黎曼幾何在物理中的代名詞，即便沒有廣義相對論，物理中也有黎曼幾何。將力學幾何化，有助於我們將力學、場論等與幾何聯繫起來。黎曼幾何其實就是一套幾何的研究框架，只要能夠對應地轉換，就可以直接應用黎曼幾何的很多結論，並且可能導出更豐富、更全面的內容。

Why Not Maupertuis 原理取代最小作用原理？

Maupertuis advantage
Problem

Limited scope
Vector forms, hard to deal with. LAP is scalar base
Lack of dynamic, only trajectory. Need to re-calculate dynamic some way!
Cannot derive common Newton's laws!!
So we use LAP

Galileo Invariant

x' = x – ut => dx' = dx – u
y' = y => dy' = dy
dx'² + dy'² = dx² + dy² – 2u dx + u²
[dx' dy' 1] = [1 0 -u; 0 1 0][dx dy 1]
TBD

Appendix

Appendix A: Hamilton and Maupertuis 最小作用原理

我後來發現 Gray 有一篇非常好的文章 [@grayPrincipleLeast2009] 說明並解釋 Hamilton 最小作用原理，和 Maupertuis 最小作用原理！

現今最常使用的是 Hamilton 原理。 Hamilton action $S$ is defined as an integral along any actual or virtual (conceivable or trial) space-time trajectory $q(t)$ connecting two specified space-time events, initial event $A≡(q_A,t_A=0)$ and final event $B≡(q_B,t_B=T),$
$S=\int_{0}^{T} L(q, \dot{q}) d t$

有無數條運動軌跡滿足給定 $A$ 和 $B$ 限制 (fixed end-positions and travel time). 為了強調特定的限制在軌跡，Hamilton 原理可以描述如下
$(\delta S)_{T}=0$
where the constraint of fixed travel time $T$ is written explicitly, and the constraint of fixed end-positions $q_A$ and $q_B$ is left implicit. 接下來我們考慮其他的最小作用原理，但仍然維持 fixed $q_A$ and $q_B$ so we will always leave the constraint of fixed $q_A$ and $q_B$ implicit.

第二個最小作用原理是 Maupertuis action $W$ (就是上文的 abbreviated action, $S_0$ ), where
$W=\int_{q_{A}}^{q_{B}} p \cdot d q=\int_{0}^{T} 2 K d t$

where the first (time-independent) form is the general definition, with $p=\partial L / \partial \dot{q},$ the canonical momentum (equal to the ordinary momentum in many cases of interest), and $p \cdot dq$ stands for $p_1 dq_1 +p_2 dq_2 +…+p_f dq_f$ in general. The second (time-dependent) form for $W$ is valid for normal systems in which the kinetic energy $K$ is quadratic in the velocity components $\dot{q}_1,\dot{q}_2,…\dot{q}_f.$ 雖然看起來第二式 $K$ 比較簡單，但因為 time-dependent, 不如直接或是應該直接訴諸 Hamilton 原理。

Maupertuis 原理描述所有軌跡 (time-independent）with fixed end positions $q_A$ and $q_B$ and fixed energy $E=K+V$ , $W$ is stationary, 可以寫為
$(\delta W)_{E}=0$

注意 Maupertuis 原理 $E$ is fixed but $T$ is not. Hamilton 原理剛好相反， $T$ is fixed but $E$ is not.

Solution of the variational problem posed by Hamilton's principle yields the true trajectories $q(t).$

Solution of Maupertuis' variational equation using the time-dependent (second) form of $W$ also yields the true trajectories, whereas using the time-independent (first) form of W yields (in multi-dimensions) true orbits, i.e. spatial shape of the true paths. In the latter case, in two and more dimensions, the action W can be rewritten as an integral involving the arc length along the orbit (Jacobi's form), and the problem then resembles a geodesic or reciprocal iso-perimetric problem (Lanczos 1970).

In one dimension, for simplicity assume $q$ is an inertial Cartesian coordinate and note that since the momentum $p$ is a simple function of $q$ in the first form for $W$ due to the constraint of fixed energy $E=p^2 /2M+V(q)$ , the only freedom for variation is instantaneous momentum reversals, so that the principle is essentially uninformative, i.e., provides essentially no information beyond what is contained in the assumed conservation of energy. In one dimension, one can find the one or more true time trajectories $q(t)$ from energy conservation and the two end-position values. (和我想的一樣)

In all cases the solutions for the true trajectories and orbits can be obtained directly from the Hamilton and Maupertuis variational principles or from the solution of the corresponding Euler-Lagrange differential equations which are equivalent to the variational principles.

Hamilton's principle is applicable to both conservative systems and nonconservative systems where the Lagrangian $L(q, \dot{q} ,t)$ is explicitly time-dependent (e.g. due to a time-dependent potential $V(q,t)$ ), whereas Maupertuis' principle is restricted to conservative systems.

An appealing feature of the action principles is its brevity and elegance in expressing the laws of motion. They are valid for any choice of coordinates (i.e., they are covariant), and readily yield conservation laws from symmetries of the system. Action principles transcend classical particle and rigid body mechanics and extend naturally to other branches of physics such as continuum mechanics, relativistic mechanics, quantum mechanics, and field theory, and thus play a unifying role.

Relation of Hamilton and Maupertuis Principles (Legendre Transform)

For conservative (time-invariant) systems the Hamilton and Maupertuis principles are related by a Legendre transformation. Recall first that the Lagrangian $L(q,\dot{q})$ and Hamiltonian $H(q,p)$ are so-related, i.e.
$H(q, p)=p \dot{q}-L(q, \dot{q})$

上式兩邊積分 with respect to t along an arbitrary virtual or trial trajectory between two points $q_A$ and $q_B$ ,
$S = W - \bar{E} T$
where $\bar{E} \equiv \int_{0}^{T} H d t / T$ is the mean energy along the trajectory. 對於 conservative system, with $\bar{E} = E =$ constant.

$S = W - E T$

From the Legendre transformation between $S$ and $W$ , for conservative systems one can derive Hamilton's principle from Maupertuis' principle, and vice-versa. The two action principles are thus equivalent for conservative systems, and related by a Legendre transformation whereby one changes between energy and time as independent constraint parameters.

We again restrict the discussion to time-invariant (conservative) systems. If we vary the trial trajectory $q(t)$ , with no variation in end positions $q_A$ and $q_B$ but allowing a variation in end-time T, the corresponding variations $δS, δW, δ\bar{E}, δT$ for an arbitrary trial trajectory are seen to be related by
$\delta S+\bar{E} \delta T=\delta W-T \delta \bar{E}$

Next one can show (Gray et al. 1996a) that the two sides separately vanish for variations around a true trajectory. The left side then gives $δS+EδT=0$ , since $\bar{E}=E$ (a constant) on a true trajectory for conservative systems, which is called the unconstrained Hamiltonian principle. This can be written in the standard form for a variational relation with a relaxed constraint
$\delta S=\lambda \delta T$

where $λ$ is a constant Lagrange multiplier, here determined as $λ=−E$ (negative of energy of the true trajectory). If we constrain $T$ to be fixed for all trial trajectories, then $δT=0$ and we have $(δS)_T = 0$ , the usual Hamilton principle. If instead we constrain $S$ to be fixed we get $(δT)_S =0$ , the so-called reciprocal Hamilton principle.

The right side gives $δW − T δ\bar{E} =0$ , which is called the unconstrained Maupertuis principle, which can also be written in the standard form of a variational principle with a relaxed constraint, i.e. $δW=λδ\bar{E}$ where $λ=T$ (duration of true trajectory) is a constant Lagrange multiplier. If we constrain $\bar{E}$ to be fixed for the trial trajectories, we get $(δW)\bar{E}=0$ , which is a generalization of Maupertuis' principle; we see that the constraint of fixed energy can be relaxed to one of fixed mean energy. If instead we constrain W to be fixed, we get $(\delta \bar{E})_{W}=0$ which is called the reciprocal Maupertuis principle. In these various generalizations of Maupertuis' principle, conservation of energy is a consequence of the principle for time-invariant systems, whereas conservation of energy is an assumption of the original Maupertuis principle.

這對於一般位能場成立，但對於例如電磁場就不成立，因為電磁力和速度相關。 ↩

Edge AI: BERT for Text Classification I

Text classification 是 NLP 最基本的任務，應用廣泛。例如 sentiment analysis, spam filtering, news categorization, etc. 本文聚焦 fake news detection (classifying a news as REAL or FAKE) 任務。

此處我們 detect fake news using the state-of-the-art models, 也就是 BERT. 這個 tutorial 可以 extended to really any text classification task.

The Transformer is the basic building block of most current state-of-the-art architectures of NLP. Its primary advantage is its multi-head attention mechanisms which allow for an increase in performance and significantly more parallelization than previous competing models such as recurrent neural networks.

In this tutorial, we will use pre-trained BERT, one of the most popular transformer models, and fine-tune it on fake news detection. [@chengBERTText2020]

Problem Statement

我們用 Kaggle "REAL and FAKE news dataset" (https://www.kaggle.com/nopdev/real-and-fake-news-dataset).

Dataset shape 7796×4, 1st column ID, 2nd column Title, 3rd column Text, 4th column Label. or 6335×4?

Step 1: Preprocess Dataset

The raw data 如下：[@chengLSTMText2020]

The dataset contains an arbitrary index, title, text, and the corresponding label.

import pandas as pd
from sklearn.model_selection import train_test_split

raw_data_path = './Data/news.csv'
destination_folder = './Data'

train_test_ratio = 0.10
train_valid_ratio = 0.80

first_n_words = 200

def trim_string(x):
    x = x.split(maxsplit=first_n_words)
    x = ' '.join(x[:first_n_words])
    return x

For preprocessing, 使用 Pandas and Sklearn 分為 train/valid/test csv files.
我們使用 trim_string function which will be used to cut each sentence to the first first_n_words words (e.g. 200). Trimming the samples in a dataset is not necessary but it enables faster training for heavier models and is normally enough to predict the outcome.

# Read raw data
df_raw = pd.read_csv(raw_data_path)

# Prepare columns
df_raw['label'] = (df_raw['label'] == 'FAKE').astype('int')
df_raw['titletext'] = df_raw['title'] + ". " + df_raw['text']
df_raw = df_raw.reindex(columns=['label', 'title', 'text', 'titletext'])

# Drop rows with empty text
df_raw.drop( df_raw[df_raw.text.str.len() < 5].index, inplace=True)

# Trim text and titletext to first_n_words
df_raw['text'] = df_raw['text'].apply(trim_string)
df_raw['titletext'] = df_raw['titletext'].apply(trim_string) 

# Split according to label
df_real = df_raw[df_raw['label'] == 0]
df_fake = df_raw[df_raw['label'] == 1]

# Train-test split
df_real_full_train, df_real_test = train_test_split(df_real, train_size = train_test_ratio, random_state = 1)
df_fake_full_train, df_fake_test = train_test_split(df_fake, train_size = train_test_ratio, random_state = 1)

# Train-valid split
df_real_train, df_real_valid = train_test_split(df_real_full_train, train_size = train_valid_ratio, random_state = 1)
df_fake_train, df_fake_valid = train_test_split(df_fake_full_train, train_size = train_valid_ratio, random_state = 1)

# Concatenate splits of different labels
df_train = pd.concat([df_real_train, df_fake_train], ignore_index=True, sort=False)
df_valid = pd.concat([df_real_valid, df_fake_valid], ignore_index=True, sort=False)
df_test = pd.concat([df_real_test, df_fake_test], ignore_index=True, sort=False)

# Write preprocessed data
df_train.to_csv(destination_folder + '/train.csv', index=False)
df_valid.to_csv(destination_folder + '/valid.csv', index=False)
df_test.to_csv(destination_folder + '/test.csv', index=False)

Step 2: Libraries

We import Pytorch for model construction, torchText for loading data, matplotlib for plotting, and sklearn for evaluation.

Step 3: Load Dataset and Tokenizing

參考 [@chengBERTText2020]

tokenizer是一個將純文本轉換為編碼的過程，該過程不涉及將詞轉換成為詞向量，僅僅是對純文本進行分詞，並且添加[MASK]、[SEP]、[CLS] 等等標記，然後將這些詞轉換為字典索引。

如上例所示，[CLS] 一般會被放在輸入序列的最前面，而 zero padding 在之前的 Transformer 文章裡已經有非常詳細的介紹。[MASK] token 一般在 fine-tuning 或是 feature extraction 時不會用到，這邊只是為了展示預訓練階段的克漏字任務才使用的。

For the tokenizer, we use the “bert-base-uncased” version of BertTokenizer.

另外我們用 TorchText, we first create the Text Field and the Label Field. The Text Field will be used for containing the news articles and the Label is the true target. We limit each article to the first 128 tokens for BERT input. 利用 Field 中的 encoder.

最後 we create a TabularDataset from our dataset csv files using the two Fields to produce the train, validation, and test sets. Then we create Iterators to prepare them in batches.

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Model parameter
MAX_SEQ_LEN = 128
PAD_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
UNK_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.unk_token)

# Fields

label_field = Field(sequential=False, use_vocab=False, batch_first=True, dtype=torch.float)
text_field = Field(use_vocab=False, tokenize=tokenizer.encode, lower=False, include_lengths=False, batch_first=True,
                   fix_length=MAX_SEQ_LEN, pad_token=PAD_INDEX, unk_token=UNK_INDEX)
fields = [('label', label_field), ('title', text_field), ('text', text_field), ('titletext', text_field)]

# TabularDataset

train, valid, test = TabularDataset.splits(path=source_folder, train='train.csv', validation='valid.csv',
                                           test='test.csv', format='CSV', fields=fields, skip_header=True)

# Iterators

train_iter = BucketIterator(train, batch_size=16, sort_key=lambda x: len(x.text),
                            device=device, train=True, sort=True, sort_within_batch=True)
valid_iter = BucketIterator(valid, batch_size=16, sort_key=lambda x: len(x.text),
                            device=device, train=True, sort=True, sort_within_batch=True)
test_iter = Iterator(test, batch_size=16, device=device, train=False, shuffle=False, sort=False)

Step 4: Build Model

class BERT(nn.Module):

    def __init__(self):
        super(BERT, self).__init__()

        options_name = "bert-base-uncased"
        self.encoder = BertForSequenceClassification.from_pretrained(options_name)

    def forward(self, text, label):
        loss, text_fea = self.encoder(text, labels=label)[:2]

        return loss, text_fea

We are using the “bert-base-uncased” version of BERT, which is the smaller model trained on lower-cased English text (with 12-layer, 768-hidden, 12-heads, 110M parameters).

Step 5: Training

# Save and Load Functions

def save_checkpoint(save_path, model, valid_loss):

    if save_path == None:
        return
    
    state_dict = {'model_state_dict': model.state_dict(),
                  'valid_loss': valid_loss}
    
    torch.save(state_dict, save_path)
    print(f'Model saved to ==> {save_path}')

def load_checkpoint(load_path, model):
    
    if load_path==None:
        return
    
    state_dict = torch.load(load_path, map_location=device)
    print(f'Model loaded from <== {load_path}')
    
    model.load_state_dict(state_dict['model_state_dict'])
    return state_dict['valid_loss']


def save_metrics(save_path, train_loss_list, valid_loss_list, global_steps_list):

    if save_path == None:
        return
    
    state_dict = {'train_loss_list': train_loss_list,
                  'valid_loss_list': valid_loss_list,
                  'global_steps_list': global_steps_list}
    
    torch.save(state_dict, save_path)
    print(f'Model saved to ==> {save_path}')


def load_metrics(load_path):

    if load_path==None:
        return
    
    state_dict = torch.load(load_path, map_location=device)
    print(f'Model loaded from <== {load_path}')
    
    return state_dict['train_loss_list'], state_dict['valid_loss_list'], state_dict['global_steps_list']

We use Adam optimizer and a suitable learning rate to tune BERT for 5 epochs.
We use BinaryCrossEntropy as the loss function since fake news detection is a two-class problem. Make sure the output is passed through Sigmoid before calculating the loss between the target and itself.

Step 6: Evaluation

# Evaluation Function

def evaluate(model, test_loader):
    y_pred = []
    y_true = []

    model.eval()
    with torch.no_grad():
        for (labels, title, text, titletext), _ in test_loader:

                labels = labels.type(torch.LongTensor)           
                labels = labels.to(device)
                titletext = titletext.type(torch.LongTensor)  
                titletext = titletext.to(device)
                output = model(titletext, labels)

                _, output = output
                y_pred.extend(torch.argmax(output, 1).tolist())
                y_true.extend(labels.tolist())
    
    print('Classification Report:')
    print(classification_report(y_true, y_pred, labels=[1,0], digits=4))
    
    cm = confusion_matrix(y_true, y_pred, labels=[1,0])
    ax= plt.subplot()
    sns.heatmap(cm, annot=True, ax = ax, cmap='Blues', fmt="d")

    ax.set_title('Confusion Matrix')

    ax.set_xlabel('Predicted Labels')
    ax.set_ylabel('True Labels')

    ax.xaxis.set_ticklabels(['FAKE', 'REAL'])
    ax.yaxis.set_ticklabels(['FAKE', 'REAL'])
    
best_model = BERT().to(device)

load_checkpoint(destination_folder + '/model.pt', best_model)

evaluate(best_model, test_iter)

After evaluating our model, we find that our model achieves an impressive accuracy of 96.99%! 比起同樣的 LSTM model 只有 77.5%!

Reference

Cheng, Raymond. 2020a. “BERT Text Classification Using Pytorch.” Medium.
July 22, 2020.
https://towardsdatascience.com/bert-text-classification-using-pytorch-723dfb8b6b5b.

———. 2020b. “LSTM Text Classification Using Pytorch.” Medium. July 22,
2020.
https://towardsdatascience.com/lstm-text-classification-using-pytorch-2c6c657f8fc0.

Edge AI: BERT for 分詞

中文自然語言處理的其中一個重要環節就是斷詞的處理。英文字詞基本相同，斷詞非常容易，直接用 space 就可以斷字或斷詞。中文斷詞在先天上就比較難處理，比如電腦要怎麼知道「全台大停電」要斷詞成「全台 / 大 / 停電」或是「全／台大 / 停電」呢？如果是英文「Power outage all over Taiwan」，就可以直接用空白斷成「Power / outage / all / over / Taiwan」，可見中文斷詞是一個大問題。因為斷詞有歧義的可能如上，所以後面用分詞來替代斷詞。英文也會有姓名，專有名詞，復合名詞需要分詞，不過比起中文分詞還是相對容易。

傳統中文分詞

中文分詞根據實現原理和特點，主要分為以下2個類別：

1、基於詞典分詞算法

也稱字符串匹配分詞算法。該算法是按照一定的策略將待匹配的字符串和一個已建立好的「充分大的」詞典中的詞進行匹配，若找到某個詞條，則說明匹配成功，識別了該詞。常見的基於詞典的分詞算法分為以下幾種：正向最大匹配法、逆向最大匹配法和雙向匹配分詞法等。

基於詞典的分詞算法是應用最廣泛、分詞速度最快的。很長一段時間內研究者都在對基於字符串匹配方法進行優化，比如最大長度設定、字符串存儲和查找方式以及對於詞表的組織結構，比如採用TRIE索引樹、Hash 索引等。

2、基於統計的機器學習算法

這類目前常用的是算法是HMM、CRF (Conditional Random Field)、SVM、深度學習等算法，比如 Stanford、Hanlp 分詞工具是基於 CRF 算法。以 CRF 為例，基本思路是對漢字進行標注訓練，不僅考慮了詞語出現的頻率，還考慮上下文，具備較好的學習能力，因此其對歧義詞和未登錄詞的識別都具有良好的效果。

例如 JIEBA 中文分詞所使用的演算法是基於 TRIE TREE 結構去生成句子中中文字所有可能成詞的情況，然後使用動態規劃（DYNAMIC PROGRAMMING）算法來找出最大機率的路徑，這個路徑就是基於詞頻的最大分詞結果。對於辨識新詞（字典詞庫中不存在的詞）則使用了 HMM 模型及 VITERBI 算法來辨識出來。基本上這樣就可以完成具有分詞功能的程式了。

分詞器的挑戰

目前中文分詞難點主要有三個：

分詞標準：比如人名，在哈工大的標準中姓和名是分開的，但在Hanlp中是合在一起的。這需要根據不同的需求制定不同的分詞標準。這一般不是問題。
歧義：對同一個待切分字符串存在多個分詞結果。歧義又分為組合型歧義、交集型歧義和真歧義三種類型。
- 組合型歧義：分詞是有不同的粒度的，指某個詞條中的一部分也可以切分為一個獨立的詞條。比如「中華人民共和國」，粗粒度的分詞就是「中華人民共和國」，細粒度的分詞可能是「中華/人民/共和國」
- 交集型歧義：在「鄭州天和服裝廠」中，「天和」是廠名，是一個專有詞，「和服」也是一個詞，它們共用了「和」字。
- 真歧義：本身的語法和語義都沒有問題, 即便採用人工切分也會產生同樣的歧義，只有通過上下文的語義環境才能給出正確的切分結果。例如：對於句子「美國會通過對台售武法案」，既可以切分成「美國/會/通過對台售武法案」，又可以切分成「美/國會/通過對台售武法案」。
新詞：也稱未被詞典收錄的詞，該問題的解決依賴於人們對分詞技術和漢語語言結構的進一步認識。

BERT 如何處理分詞

Bert並沒有使用分詞工具，是採用token級別進行輸入的，簡單來說就是字級別，具體參見tokenization.py 這個文件。[@leeBERTShiRuHeFenCiDe2019]

tokenization.py 就是預處理進行分詞的程序，主要有兩個分詞器：BasicTokenizer和WordpieceTokenizer，另外一個FullTokenizer是這兩個的結合：先進行BasicTokenizer得到一個分得比較粗的 token 列表，然後再對每個 token 進行一次 WordpieceTokenizer，得到最終的分詞結果。

BERT 源碼中 tokenization.py 就是預處理進行分詞的程序，主要有兩個分詞器：BasicTokenizer 和 WordpieceTokenizer，另外一個 FullTokenizer 是這兩個的結合：先進行 BasicTokenizer 得到一個分得比較粗的 token 列表，然後再對每個 token 進行一次 WordpieceTokenizer，得到最終的分詞結果。

為了能直觀看到每一步處理效果，我會用下面這個貫穿始終的例子來說明，該句修改自 Keras 的維基百科介紹：

example = "Keras是ONEIROS（Open-ended Neuro-Electronic Intelligent Robot Operating System，開放式神經電子智能機器人操作系統）項目研究工作的部分產物[3]，主要作者和維護者是Google工程師François Chollet。\r\n"

BasicTokenizer

BasicTokenizer（以下簡稱 BT）是一個初步的分詞器。對於一個待分詞字符串，流程大致就是轉成 unicode -> 去除各種奇怪字符 -> 處理中文 -> 空格分詞 -> 去除多餘字符和標點分詞 -> 再次空格分詞，結束。

大致流程就是這樣，還有很多細節，下面我依次說下。

轉成 unicode

轉成 unicode 這步對應於 convert_to_unicode(text) 函數，很好理解，就是將輸入轉成 unicode 字符串，如果你用的 Python 3 而且輸入是 str 類型，那麼這點無需擔心，輸入和輸出一樣；如果是 Python 3 而且輸入類型是 bytes，那麼該函數會使用 text.decode("utf-8", "ignore") 來轉成 unicode 類型。如果你用的是 Python 2，那麼請看 Sunsetting Python 2 support 和 Python 2.7 Countdown ，Just drop it。

經過這步後，example 和原來相同：

>>> example = convert_to_unicode(example)
>>> example
'Keras是ONEIROS（Open-ended Neuro-Electronic Intelligent Robot Operating System，開放式神經電子智能機器人操作系統）項目研究工作的部分產物[3]，主要作者和維護者是Google工程師François Chollet。\r\n'

去除各種奇怪字符

去除各種奇怪字符對應於 BT 類的 _clean_text(text) 方法，通過 Unicode 碼位（Unicode code point，以下碼位均指 Unicode 碼位）來去除各種不合法字符和多餘空格，包括：

碼位為 0 的 \x00，即空字符（Null character），或叫結束符，肉眼不可見，屬於控制字符，一般在字符串末尾。注意不是空格，空格的碼位是 32。
碼位為 0xfffd（十進制 65533）的 �，即替換字符)（REPLACEMENT CHARACTER），通常用來替換未知、無法識別或者無法表示的字符。
除 \t、\r 和 \n 以外的控制字符（Control character），即 Unicode 類別是 Cc 和 Cf 的字符。可以使用 unicodedata.category(c) 來查看 c 的 Unicode 類別。代碼中用 _is_control(char) 來判斷 char 是不是控制字符。
將所有空白字符轉換為一個空格，包括標準空格、\t、\r、\n 以及 Unicode 類別為 Zs 的字符。代碼中用 _is_whitespace(char) 來判斷 char 是不是空白字符。

經過這步後，example 中的 \r\n 被替換成兩個空格：

>>> example = _clean_text(example)
>>> example
'Keras是ONEIROS（Open-ended Neuro-Electronic Intelligent Robot Operating System，開放式神經電子智能機器人操作系統）項目研究工作的部分產物[3]，主要作者和維護者是Google工程師François Chollet。  '

處理中文

處理中文對應於 BT 類的 _tokenize_chinese_chars(text) 方法。對於 text 中的字符，首先判斷其是不是「中文字符」（關於中文字符的說明見下方引用塊說明），是的話在其前後加上一個空格，否則原樣輸出。那麼有一個問題，如何判斷一個字符是不是「中文」呢？

_is_chinese_char(cp) 方法，cp 就是剛才說的碼位，通過碼位來判斷，總共有 81520 個字，詳細的碼位範圍如下（都是閉區間）：

[0x4E00, 0x9FFF]：十進制 [19968, 40959]
[0x3400, 0x4DBF]：十進制 [13312, 19903]
[0x20000, 0x2A6DF]：十進制 [131072, 173791]
[0x2A700, 0x2B73F]：十進制 [173824, 177983]
[0x2B740, 0x2B81F]：十進制 [177984, 178207]
[0x2B820, 0x2CEAF]：十進制 [178208, 183983]
[0xF900, 0xFAFF]：十進制 [63744, 64255]
[0x2F800, 0x2FA1F]：十進制 [194560, 195103]

經過這步後，中文被按字分開，用空格分隔，但英文數字等仍然保持原狀：

>>> example = _tokenize_chinese_chars(example)
>>> example
'Keras 是 ONEIROS（Open-ended Neuro-Electronic Intelligent Robot Operating System， 开  放  式  神  经  电  子  智  能  机  器  人  操  作  系  统 ） 项  目  研  究  工  作  的  部  分  产  物 [3]， 主  要  作  者  和  维  护  者  是 Google 工  程  师 François Chollet。  '

空格分詞

空格分詞對應於 whitespace_tokenize(text) 函數。首先對 text 進行 strip() 操作，去掉兩邊多餘空白字符，然後如果剩下的是一個空字符串，則直接返回空列表，否則進行 split() 操作，得到最初的分詞結果 orig_tokens。

經過這步後，example 變成一個列表：

>>> example = whitespace_tokenize(example)
>>> example
['Keras', '是', 'ONEIROS（Open-ended', 'Neuro-Electronic', 'Intelligent', 'Robot', 'Operating', 'System，', '開', '放', '式', '神', '經', '電', '子', '智', '能', '機', '器', '人', '操', '作', '系', '統', '）', '項', '目', '研', '究', '工', '作', '的', '部', '分', '產', '物', '[3]，', '主', '要', '作', '者', '和', '維', '護', '者', '是', 'Google', '工', '程', '師', 'François', 'Chollet。']

去除多餘字符和標點分詞

接下來是針對 orig_tokens 的分詞結果進一步處理，代碼如下：

for token in orig_tokens:
      if self.do_lower_case:
        token = token.lower()
        token = self._run_strip_accents(token)
      split_tokens.extend(self._run_split_on_punc(token))

邏輯不複雜，先轉成小寫。這裡主要說下 _run_strip_accents 和 _run_split_on_punc。

_run_strip_accents(text) 方法用於去除 accents，即變音符號，那麼什麼是變音符號呢？像 Keras 作者 François Chollet 名字中些許奇怪的字符 ç、簡歷的英文 résumé 中的 é 和中文拼音聲調 á 等，這些都是變音符號 accents，維基百科中描述如下：

附加符號或稱變音符號（diacritic、diacritical mark、diacritical
point、diacritical sign），是指添加在字母上面的符號，以更改字母的
發音或者以區分拼寫相似詞語。例如漢語拼音字母"ü"上面的兩個小點，
或"á"、"à"字母上面的標調符。

_run_strip_accents(text) 方法就是要把這些 accents 去掉，例如 François Chollet 變成 Francois Chollet，résumé 變成 resume，á 變成 a。

_run_split_on_punc(text) 是標點分詞，按照標點符號分詞。

經過這步後，原先沒有被分開的字詞標點（例如 ONEIROS（Open-ended）、沒有去掉的變音符號（例如 ç）都被相應處理：

>>> example
['keras', '是', 'oneiros', '（', 'open', '-', 'ended', 'neuro', '-', 'electronic', 'intelligent', 'robot', 'operating', 'system', '，', '開', '放', '式', '神', '經', '電', '子', '智', '能', '機', '器', '人', '操', '作', '系', '統', '）', '項', '目', '研', '究', '工', '作', '的', '部', '分', '產', '物', '[', '3', ']', '，', '主', '要', '作', '者', '和', '維', '護', '者', '是', 'google', '工', '程', '師', 'francois', 'chollet', '。']

這就是 BT 最終的輸出了。

WordpieceTokenizer

WordpieceTokenizer（以下簡稱 WPT）是在 BT 結果的基礎上進行再一次切分，得到子詞（subword，以 ## 開頭），詞彙表就是在此時引入的。該類只有兩個方法：一個初始化方法 init(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200)，一個分詞方法 tokenize(self, text)。

對於中文來說，使不使用 WPT 都一樣，因為中文經過 BasicTokenizer 後已經變成一個字一個字了，沒法再「子」了。

init(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200)：vocab 就是詞彙表，collections.OrderedDict() 類型，由 load_vocab(vocab_file) 讀入，key 為詞彙，value 為對應索引，順序依照 vocab_file 中的順序。有一點需要注意的是，詞彙表中已包含所有可能的子詞。unk_token 為未登錄詞的標記，默認為 [UNK]。max_input_chars_per_word 為單個詞的最大長度，如果一個詞的長度超過這個最大長度，那麼直接將其設為 unk_token。

tokenize(self, text)：該方法就是主要的分詞方法了，大致分詞思路是按照從左到右的順序，將一個詞拆分成多個子詞，每個子詞盡可能長。按照源碼中的說法，該方法稱之為 greedy longest-match-first algorithm，貪婪最長優先匹配算法。

開始時首先將 text 轉成 unicode，並進行空格分詞，然後依次遍歷每個詞。為了能夠清楚直觀地理解遍歷流程，我特地製作了一個 GIF 來解釋，以 unaffable 為例：

將 BT 的結果輸入給 WPT，那麼 example 的最終分詞結果就是

['keras', '是', 'one', '##iros', '（', 'open', '-', 'ended', 'neu', '##ro', '-', 'electronic', 'intelligent', 'robot', 'operating', 'system', '，', '開', '放', '式', '神', '經', '電', '子', '智', '能', '機', '器', '人', '操', '作', '系', '統', '）', '項', '目', '研', '究', '工', '作', '的', '部', '分', '產', '物', '[', '3', ']', '，', '主', '要', '作', '者', '和', '維', '護', '者', '是', 'google', '工', '程', '師', 'franco', '##is', 'cho', '##llet', '。']

至此，BERT 分詞部分結束。

Reference

Lee, Alan. 2019. “BERT 是如何分词的.” October 16, 2019.
https://alanlee.fun/2019/10/16/bert-tokenizer/.

NLP Neural History Part2 – Attention

下圖整理簡單的 Neural history of NLP from [@ruderReviewRecent2018], [@suSeq2seqPay2018], and [@suSeq2seqPay2018a]

近年注意力模型 (Attention Model) 是深度學習領域最受矚目的新星，用來處理與序列相關的資料，特別是2017年 Google 提出 self attention 後，模型成效、複雜度又取得了更大的進展。然而，從 Attention model 讀到 self attention 時，遇到不少障礙，其中很大部分是後者在論文提出的概念，鮮少有文章解釋如何和前者做關聯。希望藉由這系列文，解釋在機器翻譯的領域中，是如何從Seq2seq演進至Attention model 再至 self attention，使讀者在理解Attention 機制不再這麼困難。Attention 的演進史如下圖，對應上圖的 Attention stage.

Stage 3 (2015-): Attention is All You Need

前文提到 Seq2Seq 包含 encoder 和 decoder 兩個 RNN 所構成。它的運作原理其實與人類的思維很相似，當我們看到一段話時，會先將這句話理解吸收，再根據我們理解的內容說出回覆，Sequence to Sequence 就是在模擬這個過程。

Encoder 就是負責將輸入序列消化、吸收成一個向量，我們通常把這個向量稱為 context vector，顧名思義，這個向量會囊括原序列的重要訊息。而 Decoder 則是根據 context vector 來生成文字。
甚至，我們能把眼光放得更遠些，Encoder 與 Decoder 不一定都只有 RNN，可以讓 CNN 與 RNN 一起當 Encoder，負責將影像編碼成向量，而 Decoder 根據這個向量來生成描述，這樣就能讓機器描述圖片中發生了什麼：

但是 Seq2Seq 模型其實有一個很大的缺陷，那就是我們把一個不限長的輸入，給編碼成一個固定長度的向量，換句話說，隨著我們的輸入越來越長，context/thought vector 的訊息損失就會越大。這方法顯然不太科學，就連我們在聽課時，都不免聽了後面就忘了前面，又怎麼能要求一個向量記住所有資訊呢？

What's Wrong with Context Vector?

Information overload and loss!
Not visualizable or explainable. We do not know what encoder information it capture!

2015 Attention Mechanism

2015 Luong et al. of Stanford 的文章在 Seq2Seq model 加上 attention model 解決長句子 context vector 訊息損失的問題。[@luongEffectiveApproaches2015] and [@alammarVisualizingNeural2018]

這種新的架構替輸入句的每個文字都創造一個context vector，而非僅僅替輸入句創造一個從最終的hidden state 得來的context vector. 舉例來說，如果一個輸入句有N個文字，就會產生N個context vector，好處是，每個context vector能夠被更有效的解碼。

Encoder-Decoder Attention: Context Vector is replaced by Attention Vector/Matrix => Attention Score => Score => Similarity

什麼是 attention? Attention 的直覺想法是當模型輸出某件事時，只會專注於與這件事相關的輸入，這很像在學習一種配對關係。目的在建立 decoder output (例如 translated word, query and answer) 和 specific input word(s) (or encoder hidden state) 的相關性。

在 Attention model 中，Encoder 和 Seq2Seq 概念一樣，一樣是從輸入句 < $x_1,x_2,… x_m$ > 產生 < $h_1,h_2,...,h_m$ > 的 hidden state，再計算目標句 < $y_1 … y_n$ >。換言之，就是將輸入句作為input而目標句作為output，所以差別就在於 context vector $c_{i}$ 是怎麼計算？
$c_{i}=\sum_{j=1}^{T_{x}} \alpha_{i j} h_{j}$
Context vector $c_{i}$ 是透過 attention score α 乘上 encoder hidden state 序列加權求和。

Attention score α 是 attention model中提出一個很重要的概念，可以用來衡量輸入句中的每個文字對目標句中的每個文字所帶來重要性的程度。由下面公式可知，attention score 藉由 score $e_{ij}$ 所計算得到, 所以先來看看 score $e_{ij}$ 是什麼： $i$ 代表目標句 the $i^{th}$ word, $j$ 是輸入句 words. Attention score $\alpha_{ij}$ 是 softmax function with respect to $j$ , 如下式：
$\alpha_{i j}=e^{e_{i j}} / \sum_{k=1}^{T_{x}} e^{e_{i k}}$

再來是 score: $e_{ij} = a(s_{i-1}, h_j)$ . $a$ 代表Alignment model會根據輸入字位置j和輸出字位置i這兩者的關聯程度，計算出一個score $e_{ij}$ .

換言之， $e_{i,j}$ 是衡量RNN decoder 中的hidden state $s_{i-1}$ 和輸入句中的第 $j$ 個文字hidden state $h_{j}$ 的關係所計算出的權重 — 那權重怎麼算呢？

[@luongEffectiveApproaches2015] 發表之後，接續的論文提出了global/local attention 和 soft/hard attention 的概念，而 score $e_{ij}$ 的計算方式類似 global和soft attention。細節不多說，下圖可以看到3種計算權重的方式，我們把剛才的公式做些改變，將score $e_{ij}$ 改寫成 $\text{score}(h_{t}, \bar{h}_{s})$ ， $h_{t}$ 代表 $s_{i-1}$ 而 $\bar{h}_{s}$ 代表 $h_{j}$ ，為了計算方便，我們採用內積計算權重，也就是用相似度來表示。

有了score $e_{ij}$ ，即可透過softmax算出attention score，context vector也可得到，在attention model中，context vector又稱為attention vector。我們可以將attention score列為矩陣，透過此矩陣可看到輸入端文字和輸出端文字間的對應關係，也就是論文當中提出align的概念。

我們用 attention vector/matrix 取代 context vector, 避免 context vector 的兩個缺點：(1) encoder information loss; => attention vector 解決這個問題。 (2) 我們不知道 context vector 到底學到什麼； => attention matrix 提供 visualization 結果。

Bi-directional RNN Encoder

回過來看 encoder, attention model 中的 encoder 用的是改良版 RNN：雙向 RNN(Bi-directional RNN) 如上上上圖，以往單向RNN的問題在於t時刻時，只能透過之前的訊息進行預測，但事實上，模型有時候可能也需要利用未來時刻的訊息進行預測，其運作模式為，一個hidden layer用來由左到右，另一個由右到左，透過雙向RNN，我們可以對詞語進行更好的預測。

舉例來說，”我喜歡蘋果，因為它很好吃”？和”我喜歡蘋果，因為他比安卓穩定”這兩個句子當中，如果只看”我喜歡蘋果”，你可能不知道蘋果指的是水果還是手機，但如果可以根據後面那句得到訊息，答案就很顯而易見，這就是雙向RNN運作的方式。

Encoder-Decoder Attention Model 缺點

Attention model 雖然解決了輸入句僅有一個 context vector 的缺點，但依舊存在不少問題。

context vector計算的是輸入句、目標句間的關聯，卻忽略了輸入句中文字間的關聯，和目標句中文字間的關聯性。後面提到的 self-attention address 這個問題。
不管是 Seq2seq 或是Attention model，其中使用的都是RNN，RNN的缺點就是無法平行化處理，導致模型訓練的時間很長，有些論文嘗試用CNN去解決這樣的問題，像是Facebook提出 Convolutional Seq2seq learning，但CNN實際上是透過大量的layer去解決局部訊息的問題。同樣也有計算的問題。

Encoder-Decoder Attention (Duplicate, to be fixed)

例如輸入是法文的 "Je suis etudiant", 輸出是英文的 “I am a student". 我們期待如下表 attention table, 越深色代表相關性越高。可以看到 "Je -> I", "suis -> am a", "etudiant -> a student". Attention 不僅可以幫助 explain/validate/debug neural network model (可視化 or explainable AI). 另外藉著移除不重要的 information, 有機會 "understand" the content. 這在 attention 用於圖像更明顯。例如 video caption, 對應 NLP article summary 應用。

具體做法是把 input sentence encoder 所有的 hidden states 而非只有 final state context vector (雖然這個 final state 包含所有之前 hidden state) 饋入 decoder. [@fuyuAttentionMechanism2020] 下表每一個 output word 基本對應 encoder hidden states weighted sum. Hidden state 佔比重越大，表示相關性 (attention) 越高。這種 attention 稱為 encoder-decoder attention.

Self-Attention and Word Embedding Vs. Contextual Word Embedding

在2017年，Google提出了一種叫做”The transformer”的模型，透過幾個新的觀念：(1) self attention; (2)multi-head attention; 擴展 attention 的概念去解決上述缺點; 並且 (3) 完全捨棄了RNN、CNN的架構。

除了 encoder-decoder attention 之外，還有一種更基本的 attention: self-attention. Self-attention 是捕捉同一句子或文章中任一 word (or token) 對其他 words (tokens) 的相關性。這聽起來像是 word embedding: 從幾個上下文 words (or token) $w(t\pm k)$ 推論 $w(t)$ 的 CBOW training. 或是反過來的 skip-gram, 如下圖。事實上 the encoder with self-attention 的 training (e.g. ELMo, BERT, GPT) 基本上和 CBOW 一樣！只是多加上前後句的一致性 training.

所以到底 self-attention encoder 和 word embedding 有什麼不同？可以參考下表：

	Self-attention encoder	Word embedding
Deeper	stacking encoders, transformer: 6	1 linear + softmax
Wider, Reception Field	entire sentence or article, transformer: 128	a few (<10) words
Multi-Head Attention	Yes, 1 word has multiple attentions. Contextual word embedding	1 word 1 attention, 1 fixed vector

Word embedding 的精神是 “word is defined by its neighbor!" CBOW 的 word embedding training 只用 single word (or token) 為單位，把 one-hot encoding sparse vector (~20,000 維) 轉換成 dense vector (~512 維 or less).

Contextual word embedding 的精神是 “word is defined by its context!" 更 specific “word is defined by its attended words!"

這個相關性不單是 local 在這個句子。而是用 contextual word embedding 來表示。

舉一些例子：

"The animal didn't cross the street because it was too tired." 其中 "it" attends 的是 "The animal", 因為只有 animal 會 tired. "it" 對應的 word embedding 比較相似 "animal".
"The animal didn't cross the street because it was too wide." 其中 "it" attends 的是 "The street", 因為只有 street 會 wide. "it" 對應的 word embedding 比較相似 "street".
"John bought a mobile phone made by Apple." 和 "Apple is a delicious fruit." 其中的 "Apple" attends to "mobile phone" or "fruit" 屬於不同的 categories, 因此 Apple 對應到不同的 word embeddings (vectors). 這是 contextual word embedding. 相反傳統的 word embedding, 一個 word 如 Apple 只會有一個 word embedding (vector).

Transformer Model

2017 來自 Google 的這篇文章："Attention Is All You Need" 建立一個新的里程碑。 [@vaswaniAttentionAll2017] 這篇文章要解決的是翻譯問題，比如從中文翻譯成英文。這篇文章完全放棄了以往經常採用的 RNN 和 CNN，而提出了一種新的網絡結構，即Transformer，其中包括 encoder 和 decoder (來自 Sequence-to-Sequence model!)。

下圖左是 Transformer model consists of 1 encoder and 1 decoder. 下圖右是完整的 transformer consists of 6 encoder + 6 decoder stacks.

再把 transformer 放大。下圖左半部是 encoder, 下圖右半部是 decoder. Decoder 的 output (shift right) 再饋入 decoder 的 input 產生下一個 output. 基本框架和 sequence-to-sequence 一樣。不過相似之處到此為止。接下來是 transformer 的獨特之處。 [@alammarIllustratedTransformer2018] 對 transformer 介紹非常仔細。

Query, Key, Value

Why Query, Key, Value?
簡單來說，因為沒有 RNN 的 encoder and decoder hidden vectors. 我們需要一個新的機制計算 attention vector/matrix.
在NLP的領域中，Key, Value通常就是指向同一個文字隱向量(word embedding vector)。

進入”The transformer”前，我們重新複習attention model，attention model是從輸入句< $X_1,X_2,X_3…X_m$ > 產生 $h_1,h_2,….h_m$ 的hidden state，透過attention score α 乘上input 的序列加權求和得到Context vector $c_{i}$ ，有了context vector和hidden state vector，便可計算目標句< $y_1…y_n$ >。換言之，就是將輸入句作為input而目標句作為output。

如果用另一種說法重新詮釋：
輸入句中的每個文字是由一系列成對的 <地址Key, 元素Value> 所構成，而目標中的每個文字是Query，那麼就可以用Key, Value, Query去重新解釋如何計算context vector，透過計算Query和各個Key的相似性，得到每個Key對應Value的權重係數，權重係數代表訊息的重要性，亦即attention score；Value則是對應的訊息，再對Value進行加權求和，得到最終的Attention/context vector。

筆者認為這概念非常創新，特別是從 attention model 到 ”The transformer” 間，鮮少有論文解釋這種想法是如何連結的，間接導致 ”attention is all you need” 這篇論文難以入門，有興趣可以參考key、value的起源論文 Key-Value Memory Networks for Directly Reading Documents。

有了 Key, Value, Query 的概念，我們可以將 attention model 中的 Decoder 公式重新改寫。

score $e_{ij}= \text{Similarity}(\text{Query}, \text{Key}_{i})$ ，上文有提到3種計算權重的方式，而我們選擇用內積。
有了 $\text{Similarity}(\text{Query}, \text{Key}_{i})$ ，便可以透過softmax算出 $\text{Softmax}(\text{sim}_{i})=a_{i}$ ，接著就可以透過attention score $a_{i}$ 乘上 $\text{Value}_{i}$ 的序列和加總所得 = Attention(Query, Source)，也就是context/attention vector。

Three kinds of Attention

“The transformer”在計算attention的方式有三種，1. encoder self attention，存在於encoder間. 2. decoder self attention，存在於decoder間，3. encoder-decoder attention, 這種attention算法和過去的attention model相似。其中 1 and 2 都是 Google paper 新的觀念。

三種 attention 對應的 query, key, value 如下。

Example

通過這樣一系列的計算，可以看到，現在每個詞的輸出向量z都包含了其他詞的信息，每個詞都不再是孤立的了。而且每個位置中，詞與詞的相關程度，可以通過softmax輸出的權重進行分析。如下圖所示，這是某一次計算的權重，其中線條顏色的深淺反映了權重的大小，可以看到it中權重最大的兩個詞是The和animal，表示it跟這兩個詞關聯最大。這就是attention的含義，輸出跟哪個詞關聯比較強，就放比較多的注意力在上面。上面我們把每一步計算都拆開了看，實際計算的時候，可以通過矩陣來計算。

Multi-Head Attention

講完了attention，再來講Multi-Head。對於同一組輸入embedding，我們可以並行做若干組上面的操作，例如，我們可以進行8組這樣的運算，每一組都有WQ，WK，WV矩陣，並且不同組的矩陣也不相同。這樣最終會計算出8組輸出，我們把8組的輸出連接起來，並且乘以矩陣WO做一次線性變換得到輸出，WO也是隨機初始化，通過訓練得到。這樣的好處，一是多個組可以並行計算，二是不同的組可以捕捉不同的子空間的信息。我們用同樣的例子說明如下圖。

如果 input 是 "The animal didn't cross the street because it was too tired." 這裡的 "it" attends "animal".
如果 input 是 "The animal didn't cross the street because it was too wide." 這裡的 "it" attends "street".
Multi-Head Attention (MHA) 可以捕捉不同字空間的信息，或是稱為 contextual word embedding. 根據不同的上下文 (context), 決定 attend 的對象。

Transformer Vs. RNN

同樣都是做NLP任務，我們來和RNN做個對比。

下圖是最基本的RNN結構，計算公式： $h_{i}=f\left(\left[W \cdot x_{i}, U \cdot h_{i-1},\right]+b\right)$
當計算隱向量 $h_4$ 時，用到了輸入 $x_4$ ，和上一步算出來的隱向量 $h_3$ ， $h_3$ 包含了前面所有節點的信息。 $h_4$ 中包含最多的信息是當前的輸入 $x_4$ ，越往前的輸入，隨著距離的增加，信息衰減得越多。對於每一個輸出隱向量 $h_i$ 都是如此，包含信息最多得是當前的輸入，隨著距離拉遠，包含前面輸入的信息越來越少。

但是Transformer這個結構就不存在這個問題，不管當前詞和其他詞的空間距離有多遠，包含其他詞的信息不取決於距離，而是取決於兩者的相關性，這是Transformer的第一個優勢。

第二個優勢在於，對於Transformer來說，在對當前詞進行計算的時候，不僅可以用到前面的詞，也可以用到後面的詞。而RNN只能用到前面的詞，這並不是個嚴重的問題，因為這可以通過雙向RNN來解決。

第三點，RNN是一個順序的結構，必須要一步一步地計算，只有計算出h1，才能計算h2，再計算h3，隱向量無法同時並行計算，導致RNN的計算效率不高，這是RNN的固有結構所造成的，之前有一些工作就是在研究如何對RNN的計算並行化。通過前文的介紹，可以看到Transformer不存在這個順序問題。

下圖 summarize 上述三點優勢。

通過這裡的比較，可以看到Transformer相對於RNN有巨大的優勢，因此 Google 認為 RNN之後會被Transformer取代: "Attention Is All You Need". 事實上 NLP 近年 (2017-) 的確被 attention-based model 如 transformer, BERT, GPT, etc. dominate.

但是 RNN 在 ASR (automatic speech recognition) 仍然廣泛使用。可能說話的循序的特性仍然和 RNN sequence model 相當吻合.

Pre-trained Language Model – BERT, ELMo, GPT

簡單來說：Transformer = Seq2Seq structure (encoder+decoder) + Attention (replace RNN with Attention). 其中 decoder 是 generative model, encoder 是 discriminative model (sort of). 我們可以分別用於不同應用。

首先 BERT = Transformer encoder
可以用 CBOW 的方式 training encoder,

另外雖然 RNN 逐漸被 Attention 取代，但仍然在 NLP 應用：
ELMo = Deep bi-directional LSTM model (encoder) [@petersDeepContextualized2018]
ELMo 雖然沒有 attention mechanism, 還是可以達到 contextual word embedding.

GPT = Transformer
GPT 則是 generative model, 包含 encoder + decoder, 其實就是很大的 transformer.
Why need pre-trained + fine-tuning?
How to do pre-trained and fine-tuing?
Refere to NTU Prof. Li's talk

Stage 3 Attention Summary

Transformer 是集大成者: (1) 從 sequence-to-sequence model 借用 encoder-decoder 架構，(2) 輸入借用 word embedding 把 words 轉換成等長而且有含義的 dense vector (word2vec). (3) Block 中的 shortcut/add/norm 來自 ResNet. (4) Feed-forward 來自 perceptron, 也就是 fully connected network.
Transformer 的創新之處在於採取 attention model, 來自 Multi-Head Attention (MHA). 捨棄 RNN or CNN model. 以及為了解決順序問題的 positional encoding.
最重要也是最關鍵的一點，就是 transformer model 可以用於 pre-train model! 廣泛用於 NLP 甚至其他如圖像的應用。

Reference

Alammar, Jay. 2018a. “Visualizing A Neural Machine Translation Model
(Mechanics of Seq2seq Models with Attention).” May 9, 2018.
https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/.

———. 2018b. “The Illustrated Transformer.” June 27, 2018.
https://jalammar.github.io/illustrated-transformer/.

Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. “Effective Approaches to Attention-Based Neural Machine Translation.” September 20, 2015. http://arxiv.org/abs/1508.04025.

Ruder, Sebastian. n.d. “A Review of the Recent History of Natural Language Processing.” Sebastian Ruder. Accessed January 27, 2021.
https://ruder.io/a-review-of-the-recent-history-of-nlp/.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. “Attention Is All You Need.” December 5, 2017. http://arxiv.org/abs/1706.03762.

Wang, Zhengyang, Yaochen Xie, and Shuiwang Ji. 2020. “Global Voxel Transformer Networks for Augmented Microscopy.” November 23, 2020.
http://arxiv.org/abs/2008.02340.

Yang, Justin. 2017. “從零開始的 Sequence to Sequence.” 雷德麥的藏書閣. September 28, 2017.
http://zake7749.github.io/2017/09/28/Sequence-to-Sequence-tutorial/index.html.

Zhang, Bei. 2019. “Transformer and BERT in NLP.” 知乎专栏. May 27, 2019. https://zhuanlan.zhihu.com/p/53099098.

Edge AI: Object Detection History: 2-Pass Vs. 1-Pass; Anchor Vs. Anchor-less

sequenceDiagram
    participant Alice
    participant Bob
    Alice->John: Hello John, how are you?
    loop Healthcheck
        John->John: Fight against hypochondria
    end
    Note right of John: Rational thoughts <br/>prevail...
    John-->Alice: Great!
    John->Bob: How about you?
    Bob-->John: Jolly good!

graph LR
A[方形] -->B(圆角)
    B --> C{条件a}
    C -->|a=1| D[结果1]
    C -->|a=2| E[结果2]
    F[横向流程图]

Multi-Object Tracking 數據集: MOT15/16/17/19/20

多目標跟蹤 (MOT) 是監控重要的功能。雖然沒有 ImageNet 大量的 dataset, 但因為 MOT Challenge 的關係，MOT Challenge dataset 成為大家公認的 dataset 用於檢視 MOT 算法的效能。
MOT Challenge是多目標跟蹤領域最權威的國際測評平臺，由慕尼黑工業大學、阿德萊德大學、蘇黎世聯邦理工學院以及達姆施塔特工業大學聯合創辦。MOT Challenge 提供了非常準確的標註資料和全面的評估指標，用以評估跟蹤演算法、行人檢測器的效能。

這裡介紹這些 datasets.

MOT 數據集

數據集用的最多的是 MOTChallenge，專注於行人追蹤的: https://motchallenge.net/

MOT15 年的都是採集的老的數據集的視頻做的修正。參考論文：MOT Challenge 2015: Towards a Benchmark for Multi-Target Tracking『https://arxiv.org/abs/1504.01942』
MOT16 年的是全新的數據集 (7 video for training and 7 video for test)，相比於 15 年的行人密度更高、難度更大。特別注意這個 DPM 檢測器，效果非常的差，全是漏檢和誤檢。參考論文：MOT16: A Benchmark for Multi-Object Tracking：『https://arxiv.org/abs/1603.00831』
MOT17 年的視頻和 16 年一模一樣，只是提供了三個檢測器，相對來說更公平。也是現在論文的主流數據集。
MOT19 年的是針對特別擁擠情形的數據集，只有 CVPR19 比賽時才能提交。
MOT20 benchmark 包含8份新的影片序列，都是極具挑戰性的場景。這份資料集在 4th BMTT MOT Challenge Workshop，CVPR 2019上首次釋出 (same as MOT-19?)，平均每幀高達246個行人，相比之前的挑戰賽資料集增加了夜晚資料集，對現有SOTA的MOT演算法在解決極端稠密場景、演算法泛化性等方面提出了艱鉅挑戰。

MOT16/MOT17

針對 MOT16 數據集介紹一下，它與 MOT15 數據的部分標注信息可能存在差別，需要注意～

MOT16 數據集是在 2016 年提出來的用於衡量多目標跟蹤檢測和跟蹤方法標準的數據集，專門用於行人跟蹤。總共有 14 個視頻，訓練集和測試集各 7 個，這些視頻每個都不一樣，按照官網的說法，它們有些是固定攝像機進行拍攝的，有些是移動攝像機進行拍攝的，而且拍攝的角度各不一樣（低、中、高度進行拍攝），拍攝的條件不一樣，包括不同的天氣，白天或者夜晚等，還具有非常高的人群密度，總之，這個數據集非常具有挑戰性。
7 training sequence: {2, 4, 5, 9, 10, 11, 13} 如下表。

另一個 MOT16 7 training and test sequence summary.

MOT16 數據集使用的 baseline 檢測器是 DPM，這個檢測器在檢測「人」這個類別上具有較好的性能。我們看一些例子如下。這些視頻的主要信息如下：包括 FPS、分辨率、視頻時長、軌跡數、目標書、密度、靜止或者移動拍攝、低中高角度拍攝、拍攝的天氣條件等。

MOT16 數據的目錄結構如下所示：包含訓練集和測試集（各有 7 個視頻）
每個子文件夾（如 MOT16-01）代表一個視頻轉換後的數據集，包含幾個文件或者文件夾，其目錄結構與具體含義如下：

MOT16/
├── test
│   ├── MOT16-01
│   │   ├── det
│   │   │   └── det.txt
│   │   ├── img1
│   │   │   ├── 000001.jpg
│   │   │   ├── xxxxxx.jpg
│   │   │   └── 000450.jpg
│   │   └── seqinfo.ini
│   ├── MOT16-03
│   ├── MOT16-06
│   ├── MOT16-07
│   ├── MOT16-08
│   ├── MOT16-12
│   └── MOT16-14
│
└── train
    ├── MOT16-02
    │   ├── det
    │   │   └── det.txt
    │   ├── gt
    │   │   └── gt.txt
    │   ├── img1
    │   │   ├── 000001.jpg
    │   │   ├── xxxxxx.jpg
    │   │   └── 000600.jpg
    │   └── seqinfo.ini
    ├── MOT16-04
    ├── MOT16-05
    ├── MOT16-09
    ├── MOT16-10
    ├── MOT16-11
    └── MOT16-13

seqinfo.ini
文件內容如下，主要用於說明這個文件夾的一些信息，比如圖片所在文件夾 img1，幀率，視頻的長度，圖片的長和寬，圖片的後綴名。
det/det.txt
這個文件中存儲了圖片的檢測框的信息（這裡用 MOT16-05 文件來說明，該文件下 img1 文件下有 837 張圖片，代表視頻的每一幀）

從左到右分別代表的意義是

第 1 個值：frame: 第幾幀圖片
第 2 個值：id: 這個檢測框分配的 id，（由於暫時未定，所以均為 – 1）
第 3-6 個值：bbox (四位): 分別是左上角坐標（top, left）和寬 (width) 高 (height)
第 7 個值：conf：這個 bbox 包含物體的置信度，可以看到並不是傳統意義的 0-1，分數越高代表置信度越高
第 8、9、10 個值：MOT3D (x,y,z): 是在 MOT3D 中使用到的內容，這裡關心的是 MOT2D，所以都設置為 -1
可以看出以上內容主要提供的和目標檢測的信息沒有區別，所以也在一定程度上可以用於檢測器的訓練。

img1 文件夾
這裡面是將視頻的每一幀抽取出來後的圖片，圖片格式是 jpg，按照視頻流的順序進行命名，如：xxxxxx.jpg
gt/gt.txt 文件（train 訓練集才有）
從左到右代表的含義是：

第 1 個值：frame: 第幾幀圖片
第 2 個值：ID: 也就是軌跡的 ID
第 3-6 個值：bbox: 分別是左上角坐標 (top, left) 和寬（width）高（height）
第 7 個值：是否忽略：0 代表忽略 (A value of 0 means that this particularinstance is ignored in the evaluation, while a value of1 is used to mark it as active.)
第 8 個值：classes: 目標的類別個數（這裡是駕駛場景包括 12 個類別），7 代表的是靜止的人。第 8 個類代表錯檢，9-11 代表被遮擋的類別，12 代表反射，如下圖中的第二列，可以看到店面的玻璃反射了路人的背景。

第 9 個值：代表目標運動時被其他目標包含、覆蓋、邊緣裁剪的情況。(The last number shows the visibility ratio of each bounding box. This can be due to occlusion by anotherstatic or moving object, or due to image border cropping.)

MOT17

MOT17 baseline 提供了三種檢測結果: DPM, Faster-RCNN, and SDP，如果你分析基於不同檢測結果的結果，你會發現大多數 MOT tracker 在 Faster-RCNN 和 SDP 檢測上有著相似的性能。

MOT19/MOT20

MOT-20

KITTI 數據集

KITTI 數據集的是針對自動駕駛的數據集，有汽車也有行人，在 MOT 的論文里用的很少。

Math AI: Maximum Likelihood Estimation (MLE) Evolve To EM Algorithm For Incomplete/Hidden Data To Variational Bayesian Inference

Major Reference

Maximum Likelihood Estimation 和應用

Maximum Likelihood Estimation (MLE) vs. EM

Toy Example [@matasExpectationMaximization2018]

前提摘要

問題一: MLE

問題二 incomplete/hidden Data

EM algorithm 邏輯

前提摘要

GMM 特例：Estimate Means of Two Gaussian Distributions (known variance and ratio; unknown means)

問題三 Complete Data (Easy case)

問題四 incomplete/hidden Data

Missing Values, EM Approach

Clustering: Soft Assignment Vs. Hard Assignment (K-means)

EM Algorithm Derivation

Log-Likelihood with Hidden Variable Lower Bound

EM Algorithm Push the Lower Bound Upwards

Example 1: Hidden variable does NOT provide any information of

Example 2:

EM 具體步驟

M-step: is fixed

E-step: is fixed

Conditional Vs. Joint Distribution

結合 E-Step and M-Step 的另一種寫法 (這是 EM 的精髓，假設可以有一個 close from Q function, e.g. GMM)

Free Energy Interpretation [@poczosCllusteringEM2015]

Variational Expectation Maximization

Appendix

問題二的 Conditional Vs. Joint Distribution 解法

To Do Next

Major Reference

Unsupervised Learning

Similarity

K-means

K-means 算法

K-means example, K=3

K-means algorithm summary

Potential Function Interpretation and EM Algorithm

Gaussian Mixture Model (GMM)

Hard Assignment Vs Soft Assignment

Probability with Exponential Distance Similarity

Hidden Variable Model

GMM

GMM MLE Estimation

EM Algorithm

特例一： 所有 Gaussian Distribution 相同且已知。另外 為已知！

Step 1: initialization, 可以隨機選擇 ( 為已知). => same as K-means

Step 2: E-Step (soft) assign each to

Step 3: M-Step re-estimate the cluster centers/means, i.e. update

General GMM: 除了 需要迭代， and 也需要迭代

Spherical EM, same variance GMM

Hidden Markov Model (HMM) Using EM Algorithm

Why VAE?

How VAE work?

1. "Robust" AE

2. Probability Distribution Conversion (分布變換)

如何 Train Generator?

VAE Encoder-Decoder Training 傳統理解和誤解

VAE 正確理解？

VAE Details

VAE Encoder

VAE Training

"Variational" Latent Loss

Re-parmeterization Trick (for Gradient Back Propagation)

Training Forward and Backward Paths

VAE 的本質

Appendix A: VAE Coding Example [@durrVAETutorial2016]

Encoder

Decoder

Loss function

Appendix B: KL Divergence

Reference

原理 → 定律 → 現象

具體的操作（至少邏輯上）

最小作用原理幾何化

古典力學的最小作用原理 recap

例一：1D 自由運動

例二：1D 一階位能場運動

如何消除時間？幾何化“最小作用原理”

Abbreviated Action

Example 1: Hidden variable $z$ does NOT provide any information of $o$

Example 2: $q(\mathbf{z}) = p(\mathbf{z} \mid \mathbf{o}, \boldsymbol{\theta})$

M-step: $q^{(t+1)}$ is fixed

E-step: $\boldsymbol{\theta}^{(t)}$ is fixed

特例一：所有 Gaussian Distribution $\sigma^2$ 相同且已知。另外 $\pi_i$ 為已知！

Step 1: initialization, 可以隨機選擇 $\theta^0 = [\mu^0_1, \mu^0_2, \cdots, \mu^0_K]$ ( $\sigma^2$ 為已知). => same as K-means

Step 2: E-Step (soft) assign each $x_j$ to $\mu_i$

Step 3: M-Step re-estimate the cluster centers/means, i.e. update $\mu^{t-1}_i \to \mu^{t}_i$

General GMM: 除了 $\mu_i$ 需要迭代， $\Sigma_i$ and $\pi_i$ 也需要迭代

例一：2D 卡氏座標系, $V(x, y)$

例二：Keplerian System ( $V = -\frac{GM}{r}$ ) [@biesiadaEpicyclicOrbital2003]