Hidden Markov Model:Part4 Learning

HMM 的參數可以用 \lambda = (P, B, \pi_o) 完整表示:包含 transition matrix, emission matrix, and an initial distribution.

重點是 HMM 如何和實際應用或問題關聯,有三種基本應用和問題:[@jurafskyHIddenMarkov2019]

  1. The Likelihood Evaluation Problem: 根據觀察序列O及模型參數λ,得出此序列的機率 P(O|λ), 對應 forward and backward algorithms.
  2. The Decoding Problem: 根據觀察序列O及模型參數λ,得出最有可能的隱藏狀態序列(hidden state sequence), 對應 Viterbi algorithm.
  3. The Learning Problem: 根據觀察序列O,調整模型參數 \lambda^* = (P, B, \pi_o) 使得 P(O|λ) 最大化, 對應 Baum-Welch (B&W) algorithm (a special case of EM algorithm).

HMM 的第三類應用更基本和廣泛。注意是調整參數,十分接近 machine learning or deep learning 的 training. 數學上可以用 \lambda = argmax( P(O\mid \lambda)) 表示。 事實上 deep learning 的 RNN model 非常接近 Kalman filter formulation (Kalman filter 可以視為 continuous state HMM).

Introduction

HMM 一個重要的問題是如何得到 transition probability matrix P and emission probability matrix B. 大致可以分為兩類 (1)prior; or (2) learning (training).

Prior

  1. Error Correction Code for Convolution Code or Trellis Code Modulation: P matrix 基本是 prior, p_{ij} = 0 or p_{ij} = 1/k . B 是 normal distribution, 由 received signal-to-noise 決定。 如何 estimate received signal magnitude and noise variance? 可以從 constellation diagram 相當準確估計。

  2. Kalman filter 可以視為 continuous state HMM. P and B matrices 是 normal distribution: mean 是由 A, B, C prior known state space equation matrices 決定; variance 則是由 state transition noise 和 output observation noise variance 決定。問題是如何從 observables 分辨 transition noise vs. observation noise? 一般而言,observation noise 是 additive white noise to output, 也就是 noise 頻譜是高低頻率均勻分佈。 state transition noise 是 accumulative noise to output, 因此低頻為主,高頻 noise tend to average out, 類似 1/fk noise.

  3. 以上的 noise variance or probability distribution 估計,基本是 open loop, one time calculation/calibration, non-adaptive. 和以下 learning base, could be adaptively adjust 有所不同。

Learning

  1. 一個常見的例子是 speech recognition, 不同音節之間的 transition probability 是 "key parameters", but depends on language, dialect, accent, application, 是很複雜的 parameters, 不容易有準確的 prior! 更不用說 output emission probability 在各種吵雜的環境,和 different microphones 下,很難有 prior. 一般分開處理,先用 clean environment learning the transition probability distribution; 再來用各種 noisy environment to train the emission probability and tune the transition probability.

Baum-Welch Algorithm [@wikiBaumWelch2020]

Let \lambda = (P, B, \pi_o) , HMM learning problem 基本上就是找出 \lambda^* = argmax( P(O\mid \lambda)) , a maximum likelihood estimate of the parameters given a set of observables.

Forward Procedure

我們利用 Part 1 的 recursion formula:

  • \alpha_{t}(j) = P(O_1,...,O_{t},X_{t}=j) 加上 parameter \alpha_{t}(j) = P(O_1,...,O_{t},X_{t}=j \mid \lambda)
  • Initial: \alpha_1(j) = \pi_j b_{j O_1}
  • Recursion: \alpha_{t+1}(j) = \sum_{i=1}^N b_{j O_{t+1}}p_{ij}\alpha_t(i) = b_{j O_{t+1}} \sum_{i=1}^N p_{ij}\alpha_t(i)

Backward Procedure

我們利用 Part 1 的 recursion formula:

  • \beta_{t}(i) = P(O_{t+1},...,O_{T}\mid X_{t}=i) 加上 parameter \beta_{t}(i) = P(O_{t+1},...,O_{T}\mid X_{t}=i, \lambda)
  • Initial: \beta_T(i) = 1
  • Recursion: \beta_{t+1}(i) = \sum_{j=1}^N b_{j O_{t+1}}p_{ij}\beta_{t+1}(j)

Bi-direction

當然我們想要計算的是 given O_1, .., O_t,.., O_T , the conditional probability of X_t
\gamma_{i}(t)=P\left(X_{t}=i \mid O, \lambda\right)=\frac{P\left(X_{t}=i, O \mid \lambda\right)}{P(O \mid \lambda)}=\frac{\alpha_{t}(i) \beta_{t}(i)}{\sum_{j=1}^{N} \alpha_{t}(j) \beta_{t}(j)}

其中 P(O_1, O_2, ..., O_t, O_{t+1}, ..., O_T, X_t=i)
= P(O_1, .., O_t | X_t=i, O_{t+1}, ..., O_T) P(O_{t+1}, ..., O_T | X_t=i) P(X_t=i)
= P(O_1, .., O_t | X_t=i) P(O_{t+1}, ..., O_T | X_t=i) P(X_t=i)
= P(O_1, .., O_t, X_t=i) P(O_{t+1}, ..., O_T | X_t=i)
= \alpha_t(i) \beta_t(i)
P(O \mid \lambda)=\sum_{j} \alpha_{t}(j) \beta_{t}(j)

我們更想得到的是 p_{ij} = P(X_t = i \mid X_{t+1}=j) = P(X_t) / P(X_t =i, X_{t+1}=j) , 需要以下 probability:
\xi_{i j}(t)=P\left(X_{t}=i, X_{t+1}=j \mid O, \lambda\right)=\frac{P\left(X_{t}=i, X_{t+1}=j, O \mid \lambda\right)}{P(O \mid \lambda)}
=\frac{\alpha_{t}(i) p_{i j} \beta_{t+1}(j) b_{j O_{t+1}}}{\sum_{k=1}^{N} \sum_{w=1}^{N} \alpha_{t}(k) p_{k w} \beta_{t+1}(w) b_{w O_{t+1}}}

此時我們用 \alpha_t and \beta_{t+1} , 所以分母 P(O\mid \lambda) 看起來和上式用 \alpha_t and \beta_t 不同,但實質相同。

Update Parameters

  • \pi_i^* = \gamma_i(1)
  • p_{ij}^* = \frac{\sum_{t=1}^{T-1} \xi_{i j}(t)}{\sum_{t=1}^{T-1} \gamma_{i}(t)}
  • b_i^* = \frac{\sum_{t=1}^{T} 1_{O_{t}=v_{k}} \gamma_{i}(t)}{\sum_{t=1}^{T} \gamma_{i}(t)}

Q&A

  • What's the difference between HMM vs. RNN; or Kalman filter vs. RNN?

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Design a site like this with WordPress.com
Get started