[ML] 機器學習基石:第九講 Linear Regression

ML:基礎學習
課程:機器學習基石
簡介:第九講 Linear Regression

$$ \mathbf{x}=(x_0,x_1,x_2, \cdots , x_d) \\ y\approx \sum_{i=0}^{d}w_ix_i \\ h(\mathbf{x})=\mathbf{w^Tx} $$

Error Measure

$$ E_{in}(h\mathbf{w})=\frac{1}{N}\sum_{n=1}^{N}(\underset{\mathbf{w^Tx_n}}{\underbrace{h(\mathbf{x_n})}}-y_n)^2 \\ $$
$$ E_{out}(\mathbf{w})=\underset{(\mathbf{x},y)\sim P}{\varepsilon }(\mathbf{w^Tx}-y)^2 $$
$$ \begin{align*} E_{in}(\mathbf{w}) &= \frac{1}{N}\sum_{n=1}^{N}(\mathbf{w^Tx_n}-y_n)^2\\ &= \frac{1}{N}\sum_{n=1}^{N}(\mathbf{x_n^Tw}-y_n)^2 \\ &= \frac{1}{N}\begin{Vmatrix} \mathbf{x_1^Tw}-y_1\\ \mathbf{x_2^Tw}-y_2\\ \vdots \\ \mathbf{x_N^Tw}-y_N \end{Vmatrix}^2\\ &= \frac{1}{N}\left \| \begin{bmatrix} \mathbf{x_1^T}\\ \mathbf{x_2^T}\\ \vdots \\ \mathbf{x_N^T} \end{bmatrix}\mathbf{w}-\begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_N \end{bmatrix} \right \|^2\\ &= \frac{1}{N}\left \| \mathbf{\underset{\mathrm{N\times(d+1)\ }}{\underbrace{X}}\underset{\mathrm{(d+1)\times1}}{\underbrace{w}}-\underset{\mathrm{N\times1}}{\underbrace{y}}} \right \|^2 \end{align*} $$

求最小值 \( E_{in} \)
$$ \begin{align*} \bigtriangledown E_{in}(\mathbf{w}) &= \bigtriangledown(\frac{1}{N}\left \| \mathbf{Xw-y} \right \|^2)\\ &= \bigtriangledown(\frac{1}{N}(\mathbf{w^TX^TXw-2w^TX^Ty+y^Ty}))\\ &= \frac{2}{N}(\mathbf{X^TXw-X^Ty})\\ &=0 \end{align*} $$ $$ \begin{align*} \frac{2}{N}(\mathbf{X^TXw-X^Ty})&=0\\ \mathbf{X^TXw}=\mathbf{X^Ty} \end{align*} $$
\( \mathbf{X^TX}\ \) invertible \(\mathbf{X^TX}\) singular
\((\mathbf{X^TX})^{-1}\mathbf{X^Ty}\) \(\mathbf{X^+}\mathbf{y}\)
程式建議直接使用別人寫好的 pseudo-inverse
$$ \frac{\partial \mathbf{w^TAw}}{\partial \mathbf{w}}=2\mathbf{A}\mathbf{w} $$
使用微分乘法法則
$$ \begin{align*} f\left ( \mathbf{w}\right )&= \mathbf{w}^{T}\mathbf{A}\mathbf{w}=\sum_{i=1}^{n}\sum_{j=1}^{n}a_{ij}w_{i}w_{j}\\ \frac{\partial f}{\partial \mathbf{w}}&=(\frac{\partial f}{\partial w_{1}},\frac{\partial f}{\partial w_{2}},\frac{\partial f}{\partial w_{3}},\cdots ,\frac{\partial f}{\partial w_{n}})^{T} \\ &=(2\sum_{k=1}^{n}a_{1k}w_{k},2\sum_{k=1}^{n}a_{2k}w_{k},\cdots ,2\sum_{k=1}^{n}a_{nk}w_{k})^T\\ &=2\mathbf{A}(w_1,w_2,\cdots ,w_n)^{T}\\ &=2\mathbf{A}\mathbf{w} \end{align*} $$

Linear Regression Algorithm

  1. 從 \(D\) 建構出 \(\mathbf{X}\) 跟 \(\mathbf{y}\)
  2. 算出 pseudo-inverse \(\mathbf{X^+}\)
  3.  \(\mathbf{w_{opt}}=\mathbf{X^+}\mathbf{y}\)

good \(E_{out}\)?

因有限的 \(d_{vc}\) 如同 perceptrons,相當於二分法
\(E_{in} 的平均\)
$$ \overline{E_{in}}=\underset{D \sim P^N}{\varepsilon}\left \{ E_{in}(\mathbf{w_{opt w.r.t D}}) \right \} = noise\ level \cdot \left ( 1-\frac{d+1}{N} \right ) $$
$$ \begin{align*} E_{in}(\mathbf{w_{opt}}) &= \frac{1}{N}\left \| \mathbf{y}-\underset{prediction}{\underbrace{\mathbf{\hat{y}}}} \right \|^2\\ &= \frac{1}{N}\left \| \mathbf{y-X}\underset{\mathbf{w_{opt}}}{\underbrace{\mathbf{X^+y}}} \right \|^2\\ &= \frac{1}{N}\left \| (\mathbf{(I-XX^+)y}) \right \|^2\\ &= \frac{1}{N}\left \| (\mathbf{(I-H)y}) \right \|^2\\ call\ \mathbf{XX^+}&\ hat\ matrix\ \mathbf{H} \end{align*} $$
\(\mathbf{H}\) 即是投影轉換矩陣,故
\(y=f(\mathbf{x})+noise\), \(f(\mathbf{x})\) 為理想的目標函數
$$ \begin{align*} E_{in}(\mathbf{w_{opt}}) &= \frac{1}{N}\left \| \mathbf{{y-\hat{y}}} \right \|^2 \\ &=\frac{1}{N}\left \| (\mathbf{I-H})noise \right \|^2 \\ &=\frac{1}{N}\left \| (\mathbf{I-H})\right \|^2\left \|noise \right \|^2 \\ &=\frac{1}{N}trace ((\mathbf{I-H})^T(\mathbf{I-H}))\left \|noise \right \|^2 \\ &=\frac{1}{N}trace (\mathbf{I-H-H^T+H^TH})\left \|noise \right \|^2 \\ &=\frac{1}{N}trace (\mathbf{I-H-(XX^+)^T+(XX^+)^T(XX^+)})\left \|noise \right \|^2 \\ &=\frac{1}{N}trace (\mathbf{I-H-(XX^+)^T+((X^+)^TX^TXX^+)})\left \|noise \right \|^2 \\ &=\frac{1}{N}trace (\mathbf{I-H-(XX^+)^T+((X^+)^TX^T)})\left \|noise \right \|^2 \\ &=\frac{1}{N}trace(\mathbf{I-H})\left \| noise \right \|^2 \\ &=\frac{1}{N}(N-(d+1))\left \| noise \right \|^2 \\ &=(1-\frac{d+1}{N})\left \| noise \right \|^2 \end{align*} $$ $$ \begin{align*} trace(\mathbf{H}) &= trace(\mathbf{XX^+})\\ &= trace(\mathbf{XX^+})\\ &= trace(\mathbf{X(X^TX)^{-1}}\mathbf{X^T}) \\ &= trace(\mathbf{(X^TX)^{-1}}\mathbf{X^TX}) \\ &= trace(\mathbf{I_{(d+1)x(d+1)}})\\ &= d+1 \end{align*} $$
$$ \overline{E_{out}}= noise\ level \cdot \left ( 1+\frac{d+1}{N} \right ) $$ 因訓練後,會往 \(\mathbf{X}\) domain 靠近,故
\(E_{in}\) 會被修正 d+1 維的錯誤
\(E_{out}\) 則再糟也只是再加上 d+1 維的錯誤
\(E_{out}-E_{in}=\frac{2(d+1)}{N}\)
\(\sigma^2\) noise level

H 的特性


Linear Classification & Linear Regression 比較

Linear Classification Linear Regression
NP-hard 的解 $$ \begin{align*} y &= {-1,+1}\\ h(\mathbf{x}) &= sign(\mathbf{w^Tx})\\ err(\hat{y},y) &= [\hat{y} \neq y] \end{align*} $$
簡單易解 $$ \begin{align*} y &= \mathbb{R}\\ h(\mathbf{x}) &=\mathbf{w^Tx}\\ err(\hat{y},y) &= (\hat{y} \neq y)^2 \end{align*} $$

error 比較
$$ \begin{align*} E_{out}(\mathbf{w}) &\overset{VC}{\leq} classification\ E_{in}(\mathbf{w})+\sqrt{\cdots }\\ &\leq regression\ E_{in}(\mathbf{w})+\sqrt{\cdots }\\ \end{align*} $$
regression 相較於 classfication 較為鬆散,也因此易解
故可用在
  • 基本的分類器,大多數表現不差
  • PLA/pocket 的初始值,加快 PLA/pocket 的速度

留言