[ML] 機器學習基石：第九講 Linear Regression

ML：基礎學習
課程：機器學習基石
簡介：第九講 Linear Regression

$$ \mathbf{x}=(x_0,x_1,x_2, \cdots , x_d) \\ y\approx \sum_{i=0}^{d}w_ix_i \\ h(\mathbf{x})=\mathbf{w^Tx} $$

Error Measure

$$ E_{in}(h\mathbf{w})=\frac{1}{N}\sum_{n=1}^{N}(\underset{\mathbf{w^Tx_n}}{\underbrace{h(\mathbf{x_n})}}-y_n)^2 \\ $$

$$ E_{out}(\mathbf{w})=\underset{(\mathbf{x},y)\sim P}{\varepsilon }(\mathbf{w^Tx}-y)^2 $$

$$ \begin{align*} E_{in}(\mathbf{w}) &= \frac{1}{N}\sum_{n=1}^{N}(\mathbf{w^Tx_n}-y_n)^2\\ &= \frac{1}{N}\sum_{n=1}^{N}(\mathbf{x_n^Tw}-y_n)^2 \\ &= \frac{1}{N}\begin{Vmatrix} \mathbf{x_1^Tw}-y_1\\ \mathbf{x_2^Tw}-y_2\\ \vdots \\ \mathbf{x_N^Tw}-y_N \end{Vmatrix}^2\\ &= \frac{1}{N}\left \| \begin{bmatrix} \mathbf{x_1^T}\\ \mathbf{x_2^T}\\ \vdots \\ \mathbf{x_N^T} \end{bmatrix}\mathbf{w}-\begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_N \end{bmatrix} \right \|^2\\ &= \frac{1}{N}\left \| \mathbf{\underset{\mathrm{N\times(d+1)\ }}{\underbrace{X}}\underset{\mathrm{(d+1)\times1}}{\underbrace{w}}-\underset{\mathrm{N\times1}}{\underbrace{y}}} \right \|^2 \end{align*} $$

求最小值 $ E_{in} $

$$ \begin{align*} \bigtriangledown E_{in}(\mathbf{w}) &= \bigtriangledown(\frac{1}{N}\left \| \mathbf{Xw-y} \right \|^2)\\ &= \bigtriangledown(\frac{1}{N}(\mathbf{w^TX^TXw-2w^TX^Ty+y^Ty}))\\ &= \frac{2}{N}(\mathbf{X^TXw-X^Ty})\\ &=0 \end{align*} $$ $$ \begin{align*} \frac{2}{N}(\mathbf{X^TXw-X^Ty})&=0\\ \mathbf{X^TXw}=\mathbf{X^Ty} \end{align*} $$

$ \mathbf{X^TX}\ $ invertible	$\mathbf{X^TX}$ singular
$(\mathbf{X^TX})^{-1}\mathbf{X^Ty}$	$\mathbf{X^+}\mathbf{y}$

程式建議直接使用別人寫好的 pseudo-inverse

$$ \frac{\partial \mathbf{w^TAw}}{\partial \mathbf{w}}=2\mathbf{A}\mathbf{w} $$

Linear Regression Algorithm

從 $D$ 建構出 $\mathbf{X}$ 跟 $\mathbf{y}$
算出 pseudo-inverse $\mathbf{X^+}$
$\mathbf{w_{opt}}=\mathbf{X^+}\mathbf{y}$

good $E_{out}$？

因有限的 $d_{vc}$ 如同 perceptrons，相當於二分法
$E_{in} 的平均$

$$ \overline{E_{in}}=\underset{D \sim P^N}{\varepsilon}\left \{ E_{in}(\mathbf{w_{opt w.r.t D}}) \right \} = noise\ level \cdot \left ( 1-\frac{d+1}{N} \right ) $$

$$ \overline{E_{out}}= noise\ level \cdot \left ( 1+\frac{d+1}{N} \right ) $$ 因訓練後，會往 $\mathbf{X}$ domain 靠近，故
$E_{in}$ 會被修正 d+1 維的錯誤
$E_{out}$ 則再糟也只是再加上 d+1 維的錯誤
$E_{out}-E_{in}=\frac{2(d+1)}{N}$

$\sigma^2$ noise level

H 的特性

Linear Classification & Linear Regression 比較

Linear Classification	Linear Regression
NP-hard 的解 $$ \begin{align} y &= {-1,+1}\\ h(\mathbf{x}) &= sign(\mathbf{w^Tx})\\ err(\hat{y},y) &= [\hat{y} \neq y] \end{align} $$	簡單易解 $$ \begin{align} y &= \mathbb{R}\\ h(\mathbf{x}) &=\mathbf{w^Tx}\\ err(\hat{y},y) &= (\hat{y} \neq y)^2 \end{align} $$

error 比較

$$ \begin{align*} E_{out}(\mathbf{w}) &\overset{VC}{\leq} classification\ E_{in}(\mathbf{w})+\sqrt{\cdots }\\ &\leq regression\ E_{in}(\mathbf{w})+\sqrt{\cdots }\\ \end{align*} $$

regression 相較於 classfication 較為鬆散，也因此易解
故可用在

基本的分類器，大多數表現不差
PLA/pocket 的初始值，加快 PLA/pocket 的速度

\( \mathbf{X^TX}\ \) invertible	\(\mathbf{X^TX}\) singular
\((\mathbf{X^TX})^{-1}\mathbf{X^Ty}\)	\(\mathbf{X^+}\mathbf{y}\)

子風的知識庫

搜尋此網誌