ML:基礎技法學習
Package:scikit-learn
課程:
機器學習技法
簡介:第十一講 Gradient Boosted Decision Tree
Adaptive Boosted Decision Tree
- function AdaBoost-DTree(\(D\))
- \(\mathbf{u}^{(1)}=[\frac{1}{N},\frac{1}{N},\cdots ,\frac{1}{N}]\)
- for \(t=1,2,\cdots ,T\),\(T\) 自行決定
- 兩種方式得到 \(g_t\)
- \(\mathrm{DTree} \) 必須為 pruned
- 傳入 \(\mathbf{u}^{(t)}\)
\(g_t=\mathrm{DTree}(D,\mathbf{u}^{(t)})\)
- \(\tilde{D}_t\) 為依 \(\mathbf{u}^{(t)}\) 從 \(D\) 抽樣
\(g_t=\mathrm{DTree}(\tilde{D}_t)\)
- 更新 \(\mathbf{u}^{(t)}\) 為 \(\mathbf{u}^{(t+1)}\),藉由
$$
\begin{align*}
(incorrect\ examples)\ [y_n\neq g_t(\mathbf{x}_n)] &: u_n^{(t+1)}
\leftarrow u_n^{(t)}\: \cdot\: \blacklozenge _t \\
(correct\ examples)\ [y_n= g_t(\mathbf{x}_n)] &: u_n^{(t+1)}
\leftarrow u_n^{(t)}\ /\ \blacklozenge _t \\
\blacklozenge _t =\sqrt{\frac{1-\epsilon _t}{\epsilon _t}}\ and\
\epsilon _t&=\frac{\sum_{n=1}^{N}u_n^{(t)}[y_n\neq
g_t(\mathbf{x}_n)]}{\sum_{n=1}^{N}u_n^{(t)}}
\end{align*}
$$
- 計算 \(\alpha_t=\mathrm{ln}(\blacklozenge _t )\)
- 回傳 \(G(\mathbf{x})=sign \left (\sum_{t=1}^{T}\alpha_t g_t(\mathbf{x}) \right )\)
因 decision Tree 太過複雜,不見得可以直接計算 \(E_{in}^\mathbf{u}(h)=\frac{1}{N}\sum_{n=1}^{N}u_n\cdot err(y_n,h(\mathbf{x}_n))\)
那為何不回到原本的概念,boostrapping-sampled 才產生 \(\mathbf{u}\)
將 decision Tree 當作黑盒子,只是在給予的資料上動手腳
從 \(D\) 中取 \(N{}'\) 筆資料得到 \(\tilde{D}_t\),但其比例和 \(\mathbf{u}_t\) 成正比
假設 Decision Tree 為 fully grown tree,且所有 \(x_n\) 完全不一樣
導致 \(E_{in}(g_t) = 0 \Rightarrow E_{in}^\mathbf{u}(g_t)=0\)
結果 \(\epsilon _t=0 \Rightarrow \alpha_t=\infty \)
這不就違反初衷,原本想要的是很多樹一起決定結果,但卻弄出個一言堂
所以在此,需要 pruned Tree 並且只針對部分資料做訓練
pruned 的方法,可參考
[ML] 機器學習技法:第九講 Decision Tree
或者直接限定樹的高度即可
而針對 \(\mathbf{u}\) 進行取樣,其實就等同於只對部分資料進行訓練
而
[ML] 機器學習技法:第八講 Adaptive Boosting 的 AdaBoost-Stump
其實就是一種 AdaBoost-DTree 的特殊案例,因為它就等同於將樹限制在 \(H \le 1\) 的情況
而且因為很簡單,所以可以直接計算 \(E_{in}^\mathbf{u}(h)=\frac{1}{N}\sum_{n=1}^{N}u_n\cdot err(y_n,h(\mathbf{x}_n))\)
Optimization View of AdaBoost
$$
\underset{\eta}{min}\ \underset{h}{min}\ \widehat{E}_{ADA} = \sum_{n=1}^{N}u_n^{t+1}=\frac{1}{N}\sum_{n=1}^{N}e^{-y_n\sum_{\tau=1}^{t-1}\alpha_\tau g_\tau(\mathbf{x}_n)+\eta h(\mathbf{x}_n)}
$$
-
Get \(g_t\) by
$$
\underset{\color{Cyan}{h}}{min}\ \widehat{E}_{ADA}=\sum_{n=1}^{N}u_n^{(t)}e^{-y_n\eta \color{Cyan}{h(\mathbf{x}_n)}}
$$
最小 \(E_{in}^{\mathbf{u}(t)}(h)\) 的 \(h\) 為 \(g_t\)
-
代入 \(g_t\),Get \(\eta_t\) by
$$
\underset{\color{Red}{\eta_t}}{min}\ \widehat{E}_{ADA}=\sum_{n=1}^{N}u_n^{(t)}e^{-y_n\color{Red} {\eta_t} g_t(\mathbf{x}_n)}
$$
\(\eta_t=\mathrm{ln}\sqrt{\frac{1-\epsilon _t}{\epsilon _t}}=\alpha_t\)
$$
\begin{align*}
u_n^{(t+1)}&=
\left\{\begin{matrix}
u_n^{(t)}\cdot \blacklozenge _t & if\ incorrect\\
u_n^{(t)}/ \blacklozenge _t& if\ correct
\end{matrix}\right.\\
&= u_n^{(t)}\cdot \blacklozenge _t^{-y_ng_t(\mathbf{x}_n)}\\
&= u_n^{(t)}\cdot e^{-y_n\alpha_t g_t(\mathbf{x}_n)}\\
\\
u_n^{(T+1)}&=u_n^{(1)}\cdot \prod_{t=1}^{T}e^{-y_n\alpha_t g_t(\mathbf{x}_n)}\\
&=\frac{1}{N}\cdot e^{-y_n\color{Orange} {\sum_{t=1}^{T}\alpha_tg_t(\mathbf{x}_n)}}
\end{align*}
$$
$$
G(\mathbf{x}_n)=sign\left (\overset{\mathrm{voting\ score}}{\overbrace{\sum_{t=1}^{T}\underset{w_i}{\underbrace{\alpha_t}}\underset{\phi _i(\mathbf{x}_n)}{\underbrace{g_t(\mathbf{x}_n)}}}} \right )
$$
且 hard-margin SVM margin\(=\frac{y_n\cdot (\mathbf{w}^T\phi (\mathbf{x}_n)+b)}{\left \| \mathbf{w} \right \|}\)
可參考
[ML] 機器學習技法:第一講 linear SVM
然而 \(\mathrm{voting\ score}\) 只是把 \(b=0\) 而已
所以 \(y_n(\mathrm{voting\ score})\) = 未正規化的 margin
而且希望 \(y_n(\mathrm{voting\ score})\) 是個很大的正值
所以 \(e^{-y_n(\mathrm{voting\ score})}\) 會是個很小的值
故 \(u_n^{(T+1)}\) 也會是個很小的值
在 AdaBoost 中,將會令 \(\sum_{n=1}^{N}u_n^{(t)}\) 越來越小,相對而言,也相當在縮小 \(u_n^{(t)}\)
如何證明呢?
首先,最後的 \(u_n^{T+1}\) 的結果,會如下式
$$
\sum_{n=1}^{N}u_n^{T+1}=\frac{1}{N}\sum_{n=1}^{N}e^{-y_n\sum_{t=1}^{T}\alpha_tg_t(\mathbf{x}_n)}=\frac{1}{N}\sum_{n=1}^{N}e^{-y_nG(\mathbf{x}_n)}
$$
所以當 \(G(\mathbf{x}_n)\) 正確且絕對值又很大時,那麼 \(-y_nG(\mathbf{x}_n)\) 就會是個很大的正值
若令 \(s=\sum_{t=1}^{T}\alpha_tg_t(\mathbf{x}_n)\)
那麼 \(err_{0/1}(s,y)=[ys \le 0]\) 是分類的錯誤
而原式中可定義 \(err_{ADA}(s,y)=e^{-ys}\),就是 exponential error measure
如下圖可看出,當把 \(err_{ADA}\) 做好時,其實就等同把 \(err_{0/1}\)做好
可參考
[ML] 機器學習基石:第十一講 Linear Models for Classification
從
[ML] 機器學習基石:第十講 Logistic Regression 知道 gradient descent
借由 taylor 展開式或者是 \(f{}'(x)=\underset{dx \to 0}{lim}\frac{f(x+dx)-f(x)}{dx}\)
\(\underset{\left \| \mathbf{v} \right \|=1}{min}\ E_{in}(\mathbf{w}_t+\eta \mathbf{v})\approx E_{in}(\mathbf{w}_t)+\eta \mathbf{v}^T\triangledown E_{in}(\mathbf{w}_t)\)
換個角度來看,向量不也是函數的一種,給予特定的 index,返回當下的值
所以把上式延伸概念,從找一個好的向量,延伸至找一個好的函數
那麼在 t 輪時,可借由下式,找到 \(g_t\)
$$
\begin{align*}
\underset{h}{min}\ \widehat{E}_{ADA} &= \sum_{n=1}^{N}u_n^{t+1}\\
&=\frac{1}{N}\sum_{n=1}^{N}e^{-y_n\sum_{\tau=1}^{t-1}\alpha_\tau g_\tau(\mathbf{x}_n)+\eta h(\mathbf{x}_n)}\\
&=\frac{1}{N}\sum_{n=1}^{N}e^{-y_n\sum_{\tau=1}^{t-1}\alpha_\tau g_\tau(\mathbf{x}_n)}\cdot e^{-y_n\eta h(\mathbf{x}_n)}\\
&=\sum_{n=1}^{N}\frac{1}{N}e^{-y_n\sum_{\tau=1}^{t-1}\alpha_\tau g_\tau(\mathbf{x}_n)}\cdot e^{-y_n\eta h(\mathbf{x}_n)}\\
&=\sum_{n=1}^{N}u_n^{(t)}\cdot e^{-y_n\eta h(\mathbf{x}_n)}\\
[by\ taylor] &\approx \sum_{n=1}^{N}u_n^{(t)}(1-y_n\eta h(\mathbf{x}_n))\\
&=\sum_{n=1}^{N}u_n^{(t)}-\sum_{n=1}^{N}u_n^{(t)}y_n\eta h(\mathbf{x}_n)\\
&=\sum_{n=1}^{N}u_n^{(t)}+\eta\sum_{n=1}^{N}u_n^{(t)}(-y_n h(\mathbf{x}_n))\\
\end{align*}
$$
實際上即是最小化 \(\sum_{n=1}^{N}u_n^{(t)}(-y_n h(\mathbf{x}_n))\),故
$$
\begin{align*}
\sum_{n=1}^{N}u_n^{(t)}(-y_n h(\mathbf{x}_n))&=\sum_{n=1}^{N}u_n^{(t)}\left\{\begin{matrix}
-1 & if\ y_n=h(\mathbf{x}_n)\\
+1 & if\ y_n\ne h(\mathbf{x}_n)
\end{matrix}\right.\\
&=-\sum_{n=1}^{N}u_n^{(t)}+\sum_{n=1}^{N}u_n^{(t)}\left\{\begin{matrix}
0 & if\ y_n=h(\mathbf{x}_n)\\
2 & if\ y_n\ne h(\mathbf{x}_n)
\end{matrix}\right.\\
\left [\because E_{in}^{\mathbf{u}^{(t)}}(h)=\frac{1}{N}\sum_{n=1}^{N}u_n^{(t)}\cdot [y_n \ne h(\mathbf{x}_n)] \right ]&=-\sum_{n=1}^{N}u_n^{(t)}+2N \cdot E_{in}^{\mathbf{u}^{(t)}}(h)
\end{align*}
$$
故最小化 \(E_{in}^{\mathbf{u}^{(t)}}(h)\) 即可得到 \(g_t\)
如何做呢?這不就是 AdaBoost 的 A 在做的事,可參考
[ML] 機器學習技法:第八講 Adaptive Boosting
即然有了 \(g_t\) 那應該也可以求當前最大步的 \(\eta\),此種方法稱為
steepest descent for optimization
$$
\begin{align*}
\underset{\eta}{min}\ \widehat{E}_{ADA}&=\sum_{n=1}^{N}u_n^{(t)}\cdot e^{-y_n\eta g_t(\mathbf{x}_n)}\\
&= \sum_{n=1}^{N}u_n^{(t)}\cdot\left\{\begin{matrix}
e^{-\eta} & if\ y_n=g_t(\mathbf{x}_n)\\
e^{\eta} & if\ y_n \ne g_t(\mathbf{x}_n)
\end{matrix}\right.\\
&= \sum_{n=1}^{N}u_n^{(t)}\cdot \left ( e^{-\eta}[ y_n=g_t(\mathbf{x}_n)] + e^{\eta}[ y_n \ne g_t(\mathbf{x}_n)] \right )\\
&= \sum_{n=1}^{N}e^{-\eta}u_n^{(t)}[ y_n=g_t(\mathbf{x}_n)] + e^{\eta}u_n^{(t)}[ y_n \ne g_t(\mathbf{x}_n)]\\
&= e^{-\eta}\sum_{n=1}^{N}u_n^{(t)}[ y_n=g_t(\mathbf{x}_n)] + e^{\eta}\sum_{n=1}^{N}u_n^{(t)}[ y_n \ne g_t(\mathbf{x}_n)]\\
&= \left (\sum_{n=1}^{N}u_n^{(t)} \right )\cdot \left (e^{-\eta}\frac{\sum_{n=1}^{N}u_n^{(t)}[ y_n=g_t(\mathbf{x}_n)]}{\sum_{n=1}^{N}u_n^{(t)}} + e^{\eta}\frac{\sum_{n=1}^{N}u_n^{(t)}[ y_n \ne g_t(\mathbf{x}_n)]}{\sum_{n=1}^{N}u_n^{(t)}} \right )\\
\left [\because \epsilon _t=\frac{\sum_{n=1}^{N}u_n^{(t)}[y_n\neq g_t(\mathbf{x}_n)]}{\sum_{n=1}^{N}u_n^{(t)}} \right ] &= \left (\sum_{n=1}^{N}u_n^{(t)} \right )\cdot \left (e^{-\eta}\cdot(1-\epsilon_t) + e^{\eta}\cdot\epsilon_t \right )\\
\\
\frac{\partial \widehat{E}_{ADA}}{\partial \eta}&=0\\
\frac{\partial \widehat{E}_{ADA}}{\partial \eta}&=\left (\sum_{n=1}^{N}u_n^{(t)} \right )\cdot \left (-\eta e^{-\eta}\cdot(1-\epsilon_t) + \eta e^{\eta}\cdot\epsilon_t \right )=0\\
0&=-\eta e^{-\eta}\cdot(1-\epsilon_t) + \eta e^{\eta}\cdot\epsilon_t\\
\eta e^{-\eta}\cdot(1-\epsilon_t)&=\eta e^{\eta}\cdot\epsilon_t\\
e^{-\eta}\cdot(1-\epsilon_t)&=e^{\eta}\cdot\epsilon_t\\
(1-\epsilon_t)&=e^{2\eta}\cdot\epsilon_t\\
\frac{1-\epsilon_t}{\epsilon_t}&=e^{2\eta}\\
\sqrt{\frac{1-\epsilon_t}{\epsilon_t}}&=e^{\eta}\\
\eta&=\mathrm{ln}\sqrt{\frac{1-\epsilon_t}{\epsilon_t}}\\
\end{align*}
$$
Gradient Boosted Decision Tree (GBDT)
- \(s_1=s_2=\cdots=s_N=0\)
- for \(t=1,2,\cdots,T\)
- \(g_t(\mathbf{x}_n)=A(\left \{(\mathbf{x}_n,y_n-s_n)\right \})\)
\(A\) 為 squared-error regression algorithm
- \(A\) 為 sampled and pruned C&RT 是不錯的選擇
- \(\alpha_t=OneVarLinearRegression(\left \{(g_t(\mathbf{x}_n),y_n-s_n) \right \})= \frac{\sum_{n=1}^{N}g(\mathbf{x}_n)(y_n-s_n)}{\sum_{n=1}^{N} g_t^2(\mathbf{x}_n)}\)
- 更新 \(s_n \leftarrow s_n+\alpha_tg_t(\mathbf{x}_n)\)
- 回傳 \(G(\mathbf{x})=\sum_{t=1}^{T}\alpha_t g_t(\mathbf{x})\)
已知 \(\widehat{E}_{ADA}\)
$$
\underset{\eta}{min}\ \underset{h}{min}\ \widehat{E}_{ADA}=\frac{1}{N}\sum_{n=1}^{N}e^{-y_n\sum_{\tau=1}^{t-1}\alpha_\tau g_\tau(\mathbf{x}_n)+\eta h(\mathbf{x}_n)}
$$
這是用在二元分類的 \(h\),那是否可以將之延伸至任意的 \(h\) 呢?
把 \(E_{ADA}\),也就是 exponential error measure 替換成其他 err 不就好了
$$
\begin{align*}
&\underset{\eta}{min}\ \underset{h}{min}\ \frac{1}{N}\sum_{n=1}^{N}err\left (\underset{s_n}{\underbrace{\sum_{\tau=1}^{t-1}\alpha_\tau g_\tau(\mathbf{x}_n)}}+\eta h(\mathbf{x}_n),y_n \right )\\
=&\underset{\eta}{min}\ \underset{h}{min}\ \frac{1}{N}\sum_{n=1}^{N}err(s_n+\eta h(\mathbf{x}_n),y_n)\\
[by\ taylor] \approx& \underset{\eta}{min}\ \underset{h}{min}\ \frac{1}{N}\sum_{n=1}^{N}err(s_n,y_n)+\eta h(\mathbf{x}_n)\left.\begin{matrix}\frac{\partial err(s,y_n)}{\partial s}\end{matrix}\right|_{s=s_n}\\
=& \underset{\eta}{min}\ \underset{h}{min}\ \frac{1}{N}\sum_{n=1}^{N}err(s_n,y_n)+\frac{1}{N}\sum_{n=1}^{N}\eta h(\mathbf{x}_n)\left.\begin{matrix}\frac{\partial err(s,y_n)}{\partial s}\end{matrix}\right|_{s=s_n}\\
=& \underset{\eta}{min}\ \underset{h}{min}\ \frac{1}{N}\sum_{n=1}^{N}\underset{constants}{\underbrace{err(s_n,y_n)}}+\frac{\eta}{N}\sum_{n=1}^{N} h(\mathbf{x}_n)\left.\begin{matrix}\frac{\partial err(s,y_n)}{\partial s}\end{matrix}\right|_{s=s_n}\\
=& \underset{\eta}{min}\ \underset{h}{min}\ constants+\frac{\eta}{N}\sum_{n=1}^{N} h(\mathbf{x}_n)\left.\begin{matrix}\frac{\partial err(s,y_n)}{\partial s}\end{matrix}\right|_{s=s_n}\\
\end{align*}
$$
如果將 \(err(s,y)=(s-y)^2\),也就是 square error 代入
$$
\begin{align*}
& \underset{\eta}{min}\ \underset{h}{min}\ constants+\frac{\eta}{N}\sum_{n=1}^{N} h(\mathbf{x}_n)\left.\begin{matrix}\frac{\partial err(s,y_n)}{\partial s}\end{matrix}\right|_{s=s_n}\\
=& \underset{\eta}{min}\ \underset{h}{min}\ constants+\frac{\eta}{N}\sum_{n=1}^{N} h(\mathbf{x}_n)\cdot 2(s_n-y_n)\\
\end{align*}
$$
直覺來看,當 \(h(\mathbf{x}_n)= -\infty \cdot (s_n-y_n)\) 不就得到最小值
這樣不就怪怪的,因為 \(h(\mathbf{x}_n)\) 決定方向,多大步皆由 \(\eta\) 決定才對
所以必須為 \(h(\mathbf{x}_n)\) 加入限制
$$
\begin{align*}
& \underset{\eta}{min}\ \underset{h}{min}\ constants+\frac{\eta}{N}\sum_{n=1}^{N} \left (h(\mathbf{x}_n)\cdot 2(s_n-y_n)+\underbrace{(h(\mathbf{x}_n))^2} \right )\\
=& \underset{\eta}{min}\ \underset{h}{min}\ constants+\frac{\eta}{N}\sum_{n=1}^{N} \left (-(s_n-y_n)^2+(s_n-y_n)^2+h(\mathbf{x}_n)\cdot 2(s_n-y_n)+(h(\mathbf{x}_n))^2 \right )\\
=& \underset{\eta}{min}\ \underset{h}{min}\ constants+\frac{\eta}{N}\sum_{n=1}^{N} \left (-(s_n-y_n)^2+(s_n-y_n+h(\mathbf{x}_n))^2 \right )\\
=& \underset{\eta}{min}\ \underset{h}{min}\ constants+\frac{\eta}{N}\sum_{n=1}^{N} \left (\underset{constant}{\underbrace{-(s_n-y_n)^2}}+(h(\mathbf{x}_n)-(y_n-s_n))^2 \right )\\
=& \underset{\eta}{min}\ \underset{h}{min}\ constants+\frac{\eta}{N}\sum_{n=1}^{N} \left (constant+(h(\mathbf{x}_n)-(y_n-s_n))^2 \right )\\
\end{align*}
$$
若對 \(h\) 做最小化時,其他都是常數,所以只要把 \((h(\mathbf{x}_n)-(y_n-s_n))^2\) 做到最小,就行了
這不就是 regression,只是資料變成 \(\left \{(x_n,\underset{residual}{\underbrace{y_n-s_n}}) \right \}\)
利用此資料,找到 \(h\) 的最佳解 \(g_t\),然後代進去求 \(\eta\)
$$
\begin{align*}
&\underset{\eta}{min}\ \frac{1}{N}\sum_{n=1}^{N}err\left (\underset{s_n}{\underbrace{\sum_{\tau=1}^{t-1}\alpha_\tau g_\tau(\mathbf{x}_n)}}+\eta g_t(\mathbf{x}_n),y_n \right )\\
\left [\because err(s,y)=(s-y)^2 \right ]=&\underset{\eta}{min}\ \frac{1}{N}\sum_{n=1}^{N}(s_n+\eta g_t(\mathbf{x}_n)-y_n)^2\\
=&\underset{\eta}{min}\ \frac{1}{N}\sum_{n=1}^{N}(-s_n-\eta g_t(\mathbf{x}_n)+y_n)^2\\
=&\underset{\eta}{min}\ \frac{1}{N}\sum_{n=1}^{N}((y_n-s_n)-\eta g_t(\mathbf{x}_n))^2\\
\end{align*}
$$
這就是一個 one-variable linear regression 在 \(\left \{ (g_t-transformed\ input,residual) \right \}\)
代公式解就好了,可參考
[ML] 機器學習基石:第九講 Linear Regression
或者求微分=0
$$
\begin{align*}
0 &=\frac{\partial }{\partial \eta}\left (\frac{1}{N}\sum_{n=1}^{N}((y_n-s_n)-\eta g_t(\mathbf{x}_n))^2 \right ) \\
0 &=\frac{\partial }{\partial \eta}\left (\sum_{n=1}^{N}((y_n-s_n)-\eta g_t(\mathbf{x}_n))^2 \right ) \\
0&= \sum_{n=1}^{N}2\cdot ((y_n-s_n)-\eta g_t(\mathbf{x}_n))\cdot g_t(\mathbf{x}_n)\\
0&= \sum_{n=1}^{N}((y_n-s_n)-\eta g_t(\mathbf{x}_n))\cdot g_t(\mathbf{x}_n)\\
0&= \sum_{n=1}^{N}g(\mathbf{x}_n)(y_n-s_n)-\eta\sum_{n=1}^{N} g_t^2(\mathbf{x}_n)\\
\eta\sum_{n=1}^{N} g_t^2(\mathbf{x}_n)&= \sum_{n=1}^{N}g(\mathbf{x}_n)(y_n-s_n)\\
\eta&= \frac{\sum_{n=1}^{N}g(\mathbf{x}_n)(y_n-s_n)}{\sum_{n=1}^{N} g_t^2(\mathbf{x}_n)}=\alpha_t\\
\end{align*}
$$
綜合比較
- Aggregation Models
- blending (aggregate 需先有多樣性的 \(g_t\))
- uniform
- 簡單
- voting/averaging of \(g_t\)
- 穩定,可互相修正
- non-uniform
- linear model on \(g_t\)-transformed inputs
- powerful 但需小心 overfitting
- conditional
- nonlinear model on \(g_t\)-transformed inputs
- powerful 但需小心 overfitting
- learning (aggregate 會自動得到多樣性的 \(g_t\))
- Bagging
- 多樣性的 \(g_t\) 來自 bootsrapping
- uniform vote by nothing
- 延伸
- Random Forest
- 實務上常用
- randomized bagging + strong DTree
- Boost
- 實務上常用
- AdaBoost
- 多樣性的 \(g_t\) 來自 reweighting
- linear vote by steepest search
- 延伸
- GradientBoost
- 多樣性的 \(g_t\) 來自 residual fitting
- linear vote by steepest search
- 延伸
- Gradient Boosted Decision Tree (GBDT)
- GradientBoost + weak DTree
- Decision Tree
- 多樣性的 \(g_t\) 來自 data splitting
- condition vote by branching
- Specialty
- cure underfitting
- \(G(x)\) strong
- aggregation 等同在做 feature transform
- cure overfitting
- \(G(x)\) moderate
- aggregation 等同在做 regularization
- 不同的 aggregation,有不同的特性
合適地混用 aggregation (a.k.a. ensemble) 可得到更好的結果
程式碼
即使是當下 stage 的最佳化 \(\alpha_n\),對 \(E_{out}\) 也不見得是好的
import numpy as np
import matplotlib.pyplot as plt
from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error
# 讀取波士頓房子價錢的資料
boston = datasets.load_boston()
# 重新洗牌
X, y = shuffle(boston.data, boston.target, random_state=13)
X = X.astype(np.float32)
# 分開 測試 與 訓練 資料
offset = int(X.shape[0] * 0.9)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]
# G_n(x) = G_{n-1}(x) + learning_rate * alpha_n * g_t(x)
# learning_rate=1 表示維持原本的演算法
lrs = [1, 0.1, 0.01]
plt.figure()
for index, lr in enumerate(lrs):
# 建立 500 棵樹,最大深度 4,loss 為‘ls’表示 least squares regression
params = {'n_estimators': 500, 'max_depth': 4, 'learning_rate': lr, 'loss': 'ls'}
clf = ensemble.GradientBoostingRegressor(**params)
clf.fit(X_train, y_train)
mse = mean_squared_error(y_test, clf.predict(X_test))
print("MSE: %.4f" % mse)
# 計算每個 stage 的分數
test_score = np.zeros((params['n_estimators'],), dtype=np.float64)
for i, y_pred in enumerate(clf.staged_predict(X_test)):
test_score[i] = clf.loss_(y_test, y_pred)
#畫圖
plt.subplot(len(lrs), 1, index+1)
plt.plot(np.arange(params['n_estimators']) + 1, clf.train_score_, 'b-',
label='Training Set')
plt.plot(np.arange(params['n_estimators']) + 1, test_score, 'r-',
label='Test Set')
plt.legend(loc='right')
plt.title("learning rate = {}".format(lr))
plt.ylabel('Loss')
plt.xlabel('Boosting Iterations')
plt.subplots_adjust(hspace=0.6)
plt.suptitle("Learning Rate")
plt.show()
參考
如何通俗地解释泰勒公式?
sklearn.ensemble.GradientBoostingClassifier
sklearn.ensemble.GradientBoostingRegressor
留言
張貼留言