[ML] 機器學習技法:第十一講 Gradient Boosted Decision Tree

ML:基礎技法學習
Package:scikit-learn
課程:機器學習技法
簡介:第十一講 Gradient Boosted Decision Tree

Adaptive Boosted Decision Tree

  • function AdaBoost-DTree(\(D\))
    • \(\mathbf{u}^{(1)}=[\frac{1}{N},\frac{1}{N},\cdots ,\frac{1}{N}]\)
    • for \(t=1,2,\cdots ,T\),\(T\) 自行決定
      1. 兩種方式得到 \(g_t\)
        • \(\mathrm{DTree} \) 必須為 pruned
        • 傳入 \(\mathbf{u}^{(t)}\)
          \(g_t=\mathrm{DTree}(D,\mathbf{u}^{(t)})\)
        • \(\tilde{D}_t\) 為依 \(\mathbf{u}^{(t)}\) 從 \(D\) 抽樣
          \(g_t=\mathrm{DTree}(\tilde{D}_t)\)
      2. 更新 \(\mathbf{u}^{(t)}\) 為 \(\mathbf{u}^{(t+1)}\),藉由
        $$ \begin{align*} (incorrect\ examples)\ [y_n\neq g_t(\mathbf{x}_n)] &: u_n^{(t+1)} \leftarrow u_n^{(t)}\: \cdot\: \blacklozenge _t \\ (correct\ examples)\ [y_n= g_t(\mathbf{x}_n)] &: u_n^{(t+1)} \leftarrow u_n^{(t)}\ /\ \blacklozenge _t \\ \blacklozenge _t =\sqrt{\frac{1-\epsilon _t}{\epsilon _t}}\ and\ \epsilon _t&=\frac{\sum_{n=1}^{N}u_n^{(t)}[y_n\neq g_t(\mathbf{x}_n)]}{\sum_{n=1}^{N}u_n^{(t)}} \end{align*} $$
      3. 計算 \(\alpha_t=\mathrm{ln}(\blacklozenge _t )\)
    • 回傳 \(G(\mathbf{x})=sign \left (\sum_{t=1}^{T}\alpha_t g_t(\mathbf{x}) \right )\)
因 decision Tree 太過複雜,不見得可以直接計算 \(E_{in}^\mathbf{u}(h)=\frac{1}{N}\sum_{n=1}^{N}u_n\cdot err(y_n,h(\mathbf{x}_n))\)
那為何不回到原本的概念,boostrapping-sampled 才產生 \(\mathbf{u}\)
將 decision Tree 當作黑盒子,只是在給予的資料上動手腳
從 \(D\) 中取 \(N{}'\) 筆資料得到 \(\tilde{D}_t\),但其比例和 \(\mathbf{u}_t\) 成正比

假設 Decision Tree 為 fully grown tree,且所有 \(x_n\) 完全不一樣
導致 \(E_{in}(g_t) = 0 \Rightarrow E_{in}^\mathbf{u}(g_t)=0\)
結果 \(\epsilon _t=0 \Rightarrow \alpha_t=\infty \)
這不就違反初衷,原本想要的是很多樹一起決定結果,但卻弄出個一言堂
所以在此,需要 pruned Tree 並且只針對部分資料做訓練
pruned 的方法,可參考 [ML] 機器學習技法:第九講 Decision Tree
或者直接限定樹的高度即可
而針對 \(\mathbf{u}\) 進行取樣,其實就等同於只對部分資料進行訓練

[ML] 機器學習技法:第八講 Adaptive Boosting 的 AdaBoost-Stump
其實就是一種 AdaBoost-DTree 的特殊案例,因為它就等同於將樹限制在 \(H \le 1\) 的情況
而且因為很簡單,所以可以直接計算 \(E_{in}^\mathbf{u}(h)=\frac{1}{N}\sum_{n=1}^{N}u_n\cdot err(y_n,h(\mathbf{x}_n))\)

Optimization View of AdaBoost

$$ \underset{\eta}{min}\ \underset{h}{min}\ \widehat{E}_{ADA} = \sum_{n=1}^{N}u_n^{t+1}=\frac{1}{N}\sum_{n=1}^{N}e^{-y_n\sum_{\tau=1}^{t-1}\alpha_\tau g_\tau(\mathbf{x}_n)+\eta h(\mathbf{x}_n)} $$
  1. Get \(g_t\) by $$ \underset{\color{Cyan}{h}}{min}\ \widehat{E}_{ADA}=\sum_{n=1}^{N}u_n^{(t)}e^{-y_n\eta \color{Cyan}{h(\mathbf{x}_n)}} $$ 最小 \(E_{in}^{\mathbf{u}(t)}(h)\) 的 \(h\) 為 \(g_t\)
  2. 代入 \(g_t\),Get \(\eta_t\) by $$ \underset{\color{Red}{\eta_t}}{min}\ \widehat{E}_{ADA}=\sum_{n=1}^{N}u_n^{(t)}e^{-y_n\color{Red} {\eta_t} g_t(\mathbf{x}_n)} $$ \(\eta_t=\mathrm{ln}\sqrt{\frac{1-\epsilon _t}{\epsilon _t}}=\alpha_t\)
$$ \begin{align*} u_n^{(t+1)}&= \left\{\begin{matrix} u_n^{(t)}\cdot \blacklozenge _t & if\ incorrect\\ u_n^{(t)}/ \blacklozenge _t& if\ correct \end{matrix}\right.\\ &= u_n^{(t)}\cdot \blacklozenge _t^{-y_ng_t(\mathbf{x}_n)}\\ &= u_n^{(t)}\cdot e^{-y_n\alpha_t g_t(\mathbf{x}_n)}\\ \\ u_n^{(T+1)}&=u_n^{(1)}\cdot \prod_{t=1}^{T}e^{-y_n\alpha_t g_t(\mathbf{x}_n)}\\ &=\frac{1}{N}\cdot e^{-y_n\color{Orange} {\sum_{t=1}^{T}\alpha_tg_t(\mathbf{x}_n)}} \end{align*} $$ $$ G(\mathbf{x}_n)=sign\left (\overset{\mathrm{voting\ score}}{\overbrace{\sum_{t=1}^{T}\underset{w_i}{\underbrace{\alpha_t}}\underset{\phi _i(\mathbf{x}_n)}{\underbrace{g_t(\mathbf{x}_n)}}}} \right ) $$ 且 hard-margin SVM margin\(=\frac{y_n\cdot (\mathbf{w}^T\phi (\mathbf{x}_n)+b)}{\left \| \mathbf{w} \right \|}\)
可參考 [ML] 機器學習技法:第一講 linear SVM

然而 \(\mathrm{voting\ score}\) 只是把 \(b=0\) 而已
所以 \(y_n(\mathrm{voting\ score})\) = 未正規化的 margin
而且希望 \(y_n(\mathrm{voting\ score})\) 是個很大的正值
所以 \(e^{-y_n(\mathrm{voting\ score})}\) 會是個很小的值
故 \(u_n^{(T+1)}\) 也會是個很小的值

在 AdaBoost 中,將會令 \(\sum_{n=1}^{N}u_n^{(t)}\) 越來越小,相對而言,也相當在縮小 \(u_n^{(t)}\)
如何證明呢?
首先,最後的 \(u_n^{T+1}\) 的結果,會如下式
$$ \sum_{n=1}^{N}u_n^{T+1}=\frac{1}{N}\sum_{n=1}^{N}e^{-y_n\sum_{t=1}^{T}\alpha_tg_t(\mathbf{x}_n)}=\frac{1}{N}\sum_{n=1}^{N}e^{-y_nG(\mathbf{x}_n)} $$ 所以當 \(G(\mathbf{x}_n)\) 正確且絕對值又很大時,那麼 \(-y_nG(\mathbf{x}_n)\) 就會是個很大的正值
若令 \(s=\sum_{t=1}^{T}\alpha_tg_t(\mathbf{x}_n)\)
那麼 \(err_{0/1}(s,y)=[ys \le 0]\) 是分類的錯誤
而原式中可定義 \(err_{ADA}(s,y)=e^{-ys}\),就是 exponential error measure
如下圖可看出,當把 \(err_{ADA}\) 做好時,其實就等同把 \(err_{0/1}\)做好
可參考 [ML] 機器學習基石:第十一講 Linear Models for Classification
[ML] 機器學習基石:第十講 Logistic Regression 知道 gradient descent
借由 taylor 展開式或者是 \(f{}'(x)=\underset{dx \to 0}{lim}\frac{f(x+dx)-f(x)}{dx}\)
\(\underset{\left \| \mathbf{v} \right \|=1}{min}\ E_{in}(\mathbf{w}_t+\eta \mathbf{v})\approx E_{in}(\mathbf{w}_t)+\eta \mathbf{v}^T\triangledown E_{in}(\mathbf{w}_t)\)
換個角度來看,向量不也是函數的一種,給予特定的 index,返回當下的值
所以把上式延伸概念,從找一個好的向量,延伸至找一個好的函數
那麼在 t 輪時,可借由下式,找到 \(g_t\)
$$ \begin{align*} \underset{h}{min}\ \widehat{E}_{ADA} &= \sum_{n=1}^{N}u_n^{t+1}\\ &=\frac{1}{N}\sum_{n=1}^{N}e^{-y_n\sum_{\tau=1}^{t-1}\alpha_\tau g_\tau(\mathbf{x}_n)+\eta h(\mathbf{x}_n)}\\ &=\frac{1}{N}\sum_{n=1}^{N}e^{-y_n\sum_{\tau=1}^{t-1}\alpha_\tau g_\tau(\mathbf{x}_n)}\cdot e^{-y_n\eta h(\mathbf{x}_n)}\\ &=\sum_{n=1}^{N}\frac{1}{N}e^{-y_n\sum_{\tau=1}^{t-1}\alpha_\tau g_\tau(\mathbf{x}_n)}\cdot e^{-y_n\eta h(\mathbf{x}_n)}\\ &=\sum_{n=1}^{N}u_n^{(t)}\cdot e^{-y_n\eta h(\mathbf{x}_n)}\\ [by\ taylor] &\approx \sum_{n=1}^{N}u_n^{(t)}(1-y_n\eta h(\mathbf{x}_n))\\ &=\sum_{n=1}^{N}u_n^{(t)}-\sum_{n=1}^{N}u_n^{(t)}y_n\eta h(\mathbf{x}_n)\\ &=\sum_{n=1}^{N}u_n^{(t)}+\eta\sum_{n=1}^{N}u_n^{(t)}(-y_n h(\mathbf{x}_n))\\ \end{align*} $$ 實際上即是最小化 \(\sum_{n=1}^{N}u_n^{(t)}(-y_n h(\mathbf{x}_n))\),故
$$ \begin{align*} \sum_{n=1}^{N}u_n^{(t)}(-y_n h(\mathbf{x}_n))&=\sum_{n=1}^{N}u_n^{(t)}\left\{\begin{matrix} -1 & if\ y_n=h(\mathbf{x}_n)\\ +1 & if\ y_n\ne h(\mathbf{x}_n) \end{matrix}\right.\\ &=-\sum_{n=1}^{N}u_n^{(t)}+\sum_{n=1}^{N}u_n^{(t)}\left\{\begin{matrix} 0 & if\ y_n=h(\mathbf{x}_n)\\ 2 & if\ y_n\ne h(\mathbf{x}_n) \end{matrix}\right.\\ \left [\because E_{in}^{\mathbf{u}^{(t)}}(h)=\frac{1}{N}\sum_{n=1}^{N}u_n^{(t)}\cdot [y_n \ne h(\mathbf{x}_n)] \right ]&=-\sum_{n=1}^{N}u_n^{(t)}+2N \cdot E_{in}^{\mathbf{u}^{(t)}}(h) \end{align*} $$ 故最小化 \(E_{in}^{\mathbf{u}^{(t)}}(h)\) 即可得到 \(g_t\)
如何做呢?這不就是 AdaBoost 的 A 在做的事,可參考 [ML] 機器學習技法:第八講 Adaptive Boosting
即然有了 \(g_t\) 那應該也可以求當前最大步的 \(\eta\),此種方法稱為 steepest descent for optimization
$$ \begin{align*} \underset{\eta}{min}\ \widehat{E}_{ADA}&=\sum_{n=1}^{N}u_n^{(t)}\cdot e^{-y_n\eta g_t(\mathbf{x}_n)}\\ &= \sum_{n=1}^{N}u_n^{(t)}\cdot\left\{\begin{matrix} e^{-\eta} & if\ y_n=g_t(\mathbf{x}_n)\\ e^{\eta} & if\ y_n \ne g_t(\mathbf{x}_n) \end{matrix}\right.\\ &= \sum_{n=1}^{N}u_n^{(t)}\cdot \left ( e^{-\eta}[ y_n=g_t(\mathbf{x}_n)] + e^{\eta}[ y_n \ne g_t(\mathbf{x}_n)] \right )\\ &= \sum_{n=1}^{N}e^{-\eta}u_n^{(t)}[ y_n=g_t(\mathbf{x}_n)] + e^{\eta}u_n^{(t)}[ y_n \ne g_t(\mathbf{x}_n)]\\ &= e^{-\eta}\sum_{n=1}^{N}u_n^{(t)}[ y_n=g_t(\mathbf{x}_n)] + e^{\eta}\sum_{n=1}^{N}u_n^{(t)}[ y_n \ne g_t(\mathbf{x}_n)]\\ &= \left (\sum_{n=1}^{N}u_n^{(t)} \right )\cdot \left (e^{-\eta}\frac{\sum_{n=1}^{N}u_n^{(t)}[ y_n=g_t(\mathbf{x}_n)]}{\sum_{n=1}^{N}u_n^{(t)}} + e^{\eta}\frac{\sum_{n=1}^{N}u_n^{(t)}[ y_n \ne g_t(\mathbf{x}_n)]}{\sum_{n=1}^{N}u_n^{(t)}} \right )\\ \left [\because \epsilon _t=\frac{\sum_{n=1}^{N}u_n^{(t)}[y_n\neq g_t(\mathbf{x}_n)]}{\sum_{n=1}^{N}u_n^{(t)}} \right ] &= \left (\sum_{n=1}^{N}u_n^{(t)} \right )\cdot \left (e^{-\eta}\cdot(1-\epsilon_t) + e^{\eta}\cdot\epsilon_t \right )\\ \\ \frac{\partial \widehat{E}_{ADA}}{\partial \eta}&=0\\ \frac{\partial \widehat{E}_{ADA}}{\partial \eta}&=\left (\sum_{n=1}^{N}u_n^{(t)} \right )\cdot \left (-\eta e^{-\eta}\cdot(1-\epsilon_t) + \eta e^{\eta}\cdot\epsilon_t \right )=0\\ 0&=-\eta e^{-\eta}\cdot(1-\epsilon_t) + \eta e^{\eta}\cdot\epsilon_t\\ \eta e^{-\eta}\cdot(1-\epsilon_t)&=\eta e^{\eta}\cdot\epsilon_t\\ e^{-\eta}\cdot(1-\epsilon_t)&=e^{\eta}\cdot\epsilon_t\\ (1-\epsilon_t)&=e^{2\eta}\cdot\epsilon_t\\ \frac{1-\epsilon_t}{\epsilon_t}&=e^{2\eta}\\ \sqrt{\frac{1-\epsilon_t}{\epsilon_t}}&=e^{\eta}\\ \eta&=\mathrm{ln}\sqrt{\frac{1-\epsilon_t}{\epsilon_t}}\\ \end{align*} $$

Gradient Boosted Decision Tree (GBDT)

  • \(s_1=s_2=\cdots=s_N=0\) 
  • for \(t=1,2,\cdots,T\)
    1. \(g_t(\mathbf{x}_n)=A(\left \{(\mathbf{x}_n,y_n-s_n)\right \})\)
      \(A\) 為 squared-error regression algorithm 
      • \(A\) 為 sampled and pruned C&RT 是不錯的選擇
    2. \(\alpha_t=OneVarLinearRegression(\left \{(g_t(\mathbf{x}_n),y_n-s_n)  \right \})= \frac{\sum_{n=1}^{N}g(\mathbf{x}_n)(y_n-s_n)}{\sum_{n=1}^{N} g_t^2(\mathbf{x}_n)}\)
    3. 更新 \(s_n \leftarrow s_n+\alpha_tg_t(\mathbf{x}_n)\)
  • 回傳 \(G(\mathbf{x})=\sum_{t=1}^{T}\alpha_t g_t(\mathbf{x})\)
已知 \(\widehat{E}_{ADA}\)
$$ \underset{\eta}{min}\ \underset{h}{min}\ \widehat{E}_{ADA}=\frac{1}{N}\sum_{n=1}^{N}e^{-y_n\sum_{\tau=1}^{t-1}\alpha_\tau g_\tau(\mathbf{x}_n)+\eta h(\mathbf{x}_n)} $$ 這是用在二元分類的 \(h\),那是否可以將之延伸至任意的 \(h\) 呢?
把 \(E_{ADA}\),也就是 exponential error measure 替換成其他 err 不就好了
$$ \begin{align*} &\underset{\eta}{min}\ \underset{h}{min}\ \frac{1}{N}\sum_{n=1}^{N}err\left (\underset{s_n}{\underbrace{\sum_{\tau=1}^{t-1}\alpha_\tau g_\tau(\mathbf{x}_n)}}+\eta h(\mathbf{x}_n),y_n \right )\\ =&\underset{\eta}{min}\ \underset{h}{min}\ \frac{1}{N}\sum_{n=1}^{N}err(s_n+\eta h(\mathbf{x}_n),y_n)\\ [by\ taylor] \approx& \underset{\eta}{min}\ \underset{h}{min}\ \frac{1}{N}\sum_{n=1}^{N}err(s_n,y_n)+\eta h(\mathbf{x}_n)\left.\begin{matrix}\frac{\partial err(s,y_n)}{\partial s}\end{matrix}\right|_{s=s_n}\\ =& \underset{\eta}{min}\ \underset{h}{min}\ \frac{1}{N}\sum_{n=1}^{N}err(s_n,y_n)+\frac{1}{N}\sum_{n=1}^{N}\eta h(\mathbf{x}_n)\left.\begin{matrix}\frac{\partial err(s,y_n)}{\partial s}\end{matrix}\right|_{s=s_n}\\ =& \underset{\eta}{min}\ \underset{h}{min}\ \frac{1}{N}\sum_{n=1}^{N}\underset{constants}{\underbrace{err(s_n,y_n)}}+\frac{\eta}{N}\sum_{n=1}^{N} h(\mathbf{x}_n)\left.\begin{matrix}\frac{\partial err(s,y_n)}{\partial s}\end{matrix}\right|_{s=s_n}\\ =& \underset{\eta}{min}\ \underset{h}{min}\ constants+\frac{\eta}{N}\sum_{n=1}^{N} h(\mathbf{x}_n)\left.\begin{matrix}\frac{\partial err(s,y_n)}{\partial s}\end{matrix}\right|_{s=s_n}\\ \end{align*} $$ 如果將 \(err(s,y)=(s-y)^2\),也就是 square error 代入
$$ \begin{align*} & \underset{\eta}{min}\ \underset{h}{min}\ constants+\frac{\eta}{N}\sum_{n=1}^{N} h(\mathbf{x}_n)\left.\begin{matrix}\frac{\partial err(s,y_n)}{\partial s}\end{matrix}\right|_{s=s_n}\\ =& \underset{\eta}{min}\ \underset{h}{min}\ constants+\frac{\eta}{N}\sum_{n=1}^{N} h(\mathbf{x}_n)\cdot 2(s_n-y_n)\\ \end{align*} $$ 直覺來看,當 \(h(\mathbf{x}_n)= -\infty \cdot (s_n-y_n)\) 不就得到最小值
這樣不就怪怪的,因為 \(h(\mathbf{x}_n)\) 決定方向,多大步皆由 \(\eta\) 決定才對
所以必須為 \(h(\mathbf{x}_n)\) 加入限制
$$ \begin{align*} & \underset{\eta}{min}\ \underset{h}{min}\ constants+\frac{\eta}{N}\sum_{n=1}^{N} \left (h(\mathbf{x}_n)\cdot 2(s_n-y_n)+\underbrace{(h(\mathbf{x}_n))^2} \right )\\ =& \underset{\eta}{min}\ \underset{h}{min}\ constants+\frac{\eta}{N}\sum_{n=1}^{N} \left (-(s_n-y_n)^2+(s_n-y_n)^2+h(\mathbf{x}_n)\cdot 2(s_n-y_n)+(h(\mathbf{x}_n))^2 \right )\\ =& \underset{\eta}{min}\ \underset{h}{min}\ constants+\frac{\eta}{N}\sum_{n=1}^{N} \left (-(s_n-y_n)^2+(s_n-y_n+h(\mathbf{x}_n))^2 \right )\\ =& \underset{\eta}{min}\ \underset{h}{min}\ constants+\frac{\eta}{N}\sum_{n=1}^{N} \left (\underset{constant}{\underbrace{-(s_n-y_n)^2}}+(h(\mathbf{x}_n)-(y_n-s_n))^2 \right )\\ =& \underset{\eta}{min}\ \underset{h}{min}\ constants+\frac{\eta}{N}\sum_{n=1}^{N} \left (constant+(h(\mathbf{x}_n)-(y_n-s_n))^2 \right )\\ \end{align*} $$ 若對 \(h\) 做最小化時,其他都是常數,所以只要把 \((h(\mathbf{x}_n)-(y_n-s_n))^2\) 做到最小,就行了
這不就是 regression,只是資料變成 \(\left \{(x_n,\underset{residual}{\underbrace{y_n-s_n}}) \right \}\)
利用此資料,找到 \(h\) 的最佳解 \(g_t\),然後代進去求 \(\eta\)
$$ \begin{align*} &\underset{\eta}{min}\ \frac{1}{N}\sum_{n=1}^{N}err\left (\underset{s_n}{\underbrace{\sum_{\tau=1}^{t-1}\alpha_\tau g_\tau(\mathbf{x}_n)}}+\eta g_t(\mathbf{x}_n),y_n \right )\\ \left [\because err(s,y)=(s-y)^2 \right ]=&\underset{\eta}{min}\ \frac{1}{N}\sum_{n=1}^{N}(s_n+\eta g_t(\mathbf{x}_n)-y_n)^2\\ =&\underset{\eta}{min}\ \frac{1}{N}\sum_{n=1}^{N}(-s_n-\eta g_t(\mathbf{x}_n)+y_n)^2\\ =&\underset{\eta}{min}\ \frac{1}{N}\sum_{n=1}^{N}((y_n-s_n)-\eta g_t(\mathbf{x}_n))^2\\ \end{align*} $$ 這就是一個 one-variable linear regression 在 \(\left \{ (g_t-transformed\ input,residual) \right \}\)
代公式解就好了,可參考 [ML] 機器學習基石:第九講 Linear Regression
或者求微分=0
$$ \begin{align*} 0 &=\frac{\partial }{\partial \eta}\left (\frac{1}{N}\sum_{n=1}^{N}((y_n-s_n)-\eta g_t(\mathbf{x}_n))^2 \right ) \\ 0 &=\frac{\partial }{\partial \eta}\left (\sum_{n=1}^{N}((y_n-s_n)-\eta g_t(\mathbf{x}_n))^2 \right ) \\ 0&= \sum_{n=1}^{N}2\cdot ((y_n-s_n)-\eta g_t(\mathbf{x}_n))\cdot g_t(\mathbf{x}_n)\\ 0&= \sum_{n=1}^{N}((y_n-s_n)-\eta g_t(\mathbf{x}_n))\cdot g_t(\mathbf{x}_n)\\ 0&= \sum_{n=1}^{N}g(\mathbf{x}_n)(y_n-s_n)-\eta\sum_{n=1}^{N} g_t^2(\mathbf{x}_n)\\ \eta\sum_{n=1}^{N} g_t^2(\mathbf{x}_n)&= \sum_{n=1}^{N}g(\mathbf{x}_n)(y_n-s_n)\\ \eta&= \frac{\sum_{n=1}^{N}g(\mathbf{x}_n)(y_n-s_n)}{\sum_{n=1}^{N} g_t^2(\mathbf{x}_n)}=\alpha_t\\ \end{align*} $$

綜合比較

  • Aggregation Models
    • blending (aggregate 需先有多樣性的 \(g_t\))
      • uniform
        • 簡單
        • voting/averaging of \(g_t\) 
        • 穩定,可互相修正
      • non-uniform
        • linear model on \(g_t\)-transformed inputs 
        • powerful 但需小心 overfitting
      • conditional 
        • nonlinear model on \(g_t\)-transformed inputs 
        • powerful 但需小心 overfitting
    • learning (aggregate 會自動得到多樣性的 \(g_t\))
      • Bagging
        • 多樣性的 \(g_t\) 來自 bootsrapping
        • uniform vote by nothing
        • 延伸
          • Random Forest
            • 實務上常用
            • randomized bagging + strong DTree
      • Boost 
        • 實務上常用
        • AdaBoost
          • 多樣性的 \(g_t\) 來自 reweighting
          • linear vote by steepest search 
          • 延伸
            • AdaBoost-DTree
              •  AdaBoost + weak DTree
        • GradientBoost
          • 多樣性的 \(g_t\) 來自 residual fitting
          • linear vote by steepest search
          • 延伸
            • Gradient Boosted Decision Tree (GBDT)
              • GradientBoost + weak DTree
      • Decision Tree
        • 多樣性的 \(g_t\) 來自 data splitting
        • condition vote by branching
    • Specialty
      • cure underfitting
        • \(G(x)\) strong
        • aggregation 等同在做 feature transform
      • cure overfitting
        • \(G(x)\) moderate
        • aggregation 等同在做 regularization
    • 不同的 aggregation,有不同的特性
      合適地混用 aggregation (a.k.a. ensemble) 可得到更好的結果

程式碼

即使是當下 stage 的最佳化 \(\alpha_n\),對 \(E_{out}\) 也不見得是好的
import numpy as np
import matplotlib.pyplot as plt

from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error

# 讀取波士頓房子價錢的資料
boston = datasets.load_boston()
# 重新洗牌
X, y = shuffle(boston.data, boston.target, random_state=13)
X = X.astype(np.float32)

# 分開 測試 與 訓練 資料
offset = int(X.shape[0] * 0.9)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]

# G_n(x) = G_{n-1}(x) + learning_rate * alpha_n * g_t(x)
# learning_rate=1 表示維持原本的演算法
lrs = [1, 0.1, 0.01]
plt.figure()
for index, lr in enumerate(lrs):
    # 建立 500 棵樹,最大深度 4,loss 為‘ls’表示 least squares regression
    params = {'n_estimators': 500, 'max_depth': 4, 'learning_rate': lr, 'loss': 'ls'}
    clf = ensemble.GradientBoostingRegressor(**params)
    clf.fit(X_train, y_train)
    mse = mean_squared_error(y_test, clf.predict(X_test))
    print("MSE: %.4f" % mse)

    # 計算每個 stage 的分數
    test_score = np.zeros((params['n_estimators'],), dtype=np.float64)
    for i, y_pred in enumerate(clf.staged_predict(X_test)):
        test_score[i] = clf.loss_(y_test, y_pred)

    #畫圖
    plt.subplot(len(lrs), 1, index+1)
    plt.plot(np.arange(params['n_estimators']) + 1, clf.train_score_, 'b-',
             label='Training Set')
    plt.plot(np.arange(params['n_estimators']) + 1, test_score, 'r-',
             label='Test Set')    

    plt.legend(loc='right')
    plt.title("learning rate = {}".format(lr))
    plt.ylabel('Loss')

plt.xlabel('Boosting Iterations')
plt.subplots_adjust(hspace=0.6)
plt.suptitle("Learning Rate")
plt.show()

參考

如何通俗地解释泰勒公式?
sklearn.ensemble.GradientBoostingClassifier
sklearn.ensemble.GradientBoostingRegressor

留言