[ML] 機器學習技法：第九講 Decision Tree

ML：基礎技法學習
Package：scikit-learn
課程：機器學習技法
簡介：第九講 Decision Tree

提要

優點

可輕易被解釋
簡單實現
有效率

缺點

理論支持薄弱
參數無理論保證，難以設計
無特定代表性的演算法，只有流不流行

基本算式

$$ G(\mathbf{x})=\sum_{c=1}^{C}[b(\mathbf{x})=c]\cdot G_c(\mathbf{x}) $$ $G(\mathbf{x})$：表示整顆樹
$b(\mathbf{x})$：表示分支的條件
$G_c(\mathbf{x})$：表示 c 條件下分支的子樹

基本演算法

function $DecisionTree(D=\{(\mathbf{x}_n,y_n)\}_{n=1}^N$

if (符合中止條件)

回傳 $g_t(\mathbf{x})$

else

學習 $b(\mathbf{x})$
將 $D$ 分割成 C 個部分 $D_c=\{(\mathbf{x}_n,y_n):b(\mathbf{x}_n)=c\}$
遞迴 $G_c(x)= DecisionTree(D_c)$
回傳 $G(\mathbf{x})=\sum_{c=1}^{C}[b(\mathbf{x})=c]\cdot G_c(\mathbf{x})$

Classiﬁcation and Regression Tree (C&RT)

來自於 CART^TM of California Statistical Software

function $DecisionTree(D=\{(\mathbf{x}_n,y_n)\}_{n=1}^N$

if (無法再切割) => 故為 fully-grown tree

回傳 $g_t(\mathbf{x})=E_{in}-optimal\ constant$

else

學習 $b(\mathbf{x})$
$$ b(\mathbf{x})\underset{decision\ stumps\ h(\mathbf{x})}{=argmin}\sum_{c=1}^{2}|D_c\ with\ h|\cdot impurity(D_c\ with\ h) $$

classification 使用 Gini index
$$impurity(D)=1-\sum_{k=1}^{K}\left (\frac{\sum_{n=1}^{N}[y_n=k]}{N} \right )^2$$
regression
$$impurity(D)=\frac{1}{N}\sum_{n=1}^{N}(y_n-\bar{y})^2$$
$\bar{y}$ 為 $\{y_n\}$ 的平均

將 $D$ 分割成 2 個部分 $D_c=\{(\mathbf{x}_n,y_n):b(\mathbf{x}_n)=c\}$
遞迴 $G_c(x)= DecisionTree(D_c)$
回傳 $G(\mathbf{x})=\sum_{c=1}^{\color{Orange}{2}}[b(\mathbf{x})=c]\cdot G_c(\mathbf{x})$

關於 Decision Tree

Regularization by Pruning

當 $\mathbf{x_n}$ 都不一樣時，fully-grown tree 雖然可以 $E_{in}(G)=0$
但這也表示很容易 overfit，因為越接近 leaf 的資料 $D_c$ 越少，訓練出來的越有問題
regularized decision tree，通常也叫做 pruned decision tree
利用此式子，$\Omega (G)$ 為 leaf 的個數
$\underset{all\ possible\ G}{argmin}E_{in}(G)+\lambda \Omega (G) $
實際上無法考慮所有的 $G$，故實務上只針對訓練出來的 $G$ 做處理

$G^{(0)}$ = fully-grown tree
$G^{(i)}$ = $\underset{G}{argmin}\ E_{in}(G)$
$G^{(i−1)}$ 移除一個 leaf 的所有選擇中，最好 $E_{in}(G)$ 的 $G$
移除 leaf 的意義即是跟隔壁資料做合併，重新計算 $g_t(\mathbf{x})$
將全部的 $G^{(i)}$ 代入 regularizer decision tree，使用 Validation 得到合適的 $\lambda$
或者直接用此得到合適的 $G$

Branching on Categorical Features

numerical feature，像是血壓
$b(\mathbf{x})=[x_i \le \theta ]+1$，$\theta \in \mathbb{R}$
若換到 Categorical Features 也可輕易應用，像是症狀
$b(\mathbf{x})=[x_i \in S ]+1$，$S \subset {1,2,\cdots ,K}$
decision subset，某幾類走左邊，其餘走右邊，由 $S$ 決定

Missing Features by Surrogate Branch

可適用缺少資料的時候
訓練時，同時記錄替代的 feature，令其分類跟原本的最佳解差不多
替代的 $\ branch\ b_1(\mathbf{x}), b_2(\mathbf{x}),\cdots \approx best\ branch\ b(\mathbf{x})$

C&RT 特點

與 AdaBoost 比較

只針對特定區域做切割，並非是整區切割
較有效率，t=90 已完成，但 AdaBoost 仍在微調

特點

容易解釋
multiclass 實現簡單
categorical features 應用簡單
missing features 處理簡單
non-linear training (and testing) 有效率

與之對應

C4.5 演算法，在關鍵點的處理不同而已

比較

blending

適用已知 $g_t$，$g_t$ 來自不同的 $H$，只是單純合併

learning

適用未知 $g_t$，$g_t$ 來自同樣的 $H$，邊學邊決定

程式碼

Overfitting 需注意

import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

# 建立訓練資料
rng = np.random.RandomState(1)
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
# 每五個加入 noise
y[::5] += 3 * (0.5 - rng.rand(16))

# 訓練模型
estimators = [("max_depth=max", DecisionTreeRegressor(), "slategray"),
              ("max_depth=5", DecisionTreeRegressor(max_depth=5), "yellowgreen"),
              ("max_depth=2", DecisionTreeRegressor(max_depth=2), "cornflowerblue"),]
size = len(estimators)
# 預測資料
X_test = np.arange(0.0, 5.0, 0.01)[:, None]

# 畫圖
plt.figure()

for i, (label, clf, c) in enumerate(estimators):
    # 訓練
    clf.fit(X, y)
    # 預測
    yp = clf.predict(X_test)
    # 畫預測的線
    plt.plot(X_test, yp, label=label, linewidth=size-i, c=c)

# 畫出原始資料
plt.scatter(X, y, c="darkorange", label="data")
# 標示 x, y
plt.xlabel("data")
plt.ylabel("target")
# 設定 title
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

參考

1.10. Decision Trees
sklearn.tree.DecisionTreeClassifier
sklearn.tree.DecisionTreeRegressor

子風的知識庫

搜尋此網誌