[ML] 機器學習技法：第十講 Random Forest

ML：基礎技法學習
Package：scikit-learn
課程：機器學習技法
簡介：第十講 Random Forest

Random Forest Algorithm

RF = bagging + random-combination C&RT — randomness everywhere

function RandomForest($D$)

for $t=1,2,\cdots ,T$

利用 bootstrapping 從 $D$ 取出 size-N' 的 $\tilde{D}_t$
$g_t=$ $\mathrm{DTree}(\tilde{D}_t)$

function $\mathrm{DTree}(\tilde{D}_t)$

if(滿足終止條件)

回傳 $g_t$

else

學習 $b(\mathbf{\Phi (x)})$ 並用此切割 $D$ 為 $D_c$

$\mathbf{\Phi (x)}=\mathbf{P}\cdot \mathbf{x}$

$\mathbf{P}$ 為低維度投影矩陣
row i of $\mathbf{P}$ 是隨機挑選 feature 組合

$G_c=\mathrm{DTree}(D_c)$
回傳 $G(\mathbf{x})=\sum_{c=1}^{C}[b(\mathbf{x})=c]G_c(\mathbf{x})$

回傳 $G(\mathbf{x})=\mathrm{Uniform}(\{g_t\})$

優勢

容易平行化(bagging)
繼承 Decision Tree 的優點

容易解釋
multiclass 實現簡單
categorical features 應用簡單
missing features 處理簡單
non-linear training (and testing) 有效率

降低 fully-grown tree 的缺點

容易 overfit

Out-Of-Bag Estimate (OOB)

$$ \begin{align*} G_{m^*} &= RF_{m^*}(D)\\ m^* &= \underset{1\le m\le M}{argmin}\ E_m\\ E_m &= E_{oob}(RF_m(D)) \end{align*} $$ 用 $E_{oob}$ for self-validation—像是決定 $d{}''$ (多少個 features 做線性組合)
且無需再用全部的資料重新訓練一次，得到最後的結果

Feature Selection

優點

令後續的演算法更有效率
移除不必要的 features，排除雜訊
剩下的 features 也許可解釋

缺點

計算量過多，從 $N$ 取 $N'$ 個，共有 $C^N_{N'}$ 個組合
假如沒做好的話，會選到看似很好其實並不然的組合

容易 overfit
無法解釋 features 的關聯性

Linear Model $\mathbf{w}^T\mathbf{x}$
$$ importance(i)=|w_i| $$

RF，實務上相當常用
$$ importance(i) = E_{oob}(G)-E^{(p)}_{oob}(G) $$ $E^{(p)}_{oob}(G)$ 對於 $x_{n,i}$ 的 permutation 來自 OOB 資料

實際例子

左圖為只替 C&RT 加入 features 的線性組合，也就是 random combination
右圖為 RF，但其 $N'=\frac{N}{2}$，也就是 bootstrapping 只抽出一半的資料量
可看到 RF 更加的 smooth 與 large margin

左圖為 t=21 的 tree
右圖為 $G$，可看到 noise 的影響已被去除掉大部分

注意事項

因為 RF 本質是隨機的演算法，需要足夠的 trees，才能達到穩定
如何決定何謂足夠的 trees，看一下 $G$ 的穩定性
像是 $E_{oob}$ 每次的結果，增減 tree 是否會有變化

程式碼

使用 b(x) 時的 features 不一定越少越好

import matplotlib.pyplot as plt

from collections import OrderedDict
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

RANDOM_STATE = 123

# 建立 500 筆資料，有 25 個 features，15 個重要 features，10 個雜訊 features
# 每個類別的資料個別分為兩大群，也就是分佈在兩個不同的位置
X, y = make_classification(n_samples=500, n_features=25,
                           n_clusters_per_class=2, n_informative=15,
                           random_state=RANDOM_STATE)

# 設定 max_features，每次做 b(x) 所看得 feature 數量
# 設定 warm_start=True，可重覆利用先前做過的 solution 加快速度
ensemble_clfs = [
    ("RandomForestClassifier, max_features='sqrt'",
        RandomForestClassifier(warm_start=True, oob_score=True,
                               max_features="sqrt",
                               random_state=RANDOM_STATE)),
    ("RandomForestClassifier, max_features='log2'",
        RandomForestClassifier(warm_start=True, max_features='log2',
                               oob_score=True,
                               random_state=RANDOM_STATE)),
    ("RandomForestClassifier, max_features=None",
        RandomForestClassifier(warm_start=True, max_features=None,
                               oob_score=True,
                               random_state=RANDOM_STATE))
]

# 建立一有序 dictionary for 記錄使用
error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)

# 設定 tree 的上下限數量
min_estimators = 15
max_estimators = 300

for label, clf in ensemble_clfs:
    for i in range(min_estimators, max_estimators + 1, 2):
        clf.set_params(n_estimators=i)
        clf.fit(X, y)

        # 記錄大小為 i 的 RF 的 oob error 
        oob_error = 1 - clf.oob_score_
        error_rate[label].append((i, oob_error))

# Generate the "OOB error rate" vs. "n_estimators" plot.
for label, clf_err in error_rate.items():
    xs, ys = zip(*clf_err)
    plt.plot(xs, ys, label=label)

plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators (n trees)")
plt.ylabel("OOB error rate")
plt.legend(loc="upper right")
plt.show()

feature selection 挑出的 features 可能比實際的多

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

RANDOM_STATE = 123
IMPORTANT_FEATURE = 3

# 建立 1000 筆資料，有 25 個 features，3 個重要 features，22 個雜訊 features
# 每個類別的資料個別分為兩大群，也就是分佈在兩個不同的位置
X, y = make_classification(n_samples=1000, n_features=25,
                           n_clusters_per_class=2, n_informative=IMPORTANT_FEATURE,
                           random_state=RANDOM_STATE)

# 建立一個大小為 250 顆 tree 的 RF
forest = RandomForestClassifier(n_estimators=250,
                              random_state=RANDOM_STATE)

forest.fit(X, y)
importances = forest.feature_importances_
# 計算所有 tree 的 feature 標準差
std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
# 由小到大排序並回傳對應的 index，再將之反向，也就是改為由大到小的排序
indices = np.argsort(importances)[::-1]

# 印出排序後的 feature 重要性
print("Feature ranking:")
for f in range(X.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# 畫圖
plt.figure()
plt.title("{} Feature importances in Real".format(IMPORTANT_FEATURE))
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()

參考

sklearn.ensemble.RandomForestClassifier
sklearn.ensemble.RandomForestRegressor
Feature importances with forests of trees

子風的知識庫

搜尋此網誌