[ML] 機器學習技法：第七講 Blending and Bagging

ML：基礎技法學習
Package：scikit-learn
課程：機器學習技法
簡介：第七講 Blending and Bagging

Aggregation

select

挑出最好的
$G(\mathbf{x})=g_{t_*}(\mathbf{x})\ with\ t_*=argmin_{t \in \{1,2,\cdots ,T\}}E_{val}(g_t^-)$
採用 $E_{val}(g_t^-)$ 而不使用 $E_{in}(g_t)$ 在於會提高複雜度，也就是 $d_{vc}$，可參考此篇
只採用一個，可能並不符合 aggregation 的期望
因為若在一堆廢渣中挑選，怎麼挑還是廢渣
但 aggregation 希望的是集合廢渣的能力，造出一個還可以接受的 model

mix uniformly

採用所有人的意見，皆給於一票
$G(\mathbf{x})=sign(\sum_{t=1}^{T}1\cdot g_t(\mathbf{x}))$
$G(\mathbf{x})=\frac{1}{T}\sum_{t=1}^{T}1\cdot g_t(\mathbf{x})$

mix non-uniformly

採用所有人的意見，依其專業給於適當票數
$G(\mathbf{x})=sign(\sum_{t=1}^{T}\alpha_t\cdot g_t(\mathbf{x}))\ with\ \alpha \ge 0$
$G(\mathbf{x})=\frac{1}{T}\sum_{t=1}^{T}\alpha_t\cdot g_t(\mathbf{x})\ with\ \alpha \ge 0$
事實上，此 model 包含著

select $\alpha_t=\left [E_{val}(g_t^-)\ smallest \right ]$
mix uniformly $\alpha_t=1$

combine

採用所有人的意見，依不同條件及其專業給於適當票數
$G(\mathbf{x})=sign(\sum_{t=1}^{T}q_t(\mathbf{x})\cdot g_t(\mathbf{x}))\ with\ q_t(\mathbf{x}) \ge 0$
$G(\mathbf{x})=\frac{1}{T}\sum_{t=1}^{T}q_t(\mathbf{x})\cdot g_t(\mathbf{x})\ with\ q_t(\mathbf{x}) \ge 0$
事實上，此 model 包含著

mix non-uniformly $ q_t(\mathbf{x})=\alpha_t$

可能具備的效果

類似 feature transform，三個 $g_t(\mathbf{x})$ 組合出更好的解
類似 regularization，不同的 PLA 解合併出更中庸的解

Uniform Blending

blending (voting)

適用於已知 $g_t$ 的情況

Uniform

每個 $g_t$ 只給予一票
$G(\mathbf{x})=sign(\sum_{t=1}^{T}1\cdot g_t(\mathbf{x}))$
$G(\mathbf{x})=\frac{1}{T}\sum_{t=1}^{T}1\cdot g_t(\mathbf{x})$

適用情況

若 $g_t$ 皆一樣，合併後只會跟原本的 $g_t$ 表現一樣
$g_t$ 需具備多樣性，才可能得到更好的 model

classifier
借由多數 model 的正確性修正少數 mode 的錯誤性
Regression
有的高估，有的低估，合併可回到平均值，更加的中庸
$$ G(\mathbf{x})=\frac{1}{T}\sum_{t=1}^{T}1\cdot g_t(\mathbf{x})\\ \mathrm{avg}(E_{out}(g_t)) =\mathrm{avg}(\mathop{\mathbb{E}} (g_t-G)^2)+E_{out}(G) \ge E_{out}(G) $$

$$ \begin{align*} \mathrm{avg}((g_t(\mathbf{x})-f(\mathbf{x}))^2) &= \mathrm{avg}(g_t(\mathbf{x})^2-2g_t(\mathbf{x})f(\mathbf{x})+f(\mathbf{x})^2)\\ &= \mathrm{avg}(g_t(\mathbf{x})^2)-2\mathrm{avg}(g_t(\mathbf{x}))f(\mathbf{x})+f(\mathbf{x})^2\\ &= \mathrm{avg}(g_t(\mathbf{x})^2)-2G(\mathbf{x})f(\mathbf{x})+f(\mathbf{x})^2\\ &= \mathrm{avg}(g_t(\mathbf{x})^2)-G(\mathbf{x})^2+G(\mathbf{x})^2-2G(\mathbf{x})f(\mathbf{x})+f(\mathbf{x})^2\\ &= \mathrm{avg}(g_t(\mathbf{x})^2)-G(\mathbf{x})^2+(G(\mathbf{x})-f(\mathbf{x}))^2\\ &= \mathrm{avg}(g_t(\mathbf{x})^2)-2G(\mathbf{x})^2+G(\mathbf{x})^2+(G(\mathbf{x})-f(\mathbf{x}))^2\\ &= \mathrm{avg}(g_t(\mathbf{x})^2)-2\mathrm{avg}(g_t(\mathbf{x}))G(\mathbf{x})+G(\mathbf{x})^2+(G(\mathbf{x})-f(\mathbf{x}))^2\\ &= \mathrm{avg}(g_t(\mathbf{x})^2-2g_t(\mathbf{x})G(\mathbf{x})+G(\mathbf{x})^2)+(G(\mathbf{x})-f(\mathbf{x}))^2\\ &= \mathrm{avg}(g_t(\mathbf{x})-G(\mathbf{x}))^2+(G(\mathbf{x})-f(\mathbf{x}))^2\\ \mathrm{avg}(E_{out}(g_t))&=\mathrm{avg}(\mathop{\mathbb{E}} (g_t-G)^2)+E_{out}(G) \ge E_{out}(G) \end{align*} $$

考慮一虛擬的過程 for $t=1,2,\cdots ,T$

從某個分佈 $P^N$(i.i.d.) 挑出 $N$ 筆資料 $D_t$
從 $A(D_t)$ 得到 $g_t$
$$ \mathrm{avg}(E_{out}(g_t))=\mathrm{avg}(\mathop{\mathbb{E}} (g_t-\bar{g})^2)+E_{out}(\bar{g}) $$ 物理意義 bias variance decomposition
演算法的期望表現 = 共識的變異量 + 共識的表現

$$ \bar{g}=\underset{T \to \infty }{lim}G=\underset{T \to \infty }{lim}\frac{1}{T}\sum_{t=1}^{T}g_t=\underset{D}{\mathop{\mathbb{E}} }A(D) $$ 依此代入之前推導的式子
$$ \begin{align*} \mathrm{avg}(E_{out}(g_t))&=\mathrm{avg}(\mathop{\mathbb{E}} (g_t-G)^2)+E_{out}(G) \\ &=\mathrm{avg}(\mathop{\mathbb{E}} (g_t-\bar{g})^2)+E_{out}(\bar{g})\end{align*} $$

實際上平均的過程，就是在消除變異量

如何得到多樣性

來自不同的 hypothesises：$g_1\in H_1,g_2\in H_2,\cdots ,g_T\in H_T$
來自不同的參數： GD with $\eta =0.001,0.01,\cdots,10$
來自本身演算法的隨機性：PLA 不同的初始值，會有不同的結果
來自不同的 data：不同的 data 訓練出不同的 $g_v^-$

Linear and Any Blending

blending (voting)

適用於已知 $g_t$ 的情況

Linear

每個 $g_t$ 只給予一票
$G(\mathbf{x})=sign(\sum_{t=1}^{T}\alpha_t\cdot g_t(\mathbf{x}))\ with\ \alpha \ge 0$
$G(\mathbf{x})=\frac{1}{T}\sum_{t=1}^{T}\alpha_t\cdot g_t(\mathbf{x})\ with\ \alpha \ge 0$

如何求解

linear blending = LinModel + hypotheses as transform + constraints
$$ \underset{\alpha\ge0}{min}\ \frac{1}{N}\sum_{n=1}^{N}err\left ( y_n,\sum_{t=1}^{T}\alpha_tg_t(\mathbf{x}_n) \right ) $$

$$ \begin{align*} \underset{\alpha\ge0}{min}\ E_{in}(\boldsymbol\alpha)&= \underset{\alpha\ge0}{min}\ \frac{1}{N}\sum_{n=1}^{N}\left ( y_n-\sum_{t=1}^{T}\alpha_tg_t(\mathbf{x}_n) \right )^2 \end{align*} $$ 實際上就是一個 two-level learning (同 Platt's Model)，先用 $g_t$ 轉換，再求 Linear Regression

實務運用

實務上求解，會忽略 constraints $\alpha\ge0$，因為即使小於 0，也只是把 $g_t$ 反過來用
就像一個有 99% 錯誤性的 $g_t$，但反過來用時就變成 99% 正確性的 $g_t$
不可用 $E_{in}$，會大大提高複雜度，甚至大過 best of best $\ge d_{VC}(\bigcup_{t=1}^{T} H_t)$
可參考此篇
實務上 $g_t$ 其實是 $g_t^-$，選擇最小 $E_{train}$ 的 $g_t^-$，再利用 $E_{val}$ 挑選 $\boldsymbol\alpha$

演算法

從 $D_{train}$ 得到 $g^-_1,g^-_2,\cdots ,g^-_T$
將 $D_{val}$ 的 $((\mathbf{x}_n,y_n)$ 轉換為 $(\mathbf{z}_n=\Phi ^-(\mathbf{x}_n),y_n)$，$\Phi ^-(\mathbf{x})=(g^-_1(\mathbf{x}),\cdots,g^-_T(\mathbf{x}) )$
依不同需求

Linear Blending

使用 Linear Model$(\mathbf{z}_n,y_n)$，得到 $\boldsymbol\alpha$
重新用所有資料 $D$，得到 $\Phi (\mathbf{x})=(g_1(\mathbf{x}),\cdots,g_T(\mathbf{x}) )$
$G_{LINB}(\mathbf{x})=\underset{\mathbf{w}=\boldsymbol\alpha}{LinearModel}(\Phi (\mathbf{x}))$

Any Blending (Stacking)

使用 Any Model$(\mathbf{z}_n,y_n)$ 計算，得到 $\tilde{g}$
重新用所有資料 $D$，得到 $\Phi (\mathbf{x})=(g_1(\mathbf{x}),\cdots,g_T(\mathbf{x}) )$
$G_{ANYB}(\mathbf{x})=\tilde{g}(\Phi (\mathbf{x}))$
可達到 conditional blending
非常強大，務必小心 overfitting

Bagging

在 Uniform Blending 中，有個虛擬的過程，若想實現並得到 $\bar{g}$ 該如何做呢？

使用有限但很大的 T 數量
$g_t=A(\tilde{D}_t)\ from\ \tilde{D}_t\sim P^N$ 的 $\tilde{D}_t$ 只用 $D$ 產生

bootstrapping

從現有資料，取任意 N 筆，每取一次就放回去，可得 $\tilde{D}_t$
假設 N 筆資料，抽 N 次，抽到完全相同的機率為 $\frac{N!}{N^N}$

兩者對照如下，bagging 是一個簡單的 meta algorithm，適用於任意演算法
(meta algorithm 意指建立在其他 base algorithm $A$ 之上的演算法)

實際 Pocket 情況
每個 Pocket 跑一千輪，共有 25 條 Pocket
可看見產生出一條非線性的 model，分得比原本的好

若演算法對不同的資料越敏感，那麼合併起來可能會更好

程式碼

from itertools import product

import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

# 訓練資料
X = np.c_[(.4, -.7),
          (-1.5, -1),
          (-1.4, -.9),
          (-1.3, -1.2),
          (-1.1, -.2),
          (-1.2, -.4),
          (-.5, 1.2),
          (-1.5, 2.1),
          (1, 1),
          # --
          (1.3, .8),
          (1.2, .5),
          (.2, -2),
          (.5, -2.4),
          (.2, -2.3),
          (0, -2.7),
          (1.3, 2.1)].T
y = np.r_[[0] * 8 + [1] * 8]

# model
clf1 = linear_model.Perceptron()
clf2 = linear_model.LogisticRegression()
clf3 = SVC(kernel='rbf')
# voting hard 表示取多數決，若是 soft 則為平均
eclf = VotingClassifier(estimators=[('Pocket', clf1), ('LogisticRegression', clf2), ('svc', clf3)],
                        voting='hard')

clf1.fit(X, y)
clf2.fit(X, y)
clf3.fit(X, y)
eclf.fit(X, y)

# 建立畫圖資料
x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, 0.1),
                      np.arange(x2_min, x2_max, 0.1))

f, axarr = plt.subplots(2, 2, figsize=(10, 8))

# product([0, 1], [0, 1]) => (0, 0), (0, 1), (1, 0), (1, 1)
for idx, clf, title in zip(product([0, 1], [0, 1]), [clf1, clf2, clf3, eclf], ['Pocket', 'LogisticRegression', 'Kernel SVM', 'Hard Voting']):

    Y = clf.predict(np.c_[xx1.ravel(), xx2.ravel()])
    Y = Y.reshape(xx1.shape)
    
    # 畫出範圍
    axarr[idx[0], idx[1]].contourf(xx1, xx2, Y, alpha=0.5, cmap="winter")
    # 畫出訓練資料的點
    axarr[idx[0], idx[1]].scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
    # 設定 title
    axarr[idx[0], idx[1]].set_title(title)

plt.show()

Linear Blending，剛好跟上圖一樣，因 weight 皆相同

from itertools import product

import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

# 訓練資料
X = np.c_[(.4, -.7),
          (-1.5, -1),
          (-1.4, -.9),
          (-1.3, -1.2),
          (-1.1, -.2),
          (-1.2, -.4),
          (-.5, 1.2),
          (-1.5, 2.1),
          (1, 1),
          # --
          (1.3, .8),
          (1.2, .5),
          (.2, -2),
          (.5, -2.4),
          (.2, -2.3),
          (0, -2.7),
          (1.3, 2.1)].T
y = np.r_[[-1] * 8 + [1] * 8]

X_val = X[0:-1:2]
y_val = y[0:-1:2]
X_train = X[1:-1:2]
y_train = y[1:-1:2]

# model
clf1 = linear_model.Perceptron()
clf2 = linear_model.LogisticRegression()
clf3 = SVC(kernel='rbf')

# linear blending training
clf4 = linear_model.LinearRegression()
clf1.fit(X_train, y_train)
clf2.fit(X_train, y_train)
clf3.fit(X_train, y_train)
phi = np.c_[clf1.predict(X_val), clf2.predict(X_val), clf3.predict(X_val)]
clf4.fit(phi, y_val)
# 重新用全部資料訓練
clf1.fit(X, y)
clf2.fit(X, y)
clf3.fit(X, y)

# 建立畫圖資料
x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, 0.1),
                      np.arange(x2_min, x2_max, 0.1))
#xx1, xx2 = np.meshgrid(np.arange(-1, 0, 0.1),
#                      np.arange(-3, -2, 0.1))

f, axarr = plt.subplots(2, 2, figsize=(10, 8))
XX = np.c_[xx1.ravel(), xx2.ravel()]
# product([0, 1], [0, 1]) => (0, 0), (0, 1), (1, 0), (1, 1)
for idx, clf, title in zip(product([0, 1], [0, 1]), [clf1, clf2, clf3, clf4], ['Pocket', 'LogisticRegression', 'Kernel SVM', 'Linear Blending']): 
    if title == 'Linear Blending':
        # 轉換資料
        phi = np.c_[clf1.predict(XX), clf2.predict(XX), clf3.predict(XX)]
        # 預測資料
        Y = clf.predict(phi)
        # 將值轉換為 label
        Y[Y>0] = 1
        Y[Y<0] = -1
    else:
        Y = clf.predict(XX)
        
    Y = Y.reshape(xx1.shape)
    
    # 畫出範圍
    axarr[idx[0], idx[1]].contourf(xx1, xx2, Y, alpha=0.5, cmap="winter")
    # 畫出訓練資料的點
    axarr[idx[0], idx[1]].scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
    # 設定 title
    axarr[idx[0], idx[1]].set_title(title)

# 設定最上面的 title
f.suptitle('Linear Blending $\\alpha$: {}'.format(clf4.coef_))
plt.show()

錯誤的使用 bagging，不見得能降低變異量

import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn.ensemble import BaggingRegressor

# Settings
n_repeat = 50       # Number of iterations for computing expectations
n_train = 50        # 訓練資料的大小
n_test = 1000       # 測試資料的大小
noise = 1         # noise 的標準差
np.random.seed(0)   # 固定隨機的種子

estimators = [("Linear Regression", linear_model.LinearRegression()),
              ("Bagging(Linear Regression)", BaggingRegressor(linear_model.LinearRegression()))]

n_estimators = len(estimators)

# 目標函數
def f(x):
    x = x.ravel()

    return -x ** 2 + 1.5 * -(x - 2)


# 產生資料並加入 noise，並依 n_repeat 產生對應 y 的數目
def generate(n_samples, noise, n_repeat=1):
    # 產生 -5 ~ 5 的資料
    X = np.random.rand(n_samples) * 10 - 5
    X = np.sort(X)

    if n_repeat == 1:
        y = f(X) + np.random.normal(0.0, noise, n_samples)
    else:
        y = np.zeros((n_samples, n_repeat))

        for i in range(n_repeat):
            y[:, i] = f(X) + np.random.normal(0.0, noise, n_samples)

    X = X.reshape((n_samples, 1))

    return X, y


X_train = []
y_train = []

# 產生 n_repeat 筆的訓練資料
for i in range(n_repeat):
    X, y = generate(n_samples=n_train, noise=noise)
    X_train.append(X)
    y_train.append(y)   
# X_train 50x1
# y_train 50x1

# 產生 n_repeat 筆的測試資料
X_test, y_test = generate(n_samples=n_test, noise=noise, n_repeat=n_repeat)
# X_test 1000x1
# y_test 1000x50

fig = plt.figure()
summary = ""
for n, (name, estimator) in enumerate(estimators):
    # 初始預測值
    y_predict = np.zeros((n_test, n_repeat))

    for i in range(n_repeat):
        # 依不同的資料訓練
        estimator.fit(X_train[i], y_train[i])
        y_predict[:, i] = estimator.predict(X_test)
    # y_predict 1000x50

    # Bias^2 + Variance + Noise decomposition of the mean squared error
    y_error = np.zeros(n_test)

    for i in range(n_repeat):
        for j in range(n_repeat):
            # 同樣的 X 而有 50 筆 y 
            y_error += (y_test[:, j] - y_predict[:, i]) ** 2
    
    # 每個 row 的平均 error
    y_error /= (n_repeat*n_repeat)
    # y_test 的變異量，本身存在的 noise
    y_noise = np.var(y_test, axis=1)
    # y_predict 的偏差
    y_bias = (f(X_test) - np.mean(y_predict, axis=1)) ** 2
    # y_predict 的變異量
    y_var = np.var(y_predict, axis=1)

    summary += "{0}: {1:.4f} (error) = {2:.4f} (bias^2) + {3:.4f} (var) + {4:.4f} (noise)\n".format(name,
              np.mean(y_error),
              np.mean(y_bias),
              np.mean(y_var),
              np.mean(y_noise))

    # 畫圖
    plt.subplot(2, n_estimators, n + 1)
    plt.plot(X_test, f(X_test), "b", label="$f(x)$")
    plt.plot(X_train[0], y_train[0], ".b", label="LS ~ $y = f(x)+noise$")

    for i in range(n_repeat):
        if i == 0:
            plt.plot(X_test, y_predict[:, i], "r", label="$g(x)$")
        else:
            plt.plot(X_test, y_predict[:, i], "r", alpha=0.05)

    plt.plot(X_test, np.mean(y_predict, axis=1), "c",
             label="$\\bar{g}(x)$")

    plt.xlim([-5, 5])
    plt.title(name)
    
    # 只在第一個顯示 legend  並設定字型大小
    if n == 0:
        plt.legend(loc="upper right", prop={"size": 11})

    plt.subplot(2, n_estimators, n_estimators + n + 1)
    plt.plot(X_test, y_error, "r", label="$error(x)$")
    plt.plot(X_test, y_bias, "b", label="$bias^2(x)$"),
    plt.plot(X_test, y_var, "g", label="$variance(x)$"),
    plt.plot(X_test, y_noise, "c", label="$noise(x)$")

    plt.xlim([-5, 5])
    plt.ylim([0, 350])
    
    # 只在第一個顯示 legend  並設定字型大小
    if n == 0:
        plt.legend(loc="upper right", prop={"size": 11})

fig.suptitle(summary, fontsize=12)
plt.show()

參考

Ensemble methods
sklearn.ensemble.VotingClassifier
sklearn.ensemble.BaggingClassifier
fukatani/stacked_generalization

子風的知識庫

搜尋此網誌

[ML] 機器學習技法：第七講 Blending and Bagging

Aggregation

Uniform Blending

Linear and Any Blending

Bagging

程式碼

參考

留言

張貼留言