[ML] 機器學習基石：第二講 Learning to Answer Yes/No

ML：基礎學習
課程：機器學習基石
簡介：第二講 Learning to Answer Yes/No

PLA (Perceptron Learning Algorithm)

選擇 $\mathbf{w_0}$，可設為 0
For $t = 0, 1, \cdots $

$(\mathbf{x_{n(t)}},y_{n(t)})$ 找到 $\mathbf{w_t}$ 的下一個錯誤 (可用依序 or 亂數 or 其他選擇點的方法)
修正錯誤
直到所有點皆無錯誤

基本定義

$\mathbf{x}=(x_1,x_2,\cdots ,x_d) \quad y=\{+1,-1\} \quad h \in H$
忽略 $y=0$
$$ \begin{align*} h(\mathbf{x})&=sign\left [ \left ( \sum_{i=1}^{d}w_ix_i \right )- threshold \right ]\\ &= sign\left [ \left ( \sum_{i=1}^{d}w_ix_i \right )+ \underset{w_0}{\underbrace{(-threshold)}}\cdot \underset{x_0}{\underbrace{(+1)}} \right ]\\ &= sign\left [ \left ( \sum_{i=1}^{d}w_ix_i \right )+ w_0\cdot x_0 \right ]\\ &= sign\left [ \sum_{i=0}^{d}w_ix_i\right ]\\ &= sign\left [\mathbf{w^Tx}\right ]\\ \end{align*} $$

更新原理

$$ \mathbf{w_{t+1}}\leftarrow \mathbf{w_t} + y_{n(t)}\mathbf{x_{n(t)}} $$

當 $y=+1$ 時，但 $ \mathbf{wx}$ 卻為負，則將 $ \mathbf{w}$ 靠近 $ \mathbf{x}$
當 $y=-1$ 時，但 $ \mathbf{wx}$ 卻為正，則將 $ \mathbf{w}$ 遠離 $ \mathbf{x}$

必要條件

資料需是線性可分，也就是可被分為各一半

線性可分的 $D$ $\Leftrightarrow $ 存在一完美 $\mathbf{w_f}$ 使得 $y_n = sign(\mathbf{w_f^T x_n})$ 證明

$$ \because \mathbf{w_f}\ 完美 \\ y_{n(t)}\mathbf{w_f^T x_{n(t)}}\geq \underset{n}{min}\ y_{n}\mathbf{w_f^T x_{n}}> 0 $$ $\mathbf{w_t}$ 是否越來越靠近 $\mathbf{w_f}$？
也就是 $\mathbf{w_f^T w_t}$ 隨著 $(\mathbf{x_{n(t)}},y_{n(t)})$ 更新，是否越來越大？
$$ \begin{align*} \mathbf{w_f^T w_{t+1}} &= \mathbf{w_f^T}(\mathbf{w_t}+y_{n(t)}\mathbf{x_{n(t)}})\\ &\geq \mathbf{w_f^T}\mathbf{w_t}+\underset{n}{min}\ y_{n}\mathbf{w_f^T x_{n}}\\ &> \mathbf{w_f^T}\mathbf{w_t}+0 \end{align*} $$ 以上證明，只得出一半，內積的變大也有可能是 $ \mathbf{w_t} $ 長度越來越長導致的
若 $ \mathbf{w_t} $ 只在犯錯才更新
$$ sign(\mathbf{w_t^T x_{n(t)}})\neq y_{n(t)} \Leftrightarrow y_{n(t)}\mathbf{w_t^T x_{n(t)}}\leq 0 $$ $$ \begin{align*} \left \| \mathbf{w_{t+1}} \right \|^2 &= \left \| \mathbf{w_t}+y_{n(t)}\mathbf{x_{n(t)}} \right \|^2\\ &= \| \mathbf{w_t}\|^2+2y_{n(t)}\mathbf{w_t^T x_{n(t)}}+\|y_{n(t)}\mathbf{x_{n(t)}} \|^2\\ &\leq \| \mathbf{w_t}\|^2+0+\|y_{n(t)}\mathbf{x_{n(t)}} \|^2\\ &\leq \| \mathbf{w_t}\|^2+\underset{n}{max}\ \|y_{n}\mathbf{x_{n}} \|^2\\ \end{align*} $$ $\|\mathbf{w_{t}}\|^2$ 會慢慢成長，即使更新的是最長的 $\|\mathbf{x_{n}}\|^2$，因 $y_{n}$ 不影響長度
或是從以下公式也可看出，會持續更新直到上限
初始於 $\mathbf{w_0} = 0$,經過 $T$ 次的更新後
$$ \frac{\mathbf{w_f^T w_T}}{\|\mathbf{w_f}\|\|\mathbf{w_T}\|}\geq \sqrt{T}\cdot constant $$ 證明如下
若 $\mathbf{w_f}$ 完美 $$ y_{n(t)}\mathbf{w_f^T x_{n(t)}}\geq \underset{n}{min}\ y_{n}\mathbf{w_f^T x_{n}}> 0 \quad (1) $$ 若 $ \mathbf{w_t} $ 只在犯錯才更新
$$ sign(\mathbf{w_t^T x_{n(t)}})\neq y_{n(t)} \Leftrightarrow y_{n(t)}\mathbf{w_t^T x_{n(t)}}\leq 0 \quad (2) $$ 更新的方法
$$ \mathbf{w_t} = \mathbf{w_{t-1}} +y_{n(t)}\mathbf{x_{n(t)}} \quad (3) $$ 針對 $\mathbf{w_t}$ $$ \begin{align*} \mathbf{w_f^T w_t} &= \mathbf{w_f^T}(\mathbf{w_{t-1}}+y_{n(t-1)}\mathbf{x_{n(t-1)}}) \quad by \ (3)\\ &\geq \mathbf{w_f^T}\mathbf{w_{t-1}}+\underset{n}{min}\ y_n\mathbf{w_f^T x_n} \quad by \ (1)\\ &\geq \mathbf{w_0}+T\cdot \underset{n}{min}\ y_n\mathbf{w_f^T x_n} \quad by\ T\ times\\ &\geq T\cdot \underset{n}{min}\ y_n\mathbf{w_f^T x_n} \end{align*} $$ 針對 $\|\mathbf{w_t}\|$ $$ \begin{align*} \|\mathbf{w_t}\|^2 &= \|\mathbf{w_{t-1}}+y_{n(t-1)}\mathbf{x_{n(t-1)}}^2\| \quad by\ (3)\\ &= \| \mathbf{w_{t-1}}\|^2+2y_{n(t-1)}\mathbf{w_{t-1}^T x_{n(t-1)}}+\|y_{n(t-1)}\mathbf{x_{n(t-1)}} \|^2\\ &\leq \| \mathbf{w_{t-1}}\|^2+0+\|y_{n(t-1)}\mathbf{x_{n(t-1)}} \|^2 \quad by\ (2)\\ &\leq \| \mathbf{w_{t-1}}\|^2+\underset{n}{max}\ \|\mathbf{x_{n}} \|^2\\ &\leq \| \mathbf{w_{0}}\|^2+T\cdot \underset{n}{max}\ \|\mathbf{x_{n}} \|^2\\ &= T\cdot \underset{n}{max}\ \|\mathbf{x_{n}} \|^2 \end{align*} $$ 代入得證： $$ \begin{align*} \frac{\mathbf{w_f^T w_t}}{\|\mathbf{w_f}\|\|\mathbf{w_t}\|} &\geq \frac{T\cdot \underset{n}{min}\ y_n\mathbf{w_f^T x_n}}{\|\mathbf{w_f}\|\|\mathbf{w_t}\|}\\ &\geq \frac{T\cdot \underset{n}{min}\ y_n\mathbf{w_f^T x_n}}{\|\mathbf{w_f}\|\cdot \sqrt{T}\cdot \underset{n}{max}\ \|\mathbf{x_{n}} \|}\\ &\geq \frac{\sqrt{T}\cdot \underset{n}{min}\ y_n\mathbf{w_f^T x_n}}{\|\mathbf{w_f}\|\cdot \underset{n}{max}\ \|\mathbf{x_{n}} \|}\\ &= \sqrt{T} \cdot constant \end{align*} $$ 因為 $\frac{\mathbf{w_f^T w_t}}{\|\mathbf{w_f}\|\|\mathbf{w_t}\|}\leq 1$ ，內積最大也只是長度的相乘 $$ \begin{align*} \sqrt{T} \cdot constant&\leq 1\\\\ \frac{\sqrt{T}\cdot \underset{n}{min}\ y_n\mathbf{w_f^T x_n}}{\|\mathbf{w_f}\|\cdot \underset{n}{max}\ \|\mathbf{x_{n}} \|}&\leq 1 \\ \\ T\leq\frac{\|\mathbf{w_f}\|^2\cdot \underset{n}{max}\ \|\mathbf{x_{n}} \|^2}{\underset{n}{min}^2\ y_n\mathbf{w_f^T x_n}}&=\frac{R^2}{\rho ^2} \\ \end{align*} $$ $$ R = \underset{n}{max}\ \|\mathbf{x_{n}}\|\ (最遠的半徑)\\ \rho =\underset{n}{min}\ y_n\frac{\mathbf{w_f^T}}{\|\mathbf{w_f}\|}\mathbf{x_n}\ (最接近分隔線的 \mathbf{x_n} 與法向量的內積)\\ $$

PLA 程式碼

Python 原始碼

import matplotlib.pyplot as plt
import numpy as np
import operator

# 網路上找的 dataset 可以線性分割
rawData = [
    ((-0.4, 0.3), -1),
    ((-0.3, -0.1), -1),
    ((-0.2, 0.4), -1),
    ((-0.1, 0.1), -1),
    ((0.9, -0.5), 1),
    ((0.7, -0.9), 1),
    ((0.8, 0.2), 1),
    ((0.4, -0.6), 1)]

# 加入 x0
dataset = [((1,) + x, y) for x, y in rawData]


# 內積
def dot(*v):
    return sum(map(operator.mul, *v))


# 取 sign (1, 0, -1)
def sign(v):
    if v > 0:
        return 1
    elif v == 0:
        return 0
    else:  # v < 0
        return -1


# 判斷有沒有分類錯誤
def check_error(w, x, y):
    if sign(dot(w, x)) != y:
        return True
    else:
        return False


# 更新 w
def update(w, x, y):
    u = map(operator.mul, [y] * len(x), x)
    w = map(operator.add, w, u)
    return list(w)


# PLA演算法實作
def pla(dataset):
    # 初始化 w
    w = [0] * 3

    t = 0
    while True:
        print("{}: {}".format(t, tuple(w)))
        no_error = True
        for x, y in dataset:
            if check_error(w, x, y):
                w = update(w, x, y)
                no_error = False
                break
        t += 1

        if no_error:
            break

    return w


# 主程式
def main():
    # 執行
    w = pla(dataset)

    # 畫圖
    fig = plt.figure()

    # numrows=1, numcols=1, fignum=1
    ax1 = fig.add_subplot(111)

    xx = list(filter(lambda d: d[1] == -1, dataset))
    ax1.scatter([x[0][1] for x in xx], [x[0][2] for x in xx],
                s=100, c='b', marker="x", label='-1')
    oo = list(filter(lambda d: d[1] == 1, dataset))
    ax1.scatter([x[0][1] for x in oo], [x[0][2] for x in oo],
                s=100, c='r', marker="o", label='1')
    l = np.linspace(-2, 2)

    # w0 + w1x + w2y = 0
    # y = -w0/w2 - w1/w2 x
    if w[2]:
        a, b = -w[1] / w[2], -w[0] / w[2]
        ax1.plot(l, a * l + b, 'b-')
    else:
        ax1.plot([-w[0] / w[1]] * len(l), l, 'b-')

    plt.legend(loc='upper left', scatterpoints=1)
    plt.show()


if __name__ == '__main__':
    main()

PLA 缺點

無法得知 $D$ 是否線性可分
即使線性可分，也無法得知要跑多久，因為需要 $\mathbf{w_f}$

若資料不為線性可分或有雜訊

找到錯誤最少的線
$$ \mathbf{w_g}\leftarrow \underset{\mathbf{w}}{argmin}\sum_{n=1}^{N}\left [ y_n\neq sign(\mathbf{w^T x_n}) \right ] $$ 但為 NP-hard 需花費大量時間解，故使用 Pocket Algorithm (PLA 變形)

Pocket Algorithm (PLA 變形)

初始化 $\mathbf{\widehat{w}}$
For $t = 0, 1, \cdots $

$(\mathbf{x_{n(t)}},y_{n(t)})$ 找到 $\mathbf{w_t}$ 的下一個錯誤 (亂數選擇較佳)
修正錯誤
假如 $\mathbf{w_{t+1}}$ 的錯誤比 $\mathbf{\widehat{w}}$ 少，則將 $\mathbf{w_{t+1}}$ 設為 $\mathbf{\widehat{w}}$
直到足夠的次數或小於設定的錯誤門檻

Pocket Algorithm 程式碼

Python 原始碼

import matplotlib.pyplot as plt
import numpy as np
import operator
import random

# 網路上找的 dataset 再加上亂數的 data 不可以線性分割
rawData = [
    ((-0.4, 0.3), -1),
    ((-0.3, -0.1), -1),
    ((-0.2, 0.4), -1),
    ((-0.1, 0.1), -1),
    ((0.9, -0.5), 1),
    ((0.7, -0.9), 1),
    ((0.8, 0.2), 1),
    ((0.4, -0.6), 1),
    ((0.2, 0.6), -1),
    ((-0.5, -0.5), -1),
    ((0.7, 0.3), 1),
    ((0.9, -0.6), 1),
    ((-0.1, 0.2), -1),
    ((0.3, -0.6), 1),
    ((0.5, 0.1), -1), ]

# 加入 x0
dataset = [((1,) + x, y) for x, y in rawData]


# 內積
def dot(*v):
    return sum(map(operator.mul, *v))


# 取 sign (1, 0, -1)
def sign(v):
    if v > 0:
        return 1
    elif v == 0:
        return 0
    else:  # v < 0
        return -1


# 判斷有沒有分類錯誤
def check_error(w, x, y):
    if sign(dot(w, x)) != y:
        return True
    else:
        return False


# 更新 w
def update(w, x, y):
    u = map(operator.mul, [y] * len(x), x)
    w = map(operator.add, w, u)
    return list(w)


# 總錯誤數
def sum_errors(w, dataset):
    errors = 0
    for x, y in dataset:
        if check_error(w, x, y):
            errors += 1

    return errors


# POCKET 演算法實作
def pocket(dataset):
    # 初始化 w
    w = [0] * 3
    min_e = sum_errors(w, dataset)

    max_t = 500
    for t in range(0, max_t):
        wt = None
        et = None

        while True:
            x, y = random.choice(dataset)
            if check_error(w, x, y):
                wt = update(w, x, y)
                et = sum_errors(wt, dataset)
                break

        if et < min_e:
            w = wt
            min_e = et

        print("{}: {}".format(t, tuple(w)))
        print("min erros: {}".format(min_e))

        t += 1

        if min_e == 0:
            break

    return (w, min_e)


# 主程式
def main():
    # 執行，並輸入新的 list
    w, e = pocket(list(dataset))

    # 畫圖
    fig = plt.figure()

    # numrows=1, numcols=1, fignum=1
    ax1 = fig.add_subplot(111)

    xx = list(filter(lambda d: d[1] == -1, dataset))
    ax1.scatter([x[0][1] for x in xx], [x[0][2] for x in xx],
                s=100, c='b', marker="x", label='-1')
    oo = list(filter(lambda d: d[1] == 1, dataset))
    ax1.scatter([x[0][1] for x in oo], [x[0][2] for x in oo],
                s=100, c='r', marker="o", label='1')
    l = np.linspace(-2, 2)

    # w0 + w1x + w2y = 0
    # y = -w0/w2 - w1/w2 x
    if w[2]:
        a, b = -w[1] / w[2], -w[0] / w[2]
        ax1.plot(l, a * l + b, 'b-')
    else:
        ax1.plot([-w[0] / w[1]] * len(l), l, 'b-')

    plt.legend(loc='upper left', scatterpoints=1)
    plt.show()


if __name__ == '__main__':
    main()

假設資料為線性可分，卻跑 Pocket Algorithm 的缺點

Pocket Algorithm 比 PLA 慢

花力氣存檔，把最佳解存起來
要檢查所有資料的錯誤才能比較 $\mathbf{w_{t+1}}$ 和 $\mathbf{\widehat{w}}$

但跑夠久，仍可得到 $\mathbf{w_{POCKET}} = \mathbf{w_{PLA}}$

子風的知識庫

搜尋此網誌