[ML] 機器學習基石：第十二講 Nonlinear Transformation

ML：基礎學習
課程：機器學習基石
簡介：第十二講 Nonlinear Transformation

利用轉換空間，從非線性變到線性

$$ \begin{align*} \mathbf{x} \in \mathbf{X} &\overset{\Phi }{\rightarrow} \mathbf{z} \in \mathbf{Z} \\ \Phi _2(\mathbf{x})&=(1,x_1,x_2,x_1^2,x_1x_2,x_2^2) \\ \mathbf{\tilde{w}}&=(\tilde{w_0},\tilde{w_1},\tilde{w_2},\tilde{w_3},\tilde{w_4},\tilde{w_5}) \\ H_{\Phi _2}&=\left \{ h(\mathbf{x}):h(\mathbf{x})=\tilde{h}(\Phi _2(\mathbf{x}))\ for\ some\ linear\ \tilde{h}\ on\ \mathbf{Z} \right \} \end{align*} $$

$\Phi ^{-1}$ 不一定存在，所以實際上是將點對應過去確認輸出，再畫回來

事實上，特徵的轉換 $\Phi$ 也是一種轉換
若轉換為二次式 $\Phi_2$，$\mathbf{x} \in \mathbb{R}^d$，請問 $\Phi_2$ 的 dimension 為多少？
二次項：$\binom{d}{2}+d$
一次項：$d$
常數項：1
總共：$\frac{d^2}{2}+\frac{3d}{2}+1$

$\Phi$ 的代價

$$ \begin{align*} \Phi _Q(\mathbf{x})= (&1, \\ &x_1,x_2,\cdots ,x_d, \\ &x_1^2,x_1x_2, \cdots ,x_d^2, \\ &\cdots \\ & x_1^Q,x_1^{Q-1}x_2, \cdots , x_d^Q) \end{align*} $$ 如下，共有這麼多項 $\mathbf{z}=\Phi _Q(\mathbf{x}), \mathbf{\tilde{w}}$ 需要計算與儲存
所以 Q 變大，會導致計算效率變低與儲存空間變大
$$ \underset{\tilde{w_0}}{\underbrace{1}}+\underset{others}{\underbrace{\tilde{d}}} =\binom{Q+d}{Q}=\binom{Q+d}{d}=O(Q^d) $$

重複組合排列概念

$ \mathbf{\tilde{w}}=\tilde{d}+1=O(Q^d) \approx d_{VC}(H_{\Phi _Q})$
有其上限
$d_{VC}(H_{\Phi _Q}) \le \tilde{d}+1$
因 $\tilde{d}+2$ 在 $\mathbf{Z}$ 無法 shatter
所以 $\tilde{d}+2$ 在 $\mathbf{X}$ 必定無法 shatter
所以 Q 變大，會導致 $d_{VC}$ 變大，使得複雜度變高

$E_{in}$ 會變小
$E_{out}$ 卻遠離 $E_{in}$

如何挑選 Q ？

用眼睛挑選嗎？
考慮以下這張圖，以下步驟 $d_{VC}$ 看似在減少

$\Phi _2:\mathbf{z}=(1,x_1,x_2,x_1^2,x_1x_2,x_2^2),d_{VC}=6$
那是不是可以改為 $\mathbf{z}=(1,x_1^2,x_2^2),d_{VC}=3$
或者 $\mathbf{z}=(1,x_1^2+x_2^2),d_{VC}=2$
甚至 $\mathbf{z}=(sign(0.6-x_1^2-x_2^2)),d_{VC}=1$

實際上你的腦中已做了計算，這些計算實際上也是個複雜的 model，所以也需考慮進去
為了 VC-safely，$\Phi$ 需要在偷看資料前就被決定，不然很可能被人為所影響

簡單範例

Q = 50, $\mathbf{x} \in \mathbb{R}^2$，請問 $\tilde{d} $為多少？
$\tilde{d}=\binom{Q+2}{2}-1=\frac{52*51}{2}-1=1325$
可以發現轉換導致 dimension 變超大，那麼要考慮是否有足夠資料可使用？

$\Phi$ 相互間的關係

$$ \begin{align*} \Phi _0(\mathbf{x}) &= (1)\\ \Phi _1(\mathbf{x}) &= (\Phi _0(\mathbf{x}), x_1,x_2,\cdots , x_d)\\ \Phi _2(\mathbf{x}) &= (\Phi _1(\mathbf{x}), x_1^2,x_1x_2,\cdots , x_d^2)\\ \\ \Phi _3(\mathbf{x}) &= (\Phi _2(\mathbf{x}), x_1^3,x_1^2x_2,\cdots , x_d^3)\\ \\ \vdots &= \vdots \\ \Phi _Q(\mathbf{x}) &= (\Phi _{Q-1}(\mathbf{x}), x_1^Q,x_1^{Q-1}x_2,\cdots , x_d^Q)\\ \end{align*} $$

$$ g_i = argmin_{h \in H_i}E_{in}(h)\\ \begin{matrix} H_0 & \subset & H_1 & \subset & H_2 & \subset & H_3 & \subset & \cdots \\ d_{VC}(H_0) & \le & d_{VC}(H_1) & \le & d_{VC}(H_2) & \le & d_{VC}(H_3) & \le & \cdots\\ E_{in}(g_0) & \ge & E_{in}(g_1) & \ge & E_{in}(g_2) & \ge & E_{in}(g_3) & \ge & \cdots\\ \end{matrix} $$

如下圖，所以使用高維度 $H_{1126}$ 不見得會比較好

安全做法：不要一開始就套進高維度轉換

從 $H_1$ 開始試

假如 $E_{in}(g_1)$ 足夠小，那麼就 ok，且又安全
不夠小，那麼再慢慢地增加維度，唯一浪費的只有計算量而已

子風的知識庫

搜尋此網誌