- 取得連結
- X
- 以電子郵件傳送
- 其他應用程式
ML:Reinforcement Learning
李宏毅 老師的課程
簡介:Reinforcement Learning
reinforcement-learning
Reinforcement Learning 健身房:OpenAI Gym
深入淺出介紹策略梯度
higgsfield/RL-Adventure
higgsfield/RL-Adventure-2
sweetice/Deep-reinforcement-learning-with-pytorch
Simple Reinforcement Learning with Tensorflow Part 8: Asynchronous Actor-Critic Agents (A3C)
MorvanZhou/pytorch-A3C
- Package:
- PyTorch
李宏毅 老師的課程
簡介:Reinforcement Learning
- Episode
表示一個回合,從開始到結束,\(\tau=\{ s_1,a_1,r_1,s_2,a_2,r_2,\cdots ,s_T,a_T,r_T\}\) - Agent
決策中心,產生 action 跟 environment 互動 - Actor
Policy-based Approach 訓練的主軸,負責產生 action - Critic
Value-based Approach 訓練的主軸,負責評估 actor 的好壞 - Environment
環境,根據 agent 的 action 回饋 reward - State
Agent 觀察到的狀態 - Reward
environment 根據 action 決定獎勵程度
課程說明
Policy-based Approach
Policy Gradient
有一 actor \(\pi _\theta\),更新方法如下 $$ \begin{align*} \theta ^{new}&=\theta ^{old}+\eta \triangledown _\theta \bar{R}_{\theta^{old}}\\ \triangledown _\theta \bar{R}_\theta &\approx \frac{1}{N}\sum ^N_{n=1}\sum ^{T_n}_{t=1}\left (R(\tau^n) -b \right )\triangledown _\theta log\ \pi _\theta(s_t^n|a_t^n)\\ \end{align*}\\ $$ $$ when\ R(\tau^n|\pi _\theta(s_t^n|a_t^n)) = \sum _{k=t}^{T_n}\gamma^{k-t} r_k^n\\ \Rightarrow \frac{1}{N}\sum ^N_{n=1}\sum ^{T_n}_{t=1}\left (\sum _{k=t}^{T_n}\gamma^{k-t} r_k^n -b \right )\triangledown _\theta log\ \pi _\theta(s_t^n|a_t^n)\\ $$
Policy-based Approach
Policy Gradient
有一 actor \(\pi _\theta\),更新方法如下 $$ \begin{align*} \theta ^{new}&=\theta ^{old}+\eta \triangledown _\theta \bar{R}_{\theta^{old}}\\ \triangledown _\theta \bar{R}_\theta &\approx \frac{1}{N}\sum ^N_{n=1}\sum ^{T_n}_{t=1}\left (R(\tau^n) -b \right )\triangledown _\theta log\ \pi _\theta(s_t^n|a_t^n)\\ \end{align*}\\ $$ $$ when\ R(\tau^n|\pi _\theta(s_t^n|a_t^n)) = \sum _{k=t}^{T_n}\gamma^{k-t} r_k^n\\ \Rightarrow \frac{1}{N}\sum ^N_{n=1}\sum ^{T_n}_{t=1}\left (\sum _{k=t}^{T_n}\gamma^{k-t} r_k^n -b \right )\triangledown _\theta log\ \pi _\theta(s_t^n|a_t^n)\\ $$
實質上就是設計一個 actor,讓此 actor 越來越好
而一個 episode 可視為 trajectory \(\tau=\{ s_1,a_1,r_1,s_2,a_2,r_2,\cdots ,s_T,a_T,r_T\} \)
而其中 total reward 為 \(R(\tau)=\sum ^T_{t=1}r_t \)
但即使是同樣的 agent,因為是採機率挑選的緣故,每次的 \(\tau\) 仍會不同
故同樣的 \(\tau\) 在 actor \(\pi\) 於參數 \(\theta\) 下,出現的機率是 \(P(\tau|\theta) \)
然後 total reward 的期望值為
$$ \bar{R}_\theta = E[R(\tau)|\theta] =\sum _\tau R(\tau)P(\tau|\theta) $$ 當用 \(\pi_\theta\) 跑過 N 個 episode 時,得到 \(\{\tau^1,\tau^2, \cdots, \tau^N\} \)
此時期望值可近似
$$ \bar{R}_\theta = \sum _\tau R(\tau)P(\tau|\theta) \approx \frac{1}{N}\sum_{n=1}^NR(\tau^n) \\ $$ 於是問題可看作
$$ \begin{align*} \theta ^* & =arg\ \underset{\theta}{max}\bar{R}_\theta\\ &=arg\ \underset{\theta}{max}\sum _\tau R(\tau)P(\tau|\theta) \\ \end{align*} $$ 利用 Gradient ascent 更新參數
$$ \theta ^{new}=\theta ^{old}+\eta \triangledown _\theta \bar{R}_{\theta^{old}} $$ $$ \begin{align*} \triangledown _\theta \bar{R}_\theta &=\sum _\tau R(\tau)\triangledown _\theta P(\tau|\theta) \\ &=\sum _\tau R(\tau)P(\tau|\theta)\frac{\triangledown _\theta P(\tau|\theta)}{P(\tau|\theta)} \\ \because \frac{\mathrm{d} log(f(x))}{\mathrm{d} x}=\frac{1}{f(x)}\frac{\mathrm{d} f(x)}{\mathrm{d} x}&=\sum _\tau R(\tau)P(\tau|\theta)\triangledown _\theta log P(\tau|\theta) \\ &\approx \frac{1}{N}\sum ^N_{n=1}R(\tau^n)\triangledown _\theta log P(\tau^n|\theta) \\ \end{align*}\\ $$ $$ \begin{align*} P(\tau|\theta) &= p(s_1)p(a_1|s_1,\theta )p(r_1,s_2|s_1,a_1)p(a_2|s_2,\theta )p(r_2,s_3|s_2,a_2)\cdots p(a_T|s_T,\theta )p(r_T,s_{T+1}|s_T,a_T)\\ &=p(s_1)\prod ^T_{t=1}p(a_t|s_t,\theta )p(r_t,s_{t+1}|s_t,a_t) \\ log P(\tau|\theta) &= log p(s_1) + \sum ^T_{t=1}log p(a_t|s_t,\theta )+\sum ^T_{t=1}log p(r_t,s_{t+1}|s_t,a_t)\\ \triangledown _\theta log P(\tau|\theta) &= \sum ^T_{t=1}\triangledown _\theta log p(a_t|s_t,\theta )\\ \triangledown _\theta log P(\tau|\theta) &= \sum ^T_{t=1}\triangledown _\theta log p(a_t|s_t,\theta )\\ \end{align*}\\ $$ 故
$$ \begin{align*} \triangledown _\theta \bar{R}_\theta &=\sum _\tau R(\tau)P(\tau|\theta)\triangledown _\theta log P(\tau|\theta) \\ &\approx \frac{1}{N}\sum ^N_{n=1}R(\tau^n)\triangledown _\theta log P(\tau^n|\theta) \\ &=\frac{1}{N}\sum ^N_{n=1}R(\tau^n)\sum ^{T_n}_{t=1}\triangledown _\theta log p(a_t^n|s_t^n,\theta )\\ &=\frac{1}{N}\sum ^N_{n=1}\sum ^{T_n}_{t=1}R(\tau^n)\triangledown _\theta log p(a_t^n|s_t^n,\theta )\\ \end{align*}\\ $$ \(\triangledown _\theta \bar{R}_\theta \) 可看作,當 \(\tau^n\) 得到正向的 \(R(\tau^n)\) ,則希望提升選擇 \(a_t^n\) 的機率 \(p(a_t^n|s_t^n,\theta)\)
反之,若是負向的 \(R(\tau^n)\) ,則希望降低 \(a_t^n\) 的機率 \(p(a_t^n|s_t^n,\theta)\)
注意:\(R(\tau^n)\) 是整個 \(\tau^n\) 的 reward 並非是單一 action 的 reward
例如:遊戲中射擊敵人才會得分,若是使用即時的 reward,會導致只會一直射擊
而為何 \(\triangledown _\theta log p(a_t^n|s_t^n,\theta)\) 取的是 log 呢?
先將式子弄回來
$$ \triangledown _\theta log p(a_t^n|s_t^n,\theta)=\frac{\triangledown _\theta p(a_t^n|s_t^n,\theta)}{p(a_t^n|s_t^n,\theta)} $$ 若無除以機率 \(p(a_t^n|s_t^n,\theta)\) 時,當一個 action 得到的 reward 比較好
但因出現的機率比較小,反而會被 model 所忽略掉
所以除以機率,像是一種 Normalize 的行為
而當 \(R(\tau^n)\) 都是正的也會有問題
如下圖,在理想狀態下,a b c 三個動作都會做到,所以更新比較少的,則會成為機率較低的一方
但實際上是採抽樣的方法,若 a 一直都沒做到,會導致 a 的機率越來越低,但也許 a 是比較好的解
所以要加上一個 bias,需自行設計
$$ \begin{align*} \triangledown _\theta \bar{R}_\theta &=\sum _\tau \left (R(\tau) -b \right )P(\tau|\theta)\triangledown _\theta log P(\tau|\theta) \\ &\approx \frac{1}{N}\sum ^N_{n=1}\sum ^{T_n}_{t=1}\left (R(\tau^n) -b \right )\triangledown _\theta log p(a_t^n|s_t^n,\theta )\\ \end{align*}\\ $$ 而 \(p(a_t^n|s_t^n,\theta)\) 其實就是產生動作的機率,這不就是由策略決定的,故等同 \(\pi _\theta(s_t^n|a_t^n)\)
$$ \begin{align*} \triangledown _\theta \bar{R}_\theta &=\sum _\tau \left (R(\tau) -b \right )P(\tau|\theta)\triangledown _\theta log P(\tau|\theta) \\ &\approx \frac{1}{N}\sum ^N_{n=1}\sum ^{T_n}_{t=1}\left (R(\tau^n) -b \right )\triangledown _\theta log\ \pi _\theta(s_t^n|a_t^n)\\ \end{align*}\\ $$ \(R(\tau^n)\) 依個人喜好,可隨意設計,其中一個想法
現在的 reward 是由現在的 action 造成的,過去的 action 影響不大
也就是越未來的 reward,目前的 action 影響會越來越小,故加上一個衰減函數 \(\gamma\)
$$ R(\tau^n|\pi _\theta(s_t^n|a_t^n)) = \sum _{k=t}^{T_n}\gamma^{k-t} r_k^n \\ $$ 套進之前式子
$$ \begin{align*} \triangledown _\theta \bar{R}_\theta &=\sum _\tau \left (R(\tau) -b \right )P(\tau|\theta)\triangledown _\theta log P(\tau|\theta) \\ &\approx \frac{1}{N}\sum ^N_{n=1}\sum ^{T_n}_{t=1}\left (R(\tau^n) -b \right )\triangledown _\theta log\ \pi _\theta(s_t^n|a_t^n)\\ when\ R(\tau^n|\pi _\theta(s_t^n|a_t^n)) = \sum _{k=t}^{T_n}\gamma^{k-t} r_k^n &\Rightarrow \frac{1}{N}\sum ^N_{n=1}\sum ^{T_n}_{t=1}\left (\sum _{k=t}^{T_n}\gamma^{k-t} r_k^n -b \right )\triangledown _\theta log\ \pi _\theta(s_t^n|a_t^n)\\ \end{align*}\\ $$
而一個 episode 可視為 trajectory \(\tau=\{ s_1,a_1,r_1,s_2,a_2,r_2,\cdots ,s_T,a_T,r_T\} \)
而其中 total reward 為 \(R(\tau)=\sum ^T_{t=1}r_t \)
但即使是同樣的 agent,因為是採機率挑選的緣故,每次的 \(\tau\) 仍會不同
故同樣的 \(\tau\) 在 actor \(\pi\) 於參數 \(\theta\) 下,出現的機率是 \(P(\tau|\theta) \)
然後 total reward 的期望值為
$$ \bar{R}_\theta = E[R(\tau)|\theta] =\sum _\tau R(\tau)P(\tau|\theta) $$ 當用 \(\pi_\theta\) 跑過 N 個 episode 時,得到 \(\{\tau^1,\tau^2, \cdots, \tau^N\} \)
此時期望值可近似
$$ \bar{R}_\theta = \sum _\tau R(\tau)P(\tau|\theta) \approx \frac{1}{N}\sum_{n=1}^NR(\tau^n) \\ $$ 於是問題可看作
$$ \begin{align*} \theta ^* & =arg\ \underset{\theta}{max}\bar{R}_\theta\\ &=arg\ \underset{\theta}{max}\sum _\tau R(\tau)P(\tau|\theta) \\ \end{align*} $$ 利用 Gradient ascent 更新參數
$$ \theta ^{new}=\theta ^{old}+\eta \triangledown _\theta \bar{R}_{\theta^{old}} $$ $$ \begin{align*} \triangledown _\theta \bar{R}_\theta &=\sum _\tau R(\tau)\triangledown _\theta P(\tau|\theta) \\ &=\sum _\tau R(\tau)P(\tau|\theta)\frac{\triangledown _\theta P(\tau|\theta)}{P(\tau|\theta)} \\ \because \frac{\mathrm{d} log(f(x))}{\mathrm{d} x}=\frac{1}{f(x)}\frac{\mathrm{d} f(x)}{\mathrm{d} x}&=\sum _\tau R(\tau)P(\tau|\theta)\triangledown _\theta log P(\tau|\theta) \\ &\approx \frac{1}{N}\sum ^N_{n=1}R(\tau^n)\triangledown _\theta log P(\tau^n|\theta) \\ \end{align*}\\ $$ $$ \begin{align*} P(\tau|\theta) &= p(s_1)p(a_1|s_1,\theta )p(r_1,s_2|s_1,a_1)p(a_2|s_2,\theta )p(r_2,s_3|s_2,a_2)\cdots p(a_T|s_T,\theta )p(r_T,s_{T+1}|s_T,a_T)\\ &=p(s_1)\prod ^T_{t=1}p(a_t|s_t,\theta )p(r_t,s_{t+1}|s_t,a_t) \\ log P(\tau|\theta) &= log p(s_1) + \sum ^T_{t=1}log p(a_t|s_t,\theta )+\sum ^T_{t=1}log p(r_t,s_{t+1}|s_t,a_t)\\ \triangledown _\theta log P(\tau|\theta) &= \sum ^T_{t=1}\triangledown _\theta log p(a_t|s_t,\theta )\\ \triangledown _\theta log P(\tau|\theta) &= \sum ^T_{t=1}\triangledown _\theta log p(a_t|s_t,\theta )\\ \end{align*}\\ $$ 故
$$ \begin{align*} \triangledown _\theta \bar{R}_\theta &=\sum _\tau R(\tau)P(\tau|\theta)\triangledown _\theta log P(\tau|\theta) \\ &\approx \frac{1}{N}\sum ^N_{n=1}R(\tau^n)\triangledown _\theta log P(\tau^n|\theta) \\ &=\frac{1}{N}\sum ^N_{n=1}R(\tau^n)\sum ^{T_n}_{t=1}\triangledown _\theta log p(a_t^n|s_t^n,\theta )\\ &=\frac{1}{N}\sum ^N_{n=1}\sum ^{T_n}_{t=1}R(\tau^n)\triangledown _\theta log p(a_t^n|s_t^n,\theta )\\ \end{align*}\\ $$ \(\triangledown _\theta \bar{R}_\theta \) 可看作,當 \(\tau^n\) 得到正向的 \(R(\tau^n)\) ,則希望提升選擇 \(a_t^n\) 的機率 \(p(a_t^n|s_t^n,\theta)\)
反之,若是負向的 \(R(\tau^n)\) ,則希望降低 \(a_t^n\) 的機率 \(p(a_t^n|s_t^n,\theta)\)
注意:\(R(\tau^n)\) 是整個 \(\tau^n\) 的 reward 並非是單一 action 的 reward
例如:遊戲中射擊敵人才會得分,若是使用即時的 reward,會導致只會一直射擊
而為何 \(\triangledown _\theta log p(a_t^n|s_t^n,\theta)\) 取的是 log 呢?
先將式子弄回來
$$ \triangledown _\theta log p(a_t^n|s_t^n,\theta)=\frac{\triangledown _\theta p(a_t^n|s_t^n,\theta)}{p(a_t^n|s_t^n,\theta)} $$ 若無除以機率 \(p(a_t^n|s_t^n,\theta)\) 時,當一個 action 得到的 reward 比較好
但因出現的機率比較小,反而會被 model 所忽略掉
所以除以機率,像是一種 Normalize 的行為
而當 \(R(\tau^n)\) 都是正的也會有問題
如下圖,在理想狀態下,a b c 三個動作都會做到,所以更新比較少的,則會成為機率較低的一方
但實際上是採抽樣的方法,若 a 一直都沒做到,會導致 a 的機率越來越低,但也許 a 是比較好的解
所以要加上一個 bias,需自行設計
$$ \begin{align*} \triangledown _\theta \bar{R}_\theta &=\sum _\tau \left (R(\tau) -b \right )P(\tau|\theta)\triangledown _\theta log P(\tau|\theta) \\ &\approx \frac{1}{N}\sum ^N_{n=1}\sum ^{T_n}_{t=1}\left (R(\tau^n) -b \right )\triangledown _\theta log p(a_t^n|s_t^n,\theta )\\ \end{align*}\\ $$ 而 \(p(a_t^n|s_t^n,\theta)\) 其實就是產生動作的機率,這不就是由策略決定的,故等同 \(\pi _\theta(s_t^n|a_t^n)\)
$$ \begin{align*} \triangledown _\theta \bar{R}_\theta &=\sum _\tau \left (R(\tau) -b \right )P(\tau|\theta)\triangledown _\theta log P(\tau|\theta) \\ &\approx \frac{1}{N}\sum ^N_{n=1}\sum ^{T_n}_{t=1}\left (R(\tau^n) -b \right )\triangledown _\theta log\ \pi _\theta(s_t^n|a_t^n)\\ \end{align*}\\ $$ \(R(\tau^n)\) 依個人喜好,可隨意設計,其中一個想法
現在的 reward 是由現在的 action 造成的,過去的 action 影響不大
也就是越未來的 reward,目前的 action 影響會越來越小,故加上一個衰減函數 \(\gamma\)
$$ R(\tau^n|\pi _\theta(s_t^n|a_t^n)) = \sum _{k=t}^{T_n}\gamma^{k-t} r_k^n \\ $$ 套進之前式子
$$ \begin{align*} \triangledown _\theta \bar{R}_\theta &=\sum _\tau \left (R(\tau) -b \right )P(\tau|\theta)\triangledown _\theta log P(\tau|\theta) \\ &\approx \frac{1}{N}\sum ^N_{n=1}\sum ^{T_n}_{t=1}\left (R(\tau^n) -b \right )\triangledown _\theta log\ \pi _\theta(s_t^n|a_t^n)\\ when\ R(\tau^n|\pi _\theta(s_t^n|a_t^n)) = \sum _{k=t}^{T_n}\gamma^{k-t} r_k^n &\Rightarrow \frac{1}{N}\sum ^N_{n=1}\sum ^{T_n}_{t=1}\left (\sum _{k=t}^{T_n}\gamma^{k-t} r_k^n -b \right )\triangledown _\theta log\ \pi _\theta(s_t^n|a_t^n)\\ \end{align*}\\ $$
課程說明
Policy-based Approach 實作
Policy-based Approach 實作
- 初始化 \(PolicyGradient(s)\)
- for each episode
- 初始化 \(s\)
- for each step
- 利用 \(PolicyGradient(s)\) 挑選 \(a\)
- 環境輸入 \(a\) ,並得到 \(r\)
- 記錄 \(r\)
- 直到達到中止條件
- 訓練 PolicyGradient(s)
$$ \begin{align*} \triangledown _\theta\bar{R}_\theta &= \sum ^{T}_{t=1} \sum _{k=t}^T\gamma^{k-t} r_k \triangledown _\theta log\ \pi _\theta(s_t^n|a_t^n)\\ \end{align*}\\ $$
在 reward 部分有一些修正
將 \(b=0, N=1\),\(N=1\) 表示不取樣多個再更新,則是利用 Adam 的方法解決,同 SGD 的概念
最後得到下式
$$ \begin{align*} \triangledown _\theta\bar{R}_\theta &=\frac{1}{N}\sum ^N_{n=1}\sum ^{T_n}_{t=1}\left (\sum _{k=t}^{T_n}\gamma^{k-t} r_k^n -b \right )\triangledown _\theta log\ \pi _\theta(s_t^n|a_t^n)\\ b=0,N=1\ &\Rightarrow \sum ^{T}_{t=1}\sum _{k=t}^{T}\gamma^{k-t} r_k \triangledown _\theta log\ \pi _\theta(s_t|a_t)\\ \end{align*}\\ $$ PolicyGradient.py
run.py
將 \(b=0, N=1\),\(N=1\) 表示不取樣多個再更新,則是利用 Adam 的方法解決,同 SGD 的概念
最後得到下式
$$ \begin{align*} \triangledown _\theta\bar{R}_\theta &=\frac{1}{N}\sum ^N_{n=1}\sum ^{T_n}_{t=1}\left (\sum _{k=t}^{T_n}\gamma^{k-t} r_k^n -b \right )\triangledown _\theta log\ \pi _\theta(s_t^n|a_t^n)\\ b=0,N=1\ &\Rightarrow \sum ^{T}_{t=1}\sum _{k=t}^{T}\gamma^{k-t} r_k \triangledown _\theta log\ \pi _\theta(s_t|a_t)\\ \end{align*}\\ $$ PolicyGradient.py
import numpy as np import torch from torch.distributions import Categorical torch.manual_seed(500) # 固定隨機種子 for 再現性 class PolicyGradient: def __init__(self, n_features, n_actions, learning_rate=0.01): self.net = Net(n_features, n_actions) print(self.net) self.lr = learning_rate # reward 衰減係數 self.gamma = 0.99 # optimizer 是訓練的工具 self.optimizer = torch.optim.Adam( self.net.parameters(), lr=self.lr ) # 傳入 net 的所有參數, 學習率 self.saved_log_probs = [] self.rewards = [] self.eps = np.finfo(np.float32).eps.item() def choose_action(self, state): state = torch.from_numpy(state).float() probs = self.net(state) m = Categorical(probs) action = m.sample() log_prob = m.log_prob(action) self.saved_log_probs.append(log_prob) # 也可自定,但因計算誤差,需調整 learning rate 才能學到東西 # 要小心節點關係不變,不然往上更新會有問題 # log_prob_m = torch.log(probs[action.item()]) # self.saved_log_probs.append(log_prob_m) return action.item() def store_trajectory(self, s, a, r): self.rewards.append(r) def train(self): R = 0 policy_loss = [] rewards = [] # 現在的 reward 是由現在的 action 造成的,過去的 action 影響不大 # 越未來的 reward,現在的 action 影響會越來越小 # 若是看直接看 total reward 不太能區分出 action 的好壞,導致學習不好 for r in self.rewards[::-1]: R = r + self.gamma * R rewards.insert(0, R) rewards = torch.tensor(rewards) # 正規化 reward 並加入 machine epsilon (self.eps) 以免除以 0 rewards = (rewards - rewards.mean()) / (rewards.std() + self.eps) for log_prob, reward in zip(self.saved_log_probs, rewards): # 最大化,故加上負號 policy_loss.append(-log_prob * reward) self.optimizer.zero_grad() policy_loss = torch.stack(policy_loss).sum() policy_loss.backward() self.optimizer.step() del self.rewards[:] del self.saved_log_probs[:] class Net(torch.nn.Module): def __init__(self, n_features, n_actions): super(Net, self).__init__() # 定義每層用什麼樣的形式 self.fc1 = torch.nn.Linear(n_features, 128) self.fc2 = torch.nn.Linear(128, n_actions) # Prob of Left def forward(self, x): # 這同時也是 Module 中的 forward 功能 # 正向傳播輸入值, 神經網絡分析出輸出值 model = torch.nn.Sequential( self.fc1, torch.nn.ReLU(), self.fc2, torch.nn.Softmax() ) return model(x)
import gym from PolicyGradient import PolicyGradient import matplotlib.pyplot as plt import torch RENDER = True # 顯示模擬會拖慢運行速度, 等學得差不多了再顯示 env = gym.make("CartPole-v0") env.seed(1) # 固定隨機種子 for 再現性 # env = env.unwrapped # 不限定 episode print(env.action_space) print(env.observation_space) print(env.observation_space.high) print(env.observation_space.low) agent = PolicyGradient( n_features=env.observation_space.shape[0], n_actions=env.action_space.n, learning_rate=0.005, ) reward_history = [] def plot_durations(): y_t = torch.FloatTensor(reward_history) plt.figure(1) plt.clf() plt.title("Training...") plt.xlabel("Episode") plt.ylabel("Reward") plt.plot(y_t.numpy()) # Take 100 episode averages and plot them too if len(reward_history) >= 100: means = y_t.unfold(0, 100, 1).mean(1).view(-1) means = torch.cat((torch.zeros(99), means)) plt.plot(means.numpy()) plt.pause(0.001) # pause a bit so that plots are updated for n_episode in range(3000): state = env.reset() sumR = 0 for t in range(3000): # Don't infinite loop while learning if RENDER: env.render() action = agent.choose_action(state) state_, reward, done, _ = env.step(action) agent.store_trajectory(state, action, reward) sumR += reward if done: break state = state_ agent.train() reward_history.append(sumR) if RENDER: plot_durations() print("episode:", n_episode, "duration:", t, "Reward", sumR)
課程說明
Value-based Approach
QLearning 只適用 discrete action
$$ Q^\pi(s_t,a_t)^{new} = Q^\pi(s_t,a_t)^{old} + \eta \left [\underset{Real(Target)}{\underbrace{r + \gamma \underset{a_{t+1}}{max}\ Q^\pi(s_{t+1},a_{t+1})^{old}}}-\underset{estimate}{\underbrace{Q^\pi(s_t,a_t)^{old}}} \right ] $$
Value-based Approach
QLearning 只適用 discrete action
$$ Q^\pi(s_t,a_t)^{new} = Q^\pi(s_t,a_t)^{old} + \eta \left [\underset{Real(Target)}{\underbrace{r + \gamma \underset{a_{t+1}}{max}\ Q^\pi(s_{t+1},a_{t+1})^{old}}}-\underset{estimate}{\underbrace{Q^\pi(s_t,a_t)^{old}}} \right ] $$
實質上就是設計一個 critic 衡量 actor 的好壞
也就是判斷 actor \(\pi\) 在目前的 state \(s\) 下,未來可得到的 total reward
若是 state value function,則如下 $$ V^\pi(s) $$ 所以即使是同樣的 state,不同的 actor 會有不同的 \(V\)
兩種設計的方法
假設有八個 episodes
$$ \begin{align*} s_a,r&=0,s_b,r=0,END\\ s_b,r&=1,END\\ s_b,r&=1,END\\ s_b,r&=1,END\\ s_b,r&=1,END\\ s_b,r&=1,END\\ s_b,r&=1,END\\ s_b,r&=0,END\\ \end{align*} $$ 若是 MC
$$ \begin{align*} V^\pi(s_b)&=\frac{6}{8}=\frac{3}{4} \\ V^\pi(s_a)&=0 \\ \end{align*} $$ 若是 TD
$$ \begin{align*} V^\pi(s_b)&=\frac{6}{8}=\frac{3}{4} \\ V^\pi(s_a)&=V^\pi(s_b)+r=\frac{3}{4}+0=\frac{3}{4} \\ \end{align*} $$ 但這無法衡量 actor,無法依此選擇 action
所以改為 state-action value function,也就是 QLearning
但因 action 需是有限的,故只適用 discrete action
$$ Q^\pi(s,a) $$
故問題可看作
$$ \pi(s)=arg\ \underset{a}{max}\ Q^\pi(s,a) $$ 此時定義一下,當下 \(s\) 可得到的未來總回報 \(G_t\)
$$ G_t=r_t+\gamma r_{t+1}+\gamma ^2 r_{t+1}+\cdots =\sum _{k=0}^T \gamma ^k r_{t+k} $$ 離現在越遠的 reward,加上 \(\gamma \) 衰減
$$ \begin{align*} \underset{a_{t}}{max}\ Q^\pi(s_t,a_t)&=G_t \\ &=\sum _{k=0}^T \gamma ^k r_{t+k} \\ &=r_t + \sum _{k=1}^T \gamma ^k r_{t+k} \\ &=r_t + \gamma \sum _{k=0}^T \gamma ^k r_{t+1+k} \\ &=r_t + \gamma G_{t+1} \\ &=r_t + \gamma \underset{a_{t+1}}{max}\ Q^\pi(s_{t+1},a_{t+1}) \\ \end{align*} $$ 利用 Gradient descent 的概念更新參數,在此用的是 TD 的概念更新
$$ Q^\pi(s_t,a_t)^{new} = Q^\pi(s_t,a_t)^{old} + \eta \left [\underset{Real(Target)}{\underbrace{r + \gamma \underset{a_{t+1}}{max}\ Q^\pi(s_{t+1},a_{t+1})^{old}}}-\underset{estimate}{\underbrace{Q^\pi(s_t,a_t)^{old}}} \right ] $$ 而 actor \(\pi\) 並不做任何更新
只是利用 \(\epsilon-greedy\) 決定動作
當隨機數大於 \(\epsilon\),則動作隨機,以供 \(Q^\pi(s,a)\) 認知到更好的可能性
反之,則選擇 \(max\ Q^\pi(s,a)\) 的動作
也就是判斷 actor \(\pi\) 在目前的 state \(s\) 下,未來可得到的 total reward
若是 state value function,則如下 $$ V^\pi(s) $$ 所以即使是同樣的 state,不同的 actor 會有不同的 \(V\)
兩種設計的方法
- Monte-Carlo based approach (MC)
- 必須等待一個 episode 完成,才能開始訓練,概念如下
- 當 \(s_a\) 開始,直到一次 episode 的完成,會得到 \(G_a\)
- 當 \(s_b\) 開始,直到一次 episode 的完成,會得到 \(G_b\)
- 依此進行訓練 $$\begin{align*} V^\pi(s_a) &= G_a \\ V^\pi(s_b) &= G_b \\ \end{align*}$$
- 變異性較小
- Temporal-difference approach (TD)
- 不用等待一個 episode 完成,就能開始訓練,概念如下
- 取一段 \(\tau=\{\cdots s_t,a_t,r_t,s_{t+1}\cdots\}\)
- \(t\) & \(t+1\) 之間只差了一個 \(r_t\)
- 依此進行訓練 \(V^\pi(s_t) = V^\pi(s_{t+1}) + r_t \)
- 變異性較大
假設有八個 episodes
$$ \begin{align*} s_a,r&=0,s_b,r=0,END\\ s_b,r&=1,END\\ s_b,r&=1,END\\ s_b,r&=1,END\\ s_b,r&=1,END\\ s_b,r&=1,END\\ s_b,r&=1,END\\ s_b,r&=0,END\\ \end{align*} $$ 若是 MC
$$ \begin{align*} V^\pi(s_b)&=\frac{6}{8}=\frac{3}{4} \\ V^\pi(s_a)&=0 \\ \end{align*} $$ 若是 TD
$$ \begin{align*} V^\pi(s_b)&=\frac{6}{8}=\frac{3}{4} \\ V^\pi(s_a)&=V^\pi(s_b)+r=\frac{3}{4}+0=\frac{3}{4} \\ \end{align*} $$ 但這無法衡量 actor,無法依此選擇 action
所以改為 state-action value function,也就是 QLearning
但因 action 需是有限的,故只適用 discrete action
$$ Q^\pi(s,a) $$
故問題可看作
$$ \pi(s)=arg\ \underset{a}{max}\ Q^\pi(s,a) $$ 此時定義一下,當下 \(s\) 可得到的未來總回報 \(G_t\)
$$ G_t=r_t+\gamma r_{t+1}+\gamma ^2 r_{t+1}+\cdots =\sum _{k=0}^T \gamma ^k r_{t+k} $$ 離現在越遠的 reward,加上 \(\gamma \) 衰減
$$ \begin{align*} \underset{a_{t}}{max}\ Q^\pi(s_t,a_t)&=G_t \\ &=\sum _{k=0}^T \gamma ^k r_{t+k} \\ &=r_t + \sum _{k=1}^T \gamma ^k r_{t+k} \\ &=r_t + \gamma \sum _{k=0}^T \gamma ^k r_{t+1+k} \\ &=r_t + \gamma G_{t+1} \\ &=r_t + \gamma \underset{a_{t+1}}{max}\ Q^\pi(s_{t+1},a_{t+1}) \\ \end{align*} $$ 利用 Gradient descent 的概念更新參數,在此用的是 TD 的概念更新
$$ Q^\pi(s_t,a_t)^{new} = Q^\pi(s_t,a_t)^{old} + \eta \left [\underset{Real(Target)}{\underbrace{r + \gamma \underset{a_{t+1}}{max}\ Q^\pi(s_{t+1},a_{t+1})^{old}}}-\underset{estimate}{\underbrace{Q^\pi(s_t,a_t)^{old}}} \right ] $$ 而 actor \(\pi\) 並不做任何更新
只是利用 \(\epsilon-greedy\) 決定動作
當隨機數大於 \(\epsilon\),則動作隨機,以供 \(Q^\pi(s,a)\) 認知到更好的可能性
反之,則選擇 \(max\ Q^\pi(s,a)\) 的動作
Value-based Approach 實作
QLearning
QLearning
- 初始化 \(Q^\pi(s,a)\)
- for each episode
- 初始化 \(s\)
- for each step
- 利用 \(Q^\pi(s,a)\) 挑選 \(a\),可用 \(\epsilon -greedy\)
- 環境輸入\(a\) ,並得到 \(r, {s}'\)
- 記錄 \(s, a, r, done, {s}'\)
- batch 更新 \(Q^\pi(s,a)\)
注意,在此只更新 \(a\) 這個動作之上的節點,並非所有動作的節點
$$Q^\pi(s,a)^{new} = Q^\pi(s,a)^{old} + \eta \left [\underset{Real(Target)}{\underbrace{r + \gamma \underset{{a}'}{max}\ Q^\pi({s}',{a}')^{old}}}-\underset{estimate}{\underbrace{Q^\pi(s,a)^{old}}} \right ] $$ - \(s = {s}'\)
- 若 s 不是中止狀態,則回到第一步
重點在於 \(1-Done\) & ReplayMemory
\(1-Done\) 可以讓 Target 更加的準確,而不是無限放大
ReplayMemory 則可以 off-policy 訓練,脫離連續動作的相關性
QLearning.py
run.py
\(1-Done\) 可以讓 Target 更加的準確,而不是無限放大
ReplayMemory 則可以 off-policy 訓練,脫離連續動作的相關性
QLearning.py
import numpy as np import torch from collections import namedtuple import random torch.manual_seed(500) # 固定隨機種子 for 再現性 Trajectory = namedtuple( "Transition", ("state", "action", "reward", "done", "next_state") ) # 很重要的機制,無此機制,比較難收斂 # 可以試著將 capacity & BATCH_SIZE 設為 1 看看 class ReplayMemory(object): def __init__(self, capacity): self.capacity = capacity self.memory = [] self.position = 0 def push(self, *args): """Saves a trajectory.""" if len(self.memory) < self.capacity: self.memory.append(None) self.memory[self.position] = Trajectory(*args) self.position = (self.position + 1) % self.capacity def sample(self, batch_size): return random.sample(self.memory, batch_size) def __len__(self): return len(self.memory) class QLearning: def __init__(self, n_features, n_actions, learning_rate=0.01, gamma=0.9): self.n_actions = n_actions self.n_features = n_features self.net = Net(n_features, n_actions) print(self.net) self.lr = learning_rate # Q 衰減係數 self.gamma = gamma # optimizer 是訓練的工具 self.optimizer = torch.optim.Adam( self.net.parameters(), lr=self.lr ) # 傳入 net 的所有參數, 學習率 # loss function self.lossFun = torch.nn.MSELoss() self.trajectories = ReplayMemory(10000) self.BATCH_SIZE = 50 def choose_action(self, state): state = torch.from_numpy(state).float() value = self.net(state) action_max_value, action = torch.max(value, 0) if np.random.random() >= 0.95: # epslion greedy action = np.random.choice(range(self.n_actions), 1) return action.item() def store_trajectory(self, s, a, r, done, s_): self.trajectories.push(s, a, r, done, s_) def train(self): if len(self.trajectories) < self.BATCH_SIZE: return trajectories = self.trajectories.sample(self.BATCH_SIZE) batch = Trajectory(*zip(*trajectories)) s = batch.state s = torch.tensor(s).float() a = batch.action a = torch.tensor(a).long() a = torch.unsqueeze(a, 1) # 在 dim=1 增加維度 ex: (50,) => (50,1) r = batch.reward r = torch.tensor(r).float() done = batch.done done = torch.tensor(done).float() s_ = batch.next_state s_ = torch.tensor(s_).float() # 在 dim=1,以 a 為 index 取值 qValue = self.net(s).gather(1, a).squeeze(1) qNext = self.net(s_).detach() # detach from graph, don't backpropagate # done 是關鍵之一,不導入計算會導致 qNext 預估錯誤 # 這也是讓 qValue 收斂的要素,不然 target 會一直往上累加,進而估不準 target = r + self.gamma * qNext.max(1)[0] * (1 - done) self.optimizer.zero_grad() loss = self.lossFun(target.detach(), qValue) loss.backward() # torch.nn.utils.clip_grad_norm(self.net.parameters(), 0.5) self.optimizer.step() class Net(torch.nn.Module): def __init__(self, n_features, n_actions): super(Net, self).__init__() # 定義每層用什麼樣的形式 self.fc1 = torch.nn.Linear(n_features, 10) self.fc2 = torch.nn.Linear(10, n_actions) # Prob of Left def forward(self, x): # 這同時也是 Module 中的 forward 功能 # 正向傳播輸入值, 神經網絡分析出輸出值 model = torch.nn.Sequential(self.fc1, torch.nn.ReLU6(), self.fc2) return model(x)
import gym from QLearning import QLearning import matplotlib.pyplot as plt import torch RENDER = True # 顯示模擬會拖慢運行速度, 等學得差不多了再顯示 env = gym.make("CartPole-v0") env.seed(1) # 固定隨機種子 for 再現性 # env = env.unwrapped # 不限定 episode print(env.action_space) print(env.observation_space) print(env.observation_space.high) print(env.observation_space.low) agent = QLearning( n_features=env.observation_space.shape[0], n_actions=env.action_space.n, learning_rate=0.01, gamma=0.99, ) reward_history = [] def plot_durations(): y_t = torch.FloatTensor(reward_history) plt.figure(1) plt.clf() plt.title("Training...") plt.xlabel("Episode") plt.ylabel("Reward") plt.plot(y_t.numpy()) # Take 100 episode averages and plot them too if len(reward_history) >= 100: means = y_t.unfold(0, 100, 1).mean(1).view(-1) means = torch.cat((torch.zeros(99), means)) plt.plot(means.numpy()) plt.pause(0.001) # pause a bit so that plots are updated for n_episode in range(3000): state = env.reset() sumR = 0 for t in range(3000): # Don't infinite loop while learning if RENDER: env.render() action = agent.choose_action(state) state_, reward, done, _ = env.step(action) agent.store_trajectory(state, action, reward, done, state_) sumR += reward agent.train() if done: break state = state_ reward_history.append(sumR) if RENDER: plot_durations() avgR = sum(reward_history[:-11:-1]) / 10 print( "episode: {:4d} duration: {:4d} Reward: {:5.1f} avgR: {:5.1f}".format( n_episode, t, sumR, avgR ) )
課程說明
Actor-Critic Approach
Advance Actor-Critic(A2C)
有一 actor \(\pi _\theta\),更新方法如下 $$ \begin{align*} \theta ^{new}&=\theta ^{old}+\eta \triangledown _\theta \bar{R}_{\theta^{old}}\\ \triangledown _\theta \bar{R}_\theta &\approx \frac{1}{N}\sum ^N_{n=1}\sum^{T_n}_{t=1}\left (r_t^n +V^{\pi_\theta}(s_{t+1}^n) -V^{\pi_\theta}(s_t^n) \right )\triangledown _\theta log\ \pi _\theta(s_t^n|a_t^n)\\ \end{align*} $$
Actor-Critic Approach
Advance Actor-Critic(A2C)
有一 actor \(\pi _\theta\),更新方法如下 $$ \begin{align*} \theta ^{new}&=\theta ^{old}+\eta \triangledown _\theta \bar{R}_{\theta^{old}}\\ \triangledown _\theta \bar{R}_\theta &\approx \frac{1}{N}\sum ^N_{n=1}\sum^{T_n}_{t=1}\left (r_t^n +V^{\pi_\theta}(s_{t+1}^n) -V^{\pi_\theta}(s_t^n) \right )\triangledown _\theta log\ \pi _\theta(s_t^n|a_t^n)\\ \end{align*} $$
藉由之前的 Policy Gradient
$$ \triangledown _\theta \bar{R}_\theta \approx \frac{1}{N}\sum ^N_{n=1}\sum ^{T_n}_{t=1}\left (R(\tau^n) -b \right )\triangledown _\theta log\ \pi _\theta(s_t^n|a_t^n)\\ $$ \(R(\tau^n)\) 因是隨機抽樣得到的結果,具有相當大的不穩定性,除非 \(N\) 夠大
但實際運用上,\(N\) 通常無法太大,而且常常都將 \(N = 1\)
那麼是否能估測 \(E[R(\tau^n)]\) 呢?
而這不就是 QLearning 中的 \(Q^{\pi_\theta}(s_t^n,a_t^n)\)
而 baseline 可以用 state value function \(V^{\pi_\theta}(s_t^n) \)所取代
而兩者相減的意義即是,當下 action 能對未來造成多少影響
故式子可重寫為
$$ \triangledown _\theta \bar{R}_\theta \approx \frac{1}{N}\sum ^N_{n=1}\sum^{T_n}_{t=1}\left (Q^{\pi_\theta}(s_t^n,a_t^n) -V^{\pi_\theta}(s_t^n) \right )\triangledown _\theta log\ \pi _\theta(s_t)\\ $$ 但若照以上實作,必須有兩個 networks \(Q\) & \(V\)
而兩個 networks 通常意味著更大的預測誤差
於是換個角度來看,下個 state 的 total reward + 目前的 reward,大概就是 \(Q\) 的 reward
$$ Q^{\pi_\theta}(s_t^n,a_t^n) \approx r_t^n +V^{\pi_\theta}(s_{t+1}^n) $$ 所以
$$ Q^{\pi_\theta}(s_t^n,a_t^n) -V^{\pi_\theta}(s_t^n) \Rightarrow r_t^n +V^{\pi_\theta}(s_{t+1}^n) -V^{\pi_\theta}(s_t^n)\\ $$ 故式子可再重寫為
$$ \triangledown _\theta \bar{R}_\theta \approx \frac{1}{N}\sum ^N_{n=1}\sum^{T_n}_{t=1}\left (r_t^n +V^{\pi_\theta}(s_{t+1}^n) -V^{\pi_\theta}(s_t^n) \right )\triangledown _\theta log\ \pi _\theta(s_t)\\ $$
$$ \triangledown _\theta \bar{R}_\theta \approx \frac{1}{N}\sum ^N_{n=1}\sum ^{T_n}_{t=1}\left (R(\tau^n) -b \right )\triangledown _\theta log\ \pi _\theta(s_t^n|a_t^n)\\ $$ \(R(\tau^n)\) 因是隨機抽樣得到的結果,具有相當大的不穩定性,除非 \(N\) 夠大
但實際運用上,\(N\) 通常無法太大,而且常常都將 \(N = 1\)
那麼是否能估測 \(E[R(\tau^n)]\) 呢?
而這不就是 QLearning 中的 \(Q^{\pi_\theta}(s_t^n,a_t^n)\)
而 baseline 可以用 state value function \(V^{\pi_\theta}(s_t^n) \)所取代
而兩者相減的意義即是,當下 action 能對未來造成多少影響
故式子可重寫為
$$ \triangledown _\theta \bar{R}_\theta \approx \frac{1}{N}\sum ^N_{n=1}\sum^{T_n}_{t=1}\left (Q^{\pi_\theta}(s_t^n,a_t^n) -V^{\pi_\theta}(s_t^n) \right )\triangledown _\theta log\ \pi _\theta(s_t)\\ $$ 但若照以上實作,必須有兩個 networks \(Q\) & \(V\)
而兩個 networks 通常意味著更大的預測誤差
於是換個角度來看,下個 state 的 total reward + 目前的 reward,大概就是 \(Q\) 的 reward
$$ Q^{\pi_\theta}(s_t^n,a_t^n) \approx r_t^n +V^{\pi_\theta}(s_{t+1}^n) $$ 所以
$$ Q^{\pi_\theta}(s_t^n,a_t^n) -V^{\pi_\theta}(s_t^n) \Rightarrow r_t^n +V^{\pi_\theta}(s_{t+1}^n) -V^{\pi_\theta}(s_t^n)\\ $$ 故式子可再重寫為
$$ \triangledown _\theta \bar{R}_\theta \approx \frac{1}{N}\sum ^N_{n=1}\sum^{T_n}_{t=1}\left (r_t^n +V^{\pi_\theta}(s_{t+1}^n) -V^{\pi_\theta}(s_t^n) \right )\triangledown _\theta log\ \pi _\theta(s_t)\\ $$
Actor-Critic Approach 實作
Advance Actor-Critic(A2C)
Advance Actor-Critic(A2C)
- 初始化 A2C
- for each episode
- 初始化 \(s\)
- for each step
- 利用 A2C 的 actor \(\pi_\theta (s)\) 挑選 \(a\)
- 環境輸入 \(a\) ,並得到 \(r, {s}'\)
- 運用 TD 更新 \(V^{\pi_\theta}(s)\)
$$V^{\pi_\theta}(s) \rightarrow r +V^{\pi_\theta}({s}') $$ - \(s={s}'\) 直到達到中止條件
- 運用 MC 更新 \(V^{\pi_\theta}(s)\)
$$V^{\pi_\theta}(s) \rightarrow R $$ - 訓練 A2C 的 actor \(\pi_\theta (s)\)
$$ \begin{align*} \theta ^{new}&=\theta ^{old}+\eta \triangledown _\theta \bar{R}_{\theta^{old}}\\ R_k &= r_k^n +V^{\pi_\theta}(s_{k+1}^n) -V^{\pi_\theta}(s_k^n)\\ \triangledown _\theta\bar{R}_\theta &= \sum ^{T}_{t=1} \sum _{k=t}^T\gamma^{k-t} R_k \triangledown _\theta log\ \pi _\theta(s_t)\\ \end{align*} $$ - 更新 target NN
在此針對下式,做了點修改
$$ \triangledown _\theta \bar{R}_\theta \approx \frac{1}{N}\sum ^N_{n=1}\sum^{T_n}_{t=1}\left (r_t^n +V^{\pi_\theta}(s_{t+1}^n) -V^{\pi_\theta}(s_t^n) \right )\triangledown _\theta log\ \pi _\theta(s_t)\\ $$ 同先前的 Policy Gradient 一樣,引入了衰減的概念
reward 是由當下的 action 造成的,而當下的 action 對未來的影響會越來越小
再將 \(N=1\) 表示不取樣多個再更新,則是利用 Adam 的方法解決,同 SGD 的概念
最後得到下式
$$ \begin{align*} R_k &= r_k^n +V^{\pi_\theta}(s_{k+1}^n) -V^{\pi_\theta}(s_k^n)\\ \triangledown _\theta\bar{R}_\theta &= \sum ^{T}_{t=1} \sum _{k=t}^T\gamma^{k-t} R_k \triangledown _\theta log\ \pi _\theta(s_t)\\ \end{align*} $$ 另外 critic 將會有兩個 NN,一個負責在過程中更新,一個負責 actor 的 \(V^{\pi_\theta}\) 估計值
這樣才不會一邊更新 critic 一邊影響 actor,導致不夠穩定
而且 critic 將會是由 TD & MC 的兩種方法進行更新
個人認為 TD 可以 step 更新,但在這個例子中會導致預測值越來越大,因為只要存活就會得到 reward
所以加上 MC 令之可以收斂至合理的值
A2C.py
run.py
$$ \triangledown _\theta \bar{R}_\theta \approx \frac{1}{N}\sum ^N_{n=1}\sum^{T_n}_{t=1}\left (r_t^n +V^{\pi_\theta}(s_{t+1}^n) -V^{\pi_\theta}(s_t^n) \right )\triangledown _\theta log\ \pi _\theta(s_t)\\ $$ 同先前的 Policy Gradient 一樣,引入了衰減的概念
reward 是由當下的 action 造成的,而當下的 action 對未來的影響會越來越小
再將 \(N=1\) 表示不取樣多個再更新,則是利用 Adam 的方法解決,同 SGD 的概念
最後得到下式
$$ \begin{align*} R_k &= r_k^n +V^{\pi_\theta}(s_{k+1}^n) -V^{\pi_\theta}(s_k^n)\\ \triangledown _\theta\bar{R}_\theta &= \sum ^{T}_{t=1} \sum _{k=t}^T\gamma^{k-t} R_k \triangledown _\theta log\ \pi _\theta(s_t)\\ \end{align*} $$ 另外 critic 將會有兩個 NN,一個負責在過程中更新,一個負責 actor 的 \(V^{\pi_\theta}\) 估計值
這樣才不會一邊更新 critic 一邊影響 actor,導致不夠穩定
而且 critic 將會是由 TD & MC 的兩種方法進行更新
個人認為 TD 可以 step 更新,但在這個例子中會導致預測值越來越大,因為只要存活就會得到 reward
所以加上 MC 令之可以收斂至合理的值
A2C.py
import torch import torch.nn.functional as F from torch.distributions import Categorical torch.manual_seed(500) # 固定隨機種子 for 再現性 class A2C: def __init__(self, n_actions, n_features, learning_rate=0.01, gamma=0.9): self.actorCriticEval = ActorCriticNet(n_actions, n_features) self.actorCriticTarget = ActorCriticNet(n_actions, n_features) print(self.actorCriticEval) print(self.actorCriticTarget) self.lr = learning_rate # reward 衰減係數 self.gamma = gamma # optimizer 是訓練的工具 # 傳入 net 的所有參數, 學習率 self.optimizerActorCriticEval = torch.optim.Adam( self.actorCriticEval.parameters(), lr=self.lr ) self.saved_log_probs = [] self.rewards = [] self.states = [] def choose_action(self, state): state = torch.from_numpy(state).float() probs, _ = self.actorCriticEval(state) m = Categorical(probs) action = m.sample() log_prob = m.log_prob(action) self.saved_log_probs.append(log_prob) return action.item() def store_trajectory(self, s, a, r, s_): self.rewards.append(r) self.states.append(s) self.nextState = s_ # episode train def trainActor(self): R = 0 policy_loss = [] rewards = [] # 現在的 reward 是由現在的 action 造成的,過去的 action 影響不大 # 越未來的 reward,現在的 action 影響會越來越小 # 若是看直接看 total reward 不太能區分出 action 的好壞,導致學習不好 nextStates = self.states + [self.nextState] for r, s, s_ in zip(self.rewards[::-1], self.states[::-1], nextStates[::-1]): _, futureVal = self.actorCriticTarget(torch.tensor(s_).float()) _, nowVal = self.actorCriticTarget(torch.tensor(s).float()) R_now = r + futureVal.detach() - nowVal.detach() R = R_now + self.gamma * R rewards.insert(0, R) for log_prob, reward in zip(self.saved_log_probs, rewards): # 最大化,故加上負號 policy_loss.append(-log_prob * reward) # Actor self.optimizerActorCriticEval.zero_grad() policy_loss = torch.stack(policy_loss).sum() policy_loss.backward() # 梯度裁剪,以免爆炸 # torch.nn.utils.clip_grad_norm(actor_network.parameters(),0.5) self.optimizerActorCriticEval.step() del self.rewards[:] del self.saved_log_probs[:] # print(list(self.actorCriticEval.parameters())) # step train def trainCriticTD(self): r = self.rewards[-1] _, futureVal = self.actorCriticTarget(torch.tensor(self.nextState).float()) val = r + futureVal target = val.detach() _, predict = self.actorCriticEval(torch.tensor(self.states[-1]).float()) # print(predict, futureVal) self.optimizerActorCriticEval.zero_grad() lossFun = torch.nn.MSELoss() loss = lossFun(target, predict) loss.backward() # 梯度裁剪,以免爆炸 # torch.nn.utils.clip_grad_norm(actor_network.parameters(),0.5) self.optimizerActorCriticEval.step() # print(list(self.actorCriticEval.parameters())) def trainCriticMC(self): R = 0 for r, s in zip(self.rewards[::-1], self.states[::-1]): R = r + R target = torch.tensor(R).float() _, predict = self.actorCriticEval(torch.tensor(s).float()) self.optimizerActorCriticEval.zero_grad() lossFun = torch.nn.MSELoss() loss = lossFun(target, predict) loss.backward() # 梯度裁剪,以免爆炸 # torch.nn.utils.clip_grad_norm(actor_network.parameters(),0.5) self.optimizerActorCriticEval.step() # print(predict.item(), target.item()) # print(list(self.actorCriticEval.parameters())) # 逐步更新 target NN def updateTarget(self): for paramEval, paramTarget in zip( self.actorCriticEval.parameters(), self.actorCriticTarget.parameters() ): paramTarget.data = paramTarget.data + 0.1 * ( paramEval.data - paramTarget.data ) class ActorCriticNet(torch.nn.Module): def __init__(self, n_actions, n_features): super(ActorCriticNet, self).__init__() # 定義每層用什麼樣的形式 self.fc1 = torch.nn.Linear(n_features, 10) self.fc2 = torch.nn.Linear(10, n_actions) # Prob of Left self.fc3 = torch.nn.Linear(n_features, 10) self.fc4 = torch.nn.Linear(10, 1) # Prob of Left def forward(self, x): # 這同時也是 Module 中的 forward 功能 # 正向傳播輸入值, 神經網絡分析出輸出值 x_a = self.fc1(x) x_a = F.relu6(x_a) action = F.softmax(self.fc2(x_a)) x_v = self.fc3(x) x_v = F.relu6(x_v) val = self.fc4(x_v) return action, val
run.py
import gym from A2C import A2C import matplotlib.pyplot as plt import torch RENDER = True # 顯示模擬會拖慢運行速度, 等學得差不多了再顯示 env = gym.make("CartPole-v0") env.seed(1) # 固定隨機種子 for 再現性 # env = env.unwrapped # 不限定 episode print(env.action_space) print(env.observation_space) print(env.observation_space.high) print(env.observation_space.low) agent = A2C( n_actions=env.action_space.n, n_features=env.observation_space.shape[0], learning_rate=0.01, gamma=0.9, ) reward_history = [] def plot_durations(): y_t = torch.FloatTensor(reward_history) plt.figure(1) plt.clf() plt.title("Training...") plt.xlabel("Episode") plt.ylabel("Reward") plt.plot(y_t.numpy()) # Take 100 episode averages and plot them too if len(reward_history) >= 100: means = y_t.unfold(0, 100, 1).mean(1).view(-1) means = torch.cat((torch.zeros(99), means)) plt.plot(means.numpy()) plt.pause(0.001) # pause a bit so that plots are updated for n_episode in range(3000): state = env.reset() sumR = 0 for t in range(3000): # Don't infinite loop while learning if RENDER: env.render() action = agent.choose_action(state) state_, reward, done, _ = env.step(action) agent.store_trajectory(state, action, reward, state_) agent.trainCriticTD() sumR += reward if done: break state = state_ agent.trainCriticMC() agent.trainActor() agent.updateTarget() reward_history.append(sumR) if RENDER: plot_durations() avgR = sum(reward_history[:-11:-1]) / 10 print( "episode: {:4d} duration: {:4d} Reward: {:5.1f} avgR: {:5.1f}".format( n_episode, t, sumR, avgR ) )
Asynchronous Advance Actor-Critic(A3C)
- copy global parameters 到各個 worker(A2C)
- workers 初始化各種不同環境參數
- worker train & update
- update gradients to global network
\(\theta ^{new}_{global}=\theta ^{old}_{global}+\eta \triangledown \theta_{worker}\)
簡言之,就是影分身之術的 A2C
在不同的 CPU or Machine 上,同時運行 A2C,再更新至最上層的 Global Network
在不同的 CPU or Machine 上,同時運行 A2C,再更新至最上層的 Global Network
參考
DQN 从入门到放弃1 DQN与增强学习reinforcement-learning
Reinforcement Learning 健身房:OpenAI Gym
深入淺出介紹策略梯度
higgsfield/RL-Adventure
higgsfield/RL-Adventure-2
sweetice/Deep-reinforcement-learning-with-pytorch
Simple Reinforcement Learning with Tensorflow Part 8: Asynchronous Actor-Critic Agents (A3C)
MorvanZhou/pytorch-A3C
留言
張貼留言