【RL】Actor&Critic

结合价值与策略学习

同时对最优Q值以及最优策略进行学习,最终策略网络的输出即为最优动作。

  • Actor网络(策略网络)\(\pi(a|s;\theta)\),近似\(Policy\ \pi\)控制Agent做动作,利用Policy Gradient训练网络
  • Critic网络(价值网络)\(Q_\pi(a,s;w)\)对策略进行评价。通过TD Learning训练网络
image-20231205150114928

特点:

  1. 策略梯度方法和价值估计方法的结合;
  2. 直接的得到最优动作;
  3. 动作空间既可以是离散的,也可以是连续的。

Actor&Critic

State-value function: \(V_\pi(s)=\sum_a \pi(a \mid s) \cdot Q_\pi(s, a) \approx \sum_a \pi(a \mid s ; \boldsymbol{\theta}) \cdot q(s, a ; \mathbf{w}) .\)

\(\pi(a|s)\) 是策略函数用于计算动作的概率值,控制agent做动作。利用策略网络\(\pi(a|s;\theta)\) 近似\(\pi(a|s)\)

\(Q_\pi(s,a)\) 是动作价值函数,用于评价动作的好坏。利用价值网络\(q(s,a;w)\)近似动作价值函数\(Q_\pi(s,a)\)

利用两个网络近似策略网络、动作价值函数,可以分别看作运动员(actor)和裁判(critic)

Actor策略网络

输入为State,输出为每个Action的概率组成的向量

由于所有Action的概率之和为1 \(\sum_{a \in \mathcal{A}} \pi(a \mid s, \boldsymbol{\theta})=1 .\),所以利用softmaxt激活函数

image-20231205143909771

训练策略网络,在价值网络的监督下,提高State-Value \(V(s;\theta,w)\),随着训练actor的表现会越来越好

利用 Policy Gradient 更新策略网络,让运动员做出的动作打分更高

计算Policy Gradient: \[ \frac{\partial V(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\mathbb{E}_A\left[\frac{\partial \log \pi(A \mid s, \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot q(s, A ; \mathbf{w})\right] \] l利用梯度上升更新策略网络

Critic价值网络

输入为state s以及action a,输出为预估的动作价值\(q(s,a;w)\)

image-20231205144107539

学习价值网络是为了让裁判打分更加精准,利用reward让\(q(s,a;w)\)的预估值更加准确

利用 TD Learning 更新价值网络,让裁判打分更准

Predicted action-value: \(q_t=q\left(s_t, a_t ; \mathbf{w}\right)\).

TD target: \(y_t=r_t+\gamma \cdot q\left(s_{t+1}, a_{t+1} ; \mathbf{w}\right)\)

Gradient: \[ \frac{\partial\left(q_t-y_t\right)^2 / 2}{\partial \mathbf{w}}=\left(q_t-y_t\right) \cdot \frac{\partial q\left(s_t, a_t ; \mathbf{w}\right)}{\partial \mathbf{w}} \] 利用梯度下降更新网络

  • 训练两个网络的目的是使运动员作出的动作打分越来越高,裁判的打分越来越精准
  • 两个网络之间可以共享参数也可不共享参数

Actor&Critic算法

  1. Observe state \(s_t\) and randomly sample \(a_t \sim \pi\left(\cdot \mid s_t ; \boldsymbol{\theta}_t\right)\).
  2. Perform \(a_t\); then environment gives new state \(s_{t+1}\) and reward \(r_t\).
  3. Randomly sample \(\tilde{a}_{t+1} \sim \pi\left(\cdot \mid s_{t+1} ; \boldsymbol{\theta}_t\right)\). (Do not perform \(\tilde{a}_{t+1} !\) )
  4. Evaluate value network: \(q_t=q\left(s_t, a_t ; \mathbf{w}_t\right)\) and \(q_{t+1}=q\left(s_{t+1}, \tilde{a}_{t+1} ; \mathbf{w}_t\right)\).
  5. Compute TD error: \(\delta_t=q_t-\left(r_t+\gamma \cdot q_{t+1}\right)\).
  6. Differentiate value network: \(\mathbf{d}_{w, t}=\left.\frac{\partial q\left(s_t, a_t ; \mathbf{w}\right)}{\partial \mathbf{w}}\right|_{\mathbf{w}=\mathbf{w}_t}\).
  7. Update value network: \(\mathbf{w}_{t+1}=\mathbf{w}_t-\alpha \cdot \delta_t \cdot \mathbf{d}_{w, t}\).
  8. Differentiate policy network: \(\mathbf{d}_{\theta, t}=\left.\frac{\partial \log \pi\left(a_t \mid s_t, \boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}}\right|_{\boldsymbol{\theta}=\boldsymbol{\theta}_t}\).
  9. Update policy network: \(\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_t+\beta \cdot q_t \cdot \mathbf{d}_{\theta, t}\) or \(\theta_{t+1}=\theta_t+\beta\cdot\delta_t\cdot d_{\theta,t}\)

Advantage Actor-Critic(A2C)

价值网络Value Network与上述有所不同,这里采用\(v(s;w)\)来近似state-value function\(V_\pi(s)\)

状态价值只依赖于state,不依赖于action,更好训练

建立网络

image-20231205151748629

数学原理

简述A2C算法原理,以下是对策略梯度的近似,用于更新策略网络 \[ \mathbf{g}\left(a_t\right) \approx \frac{\partial \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}} \cdot\left(r_t+\gamma \cdot v\left(s_{t+1} ; \mathbf{w}\right)-v\left(s_t ; \mathbf{w}\right)\right) \]

\(r_t+\gamma \cdot v\left(s_{t+1} ; \mathbf{w}\right)-v\left(s_t ; \mathbf{w}\right)\)是价值网络做出的判断,可以评价\(a_t\)的好坏,指导策略网络做改进

但这一项没有\(a_t\),如何评价\(a_t\)的好坏?

  • \(v(s_t ; \mathbf{w})\)是价值网络对t时刻对\(s_t\)的评价,与\(a_t\)无关
  • \(r_t+\gamma \cdot v\left(s_{t+1} ; \mathbf{w}\right)\)近似\(\mathbb{E}[U_t|s_t,s_{t+1}]\),是对t+1时刻\(s_{t+1}\)价值的预测
  • 在t+1时刻,\(a_t\)已发生,\(\mathbb{E}[U_t|s_t,s_{t+1}]\)依赖于\(a_t\),因此两者的差值Advantage可以反应\(a_t\)带来的优势

算法流程

image-20231205154012287
  1. Observe a transition \(\left(s_t, a_t, r_t, s_{t+1}\right)\).

  2. TD target: \(y_t=r_t+\gamma \cdot v\left(s_{t+1} ; \mathbf{w}\right)\).

  3. TD error: \(\delta_t=v\left(s_t ; \mathbf{w}\right)-y_t\).

  4. Update the policy network (actor) by:

\[ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta}-\beta \cdot \delta_t \cdot \frac{\partial \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}} . \]

  1. Update the value network (critic) by:

\[ \mathbf{w} \leftarrow \mathbf{w}-\alpha \cdot \delta_t \cdot \frac{\partial v\left(s_t ; \mathbf{w}\right)}{\partial \mathbf{w}} . \]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
class ShareLayer(nn.Module):
def __init__(self, n_input, n_hidden):
super(ShareLayer, self).__init__()
self.l1 = nn.Linear(n_input, n_hidden)
nn.init.normal_(self.l1.weight, mean=0, std=0.1)
nn.init.constant_(self.l1.bias, 0.1)

def forward(self, out):
out = self.l1(out)
out = F.relu(out)
return out


class Actor(nn.Module):
def __init__(self, output_n, share_layer):
super(Actor, self).__init__()

self.share_layer = share_layer

self.l2 = nn.Linear(share_layer.l1.out_features, output_n)
nn.init.normal_(self.l2.weight, 0, 0.1)
nn.init.constant_(self.l2.bias, 0)

def forward(self, x):
out = torch.FloatTensor([x])

out = self.share_layer(out)
out = self.l2(out)
prob = F.softmax(out, dim=1)
return prob, out


class Critic(nn.Module):
def __init__(self, share_layer):
super(Critic, self).__init__()
self.share_layer = share_layer

self.l2 = nn.Linear(share_layer.l1.out_features, 1)
nn.init.normal_(self.l2.weight, 0, 0.1)
nn.init.constant_(self.l2.bias, 0)

def forward(self, x):
out = torch.from_numpy(x).float()
out = self.share_layer(out)
out = self.l2(out)
return out


def choose_action(prob):
return np.random.choice(prob.shape[1], p=prob[0].detach().numpy())


def actor_learn(optim, logits, a, delta):
"""
Policy Gradient 更新策略网络
:param optim: 优化器
:param logits: Action网络输出(无softmax)
:param a: a_t
:param delta: TD Error
:return:
"""

a = torch.tensor([a]).long()
lnp_a = F.cross_entropy(logits, a) # softmax(logits)+ log + nllloss
loss = lnp_a * delta

optim.zero_grad()
loss.backward()
optim.step()

return


def critic_learn(optim, critic, s, r, s_, gamma):
"""
TD Learning 训练策略网络
:param optim: critic优化器
:param critic:
:param s: s_t
:param r: r_t
:param s_: s_t+1
:param gamma: 回报折扣率
:return:
"""

v_s = critic(s)
v_s_ = critic(s_)

td_target = r + gamma * v_s_.item()
td_error = td_target - v_s
loss = td_error ** 2

optim.zero_grad()
loss.backward()
optim.step()

return td_error.item()

对比REINFORCE算法

两者网络结构一模一样,但价值网络的用途有所不同

Advantage Actor-Critic(A2C)算法:

  1. Observing a trajectory from time \(t\) to \(t+m-1\).
  2. TD target: \(y_t=\sum_{i=0}^{m-1} \gamma^i \cdot r_{t+i}+\gamma^m \cdot v\left(s_{t+m} ; \mathbf{w}\right)\). (部分基于真实观测,部分基于价值网络的估计)
  3. TD error: \(\delta_t=v\left(s_t ; \mathbf{w}\right)-y_t\).
  4. Update the policy network (actor) by: \(\boldsymbol{\theta} \leftarrow \boldsymbol{\theta}-\beta \cdot \delta_t \cdot \frac{\partial \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}} .\)
  5. Update the value network (critic) by: \(\mathbf{w} \leftarrow \mathbf{w}-\alpha \cdot \delta_t \cdot \frac{\partial v\left(s_t ; \mathbf{w}\right)}{\partial \mathbf{w}}\)

REINFORCE算法

  1. Observing a trajectory from time \(t\) to \(n\).
  2. Return: \(u_t=\sum_{i=t}^n \gamma^{i-t} \cdot r_i\). (使用回报\(u_t\), 完全基于真实观测到的奖励)
  3. Error: \(\delta_t=v\left(s_t ; \mathbf{w}\right)-u_t\).
  4. Update the policy network by: \(\boldsymbol{\theta} \leftarrow \boldsymbol{\theta}-\beta \cdot \delta_t \cdot \frac{\partial \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}} .\)
  5. Update the value network by: \(\mathbf{w} \leftarrow \mathbf{w}-\alpha \cdot \delta_t \cdot \frac{\partial v\left(s_t ; \mathbf{w}\right)}{\partial \mathbf{w}}\)

总结

  • A2C with one-step TD target: \(y_t=r_t+\gamma \cdot v\left(s_{t+1} ; \mathbf{w}\right)\). (Use only one reward \((m=1)\))

  • A2C with \(m\)-step TD target: \(y_t=\sum_{i=0}^{m-1} \gamma^i \cdot r_{t+i}+\gamma^m \cdot v\left(s_{t+m} ; \mathbf{w}\right)\).

  • REINFORCE: \(y_t\) becomes \(u_t=\sum_{i=t}^n \gamma^{i-t} \cdot r_i\). (Use all the rewards)

A2C 使用了bootstrap,如果使用所有的rewards,不用价值网络自己的估计(bootstrap),则变为REINFORCE算法

Asynchronous Advantage Actor-Critic (A3C)

在A2C的基础上使用异步训练的方式提高训练效率

将多个Actor放到不同的CPU上,并行运算,让多个拥有副结构的Agent同时在这些并行环境上更新主结构中的参数。

并行中的Agent们互不干扰,而主结构的参数更新收到副结构提交更新的不连续性干扰,所以更新的相关性被降低,收敛性提高。

利用python的Tread实现

image-20240731191535286

Deep Deterministic Policy Gradient(DDPG)

能够在连续动作上更有效地学习

  • 利用一个确定性的策略网络(Actor): \(a=\pi(s;\theta)\)
  • 一个价值网络(Critic): \(q(s,a;w)\)
  • 价值网络的输出评价了Actor的动作好坏
image-20231205165132120

利用TD Learning的方法训练价值网络\((TD\ error:\delta_t=q_t-\left(r_t+\gamma \cdot q_{t+1}\right))\),和前面的相同

训练策略网络的核心是使得做出动作\(a=\pi(s;\theta)\)\(q(s,a;w)\)的值越大越好

因此可以利用\(q(s,a;w)\)\(\theta\)求导,利用链式法则,求得DPG梯度: \[ \mathbf{g}=\frac{\partial q(s, \pi(s ; \boldsymbol{\theta}) ; \mathbf{w})}{\partial \boldsymbol{\theta}}=\frac{\partial a}{\partial \boldsymbol{\theta}} \cdot \frac{\partial q(s, a ; \mathbf{w})}{\partial a} . \] 然后利用梯度上升更新 \(\boldsymbol{\theta} \leftarrow \boldsymbol{\theta}+\beta \cdot \mathbf{g} .\)

利用Target Network对DDPG算法进行优化

  • Policy network makes a decision: \(a=\pi(s ; \boldsymbol{\theta})\).
  • Update policy network by DPG: \(\quad \boldsymbol{\theta} \leftarrow \boldsymbol{\theta}+\beta \cdot \frac{\partial a}{\partial \boldsymbol{\theta}} \cdot \frac{\partial q(s, a ; \boldsymbol{w})}{\partial a}\).
  • Value network computes \(q_t=q(s, a ; \mathbf{w})\).
  • Target networks, \(\pi\left(s ; \boldsymbol{\theta}^{-}\right)\)and \(q\left(s, a ; \mathbf{w}^{-}\right)\), compute \(q_{t+1}\).
  • TD error: \(\delta_t=q_t-\left(r_t+\gamma \cdot q_{t+1}\right)\).
  • Update value network by TD: \(\mathbf{w} \leftarrow \mathbf{w}-\alpha \cdot \delta_t \cdot \frac{\partial q(s, a ; \mathbf{w})}{\partial \mathbf{w}}\)

Target Network的参数更新

  • Set a hyper-parameter \(\tau \in(0,1)\).
  • Update the target networks by weighted averaging:

\[ \begin{aligned} & \mathbf{w}^{-} \leftarrow \tau \cdot \mathbf{w}+(1-\tau) \cdot \mathbf{w}^{-} . \\ & {\boldsymbol{\theta}^{-}} \leftarrow \tau \cdot \boldsymbol{\theta}+(1-\tau) \cdot \boldsymbol{\theta}^{-} . \end{aligned} \]

image-20240731191619144

Proximal Policy Optimization (PPO)

用于解决 PG 算法中学习率不好确定的问题

  • 如果学习率过大,训练的策略不易收敛
  • 如果学习率太小,则会花费较长的时间。

PPO 算法利用新策略和旧策略的比例,从而限制了新策略的更新幅度,让 PG 算法对于稍微大一点的学习率不那么敏感。

为了判定模型的更新什么时候停止,所以 PPO 在原目标函数的基础上添加了 KL 散度部分,用来表示两个分布之间的差别,差别越大值越大,惩罚也就越大。

所以可以使两个分布尽可能的相似。PPO 算法的损失函数如下 \[ \mathrm{J}_{\mathrm{PPO}}^{\theta^{\prime}}(\theta)=\mathrm{J}^{\theta^{\prime}}(\theta)-\beta \mathrm{KL}\left(\theta, \theta^{\prime}\right) \]

\[ \mathrm{J}^{\theta^{\prime}}(\theta)=\mathrm{E}_{\left(\mathrm{s}_{\mathrm{t}}, \mathrm{a}_{\mathrm{t}}\right) \sim \pi_{\theta^{\prime}}}\left[\frac{\mathrm{p}_\theta\left(\mathrm{a}_{\mathrm{t}} \mid \mathrm{s}_{\mathrm{t}}\right)}{\mathrm{p}_{\theta^{\prime}}\left(\mathrm{a}_{\mathrm{t}} \mid \mathrm{s}_{\mathrm{t}}\right)} \mathrm{A}^{\theta^{\prime}}\left(\mathrm{s}_{\mathrm{t}}, \mathrm{a}_{\mathrm{t}}\right)\right] \]

PPO 在训练时可以采用适应性的 KL 惩罚因子.

  • 当 KL 过大时,增大 \(\beta\)的值来增加惩罚力度. if \(\mathrm{KL}\left(\theta, \theta^{\prime}\right)>\mathrm{KL}_{\max }\) increase \(\beta\)
  • 当 kL 过小时,减小 \(\beta\)值来降低惩罚力度. if \(\mathrm{KL}\left(\theta, \theta^{\prime}\right)<\mathrm{KL}_{\min }\), decrease \(\beta\)

image-20240731191656327

Distributed Proximal Policy Optimization(DPPO)

相当于多线程并行版的 PPO。