【RL】Policy-Based Learning

基于策略学习

思想:直接对最优策略进行估计。

  1. 随机性策略:对状态到最优动作概率分布之间的映射进行估计,然后从该概率分布中进行采样得到输出动作。

  2. 确定性策略: 直接对状态到最优动作之间的映射进行估计。

算法特点:神经网络的输出即为最优动作,动作空间既可以是离散的也可以是连续的。

直接输出动作的最大好处就是, 它能在一个连续区间内挑选动作, 而基于值的, 比如 Q-learning, 它如果在无穷多的动作中计算价值, 从而选择行为, 这, 它可吃不消.

核心内容

利用策略网络\(\pi(a|s;\theta)\)去近似价值函数\(\pi(a|s)\)

Agent根据策略网络的输出,概率抽样选择Action \(a_t \sim \pi\left(\cdot \mid s_t\right)\)

通过策略梯度(Policy Gradient)寻览策略网络

策略梯度算法通过最大化\(\mathbb{E}_s[V(s;\theta)]\),训练网络参数

image-20231205120520822

数学推导

Discounted return: \(U_t=R_t+\gamma \cdot R_{t+1}+\gamma^2 \cdot R_{t+2}+\gamma^3 \cdot R_{t+3}+\cdots\)

Action-value function: \(Q_\pi\left(s_t, a_t\right)=\mathbb{E}\left[U_t \mid S_t=s_t, A_t=a_t\right] \text {. }\)

State-value function: $ V_(s_t)=_A=a(a|s_t)Q(s_t,a)$

利用策略网络\(\pi(a|s;\theta)\)近似policy \(\pi\),因此状态价值函数同样可以利用\(\theta\)近似得到\(V(s;\theta)\) \[ V(s ; \boldsymbol{\theta})=\sum_a \pi(a \mid s ; \boldsymbol{\theta}) \cdot Q_\pi(s, a) \text {. } \]

\(V(s ; \boldsymbol{\theta})\)表示对当前状态State的获胜概率,再对状态State求期望得到\(J(\theta)\) \[ J(\boldsymbol{\theta})=\mathbb{E}_S[V(S ; \boldsymbol{\theta})] \] 策略学习目标就是通过更新网络参数\(\theta\) 使得\(J(\theta)\)的值越大越好

利用Policy Gradient更新网络梯度 \[ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta}+\beta\cdot \frac{\partial v(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \] \(\beta\)是学习率,\(\frac{\partial v(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\)是策略梯度,利用梯度上升来更新\(\theta\)

Policy Gradient

State-value function: $ V_(s_t)=_A=a(a|s_t)Q(s_t,a)$

计算策略梯度: \[ \begin{aligned} \frac{\partial V(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} & =\frac{\partial \sum_a \pi(a \mid s ; \boldsymbol{\theta}) \cdot Q_\pi(s, a)}{\partial \boldsymbol{\theta}} \\ & =\sum_a \frac{\partial \pi(a \mid s ; \boldsymbol{\theta}) \cdot Q_\pi(s, a)}{\partial \boldsymbol{\theta}} \\ & =\sum_a \frac{\partial \pi(a \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, a)\\ & =\sum_a \pi(a \mid s ; \boldsymbol{\theta}) \cdot \frac{\partial \log \pi(a \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, a) \\ & =\mathbb{E}_A\left[\frac{\partial \log \pi(A \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, A)\right] . \end{aligned} \]

以上假设\(Q_\pi\)独立于\(\theta\)以简化推导,但实际中并非独立,因此并不严谨

通过推导,得到策略梯度的两种表示形式 \[ \begin{aligned} & Form 1: \frac{\partial V(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\sum_a \frac{\partial \pi(a \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, a)$. \\ & Form 2: \frac{\partial V(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\mathbb{E}_{A \sim \pi(\cdot \mid s ; \boldsymbol{\theta})}\left[\frac{\partial \log \pi(A \mid s, \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, A)\right]$. \end{aligned} \]

离散的动作

如果Action是离散的,可以使用Form1: \(\frac{\partial V(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\sum_a \frac{\partial \pi(a \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, a)\)计算策略梯度

  1. Calculate \(\mathbf{f}({a}, \boldsymbol{\theta})=\frac{\partial \pi(a \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, a)\), for every action \(a \in \mathcal{A}\).

  2. Policy gradient: \(\frac{\partial V(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\mathbf{f}(a_1, \boldsymbol{\theta})+\mathbf{f}(a_2, \boldsymbol{\theta})+\mathbf{f}(a_3, \boldsymbol{\theta})\).

连续的动作

动作连续则使用第二种方式计算策略梯度,由于积分非常难求,因此采用蒙特卡洛近似计算策略梯度

蒙特卡洛近似,需要求期望 \(\mathbb{E}_A[\mathbf{g}(A, \boldsymbol{\theta})]=\frac{\partial V(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\),从Action space中根据\(\pi(.|s;\theta)\)的概率分布随机抽取一个(或很多个)样本\(\hat{a}\),计算\(g(\hat{a},\theta)\)(平均值)将其近似为策略梯度,无偏估计

假设动作是连续的,Action space \(\mathcal{A}=[0,1], \ldots\) \[ Form2:\quad \frac{\partial V(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\mathbb{E}_{A \sim \pi(\cdot \mid s ; \boldsymbol{\theta})}\left[\frac{\partial \log \pi(A \mid s, \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, A)\right] \]

  1. Randomly sample an action \(\hat{a}\) according to the PDF \(\pi(\cdot \mid s ; \boldsymbol{\theta})\).
  2. Calculate \(\mathbf{g}(\hat{a}, \boldsymbol{\theta})=\frac{\partial \log \pi(\hat{a} \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, \hat{a})\).
  3. Use \(\mathbf{g}(\hat{a}, \boldsymbol{\theta})\) as an approximation to the policy gradient \(\frac{\partial V(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\).

策略学习算法

  1. Observe the state \(s_t\).
  2. Randomly sample action \(a_t\) according to \(\pi\left(\cdot \mid s_t ; \boldsymbol{\theta}_t\right)\).
  3. Compute \(q_t \approx Q_\pi\left(s_t, a_t\right)\) (some estimate).
  4. Differentiate policy network: \(\mathbf{d}_{\theta, t}=\left.\frac{\partial \log \pi\left(a_t \mid s_t, \boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}}\right|_{\boldsymbol{\theta}=\boldsymbol{\theta}_t}\).
  5. (Approximate) policy gradient: \(\mathbf{g}\left(a_t, \boldsymbol{\theta}_t\right)=q_t \cdot \mathbf{d}_{\theta, t}\).
  6. Update policy network: \(\boldsymbol{\theta}_{t+1}={\boldsymbol{\theta}_t+\beta} \cdot \mathbf{g}\left({a_t}, {\left.\boldsymbol{\theta}_t\right)}\right.\).

但是第3步\(Q_\pi(s_t,a_t)\)的值无法计算,因此可以采用REINFORCE、actor-critic的方法计算\(Q_\pi(s_t,a_t)\)

Baseline

设定一个独立于A的baseline,b \[ \begin{aligned} \mathbb{E}_{A \sim \pi}\left[b \cdot \frac{\partial \ln \pi(A \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right]& =b \cdot \mathbb{E}_{A \sim \pi}\left[\frac{\partial \ln \pi(A \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right] \\ & =b \cdot \sum_a \pi(a \mid s ; \boldsymbol{\theta}) \cdot\left[\frac{1}{\pi(a \mid s ; \boldsymbol{\theta})} \cdot \frac{\partial \pi(a \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right] \\ & =b \cdot \sum_a \frac{\partial \pi(a \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \\ & =b \cdot \frac{\partial \sum_a \pi(a \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \\ & =b \frac{\partial 1}{\partial \theta}=0 . \\ \end{aligned} \] 如果b独立于A,可以得到\(\mathbb{E}_{A \sim \pi}\left[b \cdot \frac{\partial \ln \pi(A \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right]=0\)

根据前面Policy Gradient公式可以得到,后面一项为0 \[ \begin{aligned} \frac{\partial V_\pi(s)}{\partial \boldsymbol{\theta}} &=\mathbb{E}_{A \sim \pi}\left[\frac{\partial \ln \pi(A \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} Q_\pi(s, A)\right]-\mathbb{E}_{A \sim \pi}\left[\frac{\partial \ln \pi(A \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right]. \\ & =\mathbb{E}_{A \sim \pi}\left[\frac{\partial \ln \pi(A \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\left(Q_\pi(s, A)-b\right)\right] . \end{aligned} \] 无论b的值等于多少,只要独立于A,得到的期望与之前一样

在蒙特卡洛近似中,依照策略网络的输出概率抽样得到\(a_t\),利用\(a_t\)计算\(g(a_t)\)来近似期望 \[ \mathbf{g}\left(a_t\right)=\frac{\partial \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}} \cdot\left(Q_\pi\left(s_t, a_t\right)-b\right) . \] 显然b(独立于\(A_t\))不会影响期望\((\mathbb{E}_{A_t \sim \pi}\left[\mathbf{g}\left(A_t\right)\right])\),但可以改变\(g(a_t)\)的大小,

如果b的值接近\(Q_\pi\),则会是蒙特卡洛近似\(g(a_t)\)的方差降低,算法收敛更快

可以设置\(b=V_\pi\),由于\(V_\pi(s_t)\)独立于\(A_t\),并且\(V_\pi(s_t)\)\(Q_\pi(s_t,A_t)\)的期望,两者比较接近 \[ \frac{\partial V_\pi\left(s_t\right)}{\partial \boldsymbol{\theta}}=\mathbb{E}_{A_t \sim \pi}\left[\frac{\partial \ln \pi\left(A_t \mid s_t ; \boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}} \cdot\left(Q_\pi\left(s_t, A_t\right)-V_\pi\left(s_t\right)\right)\right] . \]

REINFORCE

基于 整条回合数据 的更新

利用策略网络\(\pi\)控制agent运动,从一开始玩到游戏结束,记录每个state、action、reward。 \[ s_1, a_1, r_1, s_2, a_2, r_2, \cdots, s_T, a_T, r_T . \]

观测到所有的\(r_t\),就可以计算所有Discounted Return \(u_t(t=1,2,3,..,T)\) \[ u_t=\sum_{k=t}^T \gamma^{k-t} r_k \] 由于\(Q_\pi(s_t,a_t)\)\(U_t\)的期望,因此可以使用\(u_t\)近似\(q_t\)

REINFORCE With Baseline

结合前边的Baseline方法,近似计算策略梯度: \[ \frac{\partial V_\pi\left(s_t\right)}{\partial \boldsymbol{\theta}} \approx \mathbf{g}\left(a_t\right) \approx \frac{\partial \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}} \cdot (u_t-v(s_t;w)) \] 使用了两次蒙特卡洛近似,一次神经网络近似

  1. Approximate expectation using one sample, \(a_t\). (Monte Carlo.)

  2. Approximate \(Q_\pi\left(s_t, a_t\right)\) by \(u_t\). (Another Monte Carlo.)

  3. Approximate \(V_\pi(s)\) by the value network, \(v(s ; \mathbf{w})\).

建立神经网络,由于两个网络都使用卷积层(马里奥demo),因此可以共享网络参数

image-20231205133639628

结合策略学习

\[ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta}+\beta \frac{\partial \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}} \cdot\left(u_t-v\left(s_t ; \mathbf{w}\right)\right) . \]\(\delta=u_t-v(s_t ; \mathbf{w})\)

可以写成 \[ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta}-\beta \cdot \delta_t \cdot \frac{\partial \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}} \] 利用回归训练价值网络\(v(s;w)\) (需要尽可能接近\(u_t\)

\(v(s;w)\)是对\(V_\pi(s_t)\)的估计,因此可以让\(v(s;w)\)拟合回报\(u_t\)

预测误差 \(\delta_t=v\left(s_t ; \mathbf{w}\right)-u_t\).

梯度 \(\frac{\partial \delta_t^2 / 2}{\partial \mathbf{w}}=\delta_t \cdot \frac{\partial v\left(s_t ; \mathbf{w}\right)}{\partial \mathbf{w}}\).

梯度下降更新网络参数 \[ \mathbf{w} \leftarrow \mathbf{w}-\alpha \cdot \delta_t \cdot \frac{\partial v\left(s_t ; \mathbf{w}\right)}{\partial \mathbf{w}} . \]

每轮epsilon完成以下步骤

  1. 玩一局记录所有State、Action、Reword, \(s_1, a_1, r_1, s_2, a_2, r_2, \cdots, s_n, a_n, r_n .\)
  2. 计算 \(u_t=\sum_{i=t}^n \gamma^{i-t} \cdot r_i\)\(\delta_t=v\left(s_t ; \mathbf{w}\right)-u_t\).
  3. 利用策略梯度更新策略网络 \(\boldsymbol{\theta} \leftarrow \boldsymbol{\theta}-\beta \cdot \delta_t \cdot \frac{\partial \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}} .\)
  4. 利用回归更新价值网络 \(\mathbf{w} \leftarrow \mathbf{w}-\alpha \cdot \delta_t \cdot \frac{\partial v\left(s_t ; \mathbf{w}\right)}{\partial \mathbf{w}}\)

算法实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
class Net(nn.Module):
"""
定义三层网络结构
"""
def __init__(self, n_feature, n_hidden, n_output):
super(Net, self).__init__()
self.l1 = nn.Linear(n_feature, n_hidden)
self.l2 = nn.Linear(n_hidden, n_output)

def forward(self, x):
x = self.l1(x)
x = torch.tanh(x)
x = self.l2(x)
return x


class PolicyGradient:
def __init__(self, n_actions, n_features, n_hidden=10, learning_rate=0.01, reward_decay=0.95, seed=None):
self.n_actions = n_actions
self.n_features = n_features
self.n_hidden = n_hidden
self.lr = learning_rate
self.gamma = reward_decay

# 存储每个episode的state、action、reward
self.ep_obs, self.ep_as, self.ep_rs = [], [], []

if seed is not None:
torch.manual_seed(seed)
np.random.seed(seed)

# 建立网络
self._build_net()

def _build_net(self):
"""
建立策略网络、优化器
:return:
"""
self.policy_net = Net(self.n_features, self.n_hidden, self.n_actions)
self.optimizer = torch.optim.RMSprop(self.policy_net.parameters(), lr=self.lr)

def choose_action(self, observation):
"""
利用policy_net计算动作概率分布,依概率选择action
"""
x = torch.FloatTensor(observation[np.newaxis, :])
prob = torch.softmax(self.policy_net(x), dim=1).data.numpy()[0]
# print(prob)
return np.random.choice(range(self.n_actions), p=prob)

def store_transition(self, state, action, reward):
"""
保存一个episode中的s,a,r
"""
self.ep_obs.append(state)
self.ep_as.append(action)
self.ep_rs.append(reward)

def learn(self):
"""
一轮结束迭代后更新网络参数
:return:
"""
# compute the discounted return ut for all t
ep_ut = torch.tensor(self._discount_and_norm_rewards())

# 计算每个state的action概率
all_prob = self.policy_net(torch.FloatTensor(np.vstack(self.ep_obs)))

# 各state选择的action
all_acts = torch.tensor(self.ep_as)

# cross_entropy combines nn.LogSoftmax() and nn.NLLLoss() in one single class
neg_log_prob = torch.nn.functional.cross_entropy(all_prob, all_acts, reduction='none')

# 取均值
loss = torch.mean(neg_log_prob * ep_ut)

# 反向传播更新参数
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()

# 清空记录
self.ep_rs, self.ep_as, self.ep_obs = [], [], []

return ep_ut

def _discount_and_norm_rewards(self):
"""
compute the discounted return ut for all t
:return:
"""
ep_ut = np.zeros_like(self.ep_rs)
discount_reward = 0
for i in reversed(range(len(self.ep_rs))):
discount_reward = self.ep_rs[i] + discount_reward * self.gamma
ep_ut[i] = discount_reward

ep_ut -= np.mean(ep_ut)
ep_ut /= np.std(ep_ut)
return ep_ut


def main(argv):
...
with gym.make("llvm-ic-v0", benchmark=benchmark) as env:

RL = PolicyGradient(...)

for episode in range(1, FLAGS.episodes + 1):
# 初始化环境
observation = env.reset()

for step_episode in range(FLAGS.episode_length):
action = int(RL.choose_action(observation))

observation_, reward, done, info = env.step(action)
RL.store_transition(observation, action, reward)

if done or (step_episode == FLAGS.episode_length - 1):
RL.learn()
break

observation = observation_