【RL】Policy-Based Learning
【RL】Policy-Based Learning
基于策略学习
思想:直接对最优策略进行估计。
随机性策略:对状态到最优动作概率分布之间的映射进行估计,然后从该概率分布中进行采样得到输出动作。
确定性策略: 直接对状态到最优动作之间的映射进行估计。
算法特点:神经网络的输出即为最优动作,动作空间既可以是离散的也可以是连续的。
直接输出动作的最大好处就是, 它能在一个连续区间内挑选动作, 而基于值的, 比如 Q-learning, 它如果在无穷多的动作中计算价值, 从而选择行为, 这, 它可吃不消.
核心内容
利用策略网络\(\pi(a|s;\theta)\)去近似价值函数\(\pi(a|s)\)
Agent根据策略网络的输出,概率抽样选择Action \(a_t \sim \pi\left(\cdot \mid s_t\right)\)
通过策略梯度(Policy Gradient)寻览策略网络
策略梯度算法通过最大化\(\mathbb{E}_s[V(s;\theta)]\),训练网络参数

数学推导
Discounted return: \(U_t=R_t+\gamma \cdot R_{t+1}+\gamma^2 \cdot R_{t+2}+\gamma^3 \cdot R_{t+3}+\cdots\)
Action-value function: \(Q_\pi\left(s_t, a_t\right)=\mathbb{E}\left[U_t \mid S_t=s_t, A_t=a_t\right] \text {. }\)
State-value function: $ V_(s_t)=_A=a(a|s_t)Q(s_t,a)$
利用策略网络\(\pi(a|s;\theta)\)近似policy \(\pi\),因此状态价值函数同样可以利用\(\theta\)近似得到\(V(s;\theta)\) \[ V(s ; \boldsymbol{\theta})=\sum_a \pi(a \mid s ; \boldsymbol{\theta}) \cdot Q_\pi(s, a) \text {. } \]
\(V(s ; \boldsymbol{\theta})\)表示对当前状态State的获胜概率,再对状态State求期望得到\(J(\theta)\) \[ J(\boldsymbol{\theta})=\mathbb{E}_S[V(S ; \boldsymbol{\theta})] \] 策略学习目标就是通过更新网络参数\(\theta\) 使得\(J(\theta)\)的值越大越好
利用Policy Gradient更新网络梯度 \[ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta}+\beta\cdot \frac{\partial v(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \] \(\beta\)是学习率,\(\frac{\partial v(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\)是策略梯度,利用梯度上升来更新\(\theta\)
Policy Gradient
State-value function: $ V_(s_t)=_A=a(a|s_t)Q(s_t,a)$
计算策略梯度: \[ \begin{aligned} \frac{\partial V(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} & =\frac{\partial \sum_a \pi(a \mid s ; \boldsymbol{\theta}) \cdot Q_\pi(s, a)}{\partial \boldsymbol{\theta}} \\ & =\sum_a \frac{\partial \pi(a \mid s ; \boldsymbol{\theta}) \cdot Q_\pi(s, a)}{\partial \boldsymbol{\theta}} \\ & =\sum_a \frac{\partial \pi(a \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, a)\\ & =\sum_a \pi(a \mid s ; \boldsymbol{\theta}) \cdot \frac{\partial \log \pi(a \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, a) \\ & =\mathbb{E}_A\left[\frac{\partial \log \pi(A \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, A)\right] . \end{aligned} \]
以上假设\(Q_\pi\)独立于\(\theta\)以简化推导,但实际中并非独立,因此并不严谨
通过推导,得到策略梯度的两种表示形式 \[ \begin{aligned} & Form 1: \frac{\partial V(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\sum_a \frac{\partial \pi(a \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, a)$. \\ & Form 2: \frac{\partial V(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\mathbb{E}_{A \sim \pi(\cdot \mid s ; \boldsymbol{\theta})}\left[\frac{\partial \log \pi(A \mid s, \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, A)\right]$. \end{aligned} \]
离散的动作
如果Action是离散的,可以使用Form1: \(\frac{\partial V(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\sum_a \frac{\partial \pi(a \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, a)\)计算策略梯度
Calculate \(\mathbf{f}({a}, \boldsymbol{\theta})=\frac{\partial \pi(a \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, a)\), for every action \(a \in \mathcal{A}\).
Policy gradient: \(\frac{\partial V(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\mathbf{f}(a_1, \boldsymbol{\theta})+\mathbf{f}(a_2, \boldsymbol{\theta})+\mathbf{f}(a_3, \boldsymbol{\theta})\).
连续的动作
动作连续则使用第二种方式计算策略梯度,由于积分非常难求,因此采用蒙特卡洛近似计算策略梯度
蒙特卡洛近似,需要求期望 \(\mathbb{E}_A[\mathbf{g}(A, \boldsymbol{\theta})]=\frac{\partial V(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\),从Action space中根据\(\pi(.|s;\theta)\)的概率分布随机抽取一个(或很多个)样本\(\hat{a}\),计算\(g(\hat{a},\theta)\)(平均值)将其近似为策略梯度,无偏估计
假设动作是连续的,Action space \(\mathcal{A}=[0,1], \ldots\) \[ Form2:\quad \frac{\partial V(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\mathbb{E}_{A \sim \pi(\cdot \mid s ; \boldsymbol{\theta})}\left[\frac{\partial \log \pi(A \mid s, \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, A)\right] \]
- Randomly sample an action \(\hat{a}\) according to the PDF \(\pi(\cdot \mid s ; \boldsymbol{\theta})\).
- Calculate \(\mathbf{g}(\hat{a}, \boldsymbol{\theta})=\frac{\partial \log \pi(\hat{a} \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, \hat{a})\).
- Use \(\mathbf{g}(\hat{a}, \boldsymbol{\theta})\) as an approximation to the policy gradient \(\frac{\partial V(s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\).
策略学习算法
- Observe the state \(s_t\).
- Randomly sample action \(a_t\) according to \(\pi\left(\cdot \mid s_t ; \boldsymbol{\theta}_t\right)\).
- Compute \(q_t \approx Q_\pi\left(s_t, a_t\right)\) (some estimate).
- Differentiate policy network: \(\mathbf{d}_{\theta, t}=\left.\frac{\partial \log \pi\left(a_t \mid s_t, \boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}}\right|_{\boldsymbol{\theta}=\boldsymbol{\theta}_t}\).
- (Approximate) policy gradient: \(\mathbf{g}\left(a_t, \boldsymbol{\theta}_t\right)=q_t \cdot \mathbf{d}_{\theta, t}\).
- Update policy network: \(\boldsymbol{\theta}_{t+1}={\boldsymbol{\theta}_t+\beta} \cdot \mathbf{g}\left({a_t}, {\left.\boldsymbol{\theta}_t\right)}\right.\).
但是第3步\(Q_\pi(s_t,a_t)\)的值无法计算,因此可以采用REINFORCE、actor-critic的方法计算\(Q_\pi(s_t,a_t)\)
Baseline
设定一个独立于A的baseline,b \[ \begin{aligned} \mathbb{E}_{A \sim \pi}\left[b \cdot \frac{\partial \ln \pi(A \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right]& =b \cdot \mathbb{E}_{A \sim \pi}\left[\frac{\partial \ln \pi(A \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right] \\ & =b \cdot \sum_a \pi(a \mid s ; \boldsymbol{\theta}) \cdot\left[\frac{1}{\pi(a \mid s ; \boldsymbol{\theta})} \cdot \frac{\partial \pi(a \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right] \\ & =b \cdot \sum_a \frac{\partial \pi(a \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \\ & =b \cdot \frac{\partial \sum_a \pi(a \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \\ & =b \frac{\partial 1}{\partial \theta}=0 . \\ \end{aligned} \] 如果b独立于A,可以得到\(\mathbb{E}_{A \sim \pi}\left[b \cdot \frac{\partial \ln \pi(A \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right]=0\)
根据前面Policy Gradient公式可以得到,后面一项为0 \[ \begin{aligned} \frac{\partial V_\pi(s)}{\partial \boldsymbol{\theta}} &=\mathbb{E}_{A \sim \pi}\left[\frac{\partial \ln \pi(A \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} Q_\pi(s, A)\right]-\mathbb{E}_{A \sim \pi}\left[\frac{\partial \ln \pi(A \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\right]. \\ & =\mathbb{E}_{A \sim \pi}\left[\frac{\partial \ln \pi(A \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}}\left(Q_\pi(s, A)-b\right)\right] . \end{aligned} \] 无论b的值等于多少,只要独立于A,得到的期望与之前一样
在蒙特卡洛近似中,依照策略网络的输出概率抽样得到\(a_t\),利用\(a_t\)计算\(g(a_t)\)来近似期望 \[ \mathbf{g}\left(a_t\right)=\frac{\partial \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}} \cdot\left(Q_\pi\left(s_t, a_t\right)-b\right) . \] 显然b(独立于\(A_t\))不会影响期望\((\mathbb{E}_{A_t \sim \pi}\left[\mathbf{g}\left(A_t\right)\right])\),但可以改变\(g(a_t)\)的大小,
如果b的值接近\(Q_\pi\),则会是蒙特卡洛近似\(g(a_t)\)的方差降低,算法收敛更快
可以设置\(b=V_\pi\),由于\(V_\pi(s_t)\)独立于\(A_t\),并且\(V_\pi(s_t)\)是\(Q_\pi(s_t,A_t)\)的期望,两者比较接近 \[ \frac{\partial V_\pi\left(s_t\right)}{\partial \boldsymbol{\theta}}=\mathbb{E}_{A_t \sim \pi}\left[\frac{\partial \ln \pi\left(A_t \mid s_t ; \boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}} \cdot\left(Q_\pi\left(s_t, A_t\right)-V_\pi\left(s_t\right)\right)\right] . \]
REINFORCE
基于 整条回合数据 的更新
利用策略网络\(\pi\)控制agent运动,从一开始玩到游戏结束,记录每个state、action、reward。 \[ s_1, a_1, r_1, s_2, a_2, r_2, \cdots, s_T, a_T, r_T . \]
观测到所有的\(r_t\),就可以计算所有Discounted Return \(u_t(t=1,2,3,..,T)\) \[ u_t=\sum_{k=t}^T \gamma^{k-t} r_k \] 由于\(Q_\pi(s_t,a_t)\)是\(U_t\)的期望,因此可以使用\(u_t\)近似\(q_t\)
REINFORCE With Baseline
结合前边的Baseline方法,近似计算策略梯度: \[ \frac{\partial V_\pi\left(s_t\right)}{\partial \boldsymbol{\theta}} \approx \mathbf{g}\left(a_t\right) \approx \frac{\partial \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}} \cdot (u_t-v(s_t;w)) \] 使用了两次蒙特卡洛近似,一次神经网络近似
Approximate expectation using one sample, \(a_t\). (Monte Carlo.)
Approximate \(Q_\pi\left(s_t, a_t\right)\) by \(u_t\). (Another Monte Carlo.)
Approximate \(V_\pi(s)\) by the value network, \(v(s ; \mathbf{w})\).
建立神经网络,由于两个网络都使用卷积层(马里奥demo),因此可以共享网络参数

结合策略学习
\[ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta}+\beta \frac{\partial \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}} \cdot\left(u_t-v\left(s_t ; \mathbf{w}\right)\right) . \] 令\(\delta=u_t-v(s_t ; \mathbf{w})\)
可以写成 \[ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta}-\beta \cdot \delta_t \cdot \frac{\partial \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}} \] 利用回归训练价值网络\(v(s;w)\) (需要尽可能接近\(u_t\))
\(v(s;w)\)是对\(V_\pi(s_t)\)的估计,因此可以让\(v(s;w)\)拟合回报\(u_t\)
预测误差 \(\delta_t=v\left(s_t ; \mathbf{w}\right)-u_t\).
梯度 \(\frac{\partial \delta_t^2 / 2}{\partial \mathbf{w}}=\delta_t \cdot \frac{\partial v\left(s_t ; \mathbf{w}\right)}{\partial \mathbf{w}}\).
梯度下降更新网络参数 \[ \mathbf{w} \leftarrow \mathbf{w}-\alpha \cdot \delta_t \cdot \frac{\partial v\left(s_t ; \mathbf{w}\right)}{\partial \mathbf{w}} . \]
每轮epsilon完成以下步骤
- 玩一局记录所有State、Action、Reword, \(s_1, a_1, r_1, s_2, a_2, r_2, \cdots, s_n, a_n, r_n .\)
- 计算 \(u_t=\sum_{i=t}^n \gamma^{i-t} \cdot r_i\) 和 \(\delta_t=v\left(s_t ; \mathbf{w}\right)-u_t\).
- 利用策略梯度更新策略网络 \(\boldsymbol{\theta} \leftarrow \boldsymbol{\theta}-\beta \cdot \delta_t \cdot \frac{\partial \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}\right)}{\partial \boldsymbol{\theta}} .\)
- 利用回归更新价值网络 \(\mathbf{w} \leftarrow \mathbf{w}-\alpha \cdot \delta_t \cdot \frac{\partial v\left(s_t ; \mathbf{w}\right)}{\partial \mathbf{w}}\)
算法实现
1 | class Net(nn.Module): |