【RL】Multi-Agent Learning
【RL】Multi-Agent Learning
一个系统中存在多个决策者
面临的困难
环境非平稳变化(马尔科夫特性不再满足),每个智能体所对应的环境包含了其他智能体的策略,其他智能体策略的变化导致环境变化的非平稳性。
局部可观,智能体只能观测到环境的部分状态
集中式学习不可行,需要一个集中式控制中心与智能体之间进行大量的信息交互,神经网络的输入输出维度会随智能体数目指数增大,难以收敛。
基本概念
Agent的关系
Fully cooperative,两个Agent需要相互配合才能完成任务(工业机器人)
Fully competitive,一方的收益是另一方的损失(捕食者和猎物)
Mixed Cooperative & competitive,既有竞争又有合作(足球机器人,队伍内是合作,队伍间是博弈)
Self-interested,利己主义,只想最大化自己的利益,动作可能让别人获益or损失,不关心别人的利益(股票的期货交易系统、无人车)
State & Action
There are \(n\) agents.
Let \(S\) be the state.
Let \(A^i\) be the \(i\)-th agent's action.
State transition: \(p\left(s^{\prime} \mid s, a^1, \cdots, a^n\right)=\mathbb{P}\left(S^{\prime}=s^{\prime} \mid S=s, A^1=a^1, \cdots, A^n=a^n\right).\)
The next state, \(S^{\prime}\), depends on all the agents' actions.
每个Agent的Action都可能影响Env的下个State,所以每个Agent都可以影响其他Agent
Rewards
Let \(R^i\) be the reward received by the \(i\)-th agent.
Fully cooperative: \(R^1=R^2=\cdots=R^n\).
Fully competitive: \(R^1 \propto-R^2\).
\(R^i\) depends on \(A^i\) as well as all the other agents' actions \(\left\{A^j\right\}_{j \neq i}\).
Returns
Let \(R_t^i\) be the reward received by the \(i\)-th agent at time \(t\). Return (of the \(i\)-th agent): \[ U_t^i=R_t^i+R_{t+1}^i+R_{t+2}^i+R_{t+3}^i+\cdots \]
Discounted return (of the \(i\)-th agent): \[ U_t^i=R_t^i+\gamma \cdot R_{t+1}^i+\gamma^2 \cdot R_{t+2}^i+\gamma^3 \cdot R_{t+3}^i+\cdots \]
Here, \(\gamma \in[0,1]\) is the discount rate.
Policy Network
Each agent has its own policy network: \(\pi\left(a^i \mid s ; \boldsymbol{\theta}^i\right)\).
Policy networks can be exchangeable: \(\boldsymbol{\theta}^1=\boldsymbol{\theta}^2=\cdots=\boldsymbol{\theta}^n\).
- Self-driving cars can have the same policy.
Policy networks can be nonexchangeable: \(\boldsymbol{\theta}^i \neq \boldsymbol{\theta}^j\).
- Soccer players have different roles, e.g., striker, defender, goalkeeper.
State-Value Function
State-value of the \(i\)-th agent: (不仅依赖自己的策略,同时依赖其他所有Agent的策略) \[ V^i\left(s_t ; \boldsymbol{\theta}^1, \cdots, \boldsymbol{\theta}^n\right)=\mathbb{E}\left[U_t^i \mid S_t=s_t\right] . \]
The expectation is taken w.r.t. all the future actions and states except \(S_t\).
Randomness in actions: \(A_t^j \sim \pi\left(\cdot \mid s_t ; \boldsymbol{\theta}^j\right)\), for all \(j=1, \cdots, n\). (That is why the state-value \(V^i\) depends on \(\boldsymbol{\theta}^1, \cdots, \boldsymbol{\theta}^n\).)
Nash Equilibrium(纳什均衡)
Multi-Agent问题收敛的条件
当其他所有Agent的政策保持不变时,第 \(i\) 个Agent无法通过改变自己的Policy获得更好的预期收益。
每个Agent都在对其他Agent的Policy做出最佳反应。
在这种平衡状态下,大家都没动机去改变自己策略
如果使用Single-Agent的策略梯度算法直接用于Mulit-Agent问题上

The \(i\)-th agent's policy network: \(\pi\left(a^i \mid s ; \boldsymbol{\theta}^i\right)\).
The \(i\)-th agent's state-value function: \(V^i\left(s ; \boldsymbol{\theta}^1, \cdots, \boldsymbol{\theta}^n\right)\).
Objective function: \(J^i\left(\boldsymbol{\theta}^1, \cdots, \boldsymbol{\theta}^n\right)=\mathbb{E}_S\left[V^i\left(S ; \boldsymbol{\theta}^1, \cdots, \boldsymbol{\theta}^n\right)\right]\).
Learn the policy network's parameter, \(\boldsymbol{\theta}^i\), by \[ \max _{\boldsymbol{\theta}^i} J^i\left(\boldsymbol{\theta}^1, \cdots, \boldsymbol{\theta}^n\right) \]
- The \(1^{\text {st }}\) agent solves: \(\quad \max _{\theta^1}J^1\left(\boldsymbol{\theta}^1, \boldsymbol{\theta}^2, \cdots, \boldsymbol{\theta}^n)\right.\).
- The \(2^{\text {nd }}\) agent solves: \(\quad \max _{\theta^2} J^2\left(\boldsymbol{\theta}^1, \boldsymbol{\theta}^2, \cdots, \boldsymbol{\theta}^n\right)\).
- ...
- The \(n^{\text {th }}\) agent solves: \(\quad \max _{\boldsymbol{\theta}^n} J^n\left(\boldsymbol{\theta}^1, \boldsymbol{\theta}^2, \cdots, \boldsymbol{\theta}^n\right)\).
每个Agent没有共同的目标,各自更新自己的\(\theta^i\),一个Agent更新策略,可能导致其他所有Agent的目标函数发生变化,大家的目标函数都在不停的发生变化,导致训练很难收敛
所以不能将Single-Agent的训练策略应用与Multi-Agent问题
Centralized Vs Decentralized
Architectures
Fully decentralized,每个Agent都用自己观测到的observation、reward去训练策略,Agent之间不通信
Fully centralized,所有Agent将信息发给中央控制器,中央控制器负责做decision,Agent负责执行
Centralized training with decentralized execution,中央控制器在训练阶段收集各Agent信息,帮助训练Agent策略网络,训练结束后,Agent自己利用策略网络做decision,不依赖中央控制器通信
Partial Observation
An agent may or may not have full knowledge of the state, \(s\).
Let \(o^i\) be the \(i\)-th agent's observation.
Partial observation: \(o^i \neq S\).
Full observation: \(\quad o^1=\cdots=o^n=s\).
Fully Decentralized
本质就是Single-Agent reinforce learning

The \(i\)-th agent has a policy network (actor): \(\pi\left(a^i \mid o^i ; \boldsymbol{\theta}^i\right)\).
The \(i\)-th agent has a value network (critic): \(q\left(o^i, a^i ; \mathbf{w}^i\right)\).
Agents do not share observations and actions.
Train the policy and value networks in the same way as the single-agent setting.
This does not work well.
Fully Centralized
各Agent中没有策略网络,将观测到的信息传递给中央控制器,控制器负责做决策。
\[
\pi\left(a^i \mid o^1, \cdots, o^n ; \boldsymbol{\theta}^i\right) \text
{, for all } i=1,2, \cdots, n \text {. }
\]
Let \(\mathbf{a}=\left[a^1, a^2, \cdots, a^n\right]\) contain all the agents' actions.
Let \(\mathbf{o}=\left[o^1, o^2, \cdots, o^n\right]\) contain all the agents' observations.
The central controller knows \(\mathbf{a}, \mathbf{o}\), and all the rewards.
The controller has \(n\) policy networks and \(n\) value networks:
- Policy network (actor) for the \(i\)-th agent: \(\pi\left(a^i \mid o ; \boldsymbol{\theta}^i\right)\).
- Value network (critic) for the \(i\)-th agent: \(q\left(o, \mathbf{a} ; \mathbf{w}^i\right)\).
Centralized Training: Training is performed by the controller.
- The controller knows all the observations, actions, and rewards.
- Train \(\pi\left(a^i \mid 0 ; \boldsymbol{\theta}^i\right)\) using policy gradient.
- Train \(q\left(\mathbf{o}, \mathbf{a} ; \mathbf{w}^i\right)\) using TD algorithm.
Centralized Execution: Decisions are made by the controller.
- For all \(i\), the \(i\)-th agent sends its observation, \(o^i\), to the controller.
- The controller knows \(\mathbf{o}=\left[o^1, o^2, \cdots, o^n\right]\).
- For all \(i\), the controller samples action by \(a^i \sim \pi\left(\cdot \mid \mathbf{o} ; \boldsymbol{\theta}^i\right)\) and sends \(a^i\) to the \(i\)-th agent.
中央控制器知道全局的信息,可以帮所有Agent做出好的决策
但执行速度较慢,Agent都没有决策权,需要等中央来做决策(发送、同步信息都需要花时间,等最慢的Agent)
Centralized Training with Decentralized Execution
Each agent has its own policy network (actor): \(\pi\left(a^i \mid o^i ; \boldsymbol{\theta}^i\right)\).
The central controller has \(n\) value networks (critics): \(q\left(0, \mathbf{a} ; \mathbf{w}^i\right)\).
Centralized Training: During training, the central controller knows all the agents' observations, actions, and rewards.
Decentralized Execution: During execution, the central controller and its value networks are not used.
训练阶段

The central controller trains the critics, \(q\left(\mathbf{o}, \mathbf{a} ; \mathbf{w}^i\right)\), for all \(i\).
To update \(\mathbf{w}^i\), TD algorithm takes as inputs:
- All the actions: \(\mathrm{a}=\left[a^1, a^2, \cdots, a^n\right]\).
- All the observations: \(\mathbf{o}=\left[o^1, o^2, \cdots, o^n\right]\).
- The \(i\)-th reward: \(r^i\).

Each agent locally trains the actor, \(\pi\left(a^i \mid o^i ; \boldsymbol{\theta}^i\right)\), using policy gradient.
To update \(\boldsymbol{\theta}^i\), the policy gradient algorithm takes as input \(\left(a^i, o^i, a^i\right)\),
执行阶段

每个Agent独立接受observation,依据自己的策略网络做决策,Agent之间不通信