【RL】Multi-Agent Learning

一个系统中存在多个决策者

面临的困难

环境非平稳变化（马尔科夫特性不再满足），每个智能体所对应的环境包含了其他智能体的策略，其他智能体策略的变化导致环境变化的非平稳性。
局部可观，智能体只能观测到环境的部分状态
集中式学习不可行，需要一个集中式控制中心与智能体之间进行大量的信息交互，神经网络的输入输出维度会随智能体数目指数增大，难以收敛。

基本概念

Agent的关系

Fully cooperative，两个Agent需要相互配合才能完成任务（工业机器人）
Fully competitive，一方的收益是另一方的损失（捕食者和猎物）
Mixed Cooperative & competitive，既有竞争又有合作（足球机器人，队伍内是合作，队伍间是博弈）
Self-interested，利己主义，只想最大化自己的利益，动作可能让别人获益or损失，不关心别人的利益（股票的期货交易系统、无人车）

State & Action

There are \(n\) agents.

Let \(S\) be the state.

Let \(A^i\) be the \(i\)-th agent's action.

State transition: \(p\left(s^{\prime} \mid s, a^1, \cdots, a^n\right)=\mathbb{P}\left(S^{\prime}=s^{\prime} \mid S=s, A^1=a^1, \cdots, A^n=a^n\right).\)

The next state, \(S^{\prime}\), depends on all the agents' actions.

每个Agent的Action都可能影响Env的下个State，所以每个Agent都可以影响其他Agent

Rewards

Let \(R^i\) be the reward received by the \(i\)-th agent.

Fully cooperative: \(R^1=R^2=\cdots=R^n\).

Fully competitive: \(R^1 \propto-R^2\).

\(R^i\) depends on \(A^i\) as well as all the other agents' actions \(\left\{A^j\right\}_{j \neq i}\).

Returns

Let \(R_t^i\) be the reward received by the \(i\)-th agent at time \(t\). Return (of the \(i\)-th agent): \[ U_t^i=R_t^i+R_{t+1}^i+R_{t+2}^i+R_{t+3}^i+\cdots \]

Discounted return (of the \(i\)-th agent): \[ U_t^i=R_t^i+\gamma \cdot R_{t+1}^i+\gamma^2 \cdot R_{t+2}^i+\gamma^3 \cdot R_{t+3}^i+\cdots \]

Here, \(\gamma \in[0,1]\) is the discount rate.

Policy Network

Each agent has its own policy network: \(\pi\left(a^i \mid s ; \boldsymbol{\theta}^i\right)\).

Policy networks can be exchangeable: \(\boldsymbol{\theta}^1=\boldsymbol{\theta}^2=\cdots=\boldsymbol{\theta}^n\).

Self-driving cars can have the same policy.

Policy networks can be nonexchangeable: \(\boldsymbol{\theta}^i \neq \boldsymbol{\theta}^j\).

Soccer players have different roles, e.g., striker, defender, goalkeeper.

State-Value Function

State-value of the \(i\)-th agent: (不仅依赖自己的策略，同时依赖其他所有Agent的策略) \[ V^i\left(s_t ; \boldsymbol{\theta}^1, \cdots, \boldsymbol{\theta}^n\right)=\mathbb{E}\left[U_t^i \mid S_t=s_t\right] . \]

The expectation is taken w.r.t. all the future actions and states except \(S_t\).

Randomness in actions: \(A_t^j \sim \pi\left(\cdot \mid s_t ; \boldsymbol{\theta}^j\right)\), for all \(j=1, \cdots, n\). (That is why the state-value \(V^i\) depends on \(\boldsymbol{\theta}^1, \cdots, \boldsymbol{\theta}^n\).)

Nash Equilibrium（纳什均衡）

Multi-Agent问题收敛的条件

当其他所有Agent的政策保持不变时，第 \(i\) 个Agent无法通过改变自己的Policy获得更好的预期收益。

每个Agent都在对其他Agent的Policy做出最佳反应。

在这种平衡状态下，大家都没动机去改变自己策略

如果使用Single-Agent的策略梯度算法直接用于Mulit-Agent问题上

The \(i\)-th agent's policy network: \(\pi\left(a^i \mid s ; \boldsymbol{\theta}^i\right)\).

The \(i\)-th agent's state-value function: \(V^i\left(s ; \boldsymbol{\theta}^1, \cdots, \boldsymbol{\theta}^n\right)\).

Objective function: \(J^i\left(\boldsymbol{\theta}^1, \cdots, \boldsymbol{\theta}^n\right)=\mathbb{E}_S\left[V^i\left(S ; \boldsymbol{\theta}^1, \cdots, \boldsymbol{\theta}^n\right)\right]\).

Learn the policy network's parameter, \(\boldsymbol{\theta}^i\), by \[ \max _{\boldsymbol{\theta}^i} J^i\left(\boldsymbol{\theta}^1, \cdots, \boldsymbol{\theta}^n\right) \]

The \(1^{\text {st }}\) agent solves: \(\quad \max _{\theta^1}J^1\left(\boldsymbol{\theta}^1, \boldsymbol{\theta}^2, \cdots, \boldsymbol{\theta}^n)\right.\).
The \(2^{\text {nd }}\) agent solves: \(\quad \max _{\theta^2} J^2\left(\boldsymbol{\theta}^1, \boldsymbol{\theta}^2, \cdots, \boldsymbol{\theta}^n\right)\).
...
The \(n^{\text {th }}\) agent solves: \(\quad \max _{\boldsymbol{\theta}^n} J^n\left(\boldsymbol{\theta}^1, \boldsymbol{\theta}^2, \cdots, \boldsymbol{\theta}^n\right)\).

每个Agent没有共同的目标，各自更新自己的\(\theta^i\)，一个Agent更新策略，可能导致其他所有Agent的目标函数发生变化，大家的目标函数都在不停的发生变化，导致训练很难收敛

所以不能将Single-Agent的训练策略应用与Multi-Agent问题

Centralized Vs Decentralized

Architectures

Fully decentralized，每个Agent都用自己观测到的observation、reward去训练策略，Agent之间不通信
Fully centralized，所有Agent将信息发给中央控制器，中央控制器负责做decision，Agent负责执行
Centralized training with decentralized execution，中央控制器在训练阶段收集各Agent信息，帮助训练Agent策略网络，训练结束后，Agent自己利用策略网络做decision，不依赖中央控制器通信

Partial Observation

An agent may or may not have full knowledge of the state, \(s\).

Let \(o^i\) be the \(i\)-th agent's observation.

Partial observation: \(o^i \neq S\).

Full observation: \(\quad o^1=\cdots=o^n=s\).

Fully Decentralized

本质就是Single-Agent reinforce learning

The \(i\)-th agent has a policy network (actor): \(\pi\left(a^i \mid o^i ; \boldsymbol{\theta}^i\right)\).

The \(i\)-th agent has a value network (critic): \(q\left(o^i, a^i ; \mathbf{w}^i\right)\).

Agents do not share observations and actions.

Train the policy and value networks in the same way as the single-agent setting.

This does not work well.

Fully Centralized

各Agent中没有策略网络，将观测到的信息传递给中央控制器，控制器负责做决策。 \[ \pi\left(a^i \mid o^1, \cdots, o^n ; \boldsymbol{\theta}^i\right) \text {, for all } i=1,2, \cdots, n \text {. } \]

Let \(\mathbf{a}=\left[a^1, a^2, \cdots, a^n\right]\) contain all the agents' actions.

Let \(\mathbf{o}=\left[o^1, o^2, \cdots, o^n\right]\) contain all the agents' observations.

The central controller knows \(\mathbf{a}, \mathbf{o}\), and all the rewards.

The controller has \(n\) policy networks and \(n\) value networks:

Policy network (actor) for the \(i\)-th agent: \(\pi\left(a^i \mid o ; \boldsymbol{\theta}^i\right)\).
Value network (critic) for the \(i\)-th agent: \(q\left(o, \mathbf{a} ; \mathbf{w}^i\right)\).

Centralized Training: Training is performed by the controller.

The controller knows all the observations, actions, and rewards.
Train \(\pi\left(a^i \mid 0 ; \boldsymbol{\theta}^i\right)\) using policy gradient.
Train \(q\left(\mathbf{o}, \mathbf{a} ; \mathbf{w}^i\right)\) using TD algorithm.

Centralized Execution: Decisions are made by the controller.

For all \(i\), the \(i\)-th agent sends its observation, \(o^i\), to the controller.
The controller knows \(\mathbf{o}=\left[o^1, o^2, \cdots, o^n\right]\).
For all \(i\), the controller samples action by \(a^i \sim \pi\left(\cdot \mid \mathbf{o} ; \boldsymbol{\theta}^i\right)\) and sends \(a^i\) to the \(i\)-th agent.

中央控制器知道全局的信息，可以帮所有Agent做出好的决策

但执行速度较慢，Agent都没有决策权，需要等中央来做决策（发送、同步信息都需要花时间，等最慢的Agent）

Centralized Training with Decentralized Execution

Each agent has its own policy network (actor): \(\pi\left(a^i \mid o^i ; \boldsymbol{\theta}^i\right)\).

The central controller has \(n\) value networks (critics): \(q\left(0, \mathbf{a} ; \mathbf{w}^i\right)\).

Centralized Training: During training, the central controller knows all the agents' observations, actions, and rewards.

Decentralized Execution: During execution, the central controller and its value networks are not used.

训练阶段

The central controller trains the critics, \(q\left(\mathbf{o}, \mathbf{a} ; \mathbf{w}^i\right)\), for all \(i\).

To update \(\mathbf{w}^i\), TD algorithm takes as inputs:

All the actions: \(\mathrm{a}=\left[a^1, a^2, \cdots, a^n\right]\).
All the observations: \(\mathbf{o}=\left[o^1, o^2, \cdots, o^n\right]\).
The \(i\)-th reward: \(r^i\).

Each agent locally trains the actor, \(\pi\left(a^i \mid o^i ; \boldsymbol{\theta}^i\right)\), using policy gradient.

To update \(\boldsymbol{\theta}^i\), the policy gradient algorithm takes as input \(\left(a^i, o^i, a^i\right)\),

执行阶段

每个Agent独立接受observation，依据自己的策略网络做决策，Agent之间不通信