【RL】Multi-Agent Learning

一个系统中存在多个决策者

面临的困难

  • 环境非平稳变化(马尔科夫特性不再满足),每个智能体所对应的环境包含了其他智能体的策略,其他智能体策略的变化导致环境变化的非平稳性。

  • 局部可观,智能体只能观测到环境的部分状态

  • 集中式学习不可行,需要一个集中式控制中心与智能体之间进行大量的信息交互,神经网络的输入输出维度会随智能体数目指数增大,难以收敛。

基本概念

Agent的关系

  • Fully cooperative,两个Agent需要相互配合才能完成任务(工业机器人)

  • Fully competitive,一方的收益是另一方的损失(捕食者和猎物)

  • Mixed Cooperative & competitive,既有竞争又有合作(足球机器人,队伍内是合作,队伍间是博弈)

  • Self-interested,利己主义,只想最大化自己的利益,动作可能让别人获益or损失,不关心别人的利益(股票的期货交易系统、无人车)

State & Action

There are \(n\) agents.

Let \(S\) be the state.

Let \(A^i\) be the \(i\)-th agent's action.

State transition: \(p\left(s^{\prime} \mid s, a^1, \cdots, a^n\right)=\mathbb{P}\left(S^{\prime}=s^{\prime} \mid S=s, A^1=a^1, \cdots, A^n=a^n\right).\)

The next state, \(S^{\prime}\), depends on all the agents' actions.

每个Agent的Action都可能影响Env的下个State,所以每个Agent都可以影响其他Agent

Rewards

Let \(R^i\) be the reward received by the \(i\)-th agent.

Fully cooperative: \(R^1=R^2=\cdots=R^n\).

Fully competitive: \(R^1 \propto-R^2\).

\(R^i\) depends on \(A^i\) as well as all the other agents' actions \(\left\{A^j\right\}_{j \neq i}\).

Returns

Let \(R_t^i\) be the reward received by the \(i\)-th agent at time \(t\). Return (of the \(i\)-th agent): \[ U_t^i=R_t^i+R_{t+1}^i+R_{t+2}^i+R_{t+3}^i+\cdots \]

Discounted return (of the \(i\)-th agent): \[ U_t^i=R_t^i+\gamma \cdot R_{t+1}^i+\gamma^2 \cdot R_{t+2}^i+\gamma^3 \cdot R_{t+3}^i+\cdots \]

Here, \(\gamma \in[0,1]\) is the discount rate.

Policy Network

Each agent has its own policy network: \(\pi\left(a^i \mid s ; \boldsymbol{\theta}^i\right)\).

Policy networks can be exchangeable: \(\boldsymbol{\theta}^1=\boldsymbol{\theta}^2=\cdots=\boldsymbol{\theta}^n\).

  • Self-driving cars can have the same policy.

Policy networks can be nonexchangeable: \(\boldsymbol{\theta}^i \neq \boldsymbol{\theta}^j\).

  • Soccer players have different roles, e.g., striker, defender, goalkeeper.

State-Value Function

State-value of the \(i\)-th agent: (不仅依赖自己的策略,同时依赖其他所有Agent的策略) \[ V^i\left(s_t ; \boldsymbol{\theta}^1, \cdots, \boldsymbol{\theta}^n\right)=\mathbb{E}\left[U_t^i \mid S_t=s_t\right] . \]

The expectation is taken w.r.t. all the future actions and states except \(S_t\).

Randomness in actions: \(A_t^j \sim \pi\left(\cdot \mid s_t ; \boldsymbol{\theta}^j\right)\), for all \(j=1, \cdots, n\). (That is why the state-value \(V^i\) depends on \(\boldsymbol{\theta}^1, \cdots, \boldsymbol{\theta}^n\).)

Nash Equilibrium(纳什均衡)

Multi-Agent问题收敛的条件

当其他所有Agent的政策保持不变时,第 \(i\) 个Agent无法通过改变自己的Policy获得更好的预期收益。

每个Agent都在对其他Agent的Policy做出最佳反应。

在这种平衡状态下,大家都没动机去改变自己策略

如果使用Single-Agent的策略梯度算法直接用于Mulit-Agent问题上

image-20231206092127067

The \(i\)-th agent's policy network: \(\pi\left(a^i \mid s ; \boldsymbol{\theta}^i\right)\).

The \(i\)-th agent's state-value function: \(V^i\left(s ; \boldsymbol{\theta}^1, \cdots, \boldsymbol{\theta}^n\right)\).

Objective function: \(J^i\left(\boldsymbol{\theta}^1, \cdots, \boldsymbol{\theta}^n\right)=\mathbb{E}_S\left[V^i\left(S ; \boldsymbol{\theta}^1, \cdots, \boldsymbol{\theta}^n\right)\right]\).

Learn the policy network's parameter, \(\boldsymbol{\theta}^i\), by \[ \max _{\boldsymbol{\theta}^i} J^i\left(\boldsymbol{\theta}^1, \cdots, \boldsymbol{\theta}^n\right) \]

  • The \(1^{\text {st }}\) agent solves: \(\quad \max _{\theta^1}J^1\left(\boldsymbol{\theta}^1, \boldsymbol{\theta}^2, \cdots, \boldsymbol{\theta}^n)\right.\).
  • The \(2^{\text {nd }}\) agent solves: \(\quad \max _{\theta^2} J^2\left(\boldsymbol{\theta}^1, \boldsymbol{\theta}^2, \cdots, \boldsymbol{\theta}^n\right)\).
  • ...
  • The \(n^{\text {th }}\) agent solves: \(\quad \max _{\boldsymbol{\theta}^n} J^n\left(\boldsymbol{\theta}^1, \boldsymbol{\theta}^2, \cdots, \boldsymbol{\theta}^n\right)\).

每个Agent没有共同的目标,各自更新自己的\(\theta^i\),一个Agent更新策略,可能导致其他所有Agent的目标函数发生变化,大家的目标函数都在不停的发生变化,导致训练很难收敛

所以不能将Single-Agent的训练策略应用与Multi-Agent问题

Centralized Vs Decentralized

Architectures

  • Fully decentralized,每个Agent都用自己观测到的observation、reward去训练策略,Agent之间不通信

  • Fully centralized,所有Agent将信息发给中央控制器,中央控制器负责做decision,Agent负责执行

  • Centralized training with decentralized execution,中央控制器在训练阶段收集各Agent信息,帮助训练Agent策略网络,训练结束后,Agent自己利用策略网络做decision,不依赖中央控制器通信

Partial Observation

An agent may or may not have full knowledge of the state, \(s\).

Let \(o^i\) be the \(i\)-th agent's observation.

Partial observation: \(o^i \neq S\).

Full observation: \(\quad o^1=\cdots=o^n=s\).

Fully Decentralized

本质就是Single-Agent reinforce learning

image-20231206094004380

The \(i\)-th agent has a policy network (actor): \(\pi\left(a^i \mid o^i ; \boldsymbol{\theta}^i\right)\).

The \(i\)-th agent has a value network (critic): \(q\left(o^i, a^i ; \mathbf{w}^i\right)\).

Agents do not share observations and actions.

Train the policy and value networks in the same way as the single-agent setting.

This does not work well.

Fully Centralized

各Agent中没有策略网络,将观测到的信息传递给中央控制器,控制器负责做决策。 \[ \pi\left(a^i \mid o^1, \cdots, o^n ; \boldsymbol{\theta}^i\right) \text {, for all } i=1,2, \cdots, n \text {. } \] image-20231206094349263

Let \(\mathbf{a}=\left[a^1, a^2, \cdots, a^n\right]\) contain all the agents' actions.

Let \(\mathbf{o}=\left[o^1, o^2, \cdots, o^n\right]\) contain all the agents' observations.

The central controller knows \(\mathbf{a}, \mathbf{o}\), and all the rewards.

The controller has \(n\) policy networks and \(n\) value networks:

  • Policy network (actor) for the \(i\)-th agent: \(\pi\left(a^i \mid o ; \boldsymbol{\theta}^i\right)\).
  • Value network (critic) for the \(i\)-th agent: \(q\left(o, \mathbf{a} ; \mathbf{w}^i\right)\).

Centralized Training: Training is performed by the controller.

  • The controller knows all the observations, actions, and rewards.
  • Train \(\pi\left(a^i \mid 0 ; \boldsymbol{\theta}^i\right)\) using policy gradient.
  • Train \(q\left(\mathbf{o}, \mathbf{a} ; \mathbf{w}^i\right)\) using TD algorithm.

Centralized Execution: Decisions are made by the controller.

  • For all \(i\), the \(i\)-th agent sends its observation, \(o^i\), to the controller.
  • The controller knows \(\mathbf{o}=\left[o^1, o^2, \cdots, o^n\right]\).
  • For all \(i\), the controller samples action by \(a^i \sim \pi\left(\cdot \mid \mathbf{o} ; \boldsymbol{\theta}^i\right)\) and sends \(a^i\) to the \(i\)-th agent.

中央控制器知道全局的信息,可以帮所有Agent做出好的决策

但执行速度较慢,Agent都没有决策权,需要等中央来做决策(发送、同步信息都需要花时间,等最慢的Agent)

Centralized Training with Decentralized Execution

Each agent has its own policy network (actor): \(\pi\left(a^i \mid o^i ; \boldsymbol{\theta}^i\right)\).

The central controller has \(n\) value networks (critics): \(q\left(0, \mathbf{a} ; \mathbf{w}^i\right)\).

Centralized Training: During training, the central controller knows all the agents' observations, actions, and rewards.

Decentralized Execution: During execution, the central controller and its value networks are not used.

训练阶段

image-20231206095643649

The central controller trains the critics, \(q\left(\mathbf{o}, \mathbf{a} ; \mathbf{w}^i\right)\), for all \(i\).

To update \(\mathbf{w}^i\), TD algorithm takes as inputs:

  • All the actions: \(\mathrm{a}=\left[a^1, a^2, \cdots, a^n\right]\).
  • All the observations: \(\mathbf{o}=\left[o^1, o^2, \cdots, o^n\right]\).
  • The \(i\)-th reward: \(r^i\).
image-20231206095819592

Each agent locally trains the actor, \(\pi\left(a^i \mid o^i ; \boldsymbol{\theta}^i\right)\), using policy gradient.

To update \(\boldsymbol{\theta}^i\), the policy gradient algorithm takes as input \(\left(a^i, o^i, a^i\right)\),

执行阶段

image-20231206100104303

每个Agent独立接受observation,依据自己的策略网络做决策,Agent之间不通信