06.md

# 异步方法

到目前为止，我们已经涵盖了大多数重要主题，例如马尔可夫决策过程，值迭代，Q 学习，策略梯度，深度 Q 网络和参与者批评算法。 这些构成了强化学习算法的核心。 在本章中，我们将继续从 Actor Critic Algorithms 中停止的地方继续搜索，并深入研究用于深度强化学习的高级**异步方法**及其最著名的变体**异步优势演员评论家算法**，通常称为 **A3C 算法**。

但是，在开始使用 A3C 算法之前，让我们修改第 4 章和“策略梯度”中涵盖的 Actor Critic 算法的基础。 如果您还记得，演员评论算法有两个组成部分：

*   演员
*   评论家

参与者采取当前的环境状态并确定要采取的最佳行动，而评论家则通过采取环境状态和行动来扮演策略评估角色，并返回一个分数，以描述该状态对某项行动的良好程度。 如下图所示：

![](img/5b448889-61bf-43ca-82f5-ce5d63662d33.png)

换句话说，演员的行为像孩子，而评论家的行为像父母，孩子探索多种行为，父母批评不良行为并补充良好行为。 因此，行动者批判算法既学习了策略功能，又学习了状态作用值函数。 像策略梯度一样，参与者评论算法也通过梯度上升来更新其参数。 在高维连续状态和动作空间的情况下，Actor-critic 方法非常有效。

因此，让我们从 Google DeepMind 发布的深度强化学习中的异步方法开始，该方法在性能和计算效率方面都超过了 DQN。

我们将在本章介绍以下主题：

*   为什么使用异步方法？
*   异步一站式 Q 学习
*   异步一步式 SARSA
*   异步 n 步 Q 学习
*   异步优势演员评论家
*   OpenAI 体育馆中用于摆锤 v0 的 A3C

# 为什么使用异步方法？

Google DeepMind 和 MILA 的联合团队于 2016 年 6 月发布了用于深度强化学习的[**同步方法**](https://arxiv.org/pdf/1602.01783.pdf)。 它速度更快，并且能够在多核 CPU 上而不是使用 GPU 上显示出良好的效果。 异步方法也适用于连续动作空间和离散动作空间。

如果我们回想起深度 Q 网络的方法，我们会使用经验重播作为存储所有经验的存储，然后使用其中的随机样本来训练我们的深度神经网络，从而反过来预测最大 Q 值。 有利的行动。 但是，随着时间的流逝，它具有高内存使用和大量计算的缺点。 这背后的基本思想是克服这个问题。 因此，不是使用经验重播，而是创建了环境的多个实例，并且多个代理以异步方式并行执行操作（如下图所示）：

![](img/f1e415ce-be75-4cb7-b24e-7e81937e3c04.png)

深度强化学习中异步方法的高级示意图

在异步方法中，为每个线程分配了一个进程，该进程包含一个学习器，该学习器表示与自己的环境副本进行交互的代理网络。 因此，这些多个学习器并行运行以探索他们自己的环境版本。 这种并行方法允许代理在任何给定的时间步长同时经历各种不同的状态，并涵盖了非策略学习和策略学习学习算法的基础。

如上所述，异步方法能够在多核 CPU（而不是 GPU）中显示出良好的效果。 因此，异步方法非常快，因此成为最新的强化学习算法，因为到目前为止，深度强化学习算法的实现依赖于 GPU 驱动的机器和分布式架构，以见证所实现算法的更快收敛。

这些并行运行的多个学习器使用不同的探索策略，从而最大限度地提高了多样性。 不同学习器的不同探索策略会更改参数，并且这些更新与时间相关的机会最少。 因此，不需要经验回放记忆，并且我们依靠使用不同探索策略的并行学习来执行先前在 DQN 中使用的经验回放的角色。

使用并行学习器的好处如下：

*   减少训练时间。
*   不使用经验重播。 因此，基于策略的强化学习方法也可以用于训练神经网络。

深度强化学习中异步方法的不同变体是：

*   异步一站式 Q 学习
*   异步一步式 SARSA
*   异步 n 步 Q 学习
*   **异步优势演员评论家**（**A3C**）

将变体 A3C 应用于各种 Atari 2600 游戏时，在多核 CPU 上获得了更好的基准测试结果，相对于早期的深度强化学习算法而言，其结果所需的时间要短得多，后者需要在 GPU 驱动的机器上运行。 因此，由于依赖于昂贵的硬件资源（如 GPU）以及不同的复杂分布式架构，因此解决了该问题。 由于所有这些优点，A3C 学习代理是当前最先进的强化学习代理。

# 异步一站式 Q 学习

异步单步 Q 学习的架构与 DQN 非常相似。 DQN 中的代理由一组主要网络和目标网络表示，其中一步损失的计算方法是主要网络预测的当前状态 s 的状态作用值与目标状态- 目标网络计算的当前状态的动作值。 相对于策略网络的参数来计算损失的梯度，然后使用梯度下降优化器将损失最小化，从而导致主网络的参数更新。

异步单步 Q 学习中的区别在于，有多个此类学习代理，例如，学习器并行运行并计算此损失。 因此，梯度计算也并行发生在不同的线程中，其中每个学习代理都与自己的环境副本进行交互。 这些梯度在多个时间步长上在不同线程中的累积用于在固定时间步长后或情节结束后更新策略网络参数。 梯度的累积优于策略网络参数更新，因为这样可以避免覆盖每个学习器代理执行的更改。

此外，将不同的探索策略添加到不同的线程可以使学习变得多样化且稳定。 由于更好的探索，因此提高了性能，因为不同线程中的每个学习代理都受到不同的探索策略。 尽管有许多方法可以执行此操作，但一种简单的方法是在使用![](img/396240a6-a09a-4588-b405-54fef7a9be74.png) -greedy 的同时为不同的线程使用不同的 epsilon ![](img/3594b237-6432-4150-bbcb-35b868d50025.png)示例。

异步单步 Q 学习的伪代码如下所示。 这里，以下是全局参数：

*   ![](img/4e2e2f05-a713-40a9-84b4-abd6e47f309f.png)：策略网络的参数（权重和偏差）
*   ![](img/ec675ec6-2574-4fe5-b97c-adfdfa98c693.png)：目标网络的参数（权重和偏差）
*   T：整体时间步长计数器

```py
// Globally shared parameters , and T 
//  is initialized arbitrarily
// T is initialized 0

pseudo-code for each learner running parallel in each of the threads:

Initialize thread level time step counter t=0
Initialize  = 
Initialize network gradients 
Start with the initial state s

repeat until  :
    Choose action a with -greedy policy such that:

    Perform action a
    Receive new state s' and reward r
    Compute target y : 
    Compute the loss, 
    Accumulate the gradient w.r.t. : 
    s = s'
    T = T + 1
    t = t + 1

    if T mod  : 
        Update the parameters of target network :  = 
        # After every  time steps the parameters of target network is updated

    if t mod  or s = terminal state:
        Asynchronous update of  using 
        Clear gradients : 
        #at every  time step in the thread or if s is the terminal state
        #update  using accumulated gradients 
```

# 异步一步式 SARSA

异步单步 SARSA 的架构几乎与异步单步 Q 学习的架构相似，不同之处在于目标网络计算当前状态的目标状态作用值的方式。 SARSA 并未使用目标网络使用下一个状态`s'`的最大 Q 值，而是使用 ε 贪婪为下一个状态`s'`选择动作`a'`。 下一个状态动作对的 Q 值`Q(s', a')`；![](img/88250d05-5523-43d2-b0fc-20f621b58966.png)用于计算当前状态的目标状态动作值。

异步单步 SARSA 的伪代码如下所示。 这里，以下是全局参数：

*   ![](img/3b0e84ed-0408-4420-9d16-382655598e8e.png)：策略网络的参数（权重和偏差）
*   ![](img/4c7f218e-0d62-44fb-b430-252fac104d1a.png)：目标网络的参数（权重和偏差）
*   T：总时间步长计数器

```py
// Globally shared parameters , and T 
//  is initialized arbitrarily
// T is initialized 0

pseudo-code for each learner running parallel in each of the threads:

Initialize thread level time step counter t=0
Initialize  = 
Initialize network gradients 
Start with the initial state s
Choose action a with -greedy policy such that:

repeat until  :
    Perform action a
    Receive new state s' and reward r
    Choose action a' with -greedy policy such that:

    Compute target y : 
    Compute the loss, 
    Accumulate the gradient w.r.t. : 
    s = s'
    T = T + 1
    t = t + 1
    a = a'

    if T mod  : 
         Update the parameters of target network :  = 
         # After every  time steps the parameters of target network is updated

    if t mod  or s = terminal state:
         Asynchronous update of  using 
         Clear gradients : 
         #at every  time step in the thread or if s is the terminal state
         #update  using accumulated gradients 
```

# 异步 n 步 Q 学习

异步 n 步 Q 学习的架构在某种程度上类似于异步单步 Q 学习的架构。 区别在于，使用探索策略最多选择![](img/934eccea-ab73-4d16-8dd9-6ae2bef78aa9.png)步骤或直到达到终端状态，才能选择学习代理动作，以便计算策略网络参数的单个更新。 此过程列出了自上次更新以来![](img/55e36559-5c3e-4598-b8a3-954b2a844161.png)来自环境的奖励。 然后，对于每个时间步长，将损失计算为该时间步长的折现未来奖励与估算 Q 值之间的差。 对于每个时间步长，此损耗相对于特定于线程的网络参数的梯度将被计算和累积。 有多个这样的学习代理并行运行和累积梯度。 这些累积的梯度用于执行策略网络参数的异步更新。

异步 n 步 Q 学习的伪代码如下所示。 这里，以下是全局参数：

*   ![](img/a8678ea8-ad82-4124-a099-603825a7d2f8.png)：策略网络的参数（权重和偏差）
*   ![](img/9008762f-89a2-4438-b034-e141fdf73759.png)：目标网络的参数（权重和偏差）
*   T：总时间步长计数器
*   t：线程级时间步长计数器
*   ![](img/1c3d0299-4182-4715-83f4-eecdcd605678.png)：总时间步长的最大值
*   ![](img/67ab7590-6e10-4049-b5c6-7cce81f3f339.png)：线程中的最大时间步数

```py
// Globally shared parameters , and T 
//  is initialized arbitrarily
// T is initialized 0

pseudo-code for each learner running parallel in each of the threads:

Initialize thread level time step counter t = 1
Initialize  = 
Initialize  = 
Initialize network gradients 

repeat until  :
     Clear gradient : 
     Synchronize thread-specific parameters: 

     Get state 
     r = [] //list of rewards
     a = [] //list of actions
     s = [] //list of actions
     repeat until  is a terminal state or :
             Choose action  with -greedy policy such that:

             Perform action 
             Receive new state  and reward 
             Accumulate rewards by appending  to r
             Accumulate actions by appending  to a
             Accumulate actions by appending  to s
             t = t + 1
             T = T + 1

     Compute returns, R : 
     for  :

            Compute loss, 
            Accumulate gradients w.r.t.  :

     Asynchronous update of  using 
     if T mod  : 
         Update the parameters of target network :  = 
         # After every  time steps the parameters of target network is updated
```

# 异步优势演员评论家

在异步优势参与者批评者的架构中，每个学习代理都包含一个参与者批评者，该学习器结合了基于价值和基于策略的方法的优点。 参与者网络将状态作为输入，并预测该状态的最佳动作，而评论家网络将状态和动作作为输入，并输出动作分数以量化该状态的动作效果。 参与者网络使用策略梯度更新其权重参数，而评论者网络使用`TD(0)`来更新其权重参数，换言之，两个时间步长之间的值估计之差，如第 4 章“策略梯度”中所述。

在第 4 章和“策略梯度”中，我们研究了如何通过从策略梯度的预期未来收益中减去基线函数来更新策略梯度，从而在不影响预期收益的情况下减少方差的情况。 梯度。 预期的未来奖励和基线函数之间的差称为**优势函数**； 它不仅告诉我们状态的好坏，而且还告诉我们该动作的预期好坏。

卷积神经网络既用于演员网络，又用于评论者网络。 在每个![](img/4a53a424-256c-43ac-98a9-b4b0fe5c0156.png)步骤之后或直到达到终端状态之前，将更新策略和操作值参数。 将与以下伪代码一起说明网络更新，熵和目标函数。 此外，将策略![](img/861ea090-c642-4a1f-b1b0-5082002bc932.png)的熵 H 添加到目标函数中，以通过避免过早收敛到次优策略来改善探索。

因此，有多个这样的学习代理程序运行，每个学习代理程序都包含参与者关键网络，其中策略网络参数（即参与者网络参数）使用策略梯度进行更新，其中优势函数用于计算那些策略梯度。

异步单步 SARSA 的伪代码如下所示。 这里，以下是全局参数：

*   ![](img/b61d023c-24db-426f-b276-170dc20d960a.png)：策略网络的参数（权重和偏差）
*   ![](img/9f23bd7c-22ba-4f6c-a6a8-d09a2dd58bfb.png)：值函数逼近器的参数（权重和偏差）
*   T：总时间步长计数器

特定于线程的参数如下：

*   ![](img/98afa05f-22c6-488f-8943-f4d2397cb8bc.png)：策略网络的线程特定参数
*   ![](img/9ce7dded-df9f-41f8-a8ab-b2950ec62a13.png)：值函数近似器的线程特定参数

```py
//Globally shared parameters  and T
//  is initialized arbitrarily
//  is initialized arbitrarily
// T is initialized 0

pseudo-code for each learner running parallel in each of the threads:

//Thread specific parameters  and 
Initialize thread level time step counter t = 1
repeat until  :
    reset gradients :  and 
    synchronize thread specific parameters :  and 

    Get state 
    r = [] //list of rewards
    a = [] //list of actions
    s = [] //list of actions
    repeat until  is a terminal state or :
        Perform  according to policy 
        Receive new state  and reward 
        Accumulate rewards by appending  to r
        Accumulate actions by appending  to a
        Accumulate actions by appending  to s
        t = t + 1
        T = T + 1

    Compute returns, that is expected future rewards R such that:

    for  :

            Accumulate gradients w.r.t.  :

            Accumulate gradients w.r.t.  :

    Asynchronous update of  using  and  using 
```

# OpenAI 体育馆中 Pong-v0 的 A3C

在第 4 章和“策略梯度”中，我们已经讨论过乒乓环境。 我们将使用以下代码在 OpenAI 体育馆中为 Pong-v0 创建 A3C：

```py
import multiprocessing
import threading
import tensorflow as tf
import numpy as np
import gym
import os
import shutil
import matplotlib.pyplot as plt

game_env = 'Pong-v0'
num_workers = multiprocessing.cpu_count()
max_global_episodes = 100000
global_network_scope = 'globalnet'
global_iteration_update = 20
gamma = 0.9
beta = 0.0001
lr_actor = 0.0001 # learning rate for actor
lr_critic = 0.0001 # learning rate for critic
global_running_rate = []
global_episode = 0

env = gym.make(game_env)

num_actions = env.action_space.n

tf.reset_default_graph()
```

输入状态图像预处理功能：

```py
def preprocessing_image(obs): #where I is the single frame of the game as the input
    """ prepro 210x160x3 uint8 frame into 6400 (80x80) 1D float vector """
    #the values below have been precomputed through trail and error by OpenAI team members
    obs = obs[35:195] 
    #cropping the image frame to an extent where it contains on the paddles and ball and area between them
    obs = obs[::2,::2,0] 
    #downsample by the factor of 2 and take only the R of the RGB channel.Therefore, now 2D frame
    obs[obs==144] = 0 #erase background type 1
    obs[obs==109] = 0 #erase background type 2
    obs[obs!=0] = 1 #everything else(other than paddles and ball) set to 1
    return obs.astype('float').ravel() #flattening to 1D
```

以下代码显示了`actor-critic`类，其中包含`actor`和`critic`网络的架构：

```py
class ActorCriticNetwork(object):
    def __init__(self, scope, globalAC=None):

        if scope == global_network_scope: # get global network
            with tf.variable_scope(scope):
                self.s = tf.placeholder(tf.float32, [None,6400], 'state')
                self.a_params, self.c_params = self._build_net(scope)[-2:]
        else: # local net, calculate losses
            with tf.variable_scope(scope):
                self.s = tf.placeholder(tf.float32, [None,6400], 'state')
                self.a_his = tf.placeholder(tf.int32, [None,], 'action')
                self.v_target = tf.placeholder(tf.float32, [None, 1], 'target_vector')

                self.a_prob, self.v, self.a_params, self.c_params = self._build_net(scope)

                td = tf.subtract(self.v_target, self.v, name='temporal_difference_error')
                with tf.name_scope('critic_loss'):
                    self.c_loss = tf.reduce_mean(tf.square(td))

                with tf.name_scope('actor_loss'):
                    log_prob = tf.reduce_sum(tf.log(self.a_prob) * tf.one_hot(self.a_his, num_actions, dtype=tf.float32), axis=1, keep_dims=True)
                    exp_v = log_prob * td
                    entropy = -tf.reduce_sum(self.a_prob * tf.log(self.a_prob + 1e-5),
                                             axis=1, keep_dims=True) #exploration
                    self.exp_v = beta * entropy + exp_v
                    self.a_loss = tf.reduce_mean(-self.exp_v)

                with tf.name_scope('local_grad'):
                    self.a_grads = tf.gradients(self.a_loss, self.a_params)
                    self.c_grads = tf.gradients(self.c_loss, self.c_params)

            with tf.name_scope('sync'):
                with tf.name_scope('pull'):
                    self.pull_a_params_op = [l_p.assign(g_p) for l_p, g_p in zip(self.a_params, globalAC.a_params)]
                    self.pull_c_params_op = [l_p.assign(g_p) for l_p, g_p in zip(self.c_params, globalAC.c_params)]
                with tf.name_scope('push'):
                    self.update_a_op = actor_train.apply_gradients(zip(self.a_grads, globalAC.a_params))
                    self.update_c_op = critic_train.apply_gradients(zip(self.c_grads, globalAC.c_params))

    def _build_net(self, scope):
        w_init = tf.random_normal_initializer(0., .1)
        with tf.variable_scope('actor_network'):
            l_a = tf.layers.dense(self.s, 300, tf.nn.relu6, kernel_initializer=w_init, name='actor_layer')
            a_prob = tf.layers.dense(l_a, num_actions, tf.nn.softmax, kernel_initializer=w_init, name='ap')
        with tf.variable_scope('critic_network'):
            l_c = tf.layers.dense(self.s, 100, tf.nn.relu6, kernel_initializer=w_init, name='critic_layer')
            v = tf.layers.dense(l_c, 1, kernel_initializer=w_init, name='v') # state value
        a_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/actor')
        c_params = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope + '/critic')
        return a_prob, v, a_params, c_params

    def update_global(self, feed_dict): # run local
        session.run([self.update_a_op, self.update_c_op], feed_dict) # local gradient applied to global net

    def pull_global(self): # run local
        session.run([self.pull_a_params_op, self.pull_c_params_op])

    def choose_action(self, s): # run local
        s = np.reshape(s,[-1])
        prob_weights = session.run(self.a_prob, feed_dict={self.s: s[np.newaxis, :]})
        action = np.random.choice(range(prob_weights.shape[1]),
                                  p=prob_weights.ravel()) # select action w.r.t the actions prob
        return action
```

代表每个线程中的进程的 worker 类如下所示：

```py
class Worker(object):
    def __init__(self, name, globalAC):
        self.env = gym.make(game_env).unwrapped
        self.name = name
        self.AC = ActorCriticNetwork(name, globalAC)

    def work(self):
        global global_running_rate, global_episode
        total_step = 1
        buffer_s, buffer_a, buffer_r = [], [], []
        while not coordinator.should_stop() and global_episode < max_global_episodes:
            obs = self.env.reset()
            s = preprocessing_image(obs)
            ep_r = 0
            while True:
                if self.name == 'W_0':
                    self.env.render()
                a = self.AC.choose_action(s)

                #print(a.shape)

                obs_, r, done, info = self.env.step(a)
                s_ = preprocessing_image(obs_)
                if done and r<=0: 
                      r = -20
                ep_r += r
                buffer_s.append(np.reshape(s,[-1]))
                buffer_a.append(a)
                buffer_r.append(r)

                if total_step % global_iteration_update == 0 or done: # update global and assign to local net
                    if done:
                        v_s_ = 0 # terminal
                    else:
                        s_ = np.reshape(s_,[-1])
                        v_s_ = session.run(self.AC.v, {self.AC.s: s_[np.newaxis, :]})[0, 0]
                    buffer_v_target = []
                    for r in buffer_r[::-1]: # reverse buffer r
                        v_s_ = r + gamma * v_s_
                        buffer_v_target.append(v_s_)
                    buffer_v_target.reverse()

                    buffer_s, buffer_a, buffer_v_target = np.vstack(buffer_s), np.array(buffer_a), np.vstack(buffer_v_target)
                    feed_dict = {
                        self.AC.s: buffer_s,
                        self.AC.a_his: buffer_a,
                        self.AC.v_target: buffer_v_target,
                    }
                    self.AC.update_global(feed_dict)

                    buffer_s, buffer_a, buffer_r = [], [], []
                    self.AC.pull_global()

                s = s_
                total_step += 1
                if done:
                    if len(global_running_rate) == 0: # record running episode reward
                        global_running_rate.append(ep_r)
                    else:
                        global_running_rate.append(0.99 * global_running_rate[-1] + 0.01 * ep_r)
                    print(
                        self.name,
                        "Ep:", global_episode,
                        "| Ep_r: %i" % global_running_rate[-1],
                          )
                    global_episode += 1
                    break
```

以下代码显示了`main`函数，该函数创建线程池并将工作线程分配给不同的线程：

```py
if __name__ == "__main__":
    session = tf.Session()

    with tf.device("/cpu:0"):
        actor_train = tf.train.RMSPropOptimizer(lr_actor, name='RMSPropOptimiserActor')
        critic_train = tf.train.RMSPropOptimizer(lr_critic, name='RMSPropOptimiserCritic')
        acn_global = ActorCriticNetwork(global_network_scope) # we only need its params
        workers = []
        # Create worker
        for i in range(num_workers):
            i_name = 'W_%i' % i # worker name
            workers.append(Worker(i_name, acn_global))

    coordinator = tf.train.Coordinator()
    session.run(tf.global_variables_initializer())

    worker_threads = []
    for worker in workers:
        job = lambda: worker.work()
        t = threading.Thread(target=job)
        t.start()
        worker_threads.append(t)
    coordinator.join(worker_threads)

    plt.plot(np.arange(len(global_running_rate)), global_running_rate)
    plt.xlabel('step')
    plt.ylabel('Total moving reward')
    plt.show()
```

根据学习的输出的屏幕截图（绿色桨是我们的学习代理）：

![](img/fae36238-a4f7-46b5-aeeb-0110aac10b23.png)

# 概要

我们看到，使用并行学习器更新共享模型可以大大改善学习过程。 我们了解了在深度学习中使用异步方法的原因及其不同的变体，包括异步单步 Q 学习，异步单步 SARSA，异步 n 步 Q 学习和异步优势参与者。 我们还学习了实现 A3C 算法的方法，在该方法中，我们使代理学习了 Breakout and Doom 游戏。

在接下来的章节中，我们将重点介绍不同的领域，以及如何以及可以应用深度强化学习。