2021-01-17 20:22:41

6156f1d5 · wizardforcel · 90f72332 · 6156f1d5
隐藏空白更改
内联并排

Showing with 41 addition and 40 deletion

new/rl-tf/02.md new/rl-tf/02.md +41 -40

未找到文件。
--- a/new/rl-tf/02.md
+++ b/new/rl-tf/02.md
@@ -37,7 +37,7 @@ OpenAI Gym 提供不同类型的环境。 它们如下：
 ```py
 $ git clone https://github.com/openai/gym 
 $ cd gym 
-$ sudo pip install -e . *# minimal install*
+$ sudo pip install -e . # minimal install
 ```

 这将进行最小安装。 您以后可以运行以下命令进行完整安装：
@@ -51,27 +51,28 @@ $ sudo pip install -e .[all]
 对于 Python 2.7，可以使用以下选项：

 ```py
-$ sudo pip install gym              *# minimal install* 
-$ sudo pip install gym[all]         *# full install* $ sudo pip install gym[atari]       #for Atari specific environment installation
+$ sudo pip install gym              # minimal install
+$ sudo pip install gym[all]         # full install 
+$ sudo pip install gym[atari]       # for Atari specific environment installation
 ```

 对于 Python 3.5，可以使用以下选项：

 ```py
-$ sudo pip3 install gym              *# minimal install* 
-$ sudo pip3 install gym[all]         *# full install
-$ sudo pip install gym[atari]       #for Atari specific environment installation*
+$ sudo pip3 install gym              # minimal install
+$ sudo pip3 install gym[all]         # full install
+$ sudo pip install gym[atari]        # for Atari specific environment installation
 ```

 # 了解 OpenAI Gym 环境

-为了了解导入 Gym 软件包，加载环境以及与 OpenAI Gym 相关的其他重要功能的基础，这里是**冻湖**环境的示例。
+为了了解导入 Gym 软件包，加载环境以及与 OpenAI Gym 相关的其他重要功能的基础，这里是**冰冻湖**环境的示例。

 通过以下方式加载 Frozen Lake 环境：

 ```py
 import Gym 
-env = Gym.make('FrozenLake-v0')   *#make function of Gym loads the specified environment*
+env = Gym.make('FrozenLake-v0')   #make function of Gym loads the specified environment
 ```

 接下来，我们来重置环境。 在执行强化学习任务时，特工会经历多个情节的学习。 结果，在每个情节开始时，都需要重置环境，使其恢复到初始状态，并且代理从开始状态开始。 以下代码显示了重置环境的过程：
@@ -79,11 +80,11 @@ env = Gym.make('FrozenLake-v0')   *#make function of Gym loads the specified env
 ```py
 import Gym 
 env = Gym.make('FrozenLake-v0') 
-s = env.reset()  *# resets the environment and returns the start state as a value*
+s = env.reset()  # resets the environment and returns the start state as a value
 print(s)

 -----------
-0                *#initial state is 0*
+0                #initial state is 0
 ```

 在执行每个操作之后，可能需要显示代理在环境中的状态。 通过以下方式可视化该状态：
@@ -98,7 +99,7 @@ FFFH
 HFFG
 ```

-前面的输出显示这是一个 *4 x 4* 网格的环境，即以前面的方式排列的 16 个状态，其中 S，H，F 和 G 表示状态的不同形式，其中：
+前面的输出显示这是一个`4 x 4`网格的环境，即以前面的方式排列的 16 个状态，其中 S，H，F 和 G 表示状态的不同形式，其中：

 *  `S`：开始块
 *  `F`：冻结块
@@ -137,14 +138,14 @@ Discrete(16)

 # 使用 OpenAI Gym 环境对代理进行编程

-本节考虑的环境是 **Frozen Lake v0** 。 有关环境的实际文档可以在[这个页面](https://gym.openai.com/envs/FrozenLake-v0/)中找到。
+本节考虑的环境是**冰冻湖 v0**。 有关环境的实际文档可以在[这个页面](https://gym.openai.com/envs/FrozenLake-v0/)中找到。

-此环境由代表一个湖泊的 *4 x 4* 网格组成。 因此，我们有 16 个网格块，其中每个块可以是开始块（S），冻结块（F），目标块（G）或孔块（H）。 因此，代理程序的目标是学会从头到尾进行导航而不会陷入困境：
+此环境由代表一个湖泊的`4 x 4`网格组成。 因此，我们有 16 个网格块，其中每个块可以是开始块（S），冻结块（F），目标块（G）或孔块（H）。 因此，代理程序的目标是学会从头到尾进行导航而不会陷入困境：

 ```py
 import Gym
-env = Gym.make('FrozenLake-v0')    *#loads the environment FrozenLake-v0*
-env.render()                       *# will output the environment and position of the agent*
+env = Gym.make('FrozenLake-v0')    #loads the environment FrozenLake-v0
+env.render()                       # will output the environment and position of the agent

 -------------------
 SFFF
@@ -203,13 +204,13 @@ else:
 让我们尝试实现一种基本的 Q 学习算法，以使代理学习如何在从头到尾的 16 个网格的冰冻湖面中导航，而不会陷入困境：

 ```py
-*# importing dependency libraries*
+# importing dependency libraries
 from __future__ import print_function
 import Gym
 import numpy as np
 import time

-*#Load the environment*
+#Load the environment

 env = Gym.make('FrozenLake-v0')

@@ -228,7 +229,7 @@ print("Number of actions : ",env.action_space.n)
 print("Number of states : ",env.observation_space.n)
 print()

-*#Epsilon-Greedy approach for Exploration and Exploitation of the state-action spaces*
+#Epsilon-Greedy approach for Exploration and Exploitation of the state-action spaces
 def epsilon_greedy(Q,s,na):
    epsilon = 0.3
    p = np.random.uniform(low=0,high=1)
@@ -238,12 +239,12 @@ def epsilon_greedy(Q,s,na):
    else:
        return env.action_space.sample()

-*# Q-Learning Implementation*
+# Q-Learning Implementation

-*#Initializing Q-table with zeros*
+#Initializing Q-table with zeros
 Q = np.zeros([env.observation_space.n,env.action_space.n])

-*#set hyperparameters*
+#set hyperparameters
 lr = 0.5 #learning rate
 y = 0.9 #discount factor lambda
 eps = 100000 #total episodes being 100000
@@ -274,8 +275,8 @@ print()

 print("Output after learning")
 print()
-*#learning ends with the end of the above loop of several episodes above*
-*#let's check how much our agent has learned*
+#learning ends with the end of the above loop of several episodes above
+#let's check how much our agent has learned
 s = env.reset()
 env.render()
 while(True):
@@ -404,10 +405,10 @@ HFFG

 维护少数州的表是可能的，但在现实世界中，州会变得无限。 因此，需要一种解决方案，其合并状态信息并输出动作的 Q 值而不使用 Q 表。 这是神经网络充当函数逼近器的地方，该函数逼近器针对不同状态信息的数据及其所有动作的相应 Q 值进行训练，从而使它们能够预测任何新状态信息输入的 Q 值。 用于预测 Q 值而不是使用 Q 表的神经网络称为 Q 网络。

-在这里，对于`FrozenLake-v0`环境，让我们使用一个将状态信息作为输入的单个神经网络，其中状态信息表示为一个 **1 x 状态**形状的热编码向量（此处为 1 x 16）并输出 **1 x 动作数**形状为的向量（此处为 1 x 4）。 输出是所有操作的 Q 值：
+在这里，对于`FrozenLake-v0`环境，让我们使用一个将状态信息作为输入的单个神经网络，其中状态信息表示为一个`1 x 状态`形状的单热编码向量（此处为 1 x 16）并输出形状为`1 x 动作数`的向量（此处为 1 x 4）。 输出是所有操作的 Q 值：

 ```py
-*# considering there are 16 states numbered from state 0 to state 15, then state number 4 will be # represented in one hot encoded vector as*
+# considering there are 16 states numbered from state 0 to state 15, then state number 4 will be # represented in one hot encoded vector as
 input_state = [0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0]
 ```

@@ -420,19 +421,19 @@ input_state = [0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0]
 让我们尝试用 Python 来实现这一点，并学习如何实现基本的 Q-Network 算法，以使代理从头到尾在整个 16 个网格的冰冻湖面中导航，而不会陷入困境：

 ```py
-*# importing dependency libraries*
+# importing dependency libraries
 from __future__ import print_function
 import Gym
 import numpy as np
 import tensorflow as tf
 import random

-*# Load the Environment*
+# Load the Environment
 env = Gym.make('FrozenLake-v0')

-*# Q - Network Implementation*
+# Q - Network Implementation

-*## Creating Neural Network*
+## Creating Neural Network

 tf.reset_default_graph()
 # tensors for inputs, weights, biases, Qtarget
@@ -449,7 +450,7 @@ loss = tf.reduce_sum(tf.square(qtar-qpred))
 train = tf.train.AdamOptimizer(learning_rate=0.001)
 minimizer = train.minimize(loss)

-*## Training the neural network*
+## Training the neural network

 init = tf.global_variables_initializer() #initializing tensor variables
 #initializing parameters
@@ -463,13 +464,13 @@ with tf.Session() as sess:
        s = env.reset() #resetting the environment at the start of each episode
        r_total = 0 #to calculate the sum of rewards in the current episode
        while(True):
-            *#running the Q-network created above*
+            #running the Q-network created above
            a_pred,q_pred = sess.run([apred,qpred],feed_dict={inputs:np.identity(env.observation_space.n)[s:s+1]})
-            *#a_pred is the action prediction by the neural network*
- *#q_pred contains q_values of the actions at current state 's'*
+            #a_pred is the action prediction by the neural network
+ #q_pred contains q_values of the actions at current state 's'
            if np.random.uniform(low=0,high=1) < e: #performing epsilon-greedy here
                a_pred[0] = env.action_space.sample()
-                *#exploring different action by randomly assigning them as the next action*
+                #exploring different action by randomly assigning them as the next action
            s_,r,t,_ = env.step(a_pred[0]) #action taken and new state 's_' is encountered with a feedback reward 'r'
            if r==0: 
                if t==True:
@@ -480,23 +481,23 @@ with tf.Session() as sess:
                r=5 #good positive goat state reward

            q_pred_new = sess.run(qpred,feed_dict={inputs:np.identity(env.observation_space.n)[s_:s_+1]})
-            *#q_pred_new contains q_values of the actions at the new state* 
+            #q_pred_new contains q_values of the actions at the new state 

-            *#update the Q-target value for action taken*
+            #update the Q-target value for action taken
            targetQ = q_pred
            max_qpredn = np.max(q_pred_new)
            targetQ[0,a_pred[0]] = r + y*max_qpredn
-            *#this gives our targetQ*
+            #this gives our targetQ

- *#train the neural network to minimize the loss*
+ #train the neural network to minimize the loss
            _ = sess.run(minimizer,feed_dict={inputs:np.identity(env.observation_space.n)[s:s+1],qtar:targetQ})

            s=s_
            if t==True:
                break

-    *#learning ends with the end of the above loop of several episodes above*
- *#let's check how much our agent has learned*
+    #learning ends with the end of the above loop of several episodes above
+ #let's check how much our agent has learned
    print("Output after learning")
    print()
    s = env.reset()