提交 6156f1d5 编写于 作者: W wizardforcel

2021-01-17 20:22:41

上级 90f72332
......@@ -37,7 +37,7 @@ OpenAI Gym 提供不同类型的环境。 它们如下:
```py
$ git clone https://github.com/openai/gym
$ cd gym
$ sudo pip install -e . *# minimal install*
$ sudo pip install -e . # minimal install
```
这将进行最小安装。 您以后可以运行以下命令进行完整安装:
......@@ -51,27 +51,28 @@ $ sudo pip install -e .[all]
对于 Python 2.7,可以使用以下选项:
```py
$ sudo pip install gym *# minimal install*
$ sudo pip install gym[all] *# full install* $ sudo pip install gym[atari] #for Atari specific environment installation
$ sudo pip install gym # minimal install
$ sudo pip install gym[all] # full install
$ sudo pip install gym[atari] # for Atari specific environment installation
```
对于 Python 3.5,可以使用以下选项:
```py
$ sudo pip3 install gym *# minimal install*
$ sudo pip3 install gym[all] *# full install
$ sudo pip install gym[atari] #for Atari specific environment installation*
$ sudo pip3 install gym # minimal install
$ sudo pip3 install gym[all] # full install
$ sudo pip install gym[atari] # for Atari specific environment installation
```
# 了解 OpenAI Gym 环境
为了了解导入 Gym 软件包,加载环境以及与 OpenAI Gym 相关的其他重要功能的基础,这里是**冻湖**环境的示例。
为了了解导入 Gym 软件包,加载环境以及与 OpenAI Gym 相关的其他重要功能的基础,这里是**冻湖**环境的示例。
通过以下方式加载 Frozen Lake 环境:
```py
import Gym
env = Gym.make('FrozenLake-v0') *#make function of Gym loads the specified environment*
env = Gym.make('FrozenLake-v0') #make function of Gym loads the specified environment
```
接下来,我们来重置环境。 在执行强化学习任务时,特工会经历多个情节的学习。 结果,在每个情节开始时,都需要重置环境,使其恢复到初始状态,并且代理从开始状态开始。 以下代码显示了重置环境的过程:
......@@ -79,11 +80,11 @@ env = Gym.make('FrozenLake-v0') *#make function of Gym loads the specified env
```py
import Gym
env = Gym.make('FrozenLake-v0')
s = env.reset() *# resets the environment and returns the start state as a value*
s = env.reset() # resets the environment and returns the start state as a value
print(s)
-----------
0 *#initial state is 0*
0 #initial state is 0
```
在执行每个操作之后,可能需要显示代理在环境中的状态。 通过以下方式可视化该状态:
......@@ -98,7 +99,7 @@ FFFH
HFFG
```
前面的输出显示这是一个 *4 x 4* 网格的环境,即以前面的方式排列的 16 个状态,其中 S,H,F 和 G 表示状态的不同形式,其中:
前面的输出显示这是一个`4 x 4`网格的环境,即以前面的方式排列的 16 个状态,其中 S,H,F 和 G 表示状态的不同形式,其中:
* `S`:开始块
* `F`:冻结块
......@@ -137,14 +138,14 @@ Discrete(16)
# 使用 OpenAI Gym 环境对代理进行编程
本节考虑的环境是 **Frozen Lake v0** 。 有关环境的实际文档可以在[这个页面](https://gym.openai.com/envs/FrozenLake-v0/)中找到。
本节考虑的环境是**冰冻湖 v0**。 有关环境的实际文档可以在[这个页面](https://gym.openai.com/envs/FrozenLake-v0/)中找到。
此环境由代表一个湖泊的 *4 x 4* 网格组成。 因此,我们有 16 个网格块,其中每个块可以是开始块(S),冻结块(F),目标块(G)或孔块(H)。 因此,代理程序的目标是学会从头到尾进行导航而不会陷入困境:
此环境由代表一个湖泊的`4 x 4`网格组成。 因此,我们有 16 个网格块,其中每个块可以是开始块(S),冻结块(F),目标块(G)或孔块(H)。 因此,代理程序的目标是学会从头到尾进行导航而不会陷入困境:
```py
import Gym
env = Gym.make('FrozenLake-v0') *#loads the environment FrozenLake-v0*
env.render() *# will output the environment and position of the agent*
env = Gym.make('FrozenLake-v0') #loads the environment FrozenLake-v0
env.render() # will output the environment and position of the agent
-------------------
SFFF
......@@ -203,13 +204,13 @@ else:
让我们尝试实现一种基本的 Q 学习算法,以使代理学习如何在从头到尾的 16 个网格的冰冻湖面中导航,而不会陷入困境:
```py
*# importing dependency libraries*
# importing dependency libraries
from __future__ import print_function
import Gym
import numpy as np
import time
*#Load the environment*
#Load the environment
env = Gym.make('FrozenLake-v0')
......@@ -228,7 +229,7 @@ print("Number of actions : ",env.action_space.n)
print("Number of states : ",env.observation_space.n)
print()
*#Epsilon-Greedy approach for Exploration and Exploitation of the state-action spaces*
#Epsilon-Greedy approach for Exploration and Exploitation of the state-action spaces
def epsilon_greedy(Q,s,na):
epsilon = 0.3
p = np.random.uniform(low=0,high=1)
......@@ -238,12 +239,12 @@ def epsilon_greedy(Q,s,na):
else:
return env.action_space.sample()
*# Q-Learning Implementation*
# Q-Learning Implementation
*#Initializing Q-table with zeros*
#Initializing Q-table with zeros
Q = np.zeros([env.observation_space.n,env.action_space.n])
*#set hyperparameters*
#set hyperparameters
lr = 0.5 #learning rate
y = 0.9 #discount factor lambda
eps = 100000 #total episodes being 100000
......@@ -274,8 +275,8 @@ print()
print("Output after learning")
print()
*#learning ends with the end of the above loop of several episodes above*
*#let's check how much our agent has learned*
#learning ends with the end of the above loop of several episodes above
#let's check how much our agent has learned
s = env.reset()
env.render()
while(True):
......@@ -404,10 +405,10 @@ HFFG
维护少数州的表是可能的,但在现实世界中,州会变得无限。 因此,需要一种解决方案,其合并状态信息并输出动作的 Q 值而不使用 Q 表。 这是神经网络充当函数逼近器的地方,该函数逼近器针对不同状态信息的数据及其所有动作的相应 Q 值进行训练,从而使它们能够预测任何新状态信息输入的 Q 值。 用于预测 Q 值而不是使用 Q 表的神经网络称为 Q 网络。
在这里,对于`FrozenLake-v0`环境,让我们使用一个将状态信息作为输入的单个神经网络,其中状态信息表示为一个 **1 x 状态**形状的热编码向量(此处为 1 x 16)并输出 **1 x 动作数**形状为的向量(此处为 1 x 4)。 输出是所有操作的 Q 值:
在这里,对于`FrozenLake-v0`环境,让我们使用一个将状态信息作为输入的单个神经网络,其中状态信息表示为一个`1 x 状态`形状的单热编码向量(此处为 1 x 16)并输出形状为`1 x 动作数`的向量(此处为 1 x 4)。 输出是所有操作的 Q 值:
```py
*# considering there are 16 states numbered from state 0 to state 15, then state number 4 will be # represented in one hot encoded vector as*
# considering there are 16 states numbered from state 0 to state 15, then state number 4 will be # represented in one hot encoded vector as
input_state = [0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0]
```
......@@ -420,19 +421,19 @@ input_state = [0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0]
让我们尝试用 Python 来实现这一点,并学习如何实现基本的 Q-Network 算法,以使代理从头到尾在整个 16 个网格的冰冻湖面中导航,而不会陷入困境:
```py
*# importing dependency libraries*
# importing dependency libraries
from __future__ import print_function
import Gym
import numpy as np
import tensorflow as tf
import random
*# Load the Environment*
# Load the Environment
env = Gym.make('FrozenLake-v0')
*# Q - Network Implementation*
# Q - Network Implementation
*## Creating Neural Network*
## Creating Neural Network
tf.reset_default_graph()
# tensors for inputs, weights, biases, Qtarget
......@@ -449,7 +450,7 @@ loss = tf.reduce_sum(tf.square(qtar-qpred))
train = tf.train.AdamOptimizer(learning_rate=0.001)
minimizer = train.minimize(loss)
*## Training the neural network*
## Training the neural network
init = tf.global_variables_initializer() #initializing tensor variables
#initializing parameters
......@@ -463,13 +464,13 @@ with tf.Session() as sess:
s = env.reset() #resetting the environment at the start of each episode
r_total = 0 #to calculate the sum of rewards in the current episode
while(True):
*#running the Q-network created above*
#running the Q-network created above
a_pred,q_pred = sess.run([apred,qpred],feed_dict={inputs:np.identity(env.observation_space.n)[s:s+1]})
*#a_pred is the action prediction by the neural network*
*#q_pred contains q_values of the actions at current state 's'*
#a_pred is the action prediction by the neural network
#q_pred contains q_values of the actions at current state 's'
if np.random.uniform(low=0,high=1) < e: #performing epsilon-greedy here
a_pred[0] = env.action_space.sample()
*#exploring different action by randomly assigning them as the next action*
#exploring different action by randomly assigning them as the next action
s_,r,t,_ = env.step(a_pred[0]) #action taken and new state 's_' is encountered with a feedback reward 'r'
if r==0:
if t==True:
......@@ -480,23 +481,23 @@ with tf.Session() as sess:
r=5 #good positive goat state reward
q_pred_new = sess.run(qpred,feed_dict={inputs:np.identity(env.observation_space.n)[s_:s_+1]})
*#q_pred_new contains q_values of the actions at the new state*
#q_pred_new contains q_values of the actions at the new state
*#update the Q-target value for action taken*
#update the Q-target value for action taken
targetQ = q_pred
max_qpredn = np.max(q_pred_new)
targetQ[0,a_pred[0]] = r + y*max_qpredn
*#this gives our targetQ*
#this gives our targetQ
*#train the neural network to minimize the loss*
#train the neural network to minimize the loss
_ = sess.run(minimizer,feed_dict={inputs:np.identity(env.observation_space.n)[s:s+1],qtar:targetQ})
s=s_
if t==True:
break
*#learning ends with the end of the above loop of several episodes above*
*#let's check how much our agent has learned*
#learning ends with the end of the above loop of several episodes above
#let's check how much our agent has learned
print("Output after learning")
print()
s = env.reset()
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册