getting_started.rst 6.8 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
Getting Started
===========

Goal of this tutorial:

  - Understand PARL's abstraction at a high level
  - Train an agent to solve the Cartpole problem with Policy Gradient algorithm

This tutorial assumes that you have a basic familiarity of policy gradient.

Model
-----
First, let's build a ``Model`` that predicts an action given the observation. As an objective-oriented programming framework, we build models on the top of ``parl.Model`` and implement the ``forward`` function.

Here, we construct a neural network with two fully connected layers.

.. code-block:: python

		import parl
		from parl import layers
		
		class CartpoleModel(parl.Model):
		    def __init__(self, act_dim):
		        act_dim = act_dim
		        hid1_size = act_dim * 10
		
		        self.fc1 = layers.fc(size=hid1_size, act='tanh')
		        self.fc2 = layers.fc(size=act_dim, act='softmax')
		
		    def forward(self, obs):
		        out = self.fc1(obs)
		        out = self.fc2(out)
		        return out

Algorithm
----------
``Algorithm`` will update the parameters of the model passed to it. In general, we define the loss function in ``Algorithm``.
In this tutorial, we solve the benchmark `Cartpole` using the `Policy Graident` algorithm, which has been implemented in our repository.
Thus, we can simply use this algorithm by importting it from ``parl.algorithms``.

We have also published various algorithms in PARL, please visit this page for more detail. For those who want to implement a new algorithm, please follow this tutorial.

.. code-block:: python

  model = CartpoleModel(act_dim=2)
  algorithm = parl.algorithms.PolicyGradient(model, lr=1e-3)

Note that each ``algorithm`` should have two functions implemented:

- ``learn``

B
Bo Zhou 已提交
52
  updates the model's parameters given transition data 
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201
- ``predict``

  predicts an action given current environmental state. 

Agent
----------
Now we pass the algorithm to an agent, which is used to interact with the environment to generate training data. Users should build their agents on the top of ``parl.Agent`` and  implement four functions:

- ``build_program``

  define programs of fluid. In general, two programs are built here, one for prediction and the other for training.
- ``learn``

  preprocess transition data and feed it into the training program.
- ``predict``

  feed current environmental state into the prediction program and return an exectuive action.
- ``sample``

  this function is usually used for exploration, fed with current state.

.. code-block:: python

		class CartpoleAgent(parl.Agent):
		    def __init__(self, algorithm, obs_dim, act_dim):
		        self.obs_dim = obs_dim
		        self.act_dim = act_dim
		        super(CartpoleAgent, self).__init__(algorithm)
		
		    def build_program(self):
		        self.pred_program = fluid.Program()
		        self.train_program = fluid.Program()
		
		        with fluid.program_guard(self.pred_program):
		            obs = layers.data(
		                name='obs', shape=[self.obs_dim], dtype='float32')
		            self.act_prob = self.alg.predict(obs)
		
		        with fluid.program_guard(self.train_program):
		            obs = layers.data(
		                name='obs', shape=[self.obs_dim], dtype='float32')
		            act = layers.data(name='act', shape=[1], dtype='int64')
		            reward = layers.data(name='reward', shape=[], dtype='float32')
		            self.cost = self.alg.learn(obs, act, reward)
		
		    def sample(self, obs):
		        obs = np.expand_dims(obs, axis=0)
		        act_prob = self.fluid_executor.run(
		            self.pred_program,
		            feed={'obs': obs.astype('float32')},
		            fetch_list=[self.act_prob])[0]
		        act_prob = np.squeeze(act_prob, axis=0)
		        act = np.random.choice(range(self.act_dim), p=act_prob)
		        return act
		
		    def predict(self, obs):
		        obs = np.expand_dims(obs, axis=0)
		        act_prob = self.fluid_executor.run(
		            self.pred_program,
		            feed={'obs': obs.astype('float32')},
		            fetch_list=[self.act_prob])[0]
		        act_prob = np.squeeze(act_prob, axis=0)
		        act = np.argmax(act_prob)
		        return act
		
		    def learn(self, obs, act, reward):
		        act = np.expand_dims(act, axis=-1)
		        feed = {
		            'obs': obs.astype('float32'),
		            'act': act.astype('int64'),
		            'reward': reward.astype('float32')
		        }
		        cost = self.fluid_executor.run(
		            self.train_program, feed=feed, fetch_list=[self.cost])[0]
		        return cost

Start Training
-----------
First, let's build an ``agent``. As the code shown below, we usually build a model, an algorithm and finally agent.

.. code-block:: python

    model = CartpoleModel(act_dim=2)
    alg = parl.algorithms.PolicyGradient(model, lr=1e-3)
    agent = CartpoleAgent(alg, obs_dim=OBS_DIM, act_dim=2)

Then we use this agent to interact with the environment, and run around 1000 episodes for training, after which this agent can solve the problem.

.. code-block:: python

		def run_episode(env, agent, train_or_test='train'):
		    obs_list, action_list, reward_list = [], [], []
		    obs = env.reset()
		    while True:
		        obs_list.append(obs)
		        if train_or_test == 'train':
		            action = agent.sample(obs)
		        else:
		            action = agent.predict(obs)
		        action_list.append(action)
		
		        obs, reward, done, info = env.step(action)
		        reward_list.append(reward)
		
		        if done:
		            break
		    return obs_list, action_list, reward_list

  		env = gym.make("CartPole-v0")
  		for i in range(1000):
  		      obs_list, action_list, reward_list = run_episode(env, agent)
  		      if i % 10 == 0:
  		          logger.info("Episode {}, Reward Sum {}.".format(i, sum(reward_list)))

  		      batch_obs = np.array(obs_list)
  		      batch_action = np.array(action_list)
  		      batch_reward = calc_discount_norm_reward(reward_list, GAMMA)

  		      agent.learn(batch_obs, batch_action, batch_reward)
  		      if (i + 1) % 100 == 0:
  		          _, _, reward_list = run_episode(env, agent, train_or_test='test')
  		          total_reward = np.sum(reward_list)
  		          logger.info('Test reward: {}'.format(total_reward))

Summary
-----------

.. image:: ../examples/QuickStart/performance.gif
  :width: 300px
.. image:: ./images/quickstart.png
  :width: 300px
In this tutorial, we have shown how to build an agent step-by-step to solve the `Cartpole` problem.

The whole training code could be found `here <https://github.com/PaddlePaddle/PARL/tree/develop/examples/QuickStart>`_. Have a try quickly by running several commands:

.. code-block:: shell

	# Install dependencies
	pip install paddlepaddle  
	
	pip install gym
	git clone https://github.com/PaddlePaddle/PARL.git
	cd PARL
	pip install .
	
	# Train model
	cd examples/QuickStart/
	python train.py