readme.md 1.1 KB
Newer Older
V
Varuna Jayasiri 已提交
1
# [Proximal Policy Optimization - PPO](https://nn.labml.ai/rl/ppo/index.html)
V
Varuna Jayasiri 已提交
2 3

This is a [PyTorch](https://pytorch.org) implementation of
V
Varuna Jayasiri 已提交
4
[Proximal Policy Optimization - PPO](https://papers.labml.ai/paper/1707.06347).
V
Varuna Jayasiri 已提交
5 6 7 8 9 10 11 12 13 14 15

PPO is a policy gradient method for reinforcement learning.
Simple policy gradient methods one do a single gradient update per sample (or a set of samples).
Doing multiple gradient steps for a singe sample causes problems
because the policy deviates too much producing a bad policy.
PPO lets us do multiple gradient updates per sample by trying to keep the
policy close to the policy that was used to sample data.
It does so by clipping gradient flow if the updated policy
is not close to the policy used to sample the data.

You can find an experiment that uses it [here](https://nn.labml.ai/rl/ppo/experiment.html).
V
Varuna Jayasiri 已提交
16 17
The experiment uses [Generalized Advantage Estimation](https://nn.labml.ai/rl/ppo/gae.html).

V
Varuna Jayasiri 已提交
18
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb)