add docs

8ebddf33 · TomorrowIsAnOtherDay · 9e5ac477 · 8ebddf33 · 8ebddf33 · 8ebddf33
Showing with 136 addition and 12 deletion

README.cn.md README.cn.md +1 -1

README.md README.md +1 -1

docs/zh_CN/Overview.md docs/zh_CN/Overview.md +27 -10

docs/zh_CN/tutorial/quick_start.md docs/zh_CN/tutorial/quick_start.md +107 -0

未找到文件。
--- a/README.cn.md
+++ b/README.cn.md
@@ -3,7 +3,7 @@
 </p>
 [English](./README.md) | 简体中文   
-[**文档**](https://parl.readthedocs.io)
+[**文档**](https://parl.readthedocs.io) | [**中文文档**](docs/zh_CN/Overview.md)
 > PARL 是一个高性能、灵活的强化学习框架。
 # 特点

--- a/README.md
+++ b/README.md
@@ -3,7 +3,7 @@
 </p>
 English | [简体中文](./README.cn.md)   
-[**Documentation**](https://parl.readthedocs.io)
+[**Documentation**](https://parl.readthedocs.io) | [**中文文档**](docs/zh_CN/Overview.md)
 > PARL is a flexible and high-efficient reinforcement learning framework.

--- a/docs/zh_CN/Overview.md
+++ b/docs/zh_CN/Overview.md
@@ -37,15 +37,15 @@
    <td align="center" valign="middle">
      </td>
      <td>
+        <ul>
        <li><b>教程</b></li>
-            <ul>
+           <ul>
-          <li><a href="docs/zh_CN/Tuner/BuiltinTuner.md#BOHB">入门：解决cartpole问题</a></li>
+          <li><a href="tutorial/quick_start.md#quick_start">入门：解决cartpole问题</a></li>
          <li><a href="docs/zh_CN/Tuner/BuiltinTuner.md#BOHB">智能体（Agent）搭建示例</a></li>
          <li><a href="docs/zh_CN/Tuner/BuiltinTuner.md#BOHB">保存模型和加载模型</a></li>
          <li><a href="docs/zh_CN/Tuner/BuiltinTuner.md#BOHB">绘制训练曲线</a></li>
-            </ul>
+           </ul>
        </ul>
-      </ul>
      </td>
      <td align="left" >
        <ul>
@@ -76,10 +76,10 @@
        <li><b>教程</b></li>
            <ul><li><a href="docs/zh_CN/TrainingService/PaiMode.md">部署集群</a></li>
            <li><a href="docs/zh_CN/TrainingService/KubeflowMode.md">入门教程</a></li>
-            <li><a href="docs/zh_CN/TrainingService/FrameworkControllerMode.md">加速案例</a></li>
+            <li><a href="docs/zh_CN/TrainingService/.md">加速案例</a></li>
+            <li><a href="docs/zh_CN/TrainingService/.md">集群信息监控</a></li>
+            <li><a href="docs/zh_CN/TrainingService/.md">如何debug</a></li>
            </ul>
-            <ul><li><a href="docs/zh_CN/TrainingService/DLTSMode.md">集群信息监控与debug</a></li>        
-      </ul>
      </td>
    </tr>
  </tbody>
@@ -105,12 +105,29 @@ git clone https://github.com/PaddlePaddle/PARL
 cd PARL
 pip install .
 ```
-如果遇到网络问题导致的下载较慢，建议使用清华源解决（参考上面的命令）。
+如果遇到网络问题导致的下载较慢，建议使用清华源解决（参考上面的命令）。<br>
-git clone如果较慢，建议使用我们托管在国内码云平台的仓库。
+遇到git clone如果较慢的问题，建议使用我们托管在国内码云平台的仓库。
 ```shell
 git clone https://gitee.com/paddlepaddle/PARL.git
 ```
 ### **关于并行**
 如果只是想使用PARL的并行功能的话，是无需安装任何深度学习框架的。
\ No newline at end of file
+## 贡献
+本项目欢迎任何贡献和建议。 大多数贡献都需要你同意参与者许可协议（CLA），来声明你有权，并实际上授予我们有权使用你的贡献。
+### 代码贡献规范
+- 代码风格规范<br>
+PARL使用yapf工具进行代码风格的统一，使用方法如下：
+```shell
+pip install yapf==0.24.0
+yapf -i modified_file.py
+```
+- 持续继承测试<br>
+当增加代码时候，需要增加测试代码覆盖所添加的代码，测试代码得放在相关代码文件的`tests`文件夹下，以`_test.py`结尾（这样持续集成测试会自动拉取代码跑）。附：[测试代码示例](../../parl/tests/import_test.py)
+## 反馈
+- 在 GitHub 上[提交问题](https://github.com/PaddlePaddle/PARL/issues)。
\ No newline at end of file
--- a/docs/zh_CN/tutorial/quick_start.md
+++ b/docs/zh_CN/tutorial/quick_start.md
+# **教程：使用PARL解决Cartpole问题**
+本教程会使用 [示例](~/parl/examples/QuickStart)中的代码来解释任何通过PARL构建智能体解决经典的Cartpole问题。
+本教程的目标：
+- 熟悉PARL构建智能体过程中需要用到的子模块。
+## Model
+**Model** 主要定义前向网络，这通常是一个策略网络(Policy Network)或者一个值函数网络(Value Function)，输入是当前环境状态(State)。
+首先，我们构建一个包含2个全连接层的前向网络。
+```python
+import parl
+from parl import layers
+class CartpoleModel(parl.Model):
+    def __init__(self, act_dim):
+        act_dim = act_dim
+        hid1_size = act_dim * 10
+        self.fc1 = layers.fc(size=hid1_size, act='tanh')
+        self.fc2 = layers.fc(size=act_dim, act='softmax')
+    def forward(self, obs):
+        out = self.fc1(obs)
+        out = self.fc2(out)
+        return out
+```
+定义前向网络的主要三个步骤：
+- 继承`parl.Model`类
+- 构造函数`__init__`中声明要用到的中间层
+- 在`forward`函数中搭建网络
+以上，我们现在构造函数中声明了两个全连接层以及其激活函数，然后在`forward`函数中定义了网络的前向计算方式：输入一个状态，然后经过两层FC，最后得到的是每个action的概率预测。
+## Algorithm
+**Algorithm** 定义了具体的算法来更新前向网络(Model)，也就是通过定义损失函数来更新Model。一个Algorithm包含至少一个Model。在这个教程中，我们将使用经典的PolicyGradient算法来解决问题。由于PARL仓库已经实现了这个算法，我们只需要直接import来使用即可。
+```python
+model = CartpoleModel(act_dim=2)
+algorithm = parl.algorithms.PolicyGradient(model, lr=1e-3)
+```
+在实例化了Model之后，我们把它传给algorithm。
+## Agent
+**Agent** 负责算法与环境的交互，在交互过程中把生成的数据提供给Algorithm来更新模型(Model)，数据的预处理流程也一般定义在这里。
+我们得要继承`parl.Agent`这个类来实现自己的Agent，下面先把Agent的代码抛出来，再按照函数解释：
+```python
+class CartpoleAgent(parl.Agent):
+    def __init__(self, algorithm, obs_dim, act_dim):
+        self.obs_dim = obs_dim
+        self.act_dim = act_dim
+        super(CartpoleAgent, self).__init__(algorithm)
+    def build_program(self):
+        self.pred_program = fluid.Program()
+        self.train_program = fluid.Program()
+        with fluid.program_guard(self.pred_program):
+            obs = layers.data(
+                name='obs', shape=[self.obs_dim], dtype='float32')
+            self.act_prob = self.alg.predict(obs)
+        with fluid.program_guard(self.train_program):
+            obs = layers.data(
+                name='obs', shape=[self.obs_dim], dtype='float32')
+            act = layers.data(name='act', shape=[1], dtype='int64')
+            reward = layers.data(name='reward', shape=[], dtype='float32')
+            self.cost = self.alg.learn(obs, act, reward)
+    def sample(self, obs):
+        obs = np.expand_dims(obs, axis=0)
+        act_prob = self.fluid_executor.run(
+            self.pred_program,
+            feed={'obs': obs.astype('float32')},
+            fetch_list=[self.act_prob])[0]
+        act_prob = np.squeeze(act_prob, axis=0)
+        act = np.random.choice(range(self.act_dim), p=act_prob)
+        return act
+    def predict(self, obs):
+        obs = np.expand_dims(obs, axis=0)
+        act_prob = self.fluid_executor.run(
+            self.pred_program,
+            feed={'obs': obs.astype('float32')},
+            fetch_list=[self.act_prob])[0]
+        act_prob = np.squeeze(act_prob, axis=0)
+        act = np.argmax(act_prob)
+        return act
+    def learn(self, obs, act, reward):
+        act = np.expand_dims(act, axis=-1)
+        feed = {
+            'obs': obs.astype('float32'),
+            'act': act.astype('int64'),
+            'reward': reward.astype('float32')
+        }
+        cost = self.fluid_executor.run(
+            self.train_program, feed=feed, fetch_list=[self.cost])[0]
+        return cost
+```
+- 构造函数