test

979a58e4 · xiaowei_xing · bacbbdd5 · 979a58e4 · 979a58e4
隐藏空白更改
内联并排

Showing with 48 addition and 2 deletion

docs/7.md docs/7.md +1 -1

docs/8&9.md docs/8&9.md +47 -1

未找到文件。
--- a/docs/7.md
+++ b/docs/7.md
@@ -189,7 +189,7 @@ $$
 ## 参考文献
-1. S. Ross, G. J. Gordon, and J. A. Bagnell, "A, reduction of imitation learning and structured prediction to no-regret online learning," *Proceedings of the 14th International Conference on Artificial Intelligence and Statistics*, 2011.
+1. S. Ross, G. J. Gordon, and J. A. Bagnell, "A reduction of imitation learning and structured prediction to no-regret online learning," *Proceedings of the 14th International Conference on Artificial Intelligence and Statistics*, 2011.
 2. P. Abbeel, and A. Y. Ng, "Apprenticeship learning via inverse reinforcement learning," *Proceedings of the 21st International Conference on Machine Learning*, 2004.

--- a/docs/8&9.md
+++ b/docs/8&9.md
@@ -437,4 +437,50 @@ $$
 L(\theta,\mathbf{w})=\sum_t (\hat{A}_ t \log \pi_{\theta}(a_t|s_t) - \parallel b(s_t)-G_t^{(i)} \parallel ^2)，
 $$
 然后我们可以计算 $L(\theta,\mathbf{w})$ 关于 $\theta$ 和 $\mathbf{w}$ 的梯度来执行 SGD 更新。
\ No newline at end of file
+## 7.2 N 步估计（N-step Estimators）
+在上面的推导中，为近似策略梯度，我们使用了奖励的蒙特卡洛估计。然而，如果我们能知道值函数（例如基准），那么我们也可以 TD 方法、或 TD 与 MC 的混合方法来进行策略梯度更新：
+$$
+\hat{G}_t^{(1)} = r_t + \gamma V(s_{t+1})
+$$
+$$
+\hat{G}_t^{(2)} = r_t + \gamma r_{t+1} + \gamma^2 V(s_{t+2})
+$$
+$$
+...
+$$
+$$
+\hat{G}_t^{(\text{inf})} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + ...，
+$$
+我们也可以用这些方法来计算优势函数：
+$$
+\hat{A}_t^{(1)} = r_t + \gamma V(s_{t+1}) - V(s_t)
+$$
+$$
+\hat{A}_t^{(2)} = r_t + \gamma r_{t+1} + \gamma^2 V(s_{t+2}) - V(s_t)
+$$
+$$
+...
+$$
+$$
+\hat{A}_t^{(\text{inf})} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} - V(s_t)，
+$$
+这里 $\hat{A}_t^{(1)}$ 是纯的 TD 估计，具有低方差、高偏移的特点，$\hat{A}_t^{(\text{inf})}$ 是纯的 MC 估计，具有零偏移、高方差的特点。如果我们选择一个中间值 $\hat{A}_t^{(k)}$，那么这个中间值的方差和偏移都是中间量。
+## 参考文献
+1. https://blog.openai.com/evolution-strategies/
+2. N. Kohl, and P. Stone, "Policy gradient reinforcement learning for fast quadrupedal locomotion," *Proceedings of the IEEE International Conference on Robotics and Automation*, 2004.
\ No newline at end of file