提交 979a58e4 编写于 作者: X xiaowei_xing

test

上级 bacbbdd5
...@@ -189,7 +189,7 @@ $$ ...@@ -189,7 +189,7 @@ $$
## 参考文献 ## 参考文献
1. S. Ross, G. J. Gordon, and J. A. Bagnell, "A, reduction of imitation learning and structured prediction to no-regret online learning," *Proceedings of the 14th International Conference on Artificial Intelligence and Statistics*, 2011. 1. S. Ross, G. J. Gordon, and J. A. Bagnell, "A reduction of imitation learning and structured prediction to no-regret online learning," *Proceedings of the 14th International Conference on Artificial Intelligence and Statistics*, 2011.
2. P. Abbeel, and A. Y. Ng, "Apprenticeship learning via inverse reinforcement learning," *Proceedings of the 21st International Conference on Machine Learning*, 2004. 2. P. Abbeel, and A. Y. Ng, "Apprenticeship learning via inverse reinforcement learning," *Proceedings of the 21st International Conference on Machine Learning*, 2004.
......
...@@ -437,4 +437,50 @@ $$ ...@@ -437,4 +437,50 @@ $$
L(\theta,\mathbf{w})=\sum_t (\hat{A}_ t \log \pi_{\theta}(a_t|s_t) - \parallel b(s_t)-G_t^{(i)} \parallel ^2), L(\theta,\mathbf{w})=\sum_t (\hat{A}_ t \log \pi_{\theta}(a_t|s_t) - \parallel b(s_t)-G_t^{(i)} \parallel ^2),
$$ $$
然后我们可以计算 $L(\theta,\mathbf{w})$ 关于 $\theta$ 和 $\mathbf{w}$ 的梯度来执行 SGD 更新。 然后我们可以计算 $L(\theta,\mathbf{w})$ 关于 $\theta$ 和 $\mathbf{w}$ 的梯度来执行 SGD 更新。
\ No newline at end of file
## 7.2 N 步估计(N-step Estimators)
在上面的推导中,为近似策略梯度,我们使用了奖励的蒙特卡洛估计。然而,如果我们能知道值函数(例如基准),那么我们也可以 TD 方法、或 TD 与 MC 的混合方法来进行策略梯度更新:
$$
\hat{G}_t^{(1)} = r_t + \gamma V(s_{t+1})
$$
$$
\hat{G}_t^{(2)} = r_t + \gamma r_{t+1} + \gamma^2 V(s_{t+2})
$$
$$
...
$$
$$
\hat{G}_t^{(\text{inf})} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + ...,
$$
我们也可以用这些方法来计算优势函数:
$$
\hat{A}_t^{(1)} = r_t + \gamma V(s_{t+1}) - V(s_t)
$$
$$
\hat{A}_t^{(2)} = r_t + \gamma r_{t+1} + \gamma^2 V(s_{t+2}) - V(s_t)
$$
$$
...
$$
$$
\hat{A}_t^{(\text{inf})} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} - V(s_t),
$$
这里 $\hat{A}_t^{(1)}$ 是纯的 TD 估计,具有低方差、高偏移的特点,$\hat{A}_t^{(\text{inf})}$ 是纯的 MC 估计,具有零偏移、高方差的特点。如果我们选择一个中间值 $\hat{A}_t^{(k)}$,那么这个中间值的方差和偏移都是中间量。
## 参考文献
1. https://blog.openai.com/evolution-strategies/
2. N. Kohl, and P. Stone, "Policy gradient reinforcement learning for fast quadrupedal locomotion," *Proceedings of the IEEE International Conference on Robotics and Automation*, 2004.
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册