提交 26834ab6 编写于 作者: X xiaowei_xing

test

上级 0d09c068
......@@ -200,9 +200,11 @@ $$
$$
$$
= 2\mathbb{E} [(\nabla_{\theta} \log \pi_{\theta}(\tau))^2 b] - 2\mathbb{E} [(\nabla_{\theta} \log \pi_{\theta}(\tau))^2 r(tau)] = 0,
= 2\mathbb{E} [(\nabla_{\theta} \log \pi_{\theta}(\tau))^2 b] - 2\mathbb{E} [(\nabla_{\theta} \log \pi_{\theta}(\tau))^2 r(\tau)] = 0,
$$
$$
b = \frac{\mathbb{E} [(\nabla_{\theta} \log \pi_{\theta}(\tau))^2 r(tau)]}{\mathbb{E} [(\nabla_{\theta} \log \pi_{\theta}(\tau))^2]}。
$$
\ No newline at end of file
b = \frac{\mathbb{E} [(\nabla_{\theta} \log \pi_{\theta}(\tau))^2 r(\tau)]}{\mathbb{E} [(\nabla_{\theta} \log \pi_{\theta}(\tau))^2]}。
$$
## 3. 离线策略策略梯度(Off Policy Policy Gradient)
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册