提交 0e39b2b1 编写于 作者: X xiaowei_xing

test

上级 e5b045f4
......@@ -58,13 +58,22 @@ $$
$$
$$
= \int \nabla_{theta} \pi _{\theta} (\tau) r(\tau) \text{d} \tau
= \int \nabla_{\theta} \pi _{\theta} (\tau) r(\tau) \text{d} \tau
$$
$$
\int \pi_{\theta}(\tau) \frac{\nabla_{theta} \pi_ {\theta} (\tau)}{\pi_{\theta}(\tau)}r(\tau)\text{d}\tau
= \int \pi_{\theta}(\tau) \frac{\nabla_{theta} \pi_ {\theta} (\tau)}{\pi_{\theta}(\tau)}r(\tau)\text{d}\tau
$$
$$
\mathbb{E}_ {\tau\sim\pi_{\theta}(\tau)}[\nabla_{\theta}\log\pi_{\theta}(\tau)r(\tau)]
= \mathbb{E}_ {\tau\sim\pi_{\theta}(\tau)}[\nabla_{\theta}\log\pi_{\theta}(\tau)r(\tau)]
$$
通过对数导数技巧,我们将梯度从期望之外转移到了期望之内。这样做的好处就是,我们不再需要对状态转移函数求梯度,正如下面我们将看到的。
$$
\nabla_{theta}J(\theta) = \mathbb{E}_ {\tau\sim\pi_{\theta}(\tau)}[\nabla_{\theta}\log\pi_{\theta}(\tau)r(\tau)]
$$
$$
= \mathbb{E}_ {\tau\sim\pi_{\theta}(\tau)}[\nabla_{theta} [\logP(s_1)+\sum_{t=1}^{T}(\log\pi_{\theta}(a_t|s_t) + \log P(s_{t+1}|s_t,a_t))]r(\tau)]
$$
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册