diff --git a/docs/10.md b/docs/10.md index b66c5d373efe814c053f36a31fd9250602844116..b59b4ec0a5a8f5d15ea8f21df58f3365dce03cff 100644 --- a/docs/10.md +++ b/docs/10.md @@ -58,13 +58,22 @@ $$ $$ $$ -= \int \nabla_{theta} \pi _{\theta} (\tau) r(\tau) \text{d} \tau += \int \nabla_{\theta} \pi _{\theta} (\tau) r(\tau) \text{d} \tau $$ $$ -\int \pi_{\theta}(\tau) \frac{\nabla_{theta} \pi_ {\theta} (\tau)}{\pi_{\theta}(\tau)}r(\tau)\text{d}\tau += \int \pi_{\theta}(\tau) \frac{\nabla_{theta} \pi_ {\theta} (\tau)}{\pi_{\theta}(\tau)}r(\tau)\text{d}\tau $$ $$ -\mathbb{E}_ {\tau\sim\pi_{\theta}(\tau)}[\nabla_{\theta}\log\pi_{\theta}(\tau)r(\tau)] += \mathbb{E}_ {\tau\sim\pi_{\theta}(\tau)}[\nabla_{\theta}\log\pi_{\theta}(\tau)r(\tau)] +$$ + +通过对数导数技巧,我们将梯度从期望之外转移到了期望之内。这样做的好处就是,我们不再需要对状态转移函数求梯度,正如下面我们将看到的。 +$$ +\nabla_{theta}J(\theta) = \mathbb{E}_ {\tau\sim\pi_{\theta}(\tau)}[\nabla_{\theta}\log\pi_{\theta}(\tau)r(\tau)] +$$ + +$$ += \mathbb{E}_ {\tau\sim\pi_{\theta}(\tau)}[\nabla_{theta} [\logP(s_1)+\sum_{t=1}^{T}(\log\pi_{\theta}(a_t|s_t) + \log P(s_{t+1}|s_t,a_t))]r(\tau)] $$ \ No newline at end of file