diff --git a/docs/10.md b/docs/10.md index 3c964c4918964032022b9917d79819714c097243..a3432bf8400e800ce27da68a4c006de6263ceb40 100644 --- a/docs/10.md +++ b/docs/10.md @@ -604,9 +604,9 @@ $\text{gradients = loss.gradients(loss, variables)}$ 5. J. Schulman et al, "Trust region policy optimization," *ICML*, 2015. -## A TRPO 证明(TRPO Proofs) +## A. TRPO 证明(TRPO Proofs) -### A.1 奖励调整(Reward Shaping) +### A.1 奖励调整(Reward Shaping) 这里我们证明[引理 5.1](#lemma51)。