From ac85eaadb93f29ef8e49d29aaf8470956fcd189e Mon Sep 17 00:00:00 2001 From: xiaowei_xing <997427575@qq.com> Date: Fri, 13 Dec 2019 15:34:28 +0900 Subject: [PATCH] test --- docs/11&12.md | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/docs/11&12.md b/docs/11&12.md index 8e66d5c..55c542d 100644 --- a/docs/11&12.md +++ b/docs/11&12.md @@ -106,4 +106,25 @@ $$ $$ P[Q(a)>\hat{Q}_ {t}(a)+U_ {t}(a)] \leq e^{-2N_{t}(a)U_{t}(a)^{2}}。 \tag{6} -$$ \ No newline at end of file +$$ + +选择一个概率 $p$ 使得 + +$$ +e^{-2N_{t}(a)U_{t}(a)^{2}} = p, +\tag{7} +$$ + +$$ +U_{t}(a) = \sqrt{\frac{-\log p}{2N_{t}(a)}}。 +\tag{8} +$$ + +随着我们观察到更多的奖励,我们将减小 $p$,特别地,选择 $p=t^{-4}$ 便得到了 UCB1 算法: + +$$ +a_{t} = \mathop{\arg\max}_ {a\in A}(Q(a)+\sqrt{\frac{2\log t}{N_{t}(a)}}), +\tag{9} +$$ + +这样保证了渐近最优动作选择,即它将 [[1]](\ref1) 下界匹配到常数因子。 \ No newline at end of file -- GitLab