提交 15ae0798 编写于 作者: X xiaowei_xing


上级 25c94b7d
......@@ -246,3 +246,9 @@ $\bullet$ 概率匹配:选择有最大概率是最优的动作,如 汤普森
$\bullet$ 信息状态空间:建立并解决扩展 MDP,因此直接包含了信息的价值
## 参考文献
1. <span id="ref1">T. L. Lai, and H. Robbins, "Asymtotically efficient adaptive allocation rules," *Advances in Applied Mathematics*, 1985.</span>
2. <span id="ref2">P. Auer, C. B. Nicolo, and P. Fischer, "Finite-time analysis of the multiarmed bandit problem," *Maching Learning*, 2002.</span>
3. <span id="ref3">R. I. Brafman, and M. Tennenholtz, "R-max - a general polynomial time algorithm for near-optimal reinforcement learning," *Journal of Maching Learning Research*, 2002.</span>
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
想要评论请 注册