1. S. Ross, G. J. Gordon, and J. A. Bagnell, "A, reduction of imitation learning and structured prediction to no-regret online learning," *Proceedings of the 14th International Conference on Artificial Intelligence and Statistics*, 2011.
1. S. Ross, G. J. Gordon, and J. A. Bagnell, "A reduction of imitation learning and structured prediction to no-regret online learning," *Proceedings of the 14th International Conference on Artificial Intelligence and Statistics*, 2011.
2. P. Abbeel, and A. Y. Ng, "Apprenticeship learning via inverse reinforcement learning," *Proceedings of the 21st International Conference on Machine Learning*, 2004.
2. P. Abbeel, and A. Y. Ng, "Apprenticeship learning via inverse reinforcement learning," *Proceedings of the 21st International Conference on Machine Learning*, 2004.
这里 $\hat{A}_t^{(1)}$ 是纯的 TD 估计,具有低方差、高偏移的特点,$\hat{A}_t^{(\text{inf})}$ 是纯的 MC 估计,具有零偏移、高方差的特点。如果我们选择一个中间值 $\hat{A}_t^{(k)}$,那么这个中间值的方差和偏移都是中间量。
## 参考文献
1. https://blog.openai.com/evolution-strategies/
2. N. Kohl, and P. Stone, "Policy gradient reinforcement learning for fast quadrupedal locomotion," *Proceedings of the IEEE International Conference on Robotics and Automation*, 2004.