Abstract
We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O ((n/ε 2) log(1/δ)) times to find an ε-optimal arm with probability of at least 1 - δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over ε-greedy Q-learning.
| Original language | English |
|---|---|
| Pages (from-to) | 1079-1105 |
| Number of pages | 27 |
| Journal | Journal of Machine Learning Research |
| Volume | 7 |
| State | Published - Jun 2006 |
Fingerprint
Dive into the research topics of 'Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver