Abstract
Markov Decision Process (MDP) and Partially Observable MDP (POMDP) have become the model of choice in reinforcement learning. This work explores an interesting connection between mistake bounded learning algorithms and computing a near-best strategy, from a restricted class of strategies, for a given POMDP. We show that if a class of strategies has a mistake bound algorithm that makes at most d mistakes, then there is an algorithm to compute a near-best strategy from the class in time polynomial in 1/ε, the accuracy parameter, log(1/δ), the confidence parameter, H, the horizon parameter, and exponential in d, the mistake bound. Our transformation assumes only the ability to execute actions in the POMDP and the ability to reset the POMDP to its initial state.
Original language | English |
---|---|
Pages | 183-192 |
Number of pages | 10 |
DOIs | |
State | Published - 1999 |
Event | Proceedings of the 1999 12th Annual Conference on Computational Learning Theory (COLT'99) - Santa Cruz, CA, USA Duration: 6 Jul 1999 → 9 Jul 1999 |
Conference
Conference | Proceedings of the 1999 12th Annual Conference on Computational Learning Theory (COLT'99) |
---|---|
City | Santa Cruz, CA, USA |
Period | 6/07/99 → 9/07/99 |