Markov Decision Process (MDP) and Partially Observable MDP (POMDP) have become the model of choice in reinforcement learning. This work explores an interesting connection between mistake bounded learning algorithms and computing a near-best strategy, from a restricted class of strategies, for a given POMDP. We show that if a class of strategies has a mistake bound algorithm that makes at most d mistakes, then there is an algorithm to compute a near-best strategy from the class in time polynomial in 1/ε, the accuracy parameter, log(1/δ), the confidence parameter, H, the horizon parameter, and exponential in d, the mistake bound. Our transformation assumes only the ability to execute actions in the POMDP and the ability to reset the POMDP to its initial state.
|Number of pages||10|
|State||Published - 1999|
|Event||Proceedings of the 1999 12th Annual Conference on Computational Learning Theory (COLT'99) - Santa Cruz, CA, USA|
Duration: 6 Jul 1999 → 9 Jul 1999
|Conference||Proceedings of the 1999 12th Annual Conference on Computational Learning Theory (COLT'99)|
|City||Santa Cruz, CA, USA|
|Period||6/07/99 → 9/07/99|