Reinforcement learning and mistake bounded algorithms

Yishay Mansour*

*Corresponding author for this work

Research output: Contribution to conferencePaperpeer-review


Markov Decision Process (MDP) and Partially Observable MDP (POMDP) have become the model of choice in reinforcement learning. This work explores an interesting connection between mistake bounded learning algorithms and computing a near-best strategy, from a restricted class of strategies, for a given POMDP. We show that if a class of strategies has a mistake bound algorithm that makes at most d mistakes, then there is an algorithm to compute a near-best strategy from the class in time polynomial in 1/ε, the accuracy parameter, log(1/δ), the confidence parameter, H, the horizon parameter, and exponential in d, the mistake bound. Our transformation assumes only the ability to execute actions in the POMDP and the ability to reset the POMDP to its initial state.

Original languageEnglish
Number of pages10
StatePublished - 1999
EventProceedings of the 1999 12th Annual Conference on Computational Learning Theory (COLT'99) - Santa Cruz, CA, USA
Duration: 6 Jul 19999 Jul 1999


ConferenceProceedings of the 1999 12th Annual Conference on Computational Learning Theory (COLT'99)
CitySanta Cruz, CA, USA


Dive into the research topics of 'Reinforcement learning and mistake bounded algorithms'. Together they form a unique fingerprint.

Cite this