PAC bounds for Multi-armed Bandit and Markov Decision Processes

Eyal Even-Dar, Shie Mannor, Yishay Mansour

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The bandit problem is revisited and considered under the PAC model. Our main contribution in this part is to show that given n arms, it suffices to pull the arms O(n/∈2 log 1/δ) times to find an ∈-optimal arm with probability of at least 1 - δ. This is in contrast to the naive bound of O(n/∈2 log n/δ). We derive another algorithm whose complexity depends on the specific settingo f the rewards, rather than the worst case setting. We also provide a matching lower bound. We show how given an algorithm for the PAC model Multi-armed Bandit problem, one can derive a batch learningalg orithm for Markov Decision Processes. This is done essentially by simulating Value Iteration, and in each iteration invokingt he multi-armed bandit algorithm. Using our PAC algorithm for the multi-armed bandit problem we improve the dependence on the number of actions.

Original languageEnglish
Title of host publicationComputational Learning Theory - 15th Annual Conference on Computational Learning Theory, COLT 2002, Proceedings
EditorsJyrki Kivinen, Robert H. Sloan
PublisherSpringer Verlag
Pages255-270
Number of pages16
ISBN (Electronic)354043836X, 9783540438366
DOIs
StatePublished - 2002
Event15th Annual Conference on Computational Learning Theory, COLT 2002 - Sydney, Australia
Duration: 8 Jul 200210 Jul 2002

Publication series

NameLecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science)
Volume2375
ISSN (Print)0302-9743

Conference

Conference15th Annual Conference on Computational Learning Theory, COLT 2002
Country/TerritoryAustralia
CitySydney
Period8/07/0210/07/02

Fingerprint

Dive into the research topics of 'PAC bounds for Multi-armed Bandit and Markov Decision Processes'. Together they form a unique fingerprint.

Cite this