TY - JOUR

T1 - Online stochastic shortest path with bandit feedback and unknown transition function

AU - Rosenberg, Aviv

AU - Mansour, Yishay

N1 - Publisher Copyright:
© 2019 Neural information processing systems foundation. All rights reserved.

PY - 2019

Y1 - 2019

N2 - We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes. The transition function is fixed but unknown to the learner, and the learner only observes bandit feedback (not the entire loss function). For this problem we develop no-regret algorithms that perform asymptotically as well as the best stationary policy in hindsight. Assuming that all states are reachable with probability ß > 0 under any policy, we give a regret bound of Õ(L|X|p|A|T/ß), where T is the number of episodes, X is the state space, A is the action space, and L is the length of each episode. When this assumption is removed we give a regret bound of Õ(L3/2|X||A|1/4T3/4), that holds for an arbitrary transition function. To our knowledge these are the first algorithms that in our setting handle both bandit feedback and an unknown transition function.

AB - We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes. The transition function is fixed but unknown to the learner, and the learner only observes bandit feedback (not the entire loss function). For this problem we develop no-regret algorithms that perform asymptotically as well as the best stationary policy in hindsight. Assuming that all states are reachable with probability ß > 0 under any policy, we give a regret bound of Õ(L|X|p|A|T/ß), where T is the number of episodes, X is the state space, A is the action space, and L is the length of each episode. When this assumption is removed we give a regret bound of Õ(L3/2|X||A|1/4T3/4), that holds for an arbitrary transition function. To our knowledge these are the first algorithms that in our setting handle both bandit feedback and an unknown transition function.

UR - http://www.scopus.com/inward/record.url?scp=85090176722&partnerID=8YFLogxK

M3 - ???researchoutput.researchoutputtypes.contributiontojournal.conferencearticle???

AN - SCOPUS:85090176722

VL - 32

JO - Advances in Neural Information Processing Systems

JF - Advances in Neural Information Processing Systems

SN - 1049-5258

Y2 - 8 December 2019 through 14 December 2019

ER -