TY - GEN
T1 - Optimism in Face of a Context
T2 - 37th AAAI Conference on Artificial Intelligence, AAAI 2023
AU - Levy, Orin
AU - Mansour, Yishay
N1 - Publisher Copyright:
Copyright © 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2023/6/27
Y1 - 2023/6/27
N2 - We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains regret bound of Oe((H + 1/pmin)H|S|3/2p|A|T log(max{|G|, |P|}/δ)) with probability 1 − δ, where P and G are finite and realizable function classes used to approximate the dynamics and rewards respectively, pmin is the minimum reachability parameter, S is the set of states, A the set of actions, H the horizon, and T the number of episodes. To our knowledge, our approach is the first optimistic approach applied to contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.). We present a lower bound of Ω(pTH|S||A|ln(|G|)/ln(|A|)), on the expected regret which holds even in the case of known dynamics. Lastly, we discuss an extension of our results to CMDPs without minimum reachability, that obtains Oe(T3/4) regret.
AB - We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains regret bound of Oe((H + 1/pmin)H|S|3/2p|A|T log(max{|G|, |P|}/δ)) with probability 1 − δ, where P and G are finite and realizable function classes used to approximate the dynamics and rewards respectively, pmin is the minimum reachability parameter, S is the set of states, A the set of actions, H the horizon, and T the number of episodes. To our knowledge, our approach is the first optimistic approach applied to contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.). We present a lower bound of Ω(pTH|S||A|ln(|G|)/ln(|A|)), on the expected regret which holds even in the case of known dynamics. Lastly, we discuss an extension of our results to CMDPs without minimum reachability, that obtains Oe(T3/4) regret.
UR - http://www.scopus.com/inward/record.url?scp=85168239914&partnerID=8YFLogxK
U2 - 10.1609/aaai.v37i7.26025
DO - 10.1609/aaai.v37i7.26025
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85168239914
T3 - Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023
SP - 8510
EP - 8517
BT - AAAI-23 Technical Tracks 7
A2 - Williams, Brian
A2 - Chen, Yiling
A2 - Neville, Jennifer
PB - AAAI press
Y2 - 7 February 2023 through 14 February 2023
ER -