Off-policy evaluation in partially observable environments

Guy Tennenholtz, Shie Mannor, Uri Shalit

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

43 Scopus citations

Abstract

This work studies the problem of batch off-policy evaluation for Reinforcement Learning in partially observable environments. Off-policy evaluation under partial observability is inherently prone to bias, with risk of arbitrarily large errors. We define the problem of off-policy evaluation for Partially Observable Markov Decision Processes (POMDPs) and establish what we believe is the first off-policy evaluation result for POMDPs. In addition, we formulate a model in which observed and unobserved variables are decoupled into two dynamic processes, called a Decoupled POMDP. We show how off-policy evaluation can be performed under this new model, mitigating estimation errors inherent to general POMDPs. We demonstrate the pitfalls of off-policy evaluation in POMDPs using a well-known off-policy method, Importance Sampling, and compare it with our result on synthetic medical data.

Original languageEnglish
Title of host publicationAAAI 2020 - 34th AAAI Conference on Artificial Intelligence
PublisherAAAI press
Pages10276-10283
Number of pages8
ISBN (Electronic)9781577358350
DOIs
StatePublished - 2020
Externally publishedYes
Event34th AAAI Conference on Artificial Intelligence, AAAI 2020 - New York, United States
Duration: 7 Feb 202012 Feb 2020

Publication series

NameAAAI 2020 - 34th AAAI Conference on Artificial Intelligence

Conference

Conference34th AAAI Conference on Artificial Intelligence, AAAI 2020
Country/TerritoryUnited States
CityNew York
Period7/02/2012/02/20

Fingerprint

Dive into the research topics of 'Off-policy evaluation in partially observable environments'. Together they form a unique fingerprint.

Cite this