TY - JOUR
T1 - Bayesian incentive-compatible bandit exploration
AU - Mansour, Yishay
AU - Slivkins, Aleksandrs
AU - Syrgkanis, Vasilis
N1 - Publisher Copyright:
© 2020 INFORMS
PY - 2020/7
Y1 - 2020/7
N2 - As self-interested individuals (“agents”) make decisions over time, they utilize information revealed by other agents in the past and produce information that may help agents in the future. This phenomenon is common in a wide range of scenarios in the Internet economy, as well as in medical decisions. Each agent would like to exploit: select the best action given the current information, but would prefer the previous agents to explore: try out various alternatives to collect information. A social planner, by means of a carefully designed recommendation policy, can incentivize the agents to balance the exploration and exploitation so as to maximize social welfare. We model the planner's recommendation policy as a multiarm bandit algorithm under incentive-compatibility constraints induced by agents' Bayesian priors. We design a bandit algorithm which is incentive-compatible and has asymptotically optimal performance, as expressed by regret. Further, we provide a black-box reduction from an arbitrary multiarm bandit algorithm to an incentive-compatible one, with only a constant multiplicative increase in regret. This reduction works for very general bandit settings that incorporate contexts and arbitrary partial feedback.
AB - As self-interested individuals (“agents”) make decisions over time, they utilize information revealed by other agents in the past and produce information that may help agents in the future. This phenomenon is common in a wide range of scenarios in the Internet economy, as well as in medical decisions. Each agent would like to exploit: select the best action given the current information, but would prefer the previous agents to explore: try out various alternatives to collect information. A social planner, by means of a carefully designed recommendation policy, can incentivize the agents to balance the exploration and exploitation so as to maximize social welfare. We model the planner's recommendation policy as a multiarm bandit algorithm under incentive-compatibility constraints induced by agents' Bayesian priors. We design a bandit algorithm which is incentive-compatible and has asymptotically optimal performance, as expressed by regret. Further, we provide a black-box reduction from an arbitrary multiarm bandit algorithm to an incentive-compatible one, with only a constant multiplicative increase in regret. This reduction works for very general bandit settings that incorporate contexts and arbitrary partial feedback.
KW - Bayesian incentive-compatibility
KW - Mechanism design
KW - Multiarmed bandits
KW - Regret
UR - http://www.scopus.com/inward/record.url?scp=85092024281&partnerID=8YFLogxK
U2 - 10.1287/opre.2019.1949
DO - 10.1287/opre.2019.1949
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85092024281
SN - 0030-364X
VL - 68
SP - 1132
EP - 1161
JO - Operations Research
JF - Operations Research
IS - 4
ER -