TY - JOUR

T1 - Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor

AU - Hansen, Thomas Dueholm

AU - Miltersen, Peter Bro

AU - Zwick, Uri

PY - 2013/2

Y1 - 2013/2

N2 - Ye [2011] showed recently that the simplex method with Dantzig's pivoting rule, as well as Howard's policy iteration algorithm, solve discounted Markov decision processes (MDPs), with a constant discount factor, in strongly polynomial time. More precisely, Ye showed that both algorithms terminate after at most O ( mn/ 1-γ log n /1-γ ) iterations, where n is the number of states, m is the total number of actions in the MDP, and 0 < γ < 1 is the discount factor. We improve Ye's analysis in two respects. First, we improve the bound given by Ye and show that Howard's policy iteration algorithm actually terminates after at most O /9 m/ 1-γ log n/ 1-γ /0 iterations. Second, and more importantly, we show that the same bound applies to the number of iterations performed by the strategy iteration (or strategy improvement) algorithm, a generalization of Howard's policy iteration algorithm used for solving 2-player turn-based stochastic games with discounted zero-sum rewards. This provides the first strongly polynomial algorithm for solving these games, solving a long standing open problem. Combined with other recent results, this provides a complete characterization of the complexity the standard strategy iteration algorithm for 2-player turn-based stochastic games; it is strongly polynomial for a fixed discount factor, and exponential otherwise.

AB - Ye [2011] showed recently that the simplex method with Dantzig's pivoting rule, as well as Howard's policy iteration algorithm, solve discounted Markov decision processes (MDPs), with a constant discount factor, in strongly polynomial time. More precisely, Ye showed that both algorithms terminate after at most O ( mn/ 1-γ log n /1-γ ) iterations, where n is the number of states, m is the total number of actions in the MDP, and 0 < γ < 1 is the discount factor. We improve Ye's analysis in two respects. First, we improve the bound given by Ye and show that Howard's policy iteration algorithm actually terminates after at most O /9 m/ 1-γ log n/ 1-γ /0 iterations. Second, and more importantly, we show that the same bound applies to the number of iterations performed by the strategy iteration (or strategy improvement) algorithm, a generalization of Howard's policy iteration algorithm used for solving 2-player turn-based stochastic games with discounted zero-sum rewards. This provides the first strongly polynomial algorithm for solving these games, solving a long standing open problem. Combined with other recent results, this provides a complete characterization of the complexity the standard strategy iteration algorithm for 2-player turn-based stochastic games; it is strongly polynomial for a fixed discount factor, and exponential otherwise.

KW - Algorithms

KW - Design

KW - Performance

UR - http://www.scopus.com/inward/record.url?scp=84874917679&partnerID=8YFLogxK

U2 - 10.1145/2432622.2432623

DO - 10.1145/2432622.2432623

M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???

AN - SCOPUS:84874917679

SN - 0004-5411

VL - 60

JO - Journal of the ACM

JF - Journal of the ACM

IS - 1

M1 - 2432623

ER -