Memory augmented policy optimization for program synthesis and semantic parsing

Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc Le, Ni Lao

Research output: Contribution to journalConference articlepeer-review

90 Scopus citations

Abstract

We present Memory Augmented Policy Optimization (MAPO), a simple and novel way to leverage a memory buffer of promising trajectories to reduce the variance of policy gradient estimates. MAPO is applicable to deterministic environments with discrete actions, such as structured prediction and combinatorial optimization. Our key idea is to express the expected return objective as a weighted sum of two terms: an expectation over the high-reward trajectories inside a memory buffer, and a separate expectation over trajectories outside of the buffer. To design an efficient algorithm based on this idea, we propose: (1) memory weight clipping to accelerate and stabilize training; (2) systematic exploration to discover high-reward trajectories; (3) distributed sampling from inside and outside of the memory buffer to speed up training. MAPO improves the sample efficiency and robustness of policy gradient, especially on tasks with sparse rewards. We evaluate MAPO on weakly supervised program synthesis from natural language (semantic parsing). On the WIKITABLEQUESTIONS benchmark, we improve the state-of-the-art by 2.6%, achieving an accuracy of 46.3%. On the WIKISQL benchmark, MAPO achieves an accuracy of 74.9% with only weak supervision, outperforming several strong baselines with full supervision. Our source code is available at goo.gl/TXBp4e.

Original languageEnglish
Pages (from-to)9994-10006
Number of pages13
JournalAdvances in Neural Information Processing Systems
Volume2018-December
StatePublished - 2018
Event32nd Conference on Neural Information Processing Systems, NeurIPS 2018 - Montreal, Canada
Duration: 2 Dec 20188 Dec 2018

Funding

FundersFunder number
Israel Science Foundation942/16
Israel Science Foundation

    Fingerprint

    Dive into the research topics of 'Memory augmented policy optimization for program synthesis and semantic parsing'. Together they form a unique fingerprint.

    Cite this