From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, Kristina Toutanova

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations


Much of the previous work towards digital agents for graphical user interfaces (GUIs) has relied on text-based representations (derived from HTML or other structured data sources), which are not always readily available. These input representations have been often coupled with custom, task-specific action spaces. This paper focuses on creating agents that interact with the digital world using the same conceptual interface that humans commonly use - via pixel-based screenshots and a generic action space corresponding to keyboard and mouse actions. Building upon recent progress in pixel-based pretraining, we show, for the first time, that it is possible for such agents to outperform human crowdworkers on the MiniWob++ benchmark of GUI-based instruction following tasks.

Original languageEnglish
JournalAdvances in Neural Information Processing Systems
StatePublished - 2023
Externally publishedYes
Event37th Conference on Neural Information Processing Systems, NeurIPS 2023 - New Orleans, United States
Duration: 10 Dec 202316 Dec 2023


Dive into the research topics of 'From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces'. Together they form a unique fingerprint.

Cite this