TY - JOUR
T1 - Patchscopes
T2 - 41st International Conference on Machine Learning, ICML 2024
AU - Ghandeharioun, Asma
AU - Caciularu, Avi
AU - Pearce, Adam
AU - Dixon, Lucas
AU - Geva, Mor
N1 - Publisher Copyright:
Copyright 2024 by the author(s)
PY - 2024
Y1 - 2024
N2 - Understanding the internal representations of large language models (LLMs) can help explain models' behavior and verify their alignment with human values. Given the capabilities of LLMs in generating human-understandable text, we propose leveraging the model itself to explain its internal representations in natural language. We introduce a framework called Patchscopes and show how it can be used to answer a wide range of questions about an LLM's computation. We show that many prior interpretability methods based on projecting representations into the vocabulary space and intervening on the LLM computation can be viewed as instances of this framework. Moreover, several of their shortcomings such as failure in inspecting early layers or lack of expressivity can be mitigated by Patchscopes. Beyond unifying prior inspection techniques, Patchscopes also opens up new possibilities such as using a more capable model to explain the representations of a smaller model, and multihop reasoning error correction.
AB - Understanding the internal representations of large language models (LLMs) can help explain models' behavior and verify their alignment with human values. Given the capabilities of LLMs in generating human-understandable text, we propose leveraging the model itself to explain its internal representations in natural language. We introduce a framework called Patchscopes and show how it can be used to answer a wide range of questions about an LLM's computation. We show that many prior interpretability methods based on projecting representations into the vocabulary space and intervening on the LLM computation can be viewed as instances of this framework. Moreover, several of their shortcomings such as failure in inspecting early layers or lack of expressivity can be mitigated by Patchscopes. Beyond unifying prior inspection techniques, Patchscopes also opens up new possibilities such as using a more capable model to explain the representations of a smaller model, and multihop reasoning error correction.
UR - http://www.scopus.com/inward/record.url?scp=85203818410&partnerID=8YFLogxK
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.conferencearticle???
AN - SCOPUS:85203818410
SN - 2640-3498
VL - 235
SP - 15466
EP - 15490
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
Y2 - 21 July 2024 through 27 July 2024
ER -