TY - GEN
T1 - Narrowing the Knowledge Evaluation Gap
T2 - 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
AU - Yona, Gal
AU - Aharoni, Roee
AU - Geva, Mor
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Factual questions can typically be answered correctly at different levels of granularity. For example, both “August 4, 1961” and “1961” are correct answers to the question “When was Barack Obama born?”. Standard question answering (QA) evaluation protocols, however, do not take this into account explicitly and instead compare a predicted answer against reference answers of a single granularity level. In this work, we propose GRANOLA QA, a novel evaluation setting where a predicted answer is evaluated in terms of accuracy and informativeness against a set of multi-granularity answers. We present a simple methodology for enriching existing datasets with multi-granularity answers, and create GRANOLA-EQ, a multi-granularity version of the ENTITYQUESTIONS dataset. We evaluate models using a range of decoding methods on GRANOLA-EQ, including a new algorithm called Decoding with Response Aggregation (DRAG), that is geared towards aligning the answer granularity with the model's uncertainty. Our experiments show that large language models with standard decoding methods tend to generate specific answers, which are often incorrect. In contrast, when evaluated on multi-granularity answers, DRAG yields a nearly 20 point increase in accuracy on average, which further increases for rare entities, revealing that standard evaluation and decoding schemes may underestimate the knowledge encapsulated in language models.
AB - Factual questions can typically be answered correctly at different levels of granularity. For example, both “August 4, 1961” and “1961” are correct answers to the question “When was Barack Obama born?”. Standard question answering (QA) evaluation protocols, however, do not take this into account explicitly and instead compare a predicted answer against reference answers of a single granularity level. In this work, we propose GRANOLA QA, a novel evaluation setting where a predicted answer is evaluated in terms of accuracy and informativeness against a set of multi-granularity answers. We present a simple methodology for enriching existing datasets with multi-granularity answers, and create GRANOLA-EQ, a multi-granularity version of the ENTITYQUESTIONS dataset. We evaluate models using a range of decoding methods on GRANOLA-EQ, including a new algorithm called Decoding with Response Aggregation (DRAG), that is geared towards aligning the answer granularity with the model's uncertainty. Our experiments show that large language models with standard decoding methods tend to generate specific answers, which are often incorrect. In contrast, when evaluated on multi-granularity answers, DRAG yields a nearly 20 point increase in accuracy on average, which further increases for rare entities, revealing that standard evaluation and decoding schemes may underestimate the knowledge encapsulated in language models.
UR - http://www.scopus.com/inward/record.url?scp=85204450059&partnerID=8YFLogxK
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85204450059
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 6737
EP - 6751
BT - Long Papers
A2 - Ku, Lun-Wei
A2 - Martins, Andre F. T.
A2 - Srikumar, Vivek
PB - Association for Computational Linguistics (ACL)
Y2 - 11 August 2024 through 16 August 2024
ER -