L-CiteEval:长上下文模型是否真正利用上下文来做出响应?
L-CiteEval: Do Long-Context Models Truly Leverage Context for Responding?
October 3, 2024
作者: Zecheng Tang, Keyan Zhou, Juntao Li, Baibei Ji, Jianye Hou, Min Zhang
cs.AI
摘要
长文本模型(LCMs)近年取得了显著进展,为用户处理涉及长篇背景的任务(如文档摘要)提供了极大便利。随着社区对生成结果忠实性的重视日益增加,仅确保LCM输出的准确性是不够的,因为人类很难验证极其长篇背景的结果。尽管一些努力已经开始评估LCMs是否真实地基于背景作出响应,但这些工作要么局限于特定任务,要么严重依赖像GPT-4这样的外部评估资源。在本研究中,我们引入了L-CiteEval,这是一个涵盖引文的长篇背景理解综合多任务基准,旨在评估LCMs的理解能力和忠实性。L-CiteEval涵盖了来自不同领域的11项任务,涵盖的背景长度从8K到48K不等,并提供了完全自动化的评估套件。通过对11个尖端闭源和开源LCMs进行测试,我们发现尽管这些模型在生成结果上存在细微差异,但开源模型在引文准确性和召回率方面明显落后于闭源模型。这表明目前的开源LCMs很容易基于其固有知识而非给定背景作出响应,这对实际应用中的用户体验构成重大风险。我们还评估了RAG方法,并观察到RAG能够显著提高LCMs的忠实性,尽管会略微降低生成质量。此外,我们发现LCMs的注意机制与引文生成过程之间存在相关性。
English
Long-context models (LCMs) have made remarkable strides in recent years,
offering users great convenience for handling tasks that involve long context,
such as document summarization. As the community increasingly prioritizes the
faithfulness of generated results, merely ensuring the accuracy of LCM outputs
is insufficient, as it is quite challenging for humans to verify the results
from the extremely lengthy context. Yet, although some efforts have been made
to assess whether LCMs respond truly based on the context, these works either
are limited to specific tasks or heavily rely on external evaluation resources
like GPT-4.In this work, we introduce L-CiteEval, a comprehensive multi-task
benchmark for long-context understanding with citations, aiming to evaluate
both the understanding capability and faithfulness of LCMs. L-CiteEval covers
11 tasks from diverse domains, spanning context lengths from 8K to 48K, and
provides a fully automated evaluation suite. Through testing with 11
cutting-edge closed-source and open-source LCMs, we find that although these
models show minor differences in their generated results, open-source models
substantially trail behind their closed-source counterparts in terms of
citation accuracy and recall. This suggests that current open-source LCMs are
prone to responding based on their inherent knowledge rather than the given
context, posing a significant risk to the user experience in practical
applications. We also evaluate the RAG approach and observe that RAG can
significantly improve the faithfulness of LCMs, albeit with a slight decrease
in the generation quality. Furthermore, we discover a correlation between the
attention mechanisms of LCMs and the citation generation process.Summary
AI-Generated Summary