明日是否依然成立？多语言常青问题分类提升可信问答系统

摘要

大型语言模型（LLMs）在问答（QA）任务中常常出现幻觉现象。一个关键但尚未充分探讨的因素是问题的时间性——即它们是常青的（答案随时间保持稳定）还是可变的（答案会变化）。在本研究中，我们引入了EverGreenQA，这是首个带有常青标签的多语言QA数据集，支持评估与训练。利用EverGreenQA，我们对12个现代LLMs进行了基准测试，以评估它们是否通过显式（通过言语判断）或隐式（通过不确定性信号）方式编码问题的时间性。此外，我们训练了EG-E5，一个轻量级的多语言分类器，在该任务上达到了最先进的性能。最后，我们展示了常青分类在三个应用中的实际效用：提升自我知识估计、过滤QA数据集以及解释GPT-4o的检索行为。

English

Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions -- whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o retrieval behavior.

明日是否依然成立？多语言常青问题分类提升可信问答系统

Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA

摘要

Support