明日是否依然為真？多語言常青問題分類以提升可信問答

摘要

大型語言模型（LLMs）在問答（QA）任務中常出現幻覺現象。一個關鍵但尚未充分探討的因素是問題的時間性——即問題是常青的（答案隨時間保持穩定）還是可變的（答案會變化）。在本研究中，我們引入了EverGreenQA，這是首個帶有常青標籤的多語言QA數據集，支持評估與訓練。利用EverGreenQA，我們對12個現代LLMs進行了基準測試，以評估它們是否通過言語化判斷（顯式）或不確定性信號（隱式）來編碼問題的時間性。此外，我們訓練了EG-E5，這是一個輕量級的多語言分類器，在該任務上達到了SoTA性能。最後，我們展示了常青分類在三個應用中的實際效用：提升自我知識估計、過濾QA數據集，以及解釋GPT-4o的檢索行為。

English

Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions -- whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o retrieval behavior.

明日是否依然為真？多語言常青問題分類以提升可信問答

Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA

摘要

Support