明日になっても真実か？信頼性の高いQAを実現するための多言語常緑質問分類

要旨

大規模言語モデル（LLM）は、質問応答（QA）タスクにおいてしばしば虚構を生成する。この現象に寄与する重要な要因でありながら、十分に検討されていないのが、質問の時間的性質――それが常緑（回答が時間とともに変化しない）か可変（回答が変化する）か――である。本研究では、評価と訓練の両方をサポートする、初の多言語QAデータセットであるEverGreenQAを紹介する。EverGreenQAを用いて、12の最新LLMをベンチマークし、それらが質問の時間的性質を明示的（言語化された判断を通じて）または暗黙的（不確実性の信号を通じて）にエンコードしているかどうかを評価する。また、このタスクにおいてSoTA性能を達成する軽量な多言語分類器EG-E5を訓練する。最後に、常緑分類の実用的な有用性を、自己知識推定の改善、QAデータセットのフィルタリング、GPT-4oの検索行動の説明という3つの応用を通じて実証する。

English

Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions -- whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o retrieval behavior.

明日になっても真実か？信頼性の高いQAを実現するための多言語常緑質問分類

Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA

要旨

Support