내일도 여전히 사실일까? 신뢰할 수 있는 질의응답을 위한 다국어 상시 유효 질문 분류

초록

대형 언어 모델(LLMs)은 질의응답(QA) 작업에서 종종 환각(hallucination)을 일으킨다. 이에 기여하는 주요하면서도 충분히 탐구되지 않은 요인 중 하나는 질문의 시간성(temporality)이다. 즉, 질문이 상시적(evergreen, 시간이 지나도 답변이 변하지 않음)인지, 아니면 가변적(mutable, 답변이 시간에 따라 변함)인지 여부이다. 본 연구에서는 상시적 레이블을 포함한 최초의 다국어 QA 데이터셋인 EverGreenQA를 소개하며, 이를 통해 평가와 훈련을 모두 지원한다. EverGreenQA를 사용하여 12개의 현대 LLMs를 벤치마킹하여 질문의 시간성을 명시적으로(언어화된 판단을 통해) 또는 암묵적으로(불확실성 신호를 통해) 인코딩하는지 평가한다. 또한, 이 작업에서 최첨단(SoTA) 성능을 달성하는 경량 다국어 분류기인 EG-E5를 훈련시킨다. 마지막으로, 상시적 분류의 실용적 유용성을 세 가지 응용 분야에서 입증한다: 자기 지식 추정 개선, QA 데이터셋 필터링, GPT-4o 검색 행동 설명.

English

Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions -- whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o retrieval behavior.

내일도 여전히 사실일까? 신뢰할 수 있는 질의응답을 위한 다국어 상시 유효 질문 분류

Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA

초록

Support