환각 범위 탐지를 위한 추론 학습

초록

대규모 언어 모델(LLM)은 종종 신뢰성을 저해하는 근거 없는 내용인 환각(hallucination)을 생성합니다. 기존 연구 대부분은 환각 탐지를 이진 분류 작업으로 접근했지만, 실제 응용에서는 다단계 의사결정 과정이 필요한 환각 구간(span)을 식별해야 하는 경우가 많습니다. 이는 명시적 추론이 환각 구간 탐지라는 복잡한 작업에 도움이 될 수 있는지에 대한 질문을 자연스럽게 제기합니다. 이 질문에 답하기 위해, 우리는 먼저 Chain-of-Thought(CoT) 추론을 적용한 모델과 그렇지 않은 모델을 평가하고, CoT 추론이 여러 번 샘플링할 때 적어도 하나의 정답을 생성할 가능성이 있음을 보여줍니다. 이를 바탕으로, 우리는 구간 수준의 보상 함수를 통해 추론을 장려하는 강화 학습 프레임워크인 RL4HS를 제안합니다. RL4HS는 Group Relative Policy Optimization을 기반으로 하며, 보상 불균형 문제를 완화하기 위해 Class-Aware Policy Optimization을 도입합니다. RAGTruth 벤치마크(요약, 질문 응답, 데이터-텍스트 변환)에서의 실험 결과, RL4HS는 사전 학습된 추론 모델과 지도 미세 조정을 능가하며, 환각 구간 탐지를 위해 구간 수준의 보상을 사용한 강화 학습의 필요성을 입증합니다.

English

Large language models (LLMs) often generate hallucinations -- unsupported content that undermines reliability. While most prior works frame hallucination detection as a binary task, many real-world applications require identifying hallucinated spans, which is a multi-step decision making process. This naturally raises the question of whether explicit reasoning can help the complex task of detecting hallucination spans. To answer this question, we first evaluate pretrained models with and without Chain-of-Thought (CoT) reasoning, and show that CoT reasoning has the potential to generate at least one correct answer when sampled multiple times. Motivated by this, we propose RL4HS, a reinforcement learning framework that incentivizes reasoning with a span-level reward function. RL4HS builds on Group Relative Policy Optimization and introduces Class-Aware Policy Optimization to mitigate reward imbalance issue. Experiments on the RAGTruth benchmark (summarization, question answering, data-to-text) show that RL4HS surpasses pretrained reasoning models and supervised fine-tuning, demonstrating the necessity of reinforcement learning with span-level rewards for detecting hallucination spans.

환각 범위 탐지를 위한 추론 학습

Learning to Reason for Hallucination Span Detection

초록

Support