사실성 추론 학습

초록

추론 대형 언어 모델(R-LLMs)은 복잡한 추론 작업에서 상당한 진전을 이루었지만, 사실성 측면에서는 여전히 어려움을 겪으며, 장문의 사실성 벤치마크에서 비추론 모델보다 훨씬 더 많은 환각(hallucination)을 생성하는 경향이 있습니다. 그러나 최근 R-LLM 발전의 핵심 요소인 온라인 강화 학습(RL)을 장문의 사실성 설정으로 확장하는 것은 신뢰할 수 있는 검증 방법의 부재로 인해 여러 가지 독특한 과제를 안고 있습니다. 기존 연구에서는 FActScore와 같은 자동 사실성 평가 프레임워크를 활용하여 오프라인 RL 설정에서 선호 데이터를 구축해 왔지만, 이러한 방법을 온라인 RL의 보상으로 직접 활용할 경우, 덜 상세하거나 관련성이 낮은 응답을 생성하는 등 여러 방식의 보상 해킹(reward hacking)이 발생함을 발견했습니다. 우리는 사실적 정확성, 응답의 상세 수준, 답변의 관련성을 동시에 고려하는 새로운 보상 함수를 제안하고, 온라인 RL을 적용하여 고품질의 사실적 추론을 학습합니다. 6개의 장문 사실성 벤치마크에서 평가한 결과, 우리의 사실적 추론 모델은 평균 23.1% 포인트의 환각률 감소, 23%의 답변 상세 수준 증가를 달성했으며, 전반적인 응답의 유용성에는 저하가 없었습니다.

English

Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning counterparts on long-form factuality benchmarks. However, extending online Reinforcement Learning (RL), a key component in recent R-LLM advancements, to the long-form factuality setting poses several unique challenges due to the lack of reliable verification methods. Previous work has utilized automatic factuality evaluation frameworks such as FActScore to curate preference data in the offline RL setting, yet we find that directly leveraging such methods as the reward in online RL leads to reward hacking in multiple ways, such as producing less detailed or relevant responses. We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance, and applies online RL to learn high quality factual reasoning. Evaluated on six long-form factuality benchmarks, our factual reasoning model achieves an average reduction of 23.1 percentage points in hallucination rate, a 23% increase in answer detail level, and no degradation in the overall response helpfulness.

사실성 추론 학습

Learning to Reason for Factuality

초록

Support