数学やコードを超えた検証可能な報酬：事実に基づく質問応答のための軽量なコーパス基盤プロセス監視

要旨

知識集約型の質問応答において事実正確性を向上させるために強化学習を適用する際には、報酬設計のジレンマが生じる。応答レベルの報酬は粗い監督しか提供できず、推論トレース内の正しい記述と誤った記述を区別することができない。文レベルの代替手法はより細かいフィードバックを提供するが、通常はNLI検証器、LLM判定器、または知識検証パイプラインに依存しており、これらは強化学習のスケールで展開するにはコストが高く、特に正確な報酬信号が重要となる稀なエンティティに関する事実に対しては信頼性が低いことが多い。そこで我々は、ニューラル検証器をWikipediaの共起統計に基づくコーパス由来の信号で置き換える、軽量でプラグイン対応のプロセス報酬であるCorVer（Corpus Verify）を提案する。CorVerは文レベルのクレジットを割り当て、単純なアライメントによりそれをトークンレベルのアドバンテージに変換する。必要とするのは0.5Bの抽出器と、一文あたり一回のコーパスルックアップのみである。6種類の指示チューニング済みモデル（3B～14B）と5種類のQAベンチマークからなる30の（モデル、ベンチマーク）セルにおいて、CorVerはすべてのセルで生のベースラインを上回り、TriviaQAでは平均+4.1ポイントの改善を達成した。また、実行可能な設定の下では20セル中18セルで4種類のニューラル検証器ベースラインを凌駕し、訓練速度は4.8～8.4倍高速である。

English

Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statements within a reasoning trace. Sentence-level alternatives offer finer-grained feedback, but typically rely on NLI verifiers, LLM judges, or knowledge-verification pipelines that are expensive to deploy at RL scale and often unreliable for rare-entity facts, where accurate reward signals are especially important. We propose CorVer (Corpus Verify), a lightweight, plug-in-ready process reward that replaces neural verifiers with a corpus-grounded signal derived from Wikipedia co-occurrence statistics. CorVer assigns sentence-level credit and maps it to token-level advantages via a simple alignment, requiring only a 0.5B extractor and a single corpus lookup per sentence. Across 30 (model, benchmark) cells spanning six instruction-tuned models (3B to 14B) and five QA benchmarks, CorVer improves over the raw baseline for every cell, with an average TriviaQA gain of +4.1 pp. It also outperforms four neural-verifier baselines in 18 of 20 cells under their feasible configurations, while training 4.8 to 8.4x faster.