CalVerT: 校正された検証器テレメトリを用いたエージェントの強化が、知識集約型タスクにおける行動と学習を改善する

要旨

知識集約型の質問応答におけるLLMエージェントは、現在の回答が不確かであるか、根拠がないか、あるいはすでに完全であるかについて不完全な知識しか持たない状態で、検索と推論の行動を取ります。これにより、二つの失敗モードが生じます。すなわち、自信はあるが根拠のない回答を採用して精度を損なうことと、手元の証拠で十分であるにもかかわらず過剰に検索を行い、計算資源を無駄にすることです。エージェントが動作する状態空間をより完全に把握できるようにするため、我々は較正済み検証器テレメトリ（CalVerT）を導入する。これはエージェントの状態に追加のテレメトリ、すなわち較正済み自己信頼度スコアと根拠付け検証器スコアを付加するものである。CalVerTが訓練不要の設定と訓練ベースの設定の両方でエージェントを改善できることを示す。四つのQAベンチマークにおいて、CalVerTが、エージェントがパラメトリック知識に過度に依存する場合に検索をトリガーすることでF1を向上させ、一方でエージェントが回答に十分な文脈を持つ場合には冗長な検索を削減することを確認した。CalVerTは訓練なしで既存のQAフレームワークを拡張できることを示す。さらに、CalVerTは訓練済みシステムも改善する。エージェントの状態にテレメトリを単純に付加するだけで、同一の訓練を受けているがCalVerTテレメトリを持たないエージェントと比較して、強化学習後に改善が見られる。

English

LLM agents in knowledge intensive question answering take retrieval and reasoning actions with incomplete knowledge about whether their current answer is uncertain, unsupported, or already complete. This produces two failure modes: committing to confident but unsupported answers, which hurts accuracy, and over-retrieving when the evidence in hand already suffices, resulting in wasted compute. To give agents a more complete picture of the state space they are operating in, we introduce calibrated verifier telemetry (CalVerT), which augments the agent's state with additional telemetry: a calibrated self-confidence score and a grounding verifier score. We show that CalVerT can improve agents in both training-free and training-based settings. On four QA benchmarks, we find that CalVerT raises F1 by triggering retrieval in cases where agents over-rely on parametric knowledge, while cutting redundant retrieval in cases where agents have sufficient context to answer. We show that CalVerT can augment existing QA frameworks without training. Moreover, CalVerT also improves trained systems: by simply augmenting an agent's state with telemetry, we observe improvements after reinforcement learning, as compared to an agent with identical training but no CalVerT telemetry.