CalVerT: 보정된 검증자 텔레메트리를 활용한 에이전트 증강이 지식 집약적 작업에서의 행동 및 학습을 개선함

초록

지식 집약적 질문 응답에서 LLM 에이전트는 현재 답변이 불확실한지, 뒷받침되지 않는지, 또는 이미 완전한지에 대한 불완전한 지식 상태에서 검색 및 추론 행동을 수행한다. 이는 두 가지 실패 모드를 초래한다: 확신하지만 뒷받침되지 않는 답변을 고수하여 정확도를 떨어뜨리는 것과, 이미 충분한 증거가 있음에도 과도하게 검색하여 계산 자원을 낭비하는 것이다. 에이전트가 작동 중인 상태 공간을 보다 완전하게 파악할 수 있도록, 우리는 보정된 검증기 원격 측정(CalVerT)을 도입한다. 이는 에이전트의 상태에 추가 원격 측정값인 보정된 자기 확신 점수와 근거 검증 점수를 추가한다. 우리는 CalVerT가 훈련이 필요 없는 환경과 훈련 기반 환경 모두에서 에이전트를 개선할 수 있음을 보여준다. 네 가지 QA 벤치마크에서 CalVerT가 에이전트가 파라미터 지식에 과도하게 의존하는 경우 검색을 촉발하여 F1을 향상시키는 동시에, 답변에 충분한 맥락을 갖춘 경우 중복 검색을 줄이는 것을 발견했다. 우리는 CalVerT가 훈련 없이도 기존 QA 프레임워크를 보강할 수 있음을 보여준다. 또한 CalVerT는 훈련된 시스템도 개선한다: 에이전트의 상태에 원격 측정값을 단순히 추가함으로써, CalVerT 원격 측정값이 없는 동일한 훈련을 받은 에이전트와 비교하여 강화 학습 후에 개선이 관찰된다.

English

LLM agents in knowledge intensive question answering take retrieval and reasoning actions with incomplete knowledge about whether their current answer is uncertain, unsupported, or already complete. This produces two failure modes: committing to confident but unsupported answers, which hurts accuracy, and over-retrieving when the evidence in hand already suffices, resulting in wasted compute. To give agents a more complete picture of the state space they are operating in, we introduce calibrated verifier telemetry (CalVerT), which augments the agent's state with additional telemetry: a calibrated self-confidence score and a grounding verifier score. We show that CalVerT can improve agents in both training-free and training-based settings. On four QA benchmarks, we find that CalVerT raises F1 by triggering retrieval in cases where agents over-rely on parametric knowledge, while cutting redundant retrieval in cases where agents have sufficient context to answer. We show that CalVerT can augment existing QA frameworks without training. Moreover, CalVerT also improves trained systems: by simply augmenting an agent's state with telemetry, we observe improvements after reinforcement learning, as compared to an agent with identical training but no CalVerT telemetry.