에이전트가 너무 일찍 확정할 때: LLM 에이전트의 조기 확정 진단

초록

장기적 LLM 에이전트는 조용히 실패할 수 있다. 즉, 증거에 대한 한 가지 해석에 조기에 고착된 후, 남은 실행 시간 동안 이를 방어하는 데 집중한다. 이러한 현상을 우리는 조기 몰두(premature commitment)라 부른다. 최종 답변 평가 점수(final-answer scoring)는 답변만을 확인할 뿐, 프로세스가 이미 안정적인 경로로 붕괴되었는지는 알 수 없기 때문에 이러한 실패 모드를 포착하지 못한다. 우리는 표상적 몰두(representational commitment)를 고정된 추론 단계에서의 교차 실행 은닉 상태 수렴으로 정의하고, 이를 궤적 일관성의 조기 진단 지표로 활용한다. Llama-3.1-70B 모델이 HotpotQA 데이터셋에서 ReAct 방식을 실행할 때, 4단계 은닉 상태 유사도는 이후의 행동 일관성을 예측한다(r = -0.35, 부분 상관계수 r = -0.45). 이 신호는 시간적·계층별로 국소화된 특성을 보인다. 이러한 패턴은 Qwen-2.5-72B 및 Phi-3-14B 모델에서도 재현되었으며, StrategyQA 데이터셋에서는 더 강한 상관관계를 보였다(r = -0.83). 그러나 이는 정확성을 추적하지 않는다. 즉, 몰두했으나 틀린 질문과 몰두했으나 맞은 질문은 활성화 유사도 측면에서 구분되지 않는다. 이러한 구분이 본 주장의 핵심이다. 몰두(commitment)는 에이전트가 고착되었는지 여부를 알려줄 뿐, 정답인지 여부는 알려주지 않는다. 런타임 모니터는 은닉 상태로부터 일관성 없는 궤적을 탐지하며, AUROC는 최대 0.97(더 엄격한 분할 조건에서는 0.85-0.88)에 달한다. 프롬프트 중재(prompting intervention)는 토큰 매칭 대조군 대비 행동 분산을 28% 감소시키면서도 정확도에는 통계적으로 유의미한 변화를 주지 않았다. 또한 이 신호가 자기 일관성 계산을 유도할 수 있는지도 테스트했으나, 더 어려운 벤치마크에서는 효과가 미미했으며 더 단순한 출력 기반 기준선과 거의 동등한 성능을 보였다. 결론적으로, 본 연구 결과는 명확한 한계를 지닌 숨은 프로세스 실패의 진단 도구를 제시한 것이지, 일반적인 정확도 향상 수단이 아니다.

English

Long-horizon LLM agents can fail quietly: they settle on one reading of the evidence early, then spend the rest of the run defending it. We call this premature commitment. Final-answer scoring misses the failure mode because it sees only the answer, not whether the process has already collapsed to a stable path. We define representational commitment as cross-run hidden-state convergence at a fixed reasoning step, and use it as an early diagnostic of trajectory consistency. On Llama-3.1-70B running ReAct on HotpotQA, step-4 hidden-state similarity predicts downstream behavioral consistency (r = -0.35, partial r = -0.45), with a localized temporal and layer-wise signature. The signal replicates across Qwen-2.5-72B and Phi-3-14B, and on StrategyQA (r = -0.83). It does not track correctness: committed-wrong and committed-correct questions are not separable in activation similarity. That boundary is central to the claim. Commitment tells us whether an agent has settled, not whether it is right. A runtime monitor detects inconsistent trajectories from hidden states at AUROC up to 0.97 (0.85--0.88 under a stricter split), and a prompting intervention cuts behavioral variance by 28% against a token-matched control while leaving accuracy statistically unchanged. We also test whether the signal can route self-consistency compute; on a harder benchmark it helps only modestly and is matched by a simpler output-based baseline. The result is a diagnostic for a hidden process failure, with clear limits rather than a general accuracy lever.