일관성의 증폭 효과: 행동 변동성이 에이전트 정확성에 미치는 영향

초록

LLM 기반 에이전트가 실제 시스템에 배포됨에 있어, 동일한 작업을 부여했을 때 유사한 행동 순서를 생성하는지 여부를 나타내는 행동 일관성 이해는 신뢰성 측면에서 중요해지고 있습니다. 본 연구는 복잡한 다단계 추론을 요구하는 까다로운 소프트웨어 엔지니어링 벤치마크인 SWE-bench 맥락에서 일관성을 분석합니다. Claude 4.5 Sonnet, GPT-5, Llama-3.1-70B 모델을 각각 50회(10개 작업 × 5회 실행) 실행하여 비교한 결과, 모델 간에 더 높은 일관성은 더 높은 정확도와 연관되었습니다: Claude가 가장 낮은 변동성(변동계수: 15.2%)과 가장 높은 정확도(58%)를 달성한 반면, GPT-5는 중간 수준(변동계수: 32.2%, 정확도: 32%), Llama는 가장 높은 변동성(변동계수: 47.0%)과 가장 낮은 정확도(4%)를 보였습니다. 그러나 단일 모델 내에서 일관성은 올바른 해석과 잘못된 해석 모두를 증폭시킬 수 있습니다. 우리의 분석은 중요한 뉘앙스를 보여주는데, 일관성은 정확성을 보장하기보다 결과를 증폭시킨다는 점입니다. Claude 실패 사례의 71%는 '일관된 오해석' 즉, 모든 실행에 걸쳐 동일한 잘못된 가정을 반복하는 데서 비롯되었습니다. 흥미롭게도 GPT-5는 Claude와 유사한 초기 전략적 일치도(각각 3.4단계와 3.2단계에서 분기)를 달성했지만, 변동성은 2.1배 높았으며, 이는 분기 시점만으로는 일관성이 결정되지 않음을 시사합니다. 이러한 결과는 실제 시스템 배포 시 실행 일관성보다 해석의 정확성이 더 중요하며, 에이전트 평가 및 훈련에 시사점을 줍니다.

English

As LLM-based agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE-bench, a challenging software engineering benchmark requiring complex, multi-step reasoning. Comparing Claude~4.5~Sonnet, GPT-5, and Llama-3.1-70B across 50 runs each (10 tasks times 5 runs), we find that across models, higher consistency aligns with higher accuracy: Claude achieves the lowest variance (CV: 15.2\%) and highest accuracy (58\%), GPT-5 is intermediate (CV: 32.2\%, accuracy: 32\%), and Llama shows the highest variance (CV: 47.0\%) with lowest accuracy (4\%). However, within a model, consistency can amplify both correct and incorrect interpretations. Our analysis reveals a critical nuance: consistency amplifies outcomes rather than guaranteeing correctness. 71\% of Claude's failures stem from "consistent wrong interpretation": making the same incorrect assumption across all runs. Interestingly, GPT-5 achieves similar early strategic agreement as Claude (diverging at step 3.4 vs.\ 3.2) but exhibits 2.1times higher variance, suggesting that divergence timing alone does not determine consistency. These findings suggest that for production deployment, interpretation accuracy matters more than execution consistency, with implications for agent evaluation and training.

일관성의 증폭 효과: 행동 변동성이 에이전트 정확성에 미치는 영향

Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

초록

Support