추론 모델이 행동 시뮬레이션에 부정적인 영향을 미치는 경우: 다중 에이전트 LLM 협상에서의 솔버-샘플러 불일치 문제

초록

대규모 언어 모델은 사회, 경제, 정책 시뮬레이션에서 에이전트로 점점 더 많이 활용되고 있습니다. 일반적인 가정은 더 강력한 추론 능력이 시뮬레이션의 정확도를 향상시켜야 한다는 것입니다. 그러나 본 논문은 전략적 문제를 해결하는 것이 목적이 아닌, 합리적 한계를 가진 그럴듯한 행동을 표본으로 추출하는 것이 목적일 경우 이 가정이 틀릴 수 있음을 주장합니다. 이러한 설정에서는 추론 능력이 향상된 모델이 더 나은 '문제 해결자'가 될 수 있지만 오히려 더 나쁜 '시뮬레이터'가 될 수 있습니다. 즉, 이들은 전략적으로 우월한 행동을 지나치게 최적화하고, 타협 지향적인 최종 행동을 사라지게 하며, 때로는 결과 수준의 정확도 없이 지역적 변동만 남는 '충실성 없는 다양성' 패턴을 보일 수 있습니다. 우리는 이러한 '해결자-표본추출자 불일치'를 기존 시뮬레이션 연구에서 adapted된 세 가지 다중 에이전트 협상 환경에서 연구합니다: 1) 권한이 분산된 모호한 거래 한계 시나리오, 2) 통합된 반대 세력을 가진 모호한 거래 한계 시나리오, 3) 긴급 전력 관리 분야의 새로운 도메인인 그리드 제한 사례입니다. 우리는 두 가지 주요 모델 패밀리에서 반성 없음, 제한적 반성, 자체 추론이라는 세 가지 반성 조건을 비교한 후, 동일한 프로토콜을 GPT-4.1 및 GPT-5.2를 이용한 직접 OpenAI 실행으로 확장합니다. 세 가지 실험 전반에 걸쳐, '제한적 반성' 조건이 '반성 없음'이나 '자체 추론' 조건보다 훨씬 더 다양하고 타협 지향적인 궤적을 생성합니다. 직접 OpenAI 확장 실험에서 GPT-5.2 자체 추론은 세 실험 45회 실행 모두에서 권위적 결정으로 종료된 반면, GPT-5.2 제한적 반성은 모든 환경에서 타협 결과를 복원했습니다. 본 논문의 기여는 추론이 일반적으로 해롭다는 주장이 아닙니다. 이는 방법론적 경고입니다: 모델 능력과 시뮬레이션 정확도는 서로 다른 목표이며, 행동 시뮬레이션은 모델을 단순한 '해결자'가 아닌 '표본추출자'로 자격을 부여해야 합니다.

English

Large language models are increasingly used as agents in social, economic, and policy simulations. A common assumption is that stronger reasoning should improve simulation fidelity. We argue that this assumption can fail when the objective is not to solve a strategic problem, but to sample plausible boundedly rational behavior. In such settings, reasoning-enhanced models can become better solvers and worse simulators: they can over-optimize for strategically dominant actions, collapse compromise-oriented terminal behavior, and sometimes exhibit a diversity-without-fidelity pattern in which local variation survives without outcome-level fidelity. We study this solver-sampler mismatch in three multi-agent negotiation environments adapted from earlier simulation work: an ambiguous fragmented-authority trading-limits scenario, an ambiguous unified-opposition trading-limits scenario, and a new-domain grid-curtailment case in emergency electricity management. We compare three reflection conditions, no reflection, bounded reflection, and native reasoning, across two primary model families and then extend the same protocol to direct OpenAI runs with GPT-4.1 and GPT-5.2. Across all three experiments, bounded reflection produces substantially more diverse and compromise-oriented trajectories than either no reflection or native reasoning. In the direct OpenAI extension, GPT-5.2 native ends in authority decisions in 45 of 45 runs across the three experiments, while GPT-5.2 bounded recovers compromise outcomes in every environment. The contribution is not a claim that reasoning is generally harmful. It is a methodological warning: model capability and simulation fidelity are different objectives, and behavioral simulation should qualify models as samplers, not only as solvers.

추론 모델이 행동 시뮬레이션에 부정적인 영향을 미치는 경우: 다중 에이전트 LLM 협상에서의 솔버-샘플러 불일치 문제

When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation

초록

Support