현재 관측을 넘어서: 제어 가능한 비마르코프 게임에서의 다중 모달 거대 언어 모델 평가

초록

멀티모달 기반 모델을 폐쇄 루프 정책으로 배포할 때, 더 이상 가시적이지 않은 관측값을 기반으로 행동을 조건화해야 하는 필요성이 점점 증가하고 있다. 그러나 기존 벤치마크는 전체 상태를 노출하거나, 은닉 상태 재구성을 다른 에이전트 기술과 혼동하거나, 에피소드 종료 후에만 기억 회상을 테스트하는 등의 문제를 안고 있다. 본 논문에서는 과거 관측값을 재구성하고 다단계 상호작용 중에 이를 바탕으로 행동하는 기반 모델의 능력을 분리하여 평가하도록 설계된 벤치마크 모음인 RNG-Bench(Reconstructive Non-Markov Games)를 소개한다. RNG-Bench는 상호 보완적인 두 가지 게임으로 구성된다. 짝 맞추기(Matching Pairs)는 특정 위치에 잠시 공개된 카드의 정체를 나중에 기억해야 하는 게임이며, 3D 미로(3D Maze)는 자아 중심 시점을 공간 지도로 통합해야 하는 게임이다. 두 게임 모두 그리드 크기, 시각 패턴, 관측 양식이라는 세 가지 통제된 난이도 축을 갖춘 통합된 평가 프레임워크에서 평가된다. 또한 이 벤치마크는 인스턴스 수준의 변동성을 통제하기 위한 1대1 결투 프로토콜과, 망각을 잘못된 행동 선택으로부터 분리하는 기억 격차 지표(Memory Gap metric)를 도입한다. 가장 어려운 설정은 에피소드당 약 128K 토큰과 350개의 이미지 입력을 요구하며, 최첨단 MLLM(멀티모달 대규모 언어 모델)으로도 아직 포화 상태에 이르지 못했다. 기억 격차 분석에 따르면, 대부분의 잔여 오류는 차선의 의사 결정보다는 초기 관측값을 망각하는 데서 비롯된다. 마지막으로, 최적 정책 롤아웃과 필터링된 모델 시연을 통해 Qwen3.5-9B를 미세 조정한 결과, RNG-Bench에서 성능이 향상되었으며, 일반적인 멀티모달 능력을 저하시키지 않으면서 기존 벤치마크로 전이되는 성능을 보였다.

English

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.