超越當前觀察：評估可控非馬爾可夫博弈中的多模態大型語言模型

摘要

將多模態基礎模型部署為閉環策略時，越來越需要根據已不可見的觀測來調整行動。然而，現有基準測試要麼暴露完整狀態、混淆隱狀態重建與其他智能體技能，要麼僅在情節結束後測試回憶能力。我們提出 RNG-Bench（重建性非馬爾可夫博弈），這套基準測試旨在分離基礎模型在多重步驟互動過程中重建過往觀測並據此行動的能力。RNG-Bench 包含兩個互補遊戲：配對記憶（Matching Pairs）——需在特定位置短暫揭示牌面後回憶其內容；以及三維迷宮（3D Maze）——需將自我中心視角整合為空間地圖。所有遊戲均在統一測試框架下進行評估，並控制三個難度維度：網格大小、視覺模式與觀測模態。該基準測試進一步引入一對一對決協議以控制樣本層級變異，以及「記憶差距」指標來區分遺忘與不良行動選擇。最困難的組態要求每情節約 128K 令牌與 350 張圖像輸入，且前沿多模態大語言模型尚未達到飽和。記憶差距分析顯示，大多數殘差錯誤源自遺忘較早觀測，而非次優決策。最後，在最佳策略軌跡與過濾後的模型示範上微調 Qwen3.5-9B，不僅在 RNG-Bench 上提升表現，還能遷移至既有基準測試，同時不損害通用多模態能力。

English

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.