ChatPaper.aiChatPaper

超越当前观测:在可控非马尔可夫博弈中评估多模态大语言模型

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

June 17, 2026
作者: Shengyuan Ding, Xilin Wei, Xinyu Fang, Haodong Duan, Dahua Lin, Jiaqi Wang, Yuhang Zang
cs.AI

摘要

将多模态基础模型部署为闭环策略时,越来越需要根据不再可见的观测结果来调节动作。然而,现有基准要么暴露完整状态、混淆了隐藏状态重建与其他智能体能力的区别,要么仅在回合结束后测试记忆恢复能力。我们提出RNG-Bench(重建性非马尔可夫博弈),这是一个旨在隔离基础模型重建过去观测结果并在多步交互中据此行动能力的基准测试套件。该套件包含两个互补博弈:匹配对游戏——需在特定位置短暂展示的卡片身份在后续被回忆;以及3D迷宫游戏——需将第一人称视角整合为空间地图。两个博弈均在同一框架下评估,包含三个受控的难度维度:网格尺寸、视觉模式及观测模态。该基准进一步引入了头对头对决协议以控制实例级方差,并提出了"记忆缺口"指标,用于区分遗忘与不良动作选择。最难的配置要求每个回合约128K个token和350张图像输入,前沿多模态大语言模型尚未达到饱和。记忆缺口分析表明,大部分残差错误源于对早期观测的遗忘,而非次优决策。最后,在最优策略轨迹及筛选后的模型演示上微调Qwen3.5-9B,不仅能提升RNG-Bench性能,还能迁移至现有基准,且不损害通用多模态能力。
English
Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.