主动式基准：多模态大语言模型主动性评估基准

摘要

有效的协作始于懂得何时寻求帮助。例如在识别被遮挡物体时，人类会请求他人移开障碍物。多模态大语言模型能否通过请求简单的用户干预，展现出类似的"主动"行为？为探究这一问题，我们推出了ProactiveBench基准测试——该基准由七个重构数据集构成，通过识别遮挡物体、提升图像质量、解读粗略草图等不同任务来检验模型的主动性。我们对22个多模态大语言模型的评估表明：（一）模型普遍缺乏主动性；（二）主动性与模型能力无关；（三）通过"提示"激发主动性仅能带来有限提升。令人惊讶的是，对话历史和上下文学习会产生负向偏差，反而影响模型表现。最后我们探索了基于强化学习的微调策略：结果表明主动性是可习得的，甚至能泛化至未见过的新场景。我们公开释放ProactiveBench基准，为构建主动式多模态模型迈出第一步。

English

Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar "proactive" behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) "hinting" at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.

主动式基准：多模态大语言模型主动性评估基准

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

摘要

Support