主動式基準：評估多模態大型語言模型的主動性能力

摘要

有效的協作始於懂得何時尋求協助。例如在嘗試辨識被遮擋物體時，人類會請他人移開障礙物。多模態大語言模型能否通過請求簡單的使用者介入，展現出類似的「主動性」行為？為探究此問題，我們提出ProactiveBench——一個基於七個改編數據集構建的基準測試，用於檢驗模型在識別遮擋物體、提升圖像質量、解讀粗略草圖等多種任務中的主動性。我們對22個MLLMs進行評估後發現：（一）模型普遍缺乏主動性；（二）主動性與模型能力無關；（三）給予主動性提示僅能帶來有限提升。令人驚訝的是，對話歷史和上下文學習會產生負面偏差，反而降低模型表現。最後我們探索基於強化學習的微調策略：結果表明模型可以習得主動性，甚至能泛化至未見過的場景。我們公開釋出ProactiveBench，作為構建主動式多模態模型的第一步。

English

Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar "proactive" behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) "hinting" at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.

主動式基準：評估多模態大型語言模型的主動性能力

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

摘要

Support