ProactiveBench: マルチモーダル大規模言語モデルにおける能動性のベンチマーク

要旨

効果的な協働は、適切なタイミングで助けを求めることから始まります。例えば、遮蔽された物体を識別しようとする際、人間は誰かに障害物を取り除いてもらうよう依頼するでしょう。MLLM（大規模言語モデル）も同様の「能動的」行動を示し、簡易的なユーザー介入を要求できるのでしょうか？本研究ではこの問いを探るため、7つのデータセットを再構成して構築したベンチマーク「ProactiveBench」を提案します。これは、遮蔽物体の認識、画質向上、粗いスケッチの解釈など様々なタスクにおける能動性を測定します。22のMLLMをProactiveBenchで評価した結果、(i) 全般的に能動性が欠如していること、(ii) 能動性はモデル容量と相関しないこと、(iii) 能動性を暗示するヒントによる改善効果は限定的であること、が明らかになりました。驚くべきことに、対話履歴やインコンテキスト学習は負のバイアスを導入し、性能を阻害することが判明しました。最後に、強化学習に基づく簡易なファインチューニング手法を検証した結果、能動性が学習可能であり、未経験のシナリオへの一般化も示唆されました。能動的なマルチモーダルモデル構築への第一歩として、ProactiveBenchを公開します。

English

Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar "proactive" behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) "hinting" at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.

ProactiveBench: マルチモーダル大規模言語モデルにおける能動性のベンチマーク

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

要旨

Support