隠れた真実：マルチモーダル言語モデルにおける暗黙的推論の探求

要旨

マルチモーダル大規模言語モデル（MLLM）は、入力が雑多で不完全であり、常に信頼できるとは限らない、オープンエンドの現実世界の環境にますます展開されています。厳選されたベンチマークとは異なり、これらの設定では、欠落したオブジェクトや矛盾する事実を参照する指示、曖昧な参照に依存する指示、または実行不可能なアクションを要求する指示が頻繁に含まれます。このような場合、成功はタスクの実行だけではなく、何かが静かに間違っていることを検出するモデルの能力にかかっています。本論文は、現在のMLLMがこのような暗黙の推論シナリオ、つまり欠陥が明示されていないが文脈から推論しなければならない場合をどのように扱うかについての体系的な分析を提示します。現実世界の失敗モードの4つのカテゴリにわたる厳選された診断スイートを使用して、o3やGPT-4oを含む6つのMLLMを評価し、モデルが必要な知覚および推論スキルを持っている場合でも、隠れた問題を表面化させることが頻繁に失敗することを発見しました。明示的なプロンプティングにより、基礎となる能力は存在するが、ユーザーへの従順さを優先してしばしば抑制されていることが明らかになりました。さらに、慎重なペルソナプロンプティングや、特に明確化の質問を要求するといった単純な推論時の介入が、パフォーマンスを劇的に回復できることを示します。我々の調査結果は、現在のMLLMにおける推論能力と行動的従順さの間の持続的なギャップを強調し、制約の少ない環境でこれらのモデルをより信頼できるものにするための実践的な戦略を提案します。

English

Multimodal large language models (MLLMs) are increasingly deployed in open-ended, real-world environments where inputs are messy, underspecified, and not always trustworthy. Unlike curated benchmarks, these settings frequently involve instructions that refer to missing objects or contradictory facts, rely on ambiguous references, or request infeasible actions. In such cases, success hinges not on task execution alone, but on a model's ability to detect when something is silently wrong. This paper presents a systematic analysis of how current MLLMs handle such implicit reasoning scenarios: cases where the flaw is not explicitly stated but must be inferred from context. Using a curated diagnostic suite spanning four categories of real-world failure modes, we evaluate six MLLMs, including o3 and GPT-4o, and find that models frequently fail to surface hidden issues, even when they possess the necessary perceptual and reasoning skills. Explicit prompting reveals that the underlying capabilities exist but are often suppressed in favor of user compliance. We further show that simple inference-time interventions, such as cautious persona prompting and, in particular, requiring a clarifying question, can dramatically recover performance. Our findings highlight a persistent gap between reasoning competence and behavioral compliance in current MLLMs and suggest practical strategies for making these models more trustworthy in underconstrained environments.

隠れた真実：マルチモーダル言語モデルにおける暗黙的推論の探求

Hidden in Plain Sight: Probing Implicit Reasoning in Multimodal Language Models

要旨

Support