SeeingEye: エージェンシックな情報フローが実現するテキスト専用LLMにおけるマルチモーダル推論

要旨

DeepSeek-R1に代表されるテキスト専用大規模言語モデル（LLM）の近年の進展は、驚くべき推論能力を示している。しかしながら、これらのモデルはマルチモーダルタスクに拡張された場合、脆弱なままであるか、あるいは完全に機能しない。既存のアプローチの多くは、単一形式のキャプションに依存しており、多様性に欠け、様々なタイプの視覚質問応答（VQA）ベンチマーク間での適応にしばしば失敗する。その結果、細粒度の視覚情報を伝達するための原理的かつ効率的な経路を提供しない。本研究では、Seeing Eyeを提案する。これは、エージェントベースの小型VLMトランスレーターを通じて、テキスト専用LLMにおけるマルチモーダル推論を可能にするモジュール型フレームワークである。このトランスレーターは知覚エージェントとして機能し、専門ツール（OCRや切り抜きなど）を呼び出し、マルチモーダル入力を質問に合わせて構造化中間表現（SIR）へと反復的に蒸留する。これらのSIRはその後、推論エージェントとして機能するテキスト専用LLMに渡される。決定的に重要なのは、トランスレーターと推論機が多段階のフィードバックと相互作用を行うことで、対象を絞った視覚的詳細の抽出を可能にし、より確信の持てる回答を生み出す点である。MMMUやMIA-Benchを含む知識集約型VQAベンチマークによる実験により、Seeing Eyeが推論コストを削減するだけでなく、はるかに大規模なエンドツーエンドVLMを凌駕することを実証した。例えば、30億パラメータの視覚トランスレーターと80億パラメータの言語推論機を組み合わせたインスタンス化は、挑戦的な知識ベースの質問において、単体の320億VLMを上回る性能を示した。我々の結果は、知覚と推論をエージェント情報フローによって分離することが、強力なテキスト専用LLMがその推論能力を完全に発揮するための、スケーラブルでプラグアンドプレイ可能なマルチモーダル推論への経路を提供することを明らかにしている。コードは以下で公開されている：https://github.com/ulab-uiuc/SeeingEye

English

Recent advances in text-only large language models (LLMs), such as DeepSeek-R1, demonstrate remarkable reasoning ability. However, these models remain fragile or entirely incapable when extended to multi-modal tasks. Existing approaches largely rely on single-form captions, which lack diversity and often fail to adapt across different types of Visual Question Answering (VQA) benchmarks. As a result, they provide no principled or efficient channel for transmitting fine-grained visual information. We introduce Seeing Eye, a modular framework that unlocks multimodal reasoning in text-only LLMs through an agent-based small VLM translator. This translator acts as a perception agent: it can invoke specialized tools (e.g., OCR and crop) and iteratively distill multimodal inputs into structured intermediate representations (SIRs) tailored to the question. These SIRs are then passed to the text-only LLM, which serves as a reasoning agent. Crucially, the translator and reasoner engage in multi-round feedback and interaction, enabling the extraction of targeted visual details and yielding more confident answers. Experiments on knowledge-intensive VQA benchmarks, including MMMU and MIA-Bench, demonstrate that Seeing Eye not only reduces inference cost but also surpasses much larger end-to-end VLMs. For example, an instantiation combining a 3B-parameter vision translator with an 8B-parameter language reasoner outperforms a monolithic 32B VLM on challenging knowledge-based questions. Our results highlight that decoupling perception from reasoning via agent information flow offers a scalable and plug-and-play pathway to multimodal reasoning, allowing strong text-only LLMs to fully leverage their reasoning capabilities. Code is available at: https://github.com/ulab-uiuc/SeeingEye

SeeingEye: エージェンシックな情報フローが実現するテキスト専用LLMにおけるマルチモーダル推論

SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In Text-only LLMs

要旨

Support