SeeingEye：代理化信息流解锁纯文本大语言模型的多模态推理能力

摘要

近年来，纯文本大语言模型（如DeepSeek-R1）在推理能力上取得了显著进展。然而，当这些模型扩展到多模态任务时，仍表现出脆弱性或完全无法胜任。现有方法主要依赖单一形式的描述文本，这类描述缺乏多样性且往往难以适应不同类型的视觉问答基准测试，因此无法为细粒度视觉信息传递提供规范化或高效的通道。我们提出"Seeing Eye"模块化框架，通过基于代理的小型视觉语言模型翻译器，解锁纯文本大语言模型的多模态推理能力。该翻译器作为感知代理：可调用专用工具（如OCR和图像裁剪），并迭代式地将多模态输入蒸馏成针对问题定制的结构化中间表示。这些中间表示随后传递给作为推理代理的纯文本大语言模型。关键在于，翻译器与推理器通过多轮反馈交互，实现针对性视觉细节提取并生成更可靠的答案。在知识密集型VQA基准测试（包括MMMU和MIA-Bench）上的实验表明，Seeing Eye不仅降低了推理成本，更超越了规模更大的端到端视觉语言模型。例如，结合30亿参数视觉翻译器与80亿参数语言推理器的实例，在挑战性知识型问题上优于320亿参数的单体视觉语言模型。我们的研究结果证明，通过代理信息流将感知与推理解耦，为实现多模态推理提供了可扩展的即插即用路径，使强效的纯文本大语言模型能充分发挥其推理能力。代码已开源于：https://github.com/ulab-uiuc/SeeingEye

English

Recent advances in text-only large language models (LLMs), such as DeepSeek-R1, demonstrate remarkable reasoning ability. However, these models remain fragile or entirely incapable when extended to multi-modal tasks. Existing approaches largely rely on single-form captions, which lack diversity and often fail to adapt across different types of Visual Question Answering (VQA) benchmarks. As a result, they provide no principled or efficient channel for transmitting fine-grained visual information. We introduce Seeing Eye, a modular framework that unlocks multimodal reasoning in text-only LLMs through an agent-based small VLM translator. This translator acts as a perception agent: it can invoke specialized tools (e.g., OCR and crop) and iteratively distill multimodal inputs into structured intermediate representations (SIRs) tailored to the question. These SIRs are then passed to the text-only LLM, which serves as a reasoning agent. Crucially, the translator and reasoner engage in multi-round feedback and interaction, enabling the extraction of targeted visual details and yielding more confident answers. Experiments on knowledge-intensive VQA benchmarks, including MMMU and MIA-Bench, demonstrate that Seeing Eye not only reduces inference cost but also surpasses much larger end-to-end VLMs. For example, an instantiation combining a 3B-parameter vision translator with an 8B-parameter language reasoner outperforms a monolithic 32B VLM on challenging knowledge-based questions. Our results highlight that decoupling perception from reasoning via agent information flow offers a scalable and plug-and-play pathway to multimodal reasoning, allowing strong text-only LLMs to fully leverage their reasoning capabilities. Code is available at: https://github.com/ulab-uiuc/SeeingEye

SeeingEye：代理化信息流解锁纯文本大语言模型的多模态推理能力

SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In Text-only LLMs

摘要

Support