SeeingEye:基于智能体信息流的多模态推理技术在纯文本大语言模型中的实现
SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In Text-only LLMs
October 29, 2025
作者: Weijia Zhang, Zijia Liu, Haoru Li, Haoqi Chen, Jiaxuan You
cs.AI
摘要
近期纯文本大语言模型(LLMs,如DeepSeek-R1)的进展展现出卓越的推理能力,但这些模型在扩展至多模态任务时仍显脆弱或完全失效。现有方法主要依赖单一形式的描述文本,这类描述缺乏多样性且往往难以适配不同类型的视觉问答(VQA)基准测试,导致其无法提供传递细粒度视觉信息的规范化高效通道。我们提出Seeing Eye模块化框架,通过基于代理的小型视觉语言模型翻译器,解锁纯文本LLMs的多模态推理能力。该翻译器作为感知代理:可调用专用工具(如OCR与图像裁剪),并将多模态输入迭代提炼为契合问题的结构化中间表示(SIRs)。这些SIRs随后传递给作为推理代理的纯文本LLM。关键在于,翻译器与推理器通过多轮反馈交互,实现针对性视觉细节提取并生成更确信的答案。在知识密集型VQA基准(含MMMU与MIA-Bench)上的实验表明,Seeing Eye不仅降低推理成本,更超越了许多规模更大的端到端视觉语言模型。例如,结合30亿参数视觉翻译器与80亿参数语言推理器的实例,在挑战性知识型问题上优于单体320亿参数视觉语言模型。我们的结果证明,通过代理信息流将感知与推理解耦,为多模态推理提供了可扩展的即插即用路径,使强效纯文本LLMs能充分发挥其推理潜能。代码已开源于:https://github.com/ulab-uiuc/SeeingEye
English
Recent advances in text-only large language models (LLMs), such as
DeepSeek-R1, demonstrate remarkable reasoning ability. However, these models
remain fragile or entirely incapable when extended to multi-modal tasks.
Existing approaches largely rely on single-form captions, which lack diversity
and often fail to adapt across different types of Visual Question Answering
(VQA) benchmarks. As a result, they provide no principled or efficient channel
for transmitting fine-grained visual information. We introduce Seeing Eye, a
modular framework that unlocks multimodal reasoning in text-only LLMs through
an agent-based small VLM translator. This translator acts as a perception
agent: it can invoke specialized tools (e.g., OCR and crop) and iteratively
distill multimodal inputs into structured intermediate representations (SIRs)
tailored to the question. These SIRs are then passed to the text-only LLM,
which serves as a reasoning agent. Crucially, the translator and reasoner
engage in multi-round feedback and interaction, enabling the extraction of
targeted visual details and yielding more confident answers. Experiments on
knowledge-intensive VQA benchmarks, including MMMU and MIA-Bench, demonstrate
that Seeing Eye not only reduces inference cost but also surpasses much larger
end-to-end VLMs. For example, an instantiation combining a 3B-parameter vision
translator with an 8B-parameter language reasoner outperforms a monolithic 32B
VLM on challenging knowledge-based questions. Our results highlight that
decoupling perception from reasoning via agent information flow offers a
scalable and plug-and-play pathway to multimodal reasoning, allowing strong
text-only LLMs to fully leverage their reasoning capabilities. Code is
available at: https://github.com/ulab-uiuc/SeeingEye