缩小智能体规模:探索小型多模态模型中的感知与推理瓶颈
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
November 21, 2025
作者: Mark Endo, Serena Yeung-Levy
cs.AI
摘要
多模态模型的规模化发展带来了视觉理解与推理能力的显著进步,但实际应用需求呼唤更小巧高效的系统。本研究对多模态模型智能缩放的规律进行了系统性分析,重点探究大型语言模型(LLM)容量缩减如何影响多模态能力。初步发现揭示了一个有趣现象:LLM的规模缩减对视觉能力的影响远大于对LLM固有能力的继承。我们进一步探究这种性能下降究竟源于预期的视觉推理能力衰减,还是更根本的感知能力丧失。通过分离LLM缩放对感知能力的影响,发现性能仍会急剧下降,其降幅往往与推理能力的衰退相当甚至更大。针对这一瓶颈,我们提出视觉提取调优方法,通过显式训练使模型能够跨任务持续提取与指令相关的视觉细节。基于这些提取的视觉信息,我们采用分步推理机制生成答案。这两大核心组件共同构成了"提取+思考"(Extract+Think)方法论,为该领域的效率与性能设立了新标杆。
English
Scaling up multimodal models has enabled remarkable advances in visual understanding and reasoning, but practical demands call for smaller, efficient systems. In this work, we conduct a principled analysis of downscaling intelligence in multimodal models, examining how reduced large language model (LLM) capacity affects multimodal capabilities. Our initial findings reveal an interesting trend: LLM downscaling disproportionately affects visual capabilities, rather than abilities inherited from the LLM. We then examine whether this drop mainly reflects the expected decline in visual reasoning or a more fundamental loss of perceptual abilities. Isolating the effect of LLM downscaling on perception, we find performance still drops sharply, often matching or exceeding the impact on reasoning. To address this bottleneck, we introduce visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details consistently across tasks. With these extracted visual details, we then apply step-by-step reasoning to generate answers. Together, these components form our Extract+Think approach, setting a new standard for efficiency and performance in this space.