ChatPaper.aiChatPaper

智能規模縮減:探索小型多模態模型的感知與推理瓶頸

Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models

November 21, 2025
作者: Mark Endo, Serena Yeung-Levy
cs.AI

摘要

随着多模态模型的规模化发展,视觉理解与推理能力取得了显著突破,但实际应用需求呼唤更小巧高效的系统。本研究对多模态模型智能缩放的规律进行系统性分析,重点探究大型语言模型(LLM)容量缩减如何影响多模态能力。初步发现揭示了一个有趣现象:LLM的降维处理对视觉能力的削弱程度远超其对语言模型固有能力的继承。我们进一步探究这种性能下降究竟源于预期的视觉推理能力衰减,还是更根本的感知能力丧失。通过分离LLM降维对感知能力的影响,发现性能仍会急剧下滑,其下降幅度往往与推理能力受损程度相当甚至更甚。为突破这一瓶颈,我们提出视觉提取调优技术,通过显式训练使模型能够跨任务持续提取与指令相关的视觉细节。基于这些提取的视觉信息,再采用逐步推理机制生成答案。这两大核心组件共同构成了我们的"提取+思考"(Extract+Think)方法论,为该领域的效率与性能设立了新标杆。
English
Scaling up multimodal models has enabled remarkable advances in visual understanding and reasoning, but practical demands call for smaller, efficient systems. In this work, we conduct a principled analysis of downscaling intelligence in multimodal models, examining how reduced large language model (LLM) capacity affects multimodal capabilities. Our initial findings reveal an interesting trend: LLM downscaling disproportionately affects visual capabilities, rather than abilities inherited from the LLM. We then examine whether this drop mainly reflects the expected decline in visual reasoning or a more fundamental loss of perceptual abilities. Isolating the effect of LLM downscaling on perception, we find performance still drops sharply, often matching or exceeding the impact on reasoning. To address this bottleneck, we introduce visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details consistently across tasks. With these extracted visual details, we then apply step-by-step reasoning to generate answers. Together, these components form our Extract+Think approach, setting a new standard for efficiency and performance in this space.
PDF92December 1, 2025