邁向下一代基礎多模態大語言模型的自我提升系統化認知
Towards Self-Improving Systematic Cognition for Next-Generation Foundation MLLMs
March 16, 2025
作者: Xiaoying Zhang, Da Peng, Yipeng Zhang, Zonghao Guo, Chengyue Wu, Chi Chen, Wei Ke, Helen Meng, Maosong Sun
cs.AI
摘要
儘管多模態大型語言模型(MLLMs)展現了令人印象深刻的能力,但在細粒度感知和複雜推理方面仍面臨挑戰。現有的多模態預訓練方法主要通過訓練高質量的圖像描述來增強感知能力,這是由於收集用於改進推理的思維鏈(CoT)數據成本極高。雖然利用先進的MLLMs生成描述提高了可擴展性,但其輸出往往缺乏全面性和準確性。本文提出了自我提升認知(SIcog),這是一個自學習框架,旨在通過使用自生成數據進行多模態預訓練來增強系統認知能力,從而構建下一代基礎MLLMs。具體而言,我們提出了描述鏈(Chain-of-Description)方法,通過逐步視覺理解來提升MLLMs的系統感知能力,確保更高的全面性和準確性。此外,我們採用結構化的CoT推理技術,使MLLMs能夠整合深入的跨模態推理。為了構建具有自我提升認知的下一代基礎MLLM,SIcog首先使用最少的外部註釋為MLLM配備系統感知和推理能力。增強後的模型生成詳細的描述和CoT推理數據,並通過自我一致性進一步篩選。這些篩選後的數據最終用於多模態預訓練,以開發下一代基礎模型。在低分辨率和高分辨率MLLMs上的廣泛實驗表明,僅使用213K自生成的預訓練樣本,SIcog就能產生認知能力顯著提升的下一代基礎MLLMs,在多樣化的基準測試中達到了領先性能,超越了現有的預訓練方法。
English
Despite their impressive capabilities, Multimodal Large Language Models
(MLLMs) face challenges with fine-grained perception and complex reasoning.
Prevalent multimodal pre-training approaches focus on enhancing perception by
training on high-quality image captions due to the extremely high cost of
collecting chain-of-thought (CoT) reasoning data for improving reasoning. While
leveraging advanced MLLMs for caption generation enhances scalability, the
outputs often lack comprehensiveness and accuracy. In this paper, we introduce
Self-Improving cognition (SIcog), a self-learning framework designed to
construct next-generation foundation MLLMs by enhancing their systematic
cognitive capabilities through multimodal pre-training with self-generated
data. Specifically, we propose Chain-of-Description, an approach that improves
an MLLM's systematic perception by enabling step-by-step visual understanding,
ensuring greater comprehensiveness and accuracy. Additionally, we adopt a
structured CoT reasoning technique to enable MLLMs to integrate in-depth
multimodal reasoning. To construct a next-generation foundation MLLM with
self-improved cognition, SIcog first equips an MLLM with systematic perception
and reasoning abilities using minimal external annotations. The enhanced models
then generate detailed captions and CoT reasoning data, which are further
curated through self-consistency. This curated data is ultimately used for
multimodal pre-training to develop next-generation foundation models. Extensive
experiments on both low- and high-resolution MLLMs across diverse benchmarks
demonstrate that, with merely 213K self-generated pre-training samples, SIcog
produces next-generation foundation MLLMs with significantly improved
cognition, achieving benchmark-leading performance compared to prevalent
pre-training approaches.Summary
AI-Generated Summary