ChatPaper.aiChatPaper

蜜蜂:高質量語料庫與全棧套件,助力全面開放的多模態大模型進階發展

Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

October 15, 2025
作者: Yi Zhang, Bolin Ni, Xin-Sheng Chen, Heng-Rui Zhang, Yongming Rao, Houwen Peng, Qinglin Lu, Han Hu, Meng-Hao Guo, Shi-Min Hu
cs.AI

摘要

全開放式多模態大型語言模型(MLLMs)目前落後於專有模型,主要原因在於監督微調(SFT)數據質量的顯著差距。現有的開源數據集普遍存在廣泛噪聲,且在複雜推理數據(如思維鏈,CoT)方面嚴重不足,這阻礙了高級模型能力的發展。針對這些挑戰,我們的工作做出了三項主要貢獻。首先,我們引入了Honey-Data-15M,這是一個包含約1500萬問答對的新SFT數據集,通過多重清洗技術處理,並採用新穎的雙層(短與長)CoT增強策略進行優化。其次,我們介紹了數據策展管道HoneyPipe及其基礎框架DataStudio,為社區提供了一種透明且可適應的數據策展方法,超越了靜態數據集發布的範疇。最後,為驗證我們的數據集和管道,我們在Honey-Data-15M上訓練了Bee-8B,一個8B參數的模型。實驗結果顯示,Bee-8B為全開放MLLMs設立了新的技術標杆,其性能與近期半開放模型如InternVL3.5-8B相當,甚至在某些方面超越。我們的工作向社區提供了一套基礎資源,包括:Honey-Data-15M語料庫;包含HoneyPipe和DataStudio的全棧套件;訓練配方;評估工具;以及模型權重。這一努力表明,以數據質量為核心的原則性關注,是開發與半開放模型高度競爭的全開放MLLMs的關鍵途徑。
English
Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.
PDF472October 16, 2025