ChatPaper.aiChatPaper

Bee:高质量语料库与全栈工具套件,助力全面开放的高级多模态大语言模型发展

Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

October 15, 2025
作者: Yi Zhang, Bolin Ni, Xin-Sheng Chen, Heng-Rui Zhang, Yongming Rao, Houwen Peng, Qinglin Lu, Han Hu, Meng-Hao Guo, Shi-Min Hu
cs.AI

摘要

当前,完全开源的多模态大语言模型(MLLMs)在性能上仍落后于专有模型,主要原因在于监督微调(SFT)所需的数据质量存在显著差距。现有的开源数据集普遍存在噪声广泛且复杂推理数据(如思维链,CoT)严重不足的问题,这阻碍了模型高级能力的发展。针对这些挑战,我们的研究做出了三项主要贡献。首先,我们推出了Honey-Data-15M,这是一个包含约1500万问答对的新SFT数据集,通过多重清洗技术处理,并采用了一种新颖的双层次(短与长)CoT增强策略进行优化。其次,我们介绍了数据整理流程HoneyPipe及其基础框架DataStudio,为社区提供了一种透明且可调整的数据整理方法,超越了静态数据集发布的局限。最后,为验证我们的数据集和流程,我们在Honey-Data-15M上训练了Bee-8B,一个80亿参数的模型。实验表明,Bee-8B为完全开源的MLLMs设立了新的技术标杆,其性能不仅与近期半开源模型如InternVL3.5-8B相媲美,在某些方面甚至超越。我们的工作为社区提供了一套基础资源,包括:Honey-Data-15M语料库;包含HoneyPipe和DataStudio的全栈工具包;训练配方;评估框架;以及模型权重。这一系列努力证明,专注于数据质量的系统性方法是开发与半开源模型高度竞争的完全开源MLLMs的关键途径。
English
Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.
PDF472October 16, 2025