高品質數據集與可靠的交錯圖文生成評估
A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation
June 11, 2025
作者: Yukang Feng, Jianwen Sun, Chuanhao Li, Zizhen Li, Jiaxin Ai, Fanrui Zhang, Yifan Chang, Sizhuo Zhou, Shenglin Zhang, Yu Dai, Kaipeng Zhang
cs.AI
摘要
近期,大型多模态模型(LMMs)的进展显著提升了多模态理解与生成能力。然而,这些模型在生成紧密交织的图文输出方面仍面临挑战,主要受限于当前训练数据集的规模、质量及指令丰富度不足。为此,我们引入了InterSyn,一个采用自评估迭代优化(SEIR)方法构建的大规模多模态数据集。InterSyn以多轮次、指令驱动的对话为特色,包含紧密交织的图文响应,提供了丰富的对象多样性和严格的自动化质量优化,使其非常适合用于训练下一代遵循指令的LMMs。此外,针对缺乏能够评估交织多模态输出的可靠工具的问题,我们推出了SynJudge,一个旨在从四个维度定量评估多模态输出的自动评估模型:文本内容、图像内容、图像质量以及图文协同性。
实验研究表明,与未经过优化的相同流程相比,SEIR方法显著提高了数据集质量。此外,基于InterSyn训练的LMMs在所有评估指标上均实现了性能的全面提升,证实了InterSyn在推动多模态系统发展方面的实用价值。
English
Recent advancements in Large Multimodal Models (LMMs) have significantly
improved multimodal understanding and generation. However, these models still
struggle to generate tightly interleaved image-text outputs, primarily due to
the limited scale, quality and instructional richness of current training
datasets. To address this, we introduce InterSyn, a large-scale multimodal
dataset constructed using our Self-Evaluation with Iterative Refinement (SEIR)
method. InterSyn features multi-turn, instruction-driven dialogues with tightly
interleaved imagetext responses, providing rich object diversity and rigorous
automated quality refinement, making it well-suited for training
next-generation instruction-following LMMs. Furthermore, to address the lack of
reliable evaluation tools capable of assessing interleaved multimodal outputs,
we introduce SynJudge, an automatic evaluation model designed to quantitatively
assess multimodal outputs along four dimensions: text content, image content,
image quality, and image-text synergy.
Experimental studies show that the SEIR method leads to substantially higher
dataset quality compared to an otherwise identical process without refinement.
Moreover, LMMs trained on InterSyn achieve uniform performance gains across
all evaluation metrics, confirming InterSyn's utility for advancing multimodal
systems.