數據萃取器沙盒：一個全面的套件，用於多模態數據模型的共同開發

摘要

大规模多模式生成模型的出现极大推动了人工智能的发展，引入了前所未有的性能和功能水平。然而，由于历史上模型中心和数据中心发展的孤立路径，优化这些模型仍然具有挑战性，导致次优结果和资源利用效率低下。为此，我们提出了一种专为集成数据-模型共同发展量身定制的新型沙盒套件。该沙盒提供了一个全面的实验平台，实现了数据和模型的快速迭代和基于洞察力的改进。我们提出的“探测-分析-改进”工作流程，通过在最先进的LLaVA类和基于DiT的模型上的应用进行验证，实现了显著的性能提升，例如在VBench排行榜上名列前茅。我们还从详尽的基准测试中发现了有益的见解，阐明了数据质量、多样性和模型行为之间的关键相互作用。希望通过维护和提供我们的代码、数据集和模型，促进对多模式数据和生成建模的深入理解和未来进展，这些资源可在以下网址获得：https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox.md。

English

The emergence of large-scale multi-modal generative models has drastically advanced artificial intelligence, introducing unprecedented levels of performance and functionality. However, optimizing these models remains challenging due to historically isolated paths of model-centric and data-centric developments, leading to suboptimal outcomes and inefficient resource utilization. In response, we present a novel sandbox suite tailored for integrated data-model co-development. This sandbox provides a comprehensive experimental platform, enabling rapid iteration and insight-driven refinement of both data and models. Our proposed "Probe-Analyze-Refine" workflow, validated through applications on state-of-the-art LLaVA-like and DiT based models, yields significant performance boosts, such as topping the VBench leaderboard. We also uncover fruitful insights gleaned from exhaustive benchmarks, shedding light on the critical interplay between data quality, diversity, and model behavior. With the hope of fostering deeper understanding and future progress in multi-modal data and generative modeling, our codes, datasets, and models are maintained and accessible at https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox.md.

數據萃取器沙盒：一個全面的套件，用於多模態數據模型的共同開發

Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development

摘要

Support