WEAVE:释放并基准测试上下文交织理解与生成能力
WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
November 14, 2025
作者: Wei Chow, Jiachun Pan, Yongyuan Liang, Mingze Zhou, Xue Song, Liyu Jia, Saining Zhang, Siliang Tang, Juncheng Li, Fengda Zhang, Weijia Wu, Hanwang Zhang, Tat-Seng Chua
cs.AI
摘要
近期,统一多模态模型(UMMs)的研究进展显著推动了视觉理解与生成能力的提升。然而,现有数据集和基准测试主要聚焦于单轮交互,未能捕捉现实世界图像创作与编辑中多轮次、上下文依赖的特性。为填补这一空白,我们提出了WEAVE——首个面向上下文交织跨模态理解与生成的全套解决方案。该套件包含两个互补部分:WEAVE-100k作为大规模数据集,包含10万个交织样本,覆盖37万次对话轮转和50万张图像,涉及需要历史上下文推理的理解、编辑及生成任务;WEAVEBench则是基于480张图像构建的含100项任务的人工标注基准测试,采用结合参考图像及"原图+编辑指令"的混合VLM评判框架,评估模型在多轮生成、视觉记忆和跨领域常识推理等方面的能力。实验表明,基于WEAVE-100k的训练能有效提升视觉理解、图像编辑及理解-生成协作能力,并促进UMMs涌现出视觉记忆能力。同时,在WEAVEBench上的广泛评估揭示了当前方法在多轮上下文感知图像生成与编辑方面存在的持续局限与挑战。我们相信WEAVE为多模态社区研究上下文交织的理解与生成提供了新的视角和基础。
English
Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation. However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation. Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with 100 tasks based on 480 images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the original image with editing instructions that assesses models' abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains. Experiments demonstrate that training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities. Furthermore, it facilitates UMMs to develop emergent visual-memory capabilities, while extensive evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation and editing. We believe WEAVE provides a view and foundation for studying in-context interleaved comprehension and generation for multi-modal community.