ChatPaper.aiChatPaper

WEAVE:釋放並評測上下文交錯理解與生成能力

WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

November 14, 2025
作者: Wei Chow, Jiachun Pan, Yongyuan Liang, Mingze Zhou, Xue Song, Liyu Jia, Saining Zhang, Siliang Tang, Juncheng Li, Fengda Zhang, Weijia Wu, Hanwang Zhang, Tat-Seng Chua
cs.AI

摘要

近期統一多模態模型(UMMs)的突破性進展,顯著推動了視覺理解與生成領域的發展。然而,現有數據集與評測基準主要聚焦於單輪互動,未能捕捉真實世界圖像創作與編輯所需的多輪次、上下文關聯特性。為填補此空白,我們提出首個支援上下文交錯跨模態理解與生成的評估套件WEAVE。該套件包含兩個互補部分:WEAVE-100k作為大規模數據集,收錄10萬個交錯樣本,涵蓋37萬次對話輪轉與50萬張圖像,包含需要歷史上下文推理的理解、編輯及生成任務;WEAVEBench則是基於480張圖像構建的100項人工標註評測任務,採用結合參考圖像與「原始圖像+編輯指令」的混合式VLM評判框架,系統評估模型在多輪生成、視覺記憶及跨領域常識推理等方面的能力。實驗表明,基於WEAVE-100k的訓練能有效提升視覺理解、圖像編輯以及理解-生成協同能力,更可促使UMMs湧現視覺記憶能力;而對WEAVEBench的廣泛評估則揭示當前方法在多輪上下文感知圖像生成與編輯方面存在的持續性局限與挑戰。我們相信WEAVE為多模態社群研究上下文交錯理解與生成提供了全新視角與基礎框架。
English
Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation. However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation. Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with 100 tasks based on 480 images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the original image with editing instructions that assesses models' abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains. Experiments demonstrate that training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities. Furthermore, it facilitates UMMs to develop emergent visual-memory capabilities, while extensive evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation and editing. We believe WEAVE provides a view and foundation for studying in-context interleaved comprehension and generation for multi-modal community.
PDF442December 1, 2025