文本编辑能否泛化至视觉生成?——多模态模型中的跨模态知识编辑基准测试
Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs
May 30, 2026
作者: Xin Gao, Cheng Yang, Chufan Shi, Taylor Berg-Kirkpatrick
cs.AI
摘要
统一多模态模型(UMMs)已成为实现通用多模态智能的有前景范式。随着这些模型在现实应用中部署,如何有效更新内部知识变得至关重要。虽然知识编辑在纯文本模型中已趋于成熟,但尚未明确:成功修改文本输出的编辑操作,是否也能迁移至UMMs中的图像生成任务。为研究该问题,我们提出UniKE——首个面向UMMs的跨模态知识编辑基准,包含属性编辑与关系编辑共2,971个编辑主题。基于VQA的视觉验证结果表明,存在显著的模态差距:文本侧有效性可达约92%,而直接图像生成下的最优整体VQA准确率仅为18.5%。我们进一步提出推理增强参数编辑(Reasoning-augmented Parameter Editing),在生成前显式激活已编辑知识,使所有模型-编辑器组合的整体VQA准确率均获提升,最高增益达18.6个百分点。机理分析表明,这一差距源于编辑后的文本表征与视觉生成的条件路径之间存在部分对齐——足以影响文本输出的编辑操作,可能因强度不足或方向偏差而无法有效引导图像合成。这些发现表明,文本知识编辑无法保证可靠的跨模态迁移,亟需开发模态感知的编辑方法。我们的代码与数据已开源:https://github.com/gxx27/UniKE。
English
Unified multimodal models (UMMs) have emerged as a promising paradigm for general-purpose multimodal intelligence. As they are deployed in real-world applications, effectively updating internal knowledge becomes critical. While knowledge editing has matured for text-only models, it remains unclear whether edits that successfully modify textual outputs also transfer to image generation in UMMs. To study this question, we introduce UniKE, the first benchmark for cross-modality knowledge editing in UMMs, comprising 2,971 edit subjects spanning attribute and relation edits. Using VQA-based visual verification, we reveal a striking modality gap: text-side efficacy can reach approximately 92%, whereas the best overall VQA accuracy under direct image generation is only 18.5%. We further propose Reasoning-augmented Parameter Editing, which explicitly activates edited knowledge before generation and improves overall VQA accuracy for all evaluated model-editor pairs, with gains up to 18.6 percentage points. Mechanistic analysis shows that this gap is associated with partial alignment between edited textual representations and the conditioning pathways for visual generation, where edits sufficient for text outputs may remain too weak or misaligned to steer image synthesis. These findings show that textual knowledge edits do not guarantee reliable cross-modality transfer and motivate modality-aware editing methods. Our code and data are available at https://github.com/gxx27/UniKE.