텍스트 편집이 시각 생성으로 일반화되는가? UMM에서의 교차 모달 지식 편집 벤치마킹

초록

통합 멀티모달 모델(UMM)은 범용 멀티모달 지능을 위한 유망한 패러다임으로 부상했다. 실제 응용 환경에 배포됨에 따라 내부 지식을 효과적으로 업데이트하는 것이 중요해진다. 텍스트 전용 모델에서 지식 편집이 성숙해졌지만, 텍스트 출력을 성공적으로 수정하는 편집이 UMM의 이미지 생성에도 전이되는지는 여전히 불분명하다. 이 문제를 연구하기 위해, 우리는 속성 및 관계 편집을 포괄하는 2,971개의 편집 주제로 구성된 UMM에서의 최초의 교차 모달리티 지식 편집 벤치마크인 UniKE를 소개한다. VQA 기반의 시각적 검증을 사용하여 놀라운 모달리티 격차를 발견했다: 텍스트 측 효과는 약 92%에 도달할 수 있는 반면, 직접 이미지 생성 하에서의 최고 전체 VQA 정확도는 18.5%에 불과하다. 또한 생성 전에 편집된 지식을 명시적으로 활성화하고 평가된 모든 모델-편집기 쌍에 대해 전체 VQA 정확도를 최대 18.6%p까지 향상시키는 추론 증강 파라미터 편집(Reasoning-augmented Parameter Editing)을 제안한다. 메커니즘 분석은 이 격차가 편집된 텍스트 표현과 시각 생성을 위한 조건화 경로 간의 부분적 정렬과 관련이 있으며, 텍스트 출력에 충분한 편집이 이미지 생성을 유도하기에는 너무 약하거나 정렬이 잘못될 수 있음을 보여준다. 이러한 발견은 텍스트 지식 편집이 신뢰할 수 있는 교차 모달리티 전이를 보장하지 않으며, 모달리티 인식 편집 방법의 필요성을 시사한다. 코드와 데이터는 https://github.com/gxx27/UniKE에서 확인할 수 있다.

English

Unified multimodal models (UMMs) have emerged as a promising paradigm for general-purpose multimodal intelligence. As they are deployed in real-world applications, effectively updating internal knowledge becomes critical. While knowledge editing has matured for text-only models, it remains unclear whether edits that successfully modify textual outputs also transfer to image generation in UMMs. To study this question, we introduce UniKE, the first benchmark for cross-modality knowledge editing in UMMs, comprising 2,971 edit subjects spanning attribute and relation edits. Using VQA-based visual verification, we reveal a striking modality gap: text-side efficacy can reach approximately 92%, whereas the best overall VQA accuracy under direct image generation is only 18.5%. We further propose Reasoning-augmented Parameter Editing, which explicitly activates edited knowledge before generation and improves overall VQA accuracy for all evaluated model-editor pairs, with gains up to 18.6 percentage points. Mechanistic analysis shows that this gap is associated with partial alignment between edited textual representations and the conditioning pathways for visual generation, where edits sufficient for text outputs may remain too weak or misaligned to steer image synthesis. These findings show that textual knowledge edits do not guarantee reliable cross-modality transfer and motivate modality-aware editing methods. Our code and data are available at https://github.com/gxx27/UniKE.