句子中的图像:缩放交错指令以实现统一视觉生成
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
May 12, 2026
作者: Yabo Zhang, Kunchang Li, Dewei Zhou, Xinyu Huang, Xun Wang
cs.AI
摘要
尽管近期多模态语言模型的进展已能根据表达性多图像指令生成图像,但现有方法在处理复杂交织指令时仍难以维持性能。这一局限性源于当前范式中图像与文本的结构性分离,迫使模型通过弥合困难的长程依赖关系来匹配描述与视觉目标。为解决这些挑战,我们提出"句子中的图像"(简称INSET)——一种统一生成模型,可将图像无缝嵌入文本指令的本机词汇中。通过将视觉特征直接定位至对应语义槽位,INSET利用Transformer的上下文局部性实现精确对象绑定,有效将图像视为密集表达性语言标记。此外,我们引入可扩展数据引擎,利用视觉语言模型与大型语言模型从标准图像与视频数据集中合成1500万高质量交织样本,构建丰富长程序列。在InterleaveBench上的评估结果表明,INSET在多图像一致性与文本对齐方面显著超越现有最优方法,且随着输入复杂度增加性能差距进一步扩大。除标准生成任务外,我们的方法可自然扩展至多模态图像编辑,将视觉内容作为指令组成部分,实现高度表达性与创造性的视觉操控。
English
While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose Images iN SEnTences (a.k.a, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, INSET leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we introduce a scalable data engine that synthesizes 15M high-quality interleaved samples from standard image and video datasets, utilizing VLMs and LLMs to construct rich, long-horizon sequences. Evaluation results on InterleaveBench demonstrate that INSET significantly outperforms state-of-the-art methods in multi-image consistency and text alignment, with performance gaps widening as input complexity increases. Beyond standard generation, our approach inherently extends to multimodal image editing, integrating visual content as part of the instruction to facilitate highly expressive and creative visual manipulations.