句子中的圖像：擴展交錯指令以實現統一視覺生成

摘要

近期，多模態語言模型的進展雖然使透過富有表達力的多影像指令進行影像生成成為可能，但現有方法在處理複雜交錯指令時仍難以維持效能。此限制源於當前範式中影像與文字的結構性分離，迫使模型需跨越困難的長程依賴，才能將描述與視覺目標匹配。為解決這些挑戰，我們提出「Images iN SEnTences」（又稱INSET），這是一個統一生成模型，能將影像作為原生詞彙無縫嵌入文字指令中。透過將視覺特徵直接置於對應的語義位置，INSET利用Transformer的上下文局部性實現精確的物件綁定，有效將影像視為密集且具表達力的語言標記。此外，我們引入一個可擴展資料引擎，從標準影像與影片資料集合成1500萬個高品質交錯樣本，並利用視覺語言模型與大型語言模型建構豐富的長時域序列。在InterleaveBench上的評測結果顯示，INSET在多影像一致性與文字對齊方面顯著優於最先進方法，且當輸入複雜度增加時，效能差距進一步擴大。除標準生成任務外，我們的方法自然延伸至多模態影像編輯，透過將視覺內容整合為指令的一部分，促進高度表達力與創造力的視覺操作。

English

While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose Images iN SEnTences (a.k.a, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, INSET leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we introduce a scalable data engine that synthesizes 15M high-quality interleaved samples from standard image and video datasets, utilizing VLMs and LLMs to construct rich, long-horizon sequences. Evaluation results on InterleaveBench demonstrate that INSET significantly outperforms state-of-the-art methods in multi-image consistency and text alignment, with performance gaps widening as input complexity increases. Beyond standard generation, our approach inherently extends to multimodal image editing, integrating visual content as part of the instruction to facilitate highly expressive and creative visual manipulations.

句子中的圖像：擴展交錯指令以實現統一視覺生成

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

摘要

Support