ChatPaper.aiChatPaper

面向结构化图像表征的通信启发式分词方法

Communication-Inspired Tokenization for Structured Image Representations

February 24, 2026
作者: Aram Davtyan, Yusuf Sahin, Yasaman Haghighi, Sebastian Stapf, Pablo Acuaviva, Alexandre Alahi, Paolo Favaro
cs.AI

摘要

离散图像分词器已成为现代视觉与多模态系统的关键组件,为基于Transformer的架构提供了序列化接口。然而,现有方法主要仍针对重建和压缩进行优化,其生成的标记往往捕捉局部纹理而非物体级语义结构。受人类交流的渐进性与组合性启发,我们提出通信启发式分词框架COMiT,用于学习结构化离散视觉标记序列。该框架通过在固定标记预算内迭代观察局部图像区块并循环更新离散表示来构建潜在信息。模型在每一步整合新视觉信息的同时,会对现有标记序列进行优化重组。经过多次编码迭代后,最终信息将作用于流匹配解码器以重建完整图像。编码和解码过程均在同一Transformer模型内实现,并通过结合流匹配重建与语义表示对齐的损失函数进行端到端训练。实验表明,语义对齐虽能提供基础支撑,但注意力驱动的序列化分词对于生成可解释的以物体为中心的标记结构至关重要,可显著提升组合泛化与关系推理能力,超越现有方法。
English
Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer-based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object-level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message conditions a flow-matching decoder that reconstructs the full image. Both encoding and decoding are implemented within a single transformer model and trained end-to-end using a combination of flow-matching reconstruction and semantic representation alignment losses. Our experiments demonstrate that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.
PDF42March 28, 2026