面向结构化图像表征的通信启发式分词方法
Communication-Inspired Tokenization for Structured Image Representations
February 24, 2026
作者: Aram Davtyan, Yusuf Sahin, Yasaman Haghighi, Sebastian Stapf, Pablo Acuaviva, Alexandre Alahi, Paolo Favaro
cs.AI
摘要
离散图像分词器已成为现代视觉与多模态系统的核心组件,为基于Transformer的架构提供序列化接口。然而现有方法主要仍针对重建与压缩任务进行优化,其生成的词元往往捕捉局部纹理而非物体级语义结构。受人类交流的渐进性与组合性启发,我们提出通信激励的分词框架COMiT,用于学习结构化的离散视觉词元序列。该框架通过在固定词元预算内迭代观察局部图像区块并循环更新离散表示来构建潜在消息。模型在每一步整合新视觉信息的同时,会对现有词元序列进行优化重组。经过多次编码迭代后,最终生成的消息将作为流匹配解码器的条件输入以重建完整图像。编码与解码过程均集成于单一Transformer模型,通过结合流匹配重建损失与语义表示对齐损失进行端到端训练。实验表明,语义对齐虽能提供基础支撑,但注意力驱动的序列化分词机制对于诱导可解释的以物体为中心的词元结构至关重要,相较现有方法能显著提升组合泛化与关系推理能力。
English
Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer-based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object-level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message conditions a flow-matching decoder that reconstructs the full image. Both encoding and decoding are implemented within a single transformer model and trained end-to-end using a combination of flow-matching reconstruction and semantic representation alignment losses. Our experiments demonstrate that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.