構造化画像表現のためのコミュニケーションに着想を得たトークン化

要旨

離散画像トークナイザーは、トランスフォーマーアーキテクチャのための逐次インターフェースを提供するものとして、現代の視覚・マルチモーダルシステムの主要コンポーネントとして台頭してきた。しかし、既存手法の多くは、依然として再構成と圧縮に最適化されたままであり、物体レベルの意味構造ではなく局所的なテクスチャを捉えるトークンを生成しがちである。本研究では、人間のコミュニケーションの漸進的かつ合成的な性質に着想を得て、構造化された離散視覚トークン系列を学習するフレームワークであるCOMiTを提案する。COMiTは、局所的な画像クロップを反復的に観察し、その離散表現を回帰的に更新することで、固定されたトークン予算内で潜在メッセージを構築する。各ステップで、モデルは新しい視覚情報を統合するとともに、既存のトークン系列を洗練・再編成する。数回のエンコーディング反復後、最終メッセージはフル画像を再構成するフローマッチングデコーダの条件として機能する。エンコーディングとデコーディングは単一のトランスフォーマーモデル内に実装され、フローマッチング再構成損失と意味的表現アライメント損失を組み合わせた端から端までの学習が行われる。実験結果から、意味的アライメントが基礎を提供する一方で、注意深い逐次的なトークン化が、解釈可能な物体中心のトークン構造を誘導し、従来手法に比べて合成的汎化と関係推論を大幅に改善する上で決定的に重要であることが示された。

English

Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer-based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object-level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message conditions a flow-matching decoder that reconstructs the full image. Both encoding and decoding are implemented within a single transformer model and trained end-to-end using a combination of flow-matching reconstruction and semantic representation alignment losses. Our experiments demonstrate that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.

構造化画像表現のためのコミュニケーションに着想を得たトークン化

Communication-Inspired Tokenization for Structured Image Representations

要旨

Support