LoST: 3D形状のための意味論的トークン化レベル

要旨

トークン化は、様々なモダリティの生成的モデリングにおける基本的な技術である。特に、最近3D生成において有力な選択肢として登場した自己回帰（AR）モデルにおいて、決定的に重要な役割を果たす。しかし、3D形状の最適なトークン化は未解決の問題である。現状の最先端（SOTA）手法は、主に元々レンダリングと圧縮のために設計された幾何学的詳細レベル（LoD）階層に依存している。これらの空間的階層は、トークン効率が悪く、ARモデリングのための意味的コヒーレンスを欠くことが多い。我々は、意味的顕著性に基づいてトークンを順序付けるLevel-of-Semantics Tokenization（LoST）を提案する。これにより、初期のプレフィックスは主要な意味を備えた完全で妥当な形状に復号化され、後続のトークンがインスタンス固有の幾何学的・意味的詳細を洗練させる。LoSTを訓練するために、3D形状潜在空間の関係的構造と意味的DINO特徴空間のそれを整合させる、新しい3D意味的アライメント損失であるRelational Inter-Distance Alignment（RIDA）を導入する。実験により、LoSTがSOTAの再構成を達成し、幾何学的および意味的再構成指標の両方において、従来のLoDベースの3D形状トークナイザーを大幅に上回ることを示す。さらに、LoSTは効率的で高品質なAR 3D生成を実現し、意味的検索などの下流タスクを可能にしながら、従来のARモデルに必要なトークンのわずか0.1%～10%のみを使用する。

English

Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation. However, optimal tokenization of 3D shapes remains an open question. State-of-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D semantic alignment loss that aligns the relational structure of the 3D shape latent space with that of the semantic DINO feature space. Experiments show that LoST achieves SOTA reconstruction, surpassing previous LoD-based 3D shape tokenizers by large margins on both geometric and semantic reconstruction metrics. Moreover, LoST achieves efficient, high-quality AR 3D generation and enables downstream tasks like semantic retrieval, while using only 0.1%-10% of the tokens needed by prior AR models.

LoST: 3D形状のための意味論的トークン化レベル

LoST: Level of Semantics Tokenization for 3D Shapes

要旨

Support