LoST: 3D 형상을 위한 의미 수준 토큰화

초록

토큰화는 다양한 모달리티의 생성 모델링에서 핵심적인 기법입니다. 특히 최근 3D 생성 분야에서 주목받는 자기회귀(AR) 모델에서 이 기술은 결정적인 역할을 수행합니다. 그러나 3D 형태에 대한 최적의 토큰화 방법은 여전히 해결과제로 남아있습니다. 최첨단(SOTA) 방법론들은 주로 렌더링 및 압축을 위해 설계된 기하학적 LOD 계층 구조에 의존하고 있습니다. 이러한 공간 계층 구조는 종종 토큰 효율성이 낮으며 AR 모델링에 필요한 의미론적 일관성을 결여하고 있습니다. 본 연구에서는 의미론적 중요도에 따라 토큰을 배열하는 Level-of-Semantics Tokenization (LoST)을 제안합니다. 이를 통해 초기 접두사 토큰만으로도 주요 의미를 지닌 완전하고 그럴듯한 형태가 복원되며, 후속 토큰들은 인스턴스별 기하학적·의미론적 세부 사항을 정교하게 보완합니다. LoST 학습을 위해 3D 형태 잠재 공간의 관계적 구조와 의미론적 DINO 특징 공간의 구조를 정렬하는 새로운 3D 의미 정렬 손실 함수인 Relational Inter-Distance Alignment (RIDA)를 도입했습니다. 실험 결과 LoST는 기하학적 및 의미론적 복원 지표 모두에서 기존 LOD 기반 3D 형태 토크나이저를 큰 차이로 능가하는 SOTA 복원 성능을 달성했습니다. 더 나아가 LoST는 기존 AR 모델 대비 0.1%~10%에 불과한 토큰만 사용하면서도 효율적이고 고품질의 AR 3D 생성을 실현하고 의미론적 검색과 같은 다운스트림 작업을 가능하게 합니다.

English

Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation. However, optimal tokenization of 3D shapes remains an open question. State-of-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D semantic alignment loss that aligns the relational structure of the 3D shape latent space with that of the semantic DINO feature space. Experiments show that LoST achieves SOTA reconstruction, surpassing previous LoD-based 3D shape tokenizers by large margins on both geometric and semantic reconstruction metrics. Moreover, LoST achieves efficient, high-quality AR 3D generation and enables downstream tasks like semantic retrieval, while using only 0.1%-10% of the tokens needed by prior AR models.

LoST: 3D 형상을 위한 의미 수준 토큰화

LoST: Level of Semantics Tokenization for 3D Shapes

초록

Support