텍스트, 이미지, 3D 구조를 토큰 단위로 정렬하기

초록

3차원 공간에서 작동하는 로봇과 3D 환경을 구축 및 편집하는 디자이너를 지원하기 위해서는 3D로 세계를 이해할 수 있는 기계를 만드는 것이 필수적입니다. 언어 및 이미지 모델링의 발전에서 영감을 받아, 우리는 새로운 모달리티인 구조화된 3D 장면에 대한 자기회귀 모델의 잠재력을 탐구합니다. 이를 위해, 우리는 언어, 이미지, 3D 장면을 정렬하는 통합 LLM 프레임워크를 제안하고, 데이터 표현, 모달리티별 목표 등과 관련된 핵심 질문을 해결하기 위한 최적의 훈련 및 성능을 달성하기 위한 중요한 설계 선택을 상세히 설명한 '쿡북'을 제공합니다. 우리는 렌더링, 인식, 지시 따르기, 질문 응답이라는 네 가지 핵심 3D 작업과 합성 및 실제 세계의 네 가지 3D 데이터셋에 걸쳐 성능을 평가합니다. 우리는 양자화된 형태 인코딩을 통해 3D 모달리티를 풍부하게 하여 복잡한 3D 객체 형태를 재구성하는 접근 방식을 확장하고, 실제 세계의 3D 객체 인식 작업에서 우리 모델의 효과를 보여줍니다. 프로젝트 웹페이지: https://glab-caltech.github.io/kyvo/

English

Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following, and question-answering -- and four 3D datasets, synthetic and real-world. We extend our approach to reconstruct complex 3D object shapes by enriching our 3D modality with quantized shape encodings, and show our model's effectiveness on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/

텍스트, 이미지, 3D 구조를 토큰 단위로 정렬하기

Aligning Text, Images, and 3D Structure Token-by-Token

초록

Support