사전 학습된 모델을 활용한 확장 가능한 3D 캡셔닝

초록

우리는 3D 객체에 대한 설명 텍스트를 자동으로 생성하는 Cap3D 접근 방식을 소개한다. 이 접근 방식은 이미지 캡셔닝, 이미지-텍스트 정렬, 그리고 대형 언어 모델(LLM)에서 사전 훈련된 모델들을 활용하여 3D 자산의 다중 뷰에서 캡션을 통합함으로써, 시간이 많이 소요되고 비용이 드는 수동 주석 작업을 완전히 우회한다. 우리는 Cap3D를 최근에 소개된 대규모 3D 데이터셋인 Objaverse에 적용하여 660k개의 3D-텍스트 쌍을 생성했다. 동일한 데이터셋에서 수집된 41k개의 인간 주석을 사용하여 수행한 평가 결과, Cap3D는 품질, 비용, 속도 측면에서 인간이 작성한 설명을 능가하는 것으로 나타났다. 효과적인 프롬프트 엔지니어링을 통해, Cap3D는 ABO 데이터셋에서 수집된 17k개의 주석에 대해 기하학적 설명을 생성하는 데 있어 인간의 성능에 필적한다. 마지막으로, 우리는 Cap3D와 인간이 작성한 캡션을 사용하여 텍스트-투-3D 모델을 미세 조정했으며, Cap3D가 더 우수한 성능을 보임을 확인했다. 또한 Point-E, Shape-E, DreamFusion을 포함한 최신 기술(SOTA)을 벤치마킹했다.

English

We introduce Cap3D, an automatic approach for generating descriptive text for 3D objects. This approach utilizes pretrained models from image captioning, image-text alignment, and LLM to consolidate captions from multiple views of a 3D asset, completely side-stepping the time-consuming and costly process of manual annotation. We apply Cap3D to the recently introduced large-scale 3D dataset, Objaverse, resulting in 660k 3D-text pairs. Our evaluation, conducted using 41k human annotations from the same dataset, demonstrates that Cap3D surpasses human-authored descriptions in terms of quality, cost, and speed. Through effective prompt engineering, Cap3D rivals human performance in generating geometric descriptions on 17k collected annotations from the ABO dataset. Finally, we finetune Text-to-3D models on Cap3D and human captions, and show Cap3D outperforms; and benchmark the SOTA including Point-E, Shape-E, and DreamFusion.

사전 학습된 모델을 활용한 확장 가능한 3D 캡셔닝

Scalable 3D Captioning with Pretrained Models

초록

Support