事前学習済みモデルを用いたスケーラブルな3Dキャプショニング

要旨

Cap3Dを紹介します。これは3Dオブジェクトに対する記述テキストを自動生成するアプローチです。この手法は、画像キャプショニング、画像-テキストアラインメント、および大規模言語モデル（LLM）の事前学習済みモデルを活用し、3Dアセットの複数視点から得られたキャプションを統合します。これにより、手動アノテーションに伴う時間とコストのかかるプロセスを完全に回避します。私たちはCap3Dを最近導入された大規模3DデータセットであるObjaverseに適用し、66万の3D-テキストペアを生成しました。同じデータセットから得られた4万1千件の人間によるアノテーションを用いた評価では、Cap3Dが品質、コスト、速度の面で人間が作成した記述を上回ることを示しています。効果的なプロンプトエンジニアリングを通じて、Cap3DはABOデータセットから収集した1万7千件のアノテーションにおいて、幾何学的記述の生成において人間のパフォーマンスに匹敵します。最後に、Text-to-3DモデルをCap3Dと人間によるキャプションでファインチューニングし、Cap3Dが優れていることを示しました。また、Point-E、Shape-E、DreamFusionなどの最新技術（SOTA）をベンチマークしました。

English

We introduce Cap3D, an automatic approach for generating descriptive text for 3D objects. This approach utilizes pretrained models from image captioning, image-text alignment, and LLM to consolidate captions from multiple views of a 3D asset, completely side-stepping the time-consuming and costly process of manual annotation. We apply Cap3D to the recently introduced large-scale 3D dataset, Objaverse, resulting in 660k 3D-text pairs. Our evaluation, conducted using 41k human annotations from the same dataset, demonstrates that Cap3D surpasses human-authored descriptions in terms of quality, cost, and speed. Through effective prompt engineering, Cap3D rivals human performance in generating geometric descriptions on 17k collected annotations from the ABO dataset. Finally, we finetune Text-to-3D models on Cap3D and human captions, and show Cap3D outperforms; and benchmark the SOTA including Point-E, Shape-E, and DreamFusion.

事前学習済みモデルを用いたスケーラブルな3Dキャプショニング

Scalable 3D Captioning with Pretrained Models

要旨

Support