使用預訓練模型進行可擴展的三維字幕生成

摘要

我們介紹了Cap3D，一種用於生成3D物體描述性文本的自動化方法。該方法利用來自圖像標題、圖像文本對齊和LLM的預訓練模型，從多個視角的3D資產中整合標題，完全避開了耗時且昂貴的手動標註過程。我們將Cap3D應用於最近引入的大規模3D數據集Objaverse，生成了660k個3D文本對。我們的評估使用了來自同一數據集的41k個人類標註，表明Cap3D在質量、成本和速度方面優於人工撰寫的描述。通過有效的提示工程，Cap3D在來自ABO數據集的17k個收集標註上達到了與人類性能相媲美的生成幾何描述的水準。最後，我們在Cap3D和人類標題上對Text-to-3D模型進行微調，並展示Cap3D的表現優於Point-E、Shape-E和DreamFusion等最新技術水平。

English

We introduce Cap3D, an automatic approach for generating descriptive text for 3D objects. This approach utilizes pretrained models from image captioning, image-text alignment, and LLM to consolidate captions from multiple views of a 3D asset, completely side-stepping the time-consuming and costly process of manual annotation. We apply Cap3D to the recently introduced large-scale 3D dataset, Objaverse, resulting in 660k 3D-text pairs. Our evaluation, conducted using 41k human annotations from the same dataset, demonstrates that Cap3D surpasses human-authored descriptions in terms of quality, cost, and speed. Through effective prompt engineering, Cap3D rivals human performance in generating geometric descriptions on 17k collected annotations from the ABO dataset. Finally, we finetune Text-to-3D models on Cap3D and human captions, and show Cap3D outperforms; and benchmark the SOTA including Point-E, Shape-E, and DreamFusion.

使用預訓練模型進行可擴展的三維字幕生成

Scalable 3D Captioning with Pretrained Models

摘要

Support