使用预训练模型实现可扩展的三维字幕生成

摘要

我们介绍了Cap3D，这是一种用于为3D对象生成描述性文本的自动化方法。该方法利用了来自图像字幕、图像文本对齐和LLM的预训练模型，以 consolideate 从3D资产的多个视图中的字幕，完全避开了手动注释的耗时和昂贵过程。我们将Cap3D 应用于最近引入的大规模3D数据集Objaverse，生成了660k个3D-文本对。我们的评估使用了来自同一数据集的41k个人类注释，表明Cap3D在质量、成本和速度方面均超过了人工撰写的描述。通过有效的提示工程，Cap3D在从ABO数据集收集的17k个注释中，在生成几何描述方面与人类表现相媲美。最后，我们在Cap3D和人类字幕上对Text-to-3D模型进行微调，并展示Cap3D的表现优于最先进技术，包括Point-E、Shape-E和DreamFusion。

English

We introduce Cap3D, an automatic approach for generating descriptive text for 3D objects. This approach utilizes pretrained models from image captioning, image-text alignment, and LLM to consolidate captions from multiple views of a 3D asset, completely side-stepping the time-consuming and costly process of manual annotation. We apply Cap3D to the recently introduced large-scale 3D dataset, Objaverse, resulting in 660k 3D-text pairs. Our evaluation, conducted using 41k human annotations from the same dataset, demonstrates that Cap3D surpasses human-authored descriptions in terms of quality, cost, and speed. Through effective prompt engineering, Cap3D rivals human performance in generating geometric descriptions on 17k collected annotations from the ABO dataset. Finally, we finetune Text-to-3D models on Cap3D and human captions, and show Cap3D outperforms; and benchmark the SOTA including Point-E, Shape-E, and DreamFusion.

使用预训练模型实现可扩展的三维字幕生成

Scalable 3D Captioning with Pretrained Models

摘要

Support