3D CoCa: コントラスティブ学習者は3Dキャプショナーである

要旨

3Dキャプショニングは、3Dシーンの内容を自然言語で記述することを目的としていますが、点群の本質的な疎性や既存手法におけるクロスモーダルアラインメントの弱さにより、依然として非常に困難な課題です。これらの課題に対処するため、我々は3D CoCaを提案します。これは、コントラスティブな視覚言語学習と3Dキャプション生成を単一のアーキテクチャ内でシームレスに統合する新しい統一フレームワークです。我々のアプローチでは、凍結されたCLIP視覚言語バックボーンを活用して豊富な意味的プライアを提供し、空間認識型の3Dシーンエンコーダで幾何学的コンテキストを捕捉し、マルチモーダルデコーダで記述的なキャプションを生成します。明示的な物体提案に依存する従来の2段階手法とは異なり、3D CoCaは共有特徴空間内でコントラスティブとキャプショニングの目的を共同で最適化し、外部検出器や手動提案の必要性を排除します。この共同訓練パラダイムにより、3D表現とテキスト表現を整合させることで、より強力な空間推論とより豊富な意味的基盤が得られます。ScanReferおよびNr3Dベンチマークでの大規模な実験により、3D CoCaが0.5IoUにおけるCIDErスコアでそれぞれ10.2%および5.76%の大幅な性能向上を達成し、現在の最先端技術を大きく上回ることが実証されました。コードはhttps://github.com/AIGeeksGroup/3DCoCaで公開予定です。

English

3D captioning, which aims to describe the content of 3D scenes in natural language, remains highly challenging due to the inherent sparsity of point clouds and weak cross-modal alignment in existing methods. To address these challenges, we propose 3D CoCa, a novel unified framework that seamlessly combines contrastive vision-language learning with 3D caption generation in a single architecture. Our approach leverages a frozen CLIP vision-language backbone to provide rich semantic priors, a spatially-aware 3D scene encoder to capture geometric context, and a multi-modal decoder to generate descriptive captions. Unlike prior two-stage methods that rely on explicit object proposals, 3D CoCa jointly optimizes contrastive and captioning objectives in a shared feature space, eliminating the need for external detectors or handcrafted proposals. This joint training paradigm yields stronger spatial reasoning and richer semantic grounding by aligning 3D and textual representations. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that 3D CoCa significantly outperforms current state-of-the-arts by 10.2% and 5.76% in CIDEr at 0.5IoU, respectively. Code will be available at https://github.com/AIGeeksGroup/3DCoCa.

3D CoCa: コントラスティブ学習者は3Dキャプショナーである

3D CoCa: Contrastive Learners are 3D Captioners

要旨

Support