TAPESTRY: 一貫性のあるターンテーブル動画による形状から質感へのアプローチ

要旨

テクスチャのない3Dモデルに対して、写実的かつ自己整合的な外観を自動生成することは、デジタルコンテンツ制作における重要な課題である。大規模ビデオ生成モデルの発展は、360度ターンテーブル動画（TTV）を直接合成するという自然なアプローチを可能にした。このTTVは、高品質な動的なプレビューとして機能するだけでなく、テクスチャ合成やニューラルレンダリングを駆動する中間表現としても活用できる。しかし、既存の汎用ビデオ拡散モデルは、全視点にわたる厳密な幾何学的一貫性と外観の安定性を維持するのが難しく、その出力は高品質な3D再構成には不向きである。この課題に対処するため、我々は明示的な3Dジオメトリを条件とした高精細なTTVを生成するフレームワーク、TAPESTRYを提案する。3D外観生成タスクを、ジオメトリ条件付きビデオ拡散問題として再定義する。具体的には、3Dメッシュが与えられると、まずマルチモーダルな幾何学的特徴をレンダリングおよびエンコードし、ピクセルレベルで精密な制約を以てビデオ生成プロセスを拘束することで、高品質で一貫性のあるTTVの生成を実現する。これを基盤として、TTV入力からの下流再構成タスクのための手法も設計する。この手法は、3D認識インペインティングを含む多段階パイプラインを特徴とする。モデルを回転させ、文脈を考慮した二次生成を実行することで、このパイプラインは自己オクルージョン領域を効果的に補完し、全面カバレッジを達成する。TAPESTRYによって生成された動画は、高品質な動的プレビューであるだけでなく、UVテクスチャへシームレスにバックプロジェクション可能、あるいは3DGSのようなニューラルレンダリング手法の教師信号として利用可能な、信頼性の高い3D認識中間表現としての役割も果たす。これにより、テクスチャのないメッシュから、制作現場で即利用可能な完成された3Dアセットの自動生成が可能となる。実験結果は、本手法が動画の一貫性と最終的な再構成品質の両面において、既存手法を凌駕することを示している。

English

Automatically generating photorealistic and self-consistent appearances for untextured 3D models is a critical challenge in digital content creation. The advancement of large-scale video generation models offers a natural approach: directly synthesizing 360-degree turntable videos (TTVs), which can serve not only as high-quality dynamic previews but also as an intermediate representation to drive texture synthesis and neural rendering. However, existing general-purpose video diffusion models struggle to maintain strict geometric consistency and appearance stability across the full range of views, making their outputs ill-suited for high-quality 3D reconstruction. To this end, we introduce TAPESTRY, a framework for generating high-fidelity TTVs conditioned on explicit 3D geometry. We reframe the 3D appearance generation task as a geometry-conditioned video diffusion problem: given a 3D mesh, we first render and encode multi-modal geometric features to constrain the video generation process with pixel-level precision, thereby enabling the creation of high-quality and consistent TTVs. Building upon this, we also design a method for downstream reconstruction tasks from the TTV input, featuring a multi-stage pipeline with 3D-Aware Inpainting. By rotating the model and performing a context-aware secondary generation, this pipeline effectively completes self-occluded regions to achieve full surface coverage. The videos generated by TAPESTRY are not only high-quality dynamic previews but also serve as a reliable, 3D-aware intermediate representation that can be seamlessly back-projected into UV textures or used to supervise neural rendering methods like 3DGS. This enables the automated creation of production-ready, complete 3D assets from untextured meshes. Experimental results demonstrate that our method outperforms existing approaches in both video consistency and final reconstruction quality.

TAPESTRY: 一貫性のあるターンテーブル動画による形状から質感へのアプローチ

TAPESTRY: From Geometry to Appearance via Consistent Turntable Videos

要旨

Support