SeqTex: ビデオシーケンスにおけるメッシュテクスチャの生成

要旨

ネイティブな3Dテクスチャ生成モデルのトレーニングは、基本的でありながら依然として困難な問題です。その主な原因は、大規模で高品質な3Dテクスチャデータセットの限られた可用性にあります。この不足は、現実世界のシナリオへの一般化を妨げています。この問題に対処するため、既存の手法の多くは、基礎となる画像生成モデルをファインチューニングして、その学習された視覚的プリオールを活用します。しかし、これらのアプローチは通常、マルチビュー画像のみを生成し、UVテクスチャマップ（現代のグラフィックスパイプラインにおいて不可欠な表現）を生成するために後処理に依存します。このような2段階のパイプラインは、エラーの蓄積や3D表面全体での空間的不整合に悩まされることが多いです。本論文では、SeqTexという新しいエンドツーエンドのフレームワークを紹介します。SeqTexは、事前学習されたビデオ基礎モデルにエンコードされた視覚的知識を活用して、完全なUVテクスチャマップを直接生成します。従来の手法がUVテクスチャの分布を単独でモデル化するのとは異なり、SeqTexはこのタスクをシーケンス生成問題として再定式化し、マルチビューレンダリングとUVテクスチャの結合分布を学習できるようにします。この設計により、ビデオ基礎モデルからの一貫した画像空間プリオールがUV領域に効果的に転移されます。さらに性能を向上させるため、いくつかのアーキテクチャ上の革新を提案します：分離されたマルチビューとUVブランチの設計、クロスドメイン特徴アラインメントを導くためのジオメトリ情報を考慮したアテンション、そして細かいテクスチャの詳細を保持しつつ計算効率を維持するための適応的トークン解像度です。これらのコンポーネントを組み合わせることで、SeqTexは事前学習されたビデオプリオールを最大限に活用し、後処理を必要とせずに高忠実度のUVテクスチャマップを合成できます。広範な実験により、SeqTexが画像条件付きおよびテキスト条件付きの3Dテクスチャ生成タスクにおいて、最先端の性能を達成し、優れた3D一貫性、テクスチャとジオメトリの整合性、および現実世界への一般化能力を示すことが確認されました。

English

Training native 3D texture generative models remains a fundamental yet challenging problem, largely due to the limited availability of large-scale, high-quality 3D texture datasets. This scarcity hinders generalization to real-world scenarios. To address this, most existing methods finetune foundation image generative models to exploit their learned visual priors. However, these approaches typically generate only multi-view images and rely on post-processing to produce UV texture maps -- an essential representation in modern graphics pipelines. Such two-stage pipelines often suffer from error accumulation and spatial inconsistencies across the 3D surface. In this paper, we introduce SeqTex, a novel end-to-end framework that leverages the visual knowledge encoded in pretrained video foundation models to directly generate complete UV texture maps. Unlike previous methods that model the distribution of UV textures in isolation, SeqTex reformulates the task as a sequence generation problem, enabling the model to learn the joint distribution of multi-view renderings and UV textures. This design effectively transfers the consistent image-space priors from video foundation models into the UV domain. To further enhance performance, we propose several architectural innovations: a decoupled multi-view and UV branch design, geometry-informed attention to guide cross-domain feature alignment, and adaptive token resolution to preserve fine texture details while maintaining computational efficiency. Together, these components allow SeqTex to fully utilize pretrained video priors and synthesize high-fidelity UV texture maps without the need for post-processing. Extensive experiments show that SeqTex achieves state-of-the-art performance on both image-conditioned and text-conditioned 3D texture generation tasks, with superior 3D consistency, texture-geometry alignment, and real-world generalization.

SeqTex: ビデオシーケンスにおけるメッシュテクスチャの生成

SeqTex: Generate Mesh Textures in Video Sequence

要旨

Support