SeqTex：生成影片序列中的網格紋理

摘要

訓練原生3D紋理生成模型仍然是一個基礎但具有挑戰性的問題，這主要歸因於大規模、高質量3D紋理數據集的有限可用性。這種稀缺性阻礙了模型在現實場景中的泛化能力。為了解決這一問題，現有的大多數方法通過微調基礎圖像生成模型來利用其學習到的視覺先驗知識。然而，這些方法通常僅生成多視角圖像，並依賴後處理來生成UV紋理貼圖——這是現代圖形管線中的一個關鍵表示。這種兩階段流程往往會導致錯誤累積和3D表面上的空間不一致性。在本文中，我們介紹了SeqTex，這是一種新穎的端到端框架，它利用預訓練視頻基礎模型中編碼的視覺知識直接生成完整的UV紋理貼圖。與之前孤立地建模UV紋理分佈的方法不同，SeqTex將任務重新表述為序列生成問題，使模型能夠學習多視角渲染和UV紋理的聯合分佈。這一設計有效地將視頻基礎模型中的一致圖像空間先驗知識轉移到UV域。為了進一步提升性能，我們提出了幾項架構創新：解耦的多視角和UV分支設計、幾何引導的注意力機制以指導跨域特徵對齊，以及自適應的令牌分辨率以在保持計算效率的同時保留精細的紋理細節。這些組件共同使得SeqTex能夠充分利用預訓練的視頻先驗知識，無需後處理即可合成高保真度的UV紋理貼圖。大量實驗表明，SeqTex在圖像條件和文本條件的3D紋理生成任務上均達到了最先進的性能，具有優越的3D一致性、紋理-幾何對齊和現實世界泛化能力。

English

Training native 3D texture generative models remains a fundamental yet challenging problem, largely due to the limited availability of large-scale, high-quality 3D texture datasets. This scarcity hinders generalization to real-world scenarios. To address this, most existing methods finetune foundation image generative models to exploit their learned visual priors. However, these approaches typically generate only multi-view images and rely on post-processing to produce UV texture maps -- an essential representation in modern graphics pipelines. Such two-stage pipelines often suffer from error accumulation and spatial inconsistencies across the 3D surface. In this paper, we introduce SeqTex, a novel end-to-end framework that leverages the visual knowledge encoded in pretrained video foundation models to directly generate complete UV texture maps. Unlike previous methods that model the distribution of UV textures in isolation, SeqTex reformulates the task as a sequence generation problem, enabling the model to learn the joint distribution of multi-view renderings and UV textures. This design effectively transfers the consistent image-space priors from video foundation models into the UV domain. To further enhance performance, we propose several architectural innovations: a decoupled multi-view and UV branch design, geometry-informed attention to guide cross-domain feature alignment, and adaptive token resolution to preserve fine texture details while maintaining computational efficiency. Together, these components allow SeqTex to fully utilize pretrained video priors and synthesize high-fidelity UV texture maps without the need for post-processing. Extensive experiments show that SeqTex achieves state-of-the-art performance on both image-conditioned and text-conditioned 3D texture generation tasks, with superior 3D consistency, texture-geometry alignment, and real-world generalization.