SeqTex：在视频序列中生成网格纹理

摘要

训练原生3D纹理生成模型仍是一个基础且具挑战性的问题，主要原因在于大规模、高质量3D纹理数据集的稀缺性。这种稀缺性阻碍了模型在现实场景中的泛化能力。为解决这一问题，现有方法大多通过微调基础图像生成模型，以利用其已学习的视觉先验知识。然而，这些方法通常仅生成多视角图像，并依赖后处理来生成UV纹理贴图——现代图形管线中的关键表示。此类两阶段流程常面临误差累积及3D表面空间不一致的问题。本文提出SeqTex，一种新颖的端到端框架，它利用预训练视频基础模型中编码的视觉知识，直接生成完整的UV纹理贴图。与以往孤立建模UV纹理分布的方法不同，SeqTex将任务重构为序列生成问题，使模型能够学习多视角渲染与UV纹理的联合分布。这一设计有效将视频基础模型中的一致图像空间先验转移至UV域。为进一步提升性能，我们提出了多项架构创新：解耦的多视角与UV分支设计、几何引导的注意力机制以指导跨域特征对齐，以及自适应令牌分辨率以在保持计算效率的同时保留精细纹理细节。这些组件共同作用，使SeqTex能够充分利用预训练视频先验，无需后处理即可合成高保真UV纹理贴图。大量实验表明，SeqTex在图像条件及文本条件的3D纹理生成任务上均达到了最先进的性能，展现出卓越的3D一致性、纹理-几何对齐及现实世界泛化能力。

English

Training native 3D texture generative models remains a fundamental yet challenging problem, largely due to the limited availability of large-scale, high-quality 3D texture datasets. This scarcity hinders generalization to real-world scenarios. To address this, most existing methods finetune foundation image generative models to exploit their learned visual priors. However, these approaches typically generate only multi-view images and rely on post-processing to produce UV texture maps -- an essential representation in modern graphics pipelines. Such two-stage pipelines often suffer from error accumulation and spatial inconsistencies across the 3D surface. In this paper, we introduce SeqTex, a novel end-to-end framework that leverages the visual knowledge encoded in pretrained video foundation models to directly generate complete UV texture maps. Unlike previous methods that model the distribution of UV textures in isolation, SeqTex reformulates the task as a sequence generation problem, enabling the model to learn the joint distribution of multi-view renderings and UV textures. This design effectively transfers the consistent image-space priors from video foundation models into the UV domain. To further enhance performance, we propose several architectural innovations: a decoupled multi-view and UV branch design, geometry-informed attention to guide cross-domain feature alignment, and adaptive token resolution to preserve fine texture details while maintaining computational efficiency. Together, these components allow SeqTex to fully utilize pretrained video priors and synthesize high-fidelity UV texture maps without the need for post-processing. Extensive experiments show that SeqTex achieves state-of-the-art performance on both image-conditioned and text-conditioned 3D texture generation tasks, with superior 3D consistency, texture-geometry alignment, and real-world generalization.

SeqTex：在视频序列中生成网格纹理

SeqTex: Generate Mesh Textures in Video Sequence

摘要

Support