SeqTex: 비디오 시퀀스에서 메쉬 텍스처 생성하기

초록

고품질 3D 텍스처 데이터셋의 대규모 구축이 제한적이라는 점은, 3D 텍스처 생성 모델의 학습을 여전히 근본적이면서도 어려운 문제로 남게 만들었습니다. 이러한 데이터 부족은 실제 시나리오로의 일반화를 방해합니다. 이를 해결하기 위해, 대부분의 기존 방법들은 기반 이미지 생성 모델을 미세 조정하여 학습된 시각적 사전 지식을 활용합니다. 그러나 이러한 접근법들은 일반적으로 다중 뷰 이미지만 생성하고, UV 텍스처 맵을 생성하기 위해 후처리에 의존합니다. UV 텍스처 맵은 현대 그래픽 파이프라인에서 필수적인 표현입니다. 이러한 두 단계 파이프라인은 종종 오류 누적과 3D 표면 전반의 공간적 불일치 문제를 겪습니다. 본 논문에서는, 사전 학습된 비디오 기반 모델에 인코딩된 시각적 지식을 활용하여 완전한 UV 텍스처 맵을 직접 생성하는 새로운 종단 간 프레임워크인 SeqTex를 소개합니다. 기존 방법들이 UV 텍스처의 분포를 독립적으로 모델링했던 것과 달리, SeqTex는 이 작업을 시퀀스 생성 문제로 재구성하여, 모델이 다중 뷰 렌더링과 UV 텍스처의 결합 분포를 학습할 수 있게 합니다. 이 설계는 비디오 기반 모델의 일관된 이미지 공간 사전 지식을 UV 도메인으로 효과적으로 전달합니다. 성능을 더욱 향상시키기 위해, 우리는 여러 가지 구조적 혁신을 제안합니다: 분리된 다중 뷰 및 UV 브랜치 설계, 교차 도메인 특징 정렬을 유도하는 기하학 정보 기반 어텐션, 그리고 세밀한 텍스처 디테일을 보존하면서 계산 효율성을 유지하는 적응형 토큰 해상도입니다. 이러한 구성 요소들은 SeqTex가 사전 학습된 비디오 사전 지식을 완전히 활용하고, 후처리 없이도 고품질 UV 텍스처 맵을 합성할 수 있게 합니다. 광범위한 실험을 통해 SeqTex가 이미지 조건 및 텍스트 조건 3D 텍스처 생성 작업에서 최첨단 성능을 달성하며, 우수한 3D 일관성, 텍스처-기하학 정렬, 그리고 실제 세계 일반화 능력을 보여줌을 확인했습니다.

English

Training native 3D texture generative models remains a fundamental yet challenging problem, largely due to the limited availability of large-scale, high-quality 3D texture datasets. This scarcity hinders generalization to real-world scenarios. To address this, most existing methods finetune foundation image generative models to exploit their learned visual priors. However, these approaches typically generate only multi-view images and rely on post-processing to produce UV texture maps -- an essential representation in modern graphics pipelines. Such two-stage pipelines often suffer from error accumulation and spatial inconsistencies across the 3D surface. In this paper, we introduce SeqTex, a novel end-to-end framework that leverages the visual knowledge encoded in pretrained video foundation models to directly generate complete UV texture maps. Unlike previous methods that model the distribution of UV textures in isolation, SeqTex reformulates the task as a sequence generation problem, enabling the model to learn the joint distribution of multi-view renderings and UV textures. This design effectively transfers the consistent image-space priors from video foundation models into the UV domain. To further enhance performance, we propose several architectural innovations: a decoupled multi-view and UV branch design, geometry-informed attention to guide cross-domain feature alignment, and adaptive token resolution to preserve fine texture details while maintaining computational efficiency. Together, these components allow SeqTex to fully utilize pretrained video priors and synthesize high-fidelity UV texture maps without the need for post-processing. Extensive experiments show that SeqTex achieves state-of-the-art performance on both image-conditioned and text-conditioned 3D texture generation tasks, with superior 3D consistency, texture-geometry alignment, and real-world generalization.

SeqTex: 비디오 시퀀스에서 메쉬 텍스처 생성하기

SeqTex: Generate Mesh Textures in Video Sequence

초록

Support