Helix4D: 복잡한 4D 메시 생성

초록

현행 비디오-4D 방법들은 복잡한 위상 변화, 투명 재질, 얇은 구조물, 내부 표면 처리에 어려움을 겪는다. 본 논문에서는 Trellis2의 표현적 능력을 계승하여 이미지-3D에서 비디오 기반 4D 생성으로 확장한 동적 메시 생성 프레임워크인 Helix4D를 제시한다. 본 설계는 두 가지 핵심 질문에서 비롯된다: (a) Trellis2의 프레임 내부 어텐션이 투명 객체나 내부 표면과 같은 드문 경우에 대해 사전 학습된 품질을 유지하면서 프레임 간 정보를 공유하도록 하는 방법, (b) 순수 3D 위치 인코딩에 시간 정보를 사전 학습 능력 손상 없이 주입하는 방법. (a)를 해결하기 위해 슬라이딩 윈도우 교차 프레임 어텐션을 도입하고 첫 번째 프레임을 앵커로 사용한다. 첫 번째 프레임은 기본 Trellis2 모델로 생성하여 본 모델에 주입함으로써 교차 프레임 어텐션을 통해 Trellis2의 드문 경우에 대한 품질을 상속받게 한다. (b)를 해결하기 위해 4D 시간 인코딩을 제안하는데, 이는 중복되는 저주파 공간 RoPE 대역을 시간 용도로 재할당하여 추가 파라미터 없이 3D 인코딩을 확장한다. 광범위한 실험을 통해 ActionBench 및 자체 구축한 까다로운 복잡 동역학 데이터셋에서 Helix4D의 고품질 동적 메시 생성 효용성을 입증한다.

English

Current video-to-4D methods struggle with complex topology changes, transparent materials, thin structures, and inner surfaces. We present Helix4D, a dynamic mesh generation framework by inheriting the expressive representation of Trellis2, adapting it from image-to-3D to video-conditioned 4D generation. Our design arises from two key questions: (a) how to enable Trellis2's frame-local attention to share information across frames while preserving its pretrained quality on rare cases such as transparent objects and inner surfaces, and (b) how to inject temporal information into a purely 3D positional encoding without breaking pretrained capabilities. We address (a) with a sliding-window cross-frame attention and anchor on the first frame. The first frame is generated by the base Trellis2 model and injected into our model, letting it inherit Trellis2's quality in rare cases through cross-frame attention. We address (b) with a 4D temporal encoding that repurposes redundant low-frequency spatial RoPE bands for time, extending the encoding from 3D with no additional parameters. Extensive experiments show the effectiveness of Helix4D for high-quality dynamic mesh generation on ActionBench and our own challenging complex dynamics set.