Helix4D：複雜四維網格生成

摘要

當前視頻到4D的方法在處理複雜拓撲變化、透明材料、薄結構及內表面時存在困難。我們提出Helix4D，這是一個動態網格生成框架，通過繼承Trellis2的高表達力表徵，將其從圖像到3D的生成擴展為視頻條件下的4D生成。我們的設計源於兩個關鍵問題：(a) 如何使Trellis2的幀內局部注意力能夠跨幀共享信息，同時保留其在透明物體和內表面等罕見案例上的預訓練品質；(b) 如何在純3D位置編碼中注入時間信息，同時不破壞預訓練能力。針對問題(a)，我們採用滑動窗口跨幀注意力機制，並以第一幀為錨點。第一幀由基礎Trellis2模型生成，並注入我們的模型，使其通過跨幀注意力繼承Trellis2在罕見案例上的品質。針對問題(b)，我們提出一種4D時間編碼方法，將冗餘的低頻空間RoPE頻帶重新用於時間編碼，從而在不增加參數的前提下將編碼從3D擴展至4D。大量實驗證明了Helix4D在ActionBench及我們自訂的複雜動態數據集上生成高品質動態網格的有效性。

English

Current video-to-4D methods struggle with complex topology changes, transparent materials, thin structures, and inner surfaces. We present Helix4D, a dynamic mesh generation framework by inheriting the expressive representation of Trellis2, adapting it from image-to-3D to video-conditioned 4D generation. Our design arises from two key questions: (a) how to enable Trellis2's frame-local attention to share information across frames while preserving its pretrained quality on rare cases such as transparent objects and inner surfaces, and (b) how to inject temporal information into a purely 3D positional encoding without breaking pretrained capabilities. We address (a) with a sliding-window cross-frame attention and anchor on the first frame. The first frame is generated by the base Trellis2 model and injected into our model, letting it inherit Trellis2's quality in rare cases through cross-frame attention. We address (b) with a 4D temporal encoding that repurposes redundant low-frequency spatial RoPE bands for time, extending the encoding from 3D with no additional parameters. Extensive experiments show the effectiveness of Helix4D for high-quality dynamic mesh generation on ActionBench and our own challenging complex dynamics set.