Helix4D: 複雑な4次元メッシュ生成

要旨

現在のビデオから4Dへの手法は、複雑なトポロジ変化、透明素材、薄い構造、内部表面に対応することが困難です。本稿では、Trellis2の表現力豊かな表現を継承し、画像から3Dへの生成をビデオ条件付き4D生成に適応させた動的メッシュ生成フレームワークHelix4Dを提案します。本設計は、以下の2つの重要な問いに基づいています：(a) Trellis2のフレーム内局所注意力が、透明物体や内部表面などの稀なケースにおける事前学習品質を維持しつつ、フレーム間で情報を共有する方法、(b) 3次元位置符号化のみに時間情報を注入し、事前学習能力を損なわない方法。(a)に対しては、スライディングウィンドウ型のフレーム間注意力と最初のフレームへのアンカーリングを採用します。最初のフレームはベースのTrellis2モデルで生成され、本モデルに注入されることで、フレーム間注意力を通じて稀なケースにおけるTrellis2の品質を継承します。(b)に対しては、冗長な低周波空間RoPE帯域を時間軸に転用する4次元時間符号化を導入し、追加パラメータなしで3次元から拡張します。広範な実験により、ActionBenchおよび我々が独自に構築した挑戦的な複雑動的セットにおいて、Helix4Dが高品質な動的メッシュ生成に有効であることを示します。

English

Current video-to-4D methods struggle with complex topology changes, transparent materials, thin structures, and inner surfaces. We present Helix4D, a dynamic mesh generation framework by inheriting the expressive representation of Trellis2, adapting it from image-to-3D to video-conditioned 4D generation. Our design arises from two key questions: (a) how to enable Trellis2's frame-local attention to share information across frames while preserving its pretrained quality on rare cases such as transparent objects and inner surfaces, and (b) how to inject temporal information into a purely 3D positional encoding without breaking pretrained capabilities. We address (a) with a sliding-window cross-frame attention and anchor on the first frame. The first frame is generated by the base Trellis2 model and injected into our model, letting it inherit Trellis2's quality in rare cases through cross-frame attention. We address (b) with a 4D temporal encoding that repurposes redundant low-frequency spatial RoPE bands for time, extending the encoding from 3D with no additional parameters. Extensive experiments show the effectiveness of Helix4D for high-quality dynamic mesh generation on ActionBench and our own challenging complex dynamics set.