ThinkJEPA: 大規模視覚言語推論モデルによる潜在世界モデルの強化

要旨

近年の潜在世界モデル（例：V-JEPA2）の進展は、映像観測から未来の世界状態を予測する有望な能力を示している。しかしながら、短い観測ウィンドウに基づく密な予測は時間的文脈を制限し、予測器を局所的・低水準の外挿に偏らせがちである。これにより、長期的な意味論の捕捉が困難となり、下流タスクにおける有用性が低下する。一方、視覚言語モデル（VLM）は、均一にサンプリングされたフレームに推論を行うことで、強力な意味的基盤と一般的知識を提供する。しかし、計算量に起因する疎なサンプリング、細かな相互作用状態をテキスト指向の表現に圧縮する言語出力ボトルネック、そして小規模な行動条件付きデータセットへの適応時のデータレジームの不一致により、単体の密な予測器としては理想的ではない。本研究では、密なフレームのダイナミクスモデリングと、長期的な意味的ガイダンスを、二つの時間経路を介して統合するVLM誘導型JEPAスタイルの潜在世界モデリングフレームワークを提案する。すなわち、細かな動作と相互作用の手がかりを扱う密なJEPAブランチと、より大きな時間ストライドで知識豊富なガイダンスを提供する均一サンプリングのVLM思考ブランチから構成される。VLMの漸進的推論信号を効果的に転送するため、階層的金字塔表現抽出モジュールを導入し、多層のVLM表現を潜在予測と互換性のあるガイダンス特徴量へ集約する。手の操作軌道予測に関する実験により、本手法が強力なVLMのみのベースラインおよびJEPA予測器ベースラインを上回り、より頑健な長期的ロールアウト挙動を実現することを示す。

English

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM thinker branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.

ThinkJEPA: 大規模視覚言語推論モデルによる潜在世界モデルの強化

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

要旨

Support