ThinkJEPA：以大型视觉语言推理模型赋能潜在世界模型

摘要

近期潛在世界模型（如V-JEPA）的研究進展顯示，其通過視頻觀測預測未來世界狀態的能力頗具潛力。然而，基於短時觀測窗口的密集預測會限制時序上下文信息，易使預測器偏向局部低層次外推，難以捕捉長時程語義，從而降低下游任務的實用性。與此相對，視覺-語言模型（VLM）通過對均勻採樣幀的推理提供強語義基礎和通用知識，但由於計算驅動的稀疏採樣、將細粒度交互狀態壓縮為文本導向表徵的語言輸出瓶頸，以及適應小規模動作條件數據集時的數據機制失配，使其難以作為獨立的密集預測器。我們提出一種VLM引導的JEPA式潛在世界建模框架，通過雙時序路徑結合密集幀動態建模與長時程語義引導：密集JEPA分支負責細粒度運動與交互線索，均勻採樣的VLM思維分支則以較大時序步長提供知識豐富的引導。為有效傳遞VLM的漸進式推理信號，我們引入分層金字塔表徵提取模塊，將多層VLM表徵聚合為兼容潛在預測的引導特徵。在手部操控軌跡預測實驗中，本方法在強VLM基線和JEPA預測器基線上均表現更優，並產生更魯棒的長時程推演行為。

English

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM thinker branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.

ThinkJEPA：以大型视觉语言推理模型赋能潜在世界模型

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

摘要

Support