ThinkJEPA:以大型视觉语言推理模型赋能潜在世界模型
ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model
March 23, 2026
作者: Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, Yun Fu
cs.AI
摘要
近期潛在世界模型(如V-JEPA)的研究進展顯示,其通過視頻觀測預測未來世界狀態的能力頗具潛力。然而,基於短時觀測窗口的密集預測會限制時序上下文信息,易使預測器偏向局部低層次外推,難以捕捉長時程語義,從而降低下游任務的實用性。與此相對,視覺-語言模型(VLM)通過對均勻採樣幀的推理提供強語義基礎和通用知識,但由於計算驅動的稀疏採樣、將細粒度交互狀態壓縮為文本導向表徵的語言輸出瓶頸,以及適應小規模動作條件數據集時的數據機制失配,使其難以作為獨立的密集預測器。我們提出一種VLM引導的JEPA式潛在世界建模框架,通過雙時序路徑結合密集幀動態建模與長時程語義引導:密集JEPA分支負責細粒度運動與交互線索,均勻採樣的VLM思維分支則以較大時序步長提供知識豐富的引導。為有效傳遞VLM的漸進式推理信號,我們引入分層金字塔表徵提取模塊,將多層VLM表徵聚合為兼容潛在預測的引導特徵。在手部操控軌跡預測實驗中,本方法在強VLM基線和JEPA預測器基線上均表現更優,並產生更魯棒的長時程推演行為。
English
Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM thinker branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.