潛在隱式視覺推理
Latent Implicit Visual Reasoning
December 24, 2025
作者: Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Roei Herzig
cs.AI
摘要
儘管大型多模態模型已取得顯著進展,其核心推理模態仍以文本為主,過度依賴語言表徵。這導致模型在處理以視覺為主導的推理任務時存在侷限性。近期研究嘗試通過輔助圖像、深度圖或圖像裁剪來監督中間視覺步驟,但這些策略對「有效」視覺抽象表徵施加了侷限性先驗,不僅增加高昂的標註成本,還難以實現跨任務泛化。為突破此關鍵限制,我們提出一種任務無關機制,使大型多模態模型能在無顯式監督的情況下自主發現並運用視覺推理標記。這些標記通過全局注意力機制以任務自適應方式對圖像進行重編碼,使模型無需人工標註即可提取相關視覺信息。我們的方法在多樣化視覺中心任務(包括難以定義中間抽象表徵的任務)上超越直接微調效果,達到最先進水平,同時展現出多任務指令調優的泛化能力。
English
While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.