隐喻之星:基于端到端视觉强化学习的图像隐喻理解与推理
MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning
February 11, 2026
作者: Chenhao Zhang, Yazhe Niu, Hongsheng Li
cs.AI
摘要
圖像隱喻理解仍是當今人工智慧系統面臨的關鍵挑戰。儘管多模態大語言模型在基礎視覺問答任務中表現卓越,卻始終難以把握視覺內容中蘊含的細膩文化語境、情感暗示與情境引申義。這一困境源於該任務需要模型具備多跳推理、文化背景理解及心理理論能力等複雜維度,而現有模型尚存不足。為此,我們提出首個端到端的視覺強化學習框架MetaphorStar,專注於圖像隱喻解析任務。該框架包含三大核心組件:細粒度數據集TFQ-Data、視覺強化學習方法TFQ-GRPO,以及結構化評估基準TFQ-Bench。
我們基於TFQ-Data數據集採用TFQ-GRPO方法訓練的全開源MetaphorStar系列模型,在圖像隱喻基準測試中實現平均82.6%的性能提升。與20餘個主流多模態大模型相比,MetaphorStar-32B在選擇題與開放式問答任務中達到最優水平,並在判斷題任務上顯著超越頂級閉源模型Gemini-3.0-pro。尤為重要的是,實驗表明學習圖像隱喻任務能有效增強模型的通用理解能力,特別是複雜視覺推理能力。我們進一步系統分析了模型參數規模、訓練數據量、不同架構與訓練策略的影響,驗證了方法的廣泛適用性。所有模型權重、數據集及方法代碼均已開源於https://metaphorstar.github.io。
English
Metaphorical comprehension in images remains a critical challenge for Nowadays AI systems. While Multimodal Large Language Models (MLLMs) excel at basic Visual Question Answering (VQA), they consistently struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. This difficulty stems from the task's demand for sophisticated multi-hop reasoning, cultural context, and Theory of Mind (ToM) capabilities, which current models lack. To fill this gap, we propose MetaphorStar, the first end-to-end visual reinforcement learning (RL) framework for image implication tasks. Our framework includes three core components: the fine-grained dataset TFQ-Data, the visual RL method TFQ-GRPO, and the well-structured benchmark TFQ-Bench.
Our fully open-source MetaphorStar family, trained using TFQ-GRPO on TFQ-Data, significantly improves performance by an average of 82.6% on the image implication benchmarks. Compared with 20+ mainstream MLLMs, MetaphorStar-32B achieves state-of-the-art (SOTA) on Multiple-Choice Question and Open-Style Question, significantly outperforms the top closed-source model Gemini-3.0-pro on True-False Question. Crucially, our experiments reveal that learning image implication tasks improves the general understanding ability, especially the complex visual reasoning ability. We further provide a systematic analysis of model parameter scaling, training data scaling, and the impact of different model architectures and training strategies, demonstrating the broad applicability of our method. We open-sourced all model weights, datasets, and method code at https://metaphorstar.github.io.