讓仿生人夢見電子羊：一個類人化的圖像隱喻理解與推理框架

摘要

圖像中的隱喻理解仍然是人工智慧系統面臨的關鍵挑戰，現有模型難以把握視覺內容中蘊含的細膩文化、情感和語境含義。儘管多模態大語言模型（MLLMs）在基礎的視覺問答（VQA）任務上表現出色，但在圖像隱含意義任務上卻存在根本性限制：語境缺失導致不同視覺元素及其抽象意義之間的關係模糊不清。受人類認知過程的啟發，我們提出了“讓安卓夢見”（LAD）這一新穎框架，用於圖像隱含意義的理解與推理。LAD通過三階段框架解決語境缺失問題：（1）感知：將視覺信息轉化為豐富的多層次文本表示；（2）搜索：迭代搜索並整合跨領域知識以消除歧義；（3）推理：通過顯式推理生成與語境對齊的圖像隱含意義。我們的框架結合輕量級GPT-4o-mini模型，在英文圖像隱含意義基準測試中相較於15+個MLLMs達到了SOTA性能，並在中文基準測試上實現了顯著提升，在選擇題（MCQ）上與GPT-4o模型表現相當，在開放式問題（OSQ）上則超出36.7%。此外，我們的工作為AI如何更有效地解讀圖像隱含意義提供了新見解，推動了視覺語言推理和人機互動領域的發展。我們的項目已公開於https://github.com/MING-ZCH/Let-Androids-Dream-of-Electric-Sheep。

English

Metaphorical comprehension in images remains a critical challenge for AI systems, as existing models struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. While multimodal large language models (MLLMs) excel in basic Visual Question Answer (VQA) tasks, they struggle with a fundamental limitation on image implication tasks: contextual gaps that obscure the relationships between different visual elements and their abstract meanings. Inspired by the human cognitive process, we propose Let Androids Dream (LAD), a novel framework for image implication understanding and reasoning. LAD addresses contextual missing through the three-stage framework: (1) Perception: converting visual information into rich and multi-level textual representations, (2) Search: iteratively searching and integrating cross-domain knowledge to resolve ambiguity, and (3) Reasoning: generating context-alignment image implication via explicit reasoning. Our framework with the lightweight GPT-4o-mini model achieves SOTA performance compared to 15+ MLLMs on English image implication benchmark and a huge improvement on Chinese benchmark, performing comparable with the GPT-4o model on Multiple-Choice Question (MCQ) and outperforms 36.7% on Open-Style Question (OSQ). Additionally, our work provides new insights into how AI can more effectively interpret image implications, advancing the field of vision-language reasoning and human-AI interaction. Our project is publicly available at https://github.com/MING-ZCH/Let-Androids-Dream-of-Electric-Sheep.

讓仿生人夢見電子羊：一個類人化的圖像隱喻理解與推理框架

Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning Framework

摘要

Support