讓仿生人夢見電子羊:一個類人化的圖像隱喻理解與推理框架
Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning Framework
May 22, 2025
作者: Chenhao Zhang, Yazhe Niu
cs.AI
摘要
圖像中的隱喻理解仍然是人工智慧系統面臨的關鍵挑戰,現有模型難以把握視覺內容中蘊含的細膩文化、情感和語境含義。儘管多模態大語言模型(MLLMs)在基礎的視覺問答(VQA)任務上表現出色,但在圖像隱含意義任務上卻存在根本性限制:語境缺失導致不同視覺元素及其抽象意義之間的關係模糊不清。受人類認知過程的啟發,我們提出了“讓安卓夢見”(LAD)這一新穎框架,用於圖像隱含意義的理解與推理。LAD通過三階段框架解決語境缺失問題:(1)感知:將視覺信息轉化為豐富的多層次文本表示;(2)搜索:迭代搜索並整合跨領域知識以消除歧義;(3)推理:通過顯式推理生成與語境對齊的圖像隱含意義。我們的框架結合輕量級GPT-4o-mini模型,在英文圖像隱含意義基準測試中相較於15+個MLLMs達到了SOTA性能,並在中文基準測試上實現了顯著提升,在選擇題(MCQ)上與GPT-4o模型表現相當,在開放式問題(OSQ)上則超出36.7%。此外,我們的工作為AI如何更有效地解讀圖像隱含意義提供了新見解,推動了視覺語言推理和人機互動領域的發展。我們的項目已公開於https://github.com/MING-ZCH/Let-Androids-Dream-of-Electric-Sheep。
English
Metaphorical comprehension in images remains a critical challenge for AI
systems, as existing models struggle to grasp the nuanced cultural, emotional,
and contextual implications embedded in visual content. While multimodal large
language models (MLLMs) excel in basic Visual Question Answer (VQA) tasks, they
struggle with a fundamental limitation on image implication tasks: contextual
gaps that obscure the relationships between different visual elements and their
abstract meanings. Inspired by the human cognitive process, we propose Let
Androids Dream (LAD), a novel framework for image implication understanding and
reasoning. LAD addresses contextual missing through the three-stage framework:
(1) Perception: converting visual information into rich and multi-level textual
representations, (2) Search: iteratively searching and integrating cross-domain
knowledge to resolve ambiguity, and (3) Reasoning: generating context-alignment
image implication via explicit reasoning. Our framework with the lightweight
GPT-4o-mini model achieves SOTA performance compared to 15+ MLLMs on English
image implication benchmark and a huge improvement on Chinese benchmark,
performing comparable with the GPT-4o model on Multiple-Choice Question (MCQ)
and outperforms 36.7% on Open-Style Question (OSQ). Additionally, our work
provides new insights into how AI can more effectively interpret image
implications, advancing the field of vision-language reasoning and human-AI
interaction. Our project is publicly available at
https://github.com/MING-ZCH/Let-Androids-Dream-of-Electric-Sheep.