视觉思考,文本推理:ARC挑战中的视觉语言协同
Think Visually, Reason Textually: Vision-Language Synergy in ARC
November 19, 2025
作者: Beichen Zhang, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang
cs.AI
摘要
从少量示例中进行抽象推理仍然是GPT-5和Grok-4等前沿基础模型尚未解决的核心问题。这些模型仍难以从少数样本中推断出结构化转换规则,而这正是人类智能的关键特征。面向通用人工智能的抽象与推理语料库(ARC-AGI)为此能力提供了严格测试平台,要求实现概念规则归纳并向新任务迁移。现有方法大多将ARC-AGI视为纯文本推理任务,却忽略了人类在解决此类难题时高度依赖视觉抽象的特性。然而我们的初步实验揭示了一个悖论:由于规则执行精度不足,简单地将ARC-AGI网格转换为图像反而会降低性能。由此我们提出核心假设:视觉与语言在不同推理阶段具有互补优势——视觉支持全局模式抽象与验证,而语言擅长符号化规则表述与精确执行。基于此洞见,我们提出两种协同策略:(1)视觉-语言协同推理(VLSR),将ARC-AGI分解为模态对齐的子任务;(2)模态切换自校正(MSSC),利用视觉验证基于文本的推理以实现内在误差修正。大量实验表明,该方法在多种旗舰模型和多项ARC-AGI任务中相较纯文本基线最高提升4.33%。我们的研究结果表明,将视觉抽象与语言推理相统一,是未来基础模型实现可泛化、类人智能的关键步骤。源代码即将发布。
English
Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code will be released soon.