A4-Agent:面向零样本可供性推理的智能体框架
A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning
December 16, 2025
作者: Zixin Zhang, Kanghao Chen, Hanqing Wang, Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Litao Guo, Ying-Cong Chen
cs.AI
摘要
基于语言指令识别物体交互区域的可供性预测,对具身AI至关重要。主流端到端模型将高层推理与低层定位耦合在单一流程中,依赖标注数据集进行训练,导致对新物体和未知环境的泛化能力不足。本文突破该范式,提出A4-Agent——一种免训练智能体框架,将可供性预测解耦为三阶段流程:该框架在测试时协调三个专业基础模型:(1)运用生成模型可视化交互效果的"造梦者";(2)利用视觉语言模型确定交互部位的"思考者";(3)调度视觉基础模型精确定位交互区域的"定位者"。通过无需任务微调即可发挥预训练模型的互补优势,我们的零样本框架在多个基准测试中显著超越最先进的监督方法,并展现出对真实场景的强泛化能力。
English
Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI. Prevailing end-to-end models couple high-level reasoning and low-level grounding into a single monolithic pipeline and rely on training over annotated datasets, which leads to poor generalization on novel objects and unseen environments. In this paper, we move beyond this paradigm by proposing A4-Agent, a training-free agentic framework that decouples affordance prediction into a three-stage pipeline. Our framework coordinates specialized foundation models at test time: (1) a Dreamer that employs generative models to visualize how an interaction would look; (2) a Thinker that utilizes large vision-language models to decide what object part to interact with; and (3) a Spotter that orchestrates vision foundation models to precisely locate where the interaction area is. By leveraging the complementary strengths of pre-trained models without any task-specific fine-tuning, our zero-shot framework significantly outperforms state-of-the-art supervised methods across multiple benchmarks and demonstrates robust generalization to real-world settings.