ChatPaper.aiChatPaper

A4-Agent:面向零样本可供性推理的智能体框架

A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

December 16, 2025
作者: Zixin Zhang, Kanghao Chen, Hanqing Wang, Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Litao Guo, Ying-Cong Chen
cs.AI

摘要

功能可供性预测作为具身智能的关键技术,能根据语言指令识别物体的交互区域。现有端到端模型将高层推理与低层定位耦合在单一流程中,依赖标注数据集进行训练,导致对新物体和陌生环境的泛化能力不足。本文突破这一范式,提出无需训练的智能体框架A4-Agent,将功能预测解耦为三阶段流程:该框架在测试时协调三大基础模型——(1)运用生成模型可视化交互场景的"造梦者";(2)借助视觉语言模型确定交互部位的"思考者";(3)调度视觉基础模型精确定位交互区域的"定位者"。通过融合预训练模型的互补优势且无需任务微调,我们的零样本框架在多个基准测试中显著超越现有监督方法,并展现出对真实场景的强泛化能力。
English
Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI. Prevailing end-to-end models couple high-level reasoning and low-level grounding into a single monolithic pipeline and rely on training over annotated datasets, which leads to poor generalization on novel objects and unseen environments. In this paper, we move beyond this paradigm by proposing A4-Agent, a training-free agentic framework that decouples affordance prediction into a three-stage pipeline. Our framework coordinates specialized foundation models at test time: (1) a Dreamer that employs generative models to visualize how an interaction would look; (2) a Thinker that utilizes large vision-language models to decide what object part to interact with; and (3) a Spotter that orchestrates vision foundation models to precisely locate where the interaction area is. By leveraging the complementary strengths of pre-trained models without any task-specific fine-tuning, our zero-shot framework significantly outperforms state-of-the-art supervised methods across multiple benchmarks and demonstrates robust generalization to real-world settings.
PDF91December 18, 2025