FreeAskWorld:面向人本具身AI的交互式闭环模拟器
FreeAskWorld: An Interactive and Closed-Loop Simulator for Human-Centric Embodied AI
November 17, 2025
作者: Yuhang Peng, Yizhou Pan, Xinning He, Jihaoyu Yang, Xinyu Yin, Han Wang, Xiaoji Zheng, Chao Gao, Jiangtao Gong
cs.AI
摘要
随着具身智能成为人工智能研究的核心前沿,仿真平台必须超越低阶物理交互,转向捕捉以人为中心的复杂社会行为。我们提出FreeAskWorld交互式仿真框架,该框架融合大语言模型实现高阶行为规划与语义 grounded 的交互机制,并基于意图理论与社会认知理论构建。本框架支持可扩展的逼真人机仿真,包含针对多样化具身任务设计的模块化数据生成流水线。为验证框架效能,我们将经典视觉语言导航任务拓展为交互增强的定向问询场景,使智能体能够主动寻求并解析导航指引。我们公开推出FreeAskWorld大规模基准数据集,包含重构环境、六类任务形态、16种核心物体类别、63,429帧标注样本及逾17小时交互数据,以支持具身AI系统的训练与评估。通过对VLN模型与人类参与者在开环与闭环设置下的基准测试,实验结果表明:基于FreeAskWorld微调的模型在语义理解与交互能力上均超越原模型。这些发现印证了社会 grounded 仿真框架在推动具身AI系统实现高阶规划与自然人机交互方面的有效性。尤为重要的是,本研究揭示了交互本身可作为独立的信息模态这一核心洞见。
English
As embodied intelligence emerges as a core frontier in artificial intelligence research, simulation platforms must evolve beyond low-level physical interactions to capture complex, human-centered social behaviors. We introduce FreeAskWorld, an interactive simulation framework that integrates large language models (LLMs) for high-level behavior planning and semantically grounded interaction, informed by theories of intention and social cognition. Our framework supports scalable, realistic human-agent simulations and includes a modular data generation pipeline tailored for diverse embodied tasks.To validate the framework, we extend the classic Vision-and-Language Navigation (VLN) task into a interaction enriched Direction Inquiry setting, wherein agents can actively seek and interpret navigational guidance. We present and publicly release FreeAskWorld, a large-scale benchmark dataset comprising reconstructed environments, six diverse task types, 16 core object categories, 63,429 annotated sample frames, and more than 17 hours of interaction data to support training and evaluation of embodied AI systems. We benchmark VLN models, and human participants under both open-loop and closed-loop settings. Experimental results demonstrate that models fine-tuned on FreeAskWorld outperform their original counterparts, achieving enhanced semantic understanding and interaction competency. These findings underscore the efficacy of socially grounded simulation frameworks in advancing embodied AI systems toward sophisticated high-level planning and more naturalistic human-agent interaction. Importantly, our work underscores that interaction itself serves as an additional information modality.