FreeAskWorld:面向人本體現人工智慧的互動式閉環模擬器
FreeAskWorld: An Interactive and Closed-Loop Simulator for Human-Centric Embodied AI
November 17, 2025
作者: Yuhang Peng, Yizhou Pan, Xinning He, Jihaoyu Yang, Xinyu Yin, Han Wang, Xiaoji Zheng, Chao Gao, Jiangtao Gong
cs.AI
摘要
隨著具身智能成為人工智能研究的核心前沿,模擬平台必須超越低階物理互動,進而捕捉以人為中心的複雜社會行為。我們推出FreeAskWorld互動模擬框架,該框架整合大型語言模型實現高階行為規劃與語義接地互動,並融入了意圖理論與社會認知理論的設計理念。本框架支持可擴展的逼真人機模擬,並包含專為多樣化具身任務設計的模組化數據生成管線。為驗證框架效能,我們將經典視覺語言導航任務擴展為互動增強的問路情境,使智能體能主動尋求並解讀導航指引。我們公開發佈大規模基準數據集FreeAskWorld,包含重建環境、六種任務類型、16類核心物體、63,429幀註解樣本及逾17小時互動數據,以支持具身AI系統的訓練與評估。我們在開環與閉環設定下對比測試了VLN模型與人類表現,實驗結果表明:基於FreeAskWorld微調的模型在語義理解與互動能力上均超越原始版本。這些發現驗證了社會情境化模擬框架能有效推動具身AI系統實現更高階的規劃能力與更自然的人機互動。尤為重要的是,本研究揭示了互動本身可作為獨立的資訊模態發揮作用。
English
As embodied intelligence emerges as a core frontier in artificial intelligence research, simulation platforms must evolve beyond low-level physical interactions to capture complex, human-centered social behaviors. We introduce FreeAskWorld, an interactive simulation framework that integrates large language models (LLMs) for high-level behavior planning and semantically grounded interaction, informed by theories of intention and social cognition. Our framework supports scalable, realistic human-agent simulations and includes a modular data generation pipeline tailored for diverse embodied tasks.To validate the framework, we extend the classic Vision-and-Language Navigation (VLN) task into a interaction enriched Direction Inquiry setting, wherein agents can actively seek and interpret navigational guidance. We present and publicly release FreeAskWorld, a large-scale benchmark dataset comprising reconstructed environments, six diverse task types, 16 core object categories, 63,429 annotated sample frames, and more than 17 hours of interaction data to support training and evaluation of embodied AI systems. We benchmark VLN models, and human participants under both open-loop and closed-loop settings. Experimental results demonstrate that models fine-tuned on FreeAskWorld outperform their original counterparts, achieving enhanced semantic understanding and interaction competency. These findings underscore the efficacy of socially grounded simulation frameworks in advancing embodied AI systems toward sophisticated high-level planning and more naturalistic human-agent interaction. Importantly, our work underscores that interaction itself serves as an additional information modality.