ChatPaper.aiChatPaper

Text2World:大型語言模型在符號化世界模型生成上的基準測試

Text2World: Benchmarking Large Language Models for Symbolic World Model Generation

February 18, 2025
作者: Mengkang Hu, Tianxing Chen, Yude Zou, Yuheng Lei, Qiguang Chen, Ming Li, Hongyuan Zhang, Wenqi Shao, Ping Luo
cs.AI

摘要

近年來,利用大型語言模型(LLMs)從文本描述生成符號化世界模型的興趣日益增長。儘管LLMs在世界建模的背景下已被廣泛探索,先前的研究仍面臨多項挑戰,包括評估的隨機性、對間接指標的依賴以及有限的領域範圍。為解決這些限制,我們引入了一個基於規劃領域定義語言(PDDL)的新基準——Text2World,該基準涵蓋數百個多樣化的領域,並採用多準則、基於執行的指標進行更為穩健的評估。我們使用Text2World對現有的LLMs進行基準測試,發現通過大規模強化學習訓練的推理模型表現優於其他模型。然而,即便是表現最佳的模型,在世界建模方面仍顯示出有限的能力。基於這些洞察,我們探討了多種有潛力的策略來增強LLMs的世界建模能力,包括測試時擴展、代理訓練等。我們希望Text2World能成為一項關鍵資源,為未來利用LLMs作為世界模型的研究奠定基礎。項目頁面可訪問:https://text-to-world.github.io/。
English
Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL), featuring hundreds of diverse domains and employing multi-criteria, execution-based metrics for a more robust evaluation. We benchmark current LLMs using Text2World and find that reasoning models trained with large-scale reinforcement learning outperform others. However, even the best-performing model still demonstrates limited capabilities in world modeling. Building on these insights, we examine several promising strategies to enhance the world modeling capabilities of LLMs, including test-time scaling, agent training, and more. We hope that Text2World can serve as a crucial resource, laying the groundwork for future research in leveraging LLMs as world models. The project page is available at https://text-to-world.github.io/.

Summary

AI-Generated Summary

PDF132February 19, 2025