Text2World: 大規模言語モデルのシンボリック世界モデル生成のベンチマーキング

要旨

近年、大規模言語モデル（LLM）を活用してテキスト記述からシンボリックな世界モデルを生成することに対する関心が高まっています。世界モデリングの文脈でLLMは広く研究されてきましたが、これまでの研究では評価のランダム性、間接的な指標への依存、限られたドメイン範囲といった課題に直面していました。これらの課題を解決するため、我々は計画領域定義言語（PDDL）に基づく新しいベンチマーク、Text2Worldを提案します。これは数百の多様なドメインを特徴とし、実行ベースの多基準評価を用いることで、より堅牢な評価を実現します。Text2Worldを用いて現在のLLMをベンチマークした結果、大規模強化学習で訓練された推論モデルが他のモデルを上回ることがわかりました。しかし、最も性能の高いモデルでも、世界モデリングの能力には限界があることが示されました。これらの知見を基に、テスト時のスケーリング、エージェント訓練など、LLMの世界モデリング能力を向上させるための有望な戦略を検討します。Text2Worldが重要なリソースとして、LLMを世界モデルとして活用する今後の研究の基盤となることを期待しています。プロジェクトページはhttps://text-to-world.github.io/で公開されています。

English

Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL), featuring hundreds of diverse domains and employing multi-criteria, execution-based metrics for a more robust evaluation. We benchmark current LLMs using Text2World and find that reasoning models trained with large-scale reinforcement learning outperform others. However, even the best-performing model still demonstrates limited capabilities in world modeling. Building on these insights, we examine several promising strategies to enhance the world modeling capabilities of LLMs, including test-time scaling, agent training, and more. We hope that Text2World can serve as a crucial resource, laying the groundwork for future research in leveraging LLMs as world models. The project page is available at https://text-to-world.github.io/.

Text2World: 大規模言語モデルのシンボリック世界モデル生成のベンチマーキング

Text2World: Benchmarking Large Language Models for Symbolic World Model Generation

要旨

Support