DeepTravel:一個端到端的代理強化學習框架,用於自主旅行規劃代理
DeepTravel: An End-to-End Agentic Reinforcement Learning Framework for Autonomous Travel Planning Agents
September 26, 2025
作者: Yansong Ning, Rui Liu, Jun Wang, Kai Chen, Wei Li, Jun Fang, Kan Zheng, Naiqiang Tan, Hao Liu
cs.AI
摘要
旅行規劃(TP)代理最近作為一個新興的構建模塊,用於與外部工具和資源互動以生成旅行行程,確保用戶體驗的愉悅。儘管其優勢顯著,現有研究依賴於手工製作的提示和固定的代理工作流程,限制了更靈活和自主的TP代理的發展。本文提出了DeepTravel,一個端到端的代理強化學習框架,用於構建自主的旅行規劃代理,能夠自主規劃、執行工具,並對工具響應進行反思,以在多步推理中探索、驗證和精煉中間行動。為實現這一目標,我們首先通過緩存交通、住宿和POI數據構建了一個穩健的沙盒環境,促進TP代理的訓練,而不受現實世界API限制(如不一致的輸出)的約束。此外,我們開發了一個分層獎勵建模系統,其中軌跡級驗證器首先檢查時空可行性並過濾不滿意的旅行行程,然後回合級驗證器進一步驗證行程細節與工具響應的一致性,實現高效且精確的獎勵服務。最後,我們提出了回放增強的強化學習方法,使TP代理能夠定期從失敗經驗緩衝區中回放,顯著提升代理能力。我們將訓練後的TP代理部署在滴滴企業解決方案應用上,並進行了全面的在線和離線評估,結果表明DeepTravel使小型LLM(如Qwen3 32B)在旅行規劃任務中顯著超越現有的前沿LLM,如OpenAI o1、o3和DeepSeek R1。
English
Travel planning (TP) agent has recently worked as an emerging building block
to interact with external tools and resources for travel itinerary generation,
ensuring enjoyable user experience. Despite its benefits, existing studies rely
on hand craft prompt and fixed agent workflow, hindering more flexible and
autonomous TP agent. This paper proposes DeepTravel, an end to end agentic
reinforcement learning framework for building autonomous travel planning agent,
capable of autonomously planning, executing tools, and reflecting on tool
responses to explore, verify, and refine intermediate actions in multi step
reasoning. To achieve this, we first construct a robust sandbox environment by
caching transportation, accommodation and POI data, facilitating TP agent
training without being constrained by real world APIs limitations (e.g.,
inconsistent outputs). Moreover, we develop a hierarchical reward modeling
system, where a trajectory level verifier first checks spatiotemporal
feasibility and filters unsatisfied travel itinerary, and then the turn level
verifier further validate itinerary detail consistency with tool responses,
enabling efficient and precise reward service. Finally, we propose the reply
augmented reinforcement learning method that enables TP agent to periodically
replay from a failures experience buffer, emerging notable agentic capacity. We
deploy trained TP agent on DiDi Enterprise Solutions App and conduct
comprehensive online and offline evaluations, demonstrating that DeepTravel
enables small size LLMs (e.g., Qwen3 32B) to significantly outperform existing
frontier LLMs such as OpenAI o1, o3 and DeepSeek R1 in travel planning tasks.