DeepTravel：一個端到端的代理強化學習框架，用於自主旅行規劃代理

摘要

旅行規劃（TP）代理最近作為一個新興的構建模塊，用於與外部工具和資源互動以生成旅行行程，確保用戶體驗的愉悅。儘管其優勢顯著，現有研究依賴於手工製作的提示和固定的代理工作流程，限制了更靈活和自主的TP代理的發展。本文提出了DeepTravel，一個端到端的代理強化學習框架，用於構建自主的旅行規劃代理，能夠自主規劃、執行工具，並對工具響應進行反思，以在多步推理中探索、驗證和精煉中間行動。為實現這一目標，我們首先通過緩存交通、住宿和POI數據構建了一個穩健的沙盒環境，促進TP代理的訓練，而不受現實世界API限制（如不一致的輸出）的約束。此外，我們開發了一個分層獎勵建模系統，其中軌跡級驗證器首先檢查時空可行性並過濾不滿意的旅行行程，然後回合級驗證器進一步驗證行程細節與工具響應的一致性，實現高效且精確的獎勵服務。最後，我們提出了回放增強的強化學習方法，使TP代理能夠定期從失敗經驗緩衝區中回放，顯著提升代理能力。我們將訓練後的TP代理部署在滴滴企業解決方案應用上，並進行了全面的在線和離線評估，結果表明DeepTravel使小型LLM（如Qwen3 32B）在旅行規劃任務中顯著超越現有的前沿LLM，如OpenAI o1、o3和DeepSeek R1。

English

Travel planning (TP) agent has recently worked as an emerging building block to interact with external tools and resources for travel itinerary generation, ensuring enjoyable user experience. Despite its benefits, existing studies rely on hand craft prompt and fixed agent workflow, hindering more flexible and autonomous TP agent. This paper proposes DeepTravel, an end to end agentic reinforcement learning framework for building autonomous travel planning agent, capable of autonomously planning, executing tools, and reflecting on tool responses to explore, verify, and refine intermediate actions in multi step reasoning. To achieve this, we first construct a robust sandbox environment by caching transportation, accommodation and POI data, facilitating TP agent training without being constrained by real world APIs limitations (e.g., inconsistent outputs). Moreover, we develop a hierarchical reward modeling system, where a trajectory level verifier first checks spatiotemporal feasibility and filters unsatisfied travel itinerary, and then the turn level verifier further validate itinerary detail consistency with tool responses, enabling efficient and precise reward service. Finally, we propose the reply augmented reinforcement learning method that enables TP agent to periodically replay from a failures experience buffer, emerging notable agentic capacity. We deploy trained TP agent on DiDi Enterprise Solutions App and conduct comprehensive online and offline evaluations, demonstrating that DeepTravel enables small size LLMs (e.g., Qwen3 32B) to significantly outperform existing frontier LLMs such as OpenAI o1, o3 and DeepSeek R1 in travel planning tasks.

DeepTravel：一個端到端的代理強化學習框架，用於自主旅行規劃代理

DeepTravel: An End-to-End Agentic Reinforcement Learning Framework for Autonomous Travel Planning Agents

摘要

Support