TourPlanner：基于约束门控强化学习的竞争共识式旅行规划框架

摘要

旅行规划是一项复杂的决策过程，需要综合多维度信息以构建行程方案。然而现有方法面临三大挑战：（1）在保持高召回率的同时筛选候选兴趣点；（2）单一路径推理模式限制了可行解空间的探索能力；（3）硬约束与软约束的同步优化仍是重大难题。为此，我们提出TourPlanner——一个集成了多路径推理与约束门控强化学习的综合框架。具体而言，首先通过个性化召回与空间优化工作流构建空间感知的候选兴趣点集；随后提出竞争共识思维链的多路径推理范式，增强对可行解空间的探索能力；为进一步优化方案，在强化学习阶段引入基于S型函数的门控机制，实现硬约束达标后对软约束满足度的动态优先处理。在旅行规划基准测试上的实验结果表明，TourPlanner在可行性与用户偏好契合度方面均显著超越现有方法，达到业界最优性能。

English

Travel planning is a sophisticated decision-making process that requires synthesizing multifaceted information to construct itineraries. However, existing travel planning approaches face several challenges: (1) Pruning candidate points of interest (POIs) while maintaining a high recall rate; (2) A single reasoning path restricts the exploration capability within the feasible solution space for travel planning; (3) Simultaneously optimizing hard constraints and soft constraints remains a significant difficulty. To address these challenges, we propose TourPlanner, a comprehensive framework featuring multi-path reasoning and constraint-gated reinforcement learning. Specifically, we first introduce a Personalized Recall and Spatial Optimization (PReSO) workflow to construct spatially-aware candidate POIs' set. Subsequently, we propose Competitive consensus Chain-of-Thought (CCoT), a multi-path reasoning paradigm that improves the ability of exploring the feasible solution space. To further refine the plan, we integrate a sigmoid-based gating mechanism into the reinforcement learning stage, which dynamically prioritizes soft-constraint satisfaction only after hard constraints are met. Experimental results on travel planning benchmarks demonstrate that TourPlanner achieves state-of-the-art performance, significantly surpassing existing methods in both feasibility and user-preference alignment.