TRIP-Bench：面向现实场景长程交互智能体的基准测试框架

摘要

随着基于大语言模型的智能体日益应用于复杂现实场景，现有基准测试难以充分体现关键挑战，例如全局约束的强制执行、多工具协同推理能力，以及面对长程多轮交互时对用户行为动态变化的适应性。为弥补这一空白，我们推出TRIP-Bench——一个基于真实旅行规划场景的长程交互基准。该基准利用真实世界数据，提供18个精选工具与40余项旅行需求，并支持自动化评估。其包含不同难度层级：困难级重点考察长时模糊对话、风格转换、可行性动态调整及迭代式方案修订等场景。对话跨度可达15轮用户交互，涉及逾150次工具调用，上下文长度可能超过20万词元。实验表明，即使先进模型在简单层级上的成功率最高仅达50%，而在困难子集上表现骤降至10%以下。我们进一步提出GTPO方法，这是一种结合专用奖励归一化与差分奖励机制的在线多轮强化学习算法。将其应用于Qwen2.5-32B-Instruct模型后，GTPO显著提升了约束满足度与交互鲁棒性，在我们的评估中表现优于Gemini-3-Pro。我们期待TRIP-Bench能推动实用型长程交互智能体的发展，同时GTPO能为鲁棒的长程训练提供有效的在线强化学习方案。

English

As LLM-based agents are deployed in increasingly complex real-world settings, existing benchmarks underrepresent key challenges such as enforcing global constraints, coordinating multi-tool reasoning, and adapting to evolving user behavior over long, multi-turn interactions. To bridge this gap, we introduce TRIP-Bench, a long-horizon benchmark grounded in realistic travel-planning scenarios. TRIP-Bench leverages real-world data, offers 18 curated tools and 40+ travel requirements, and supports automated evaluation. It includes splits of varying difficulty; the hard split emphasizes long and ambiguous interactions, style shifts, feasibility changes, and iterative version revision. Dialogues span up to 15 user turns, can involve 150+ tool calls, and may exceed 200k tokens of context. Experiments show that even advanced models achieve at most 50\% success on the easy split, with performance dropping below 10\% on hard subsets. We further propose GTPO, an online multi-turn reinforcement learning method with specialized reward normalization and reward differencing. Applied to Qwen2.5-32B-Instruct, GTPO improves constraint satisfaction and interaction robustness, outperforming Gemini-3-Pro in our evaluation. We expect TRIP-Bench to advance practical long-horizon interactive agents, and GTPO to provide an effective online RL recipe for robust long-horizon training.

TRIP-Bench：面向现实场景长程交互智能体的基准测试框架

TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios

摘要

Support