TRIP-Bench:现实场景中长期交互智能体基准测试框架
TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios
February 2, 2026
作者: Yuanzhe Shen, Zisu Huang, Zhengyuan Wang, Muzhao Tian, Zhengkang Guo, Chenyang Zhang, Shuaiyu Zhou, Zengjie Hu, Dailin Li, Jingwen Xu, Kaimin Wang, Wenhao Liu, Tianlong Li, Fengpeng Yue, Feng Hong, Cao Liu, Ke Zeng
cs.AI
摘要
随着基于大语言模型的智能体被部署到日益复杂的现实场景中,现有基准测试未能充分体现关键挑战,例如执行全局约束、协调多工具推理,以及在长周期多轮交互中适应用户行为的动态变化。为弥补这一缺陷,我们推出了TRIP-Bench——一个基于真实旅行规划场景的长周期基准测试。该基准利用真实世界数据,提供18个精选工具和40余项旅行需求,并支持自动化评估。其包含不同难度的测试集:困难级测试集重点关注冗长模糊的交互、风格转换、可行性变更及迭代式方案修订。对话最长可达15轮用户交互,可能涉及150余次工具调用,上下文长度可超过20万词元。实验表明,即使先进模型在简单测试集上的最高成功率也仅达50%,而在困难子集上性能更降至10%以下。我们进一步提出GTPO方法——一种采用特定奖励归一化与奖励差分机制的在线多轮强化学习方法。将GTPO应用于Qwen2.5-32B-Instruct模型后,其在约束满足度与交互鲁棒性方面显著提升,在我们的评估中表现优于Gemini-3-Pro。我们期待TRIP-Bench能推动实用型长周期交互智能体的发展,而GTPO能为鲁棒的长周期训练提供有效的在线强化学习方案。
English
As LLM-based agents are deployed in increasingly complex real-world settings, existing benchmarks underrepresent key challenges such as enforcing global constraints, coordinating multi-tool reasoning, and adapting to evolving user behavior over long, multi-turn interactions. To bridge this gap, we introduce TRIP-Bench, a long-horizon benchmark grounded in realistic travel-planning scenarios. TRIP-Bench leverages real-world data, offers 18 curated tools and 40+ travel requirements, and supports automated evaluation. It includes splits of varying difficulty; the hard split emphasizes long and ambiguous interactions, style shifts, feasibility changes, and iterative version revision. Dialogues span up to 15 user turns, can involve 150+ tool calls, and may exceed 200k tokens of context. Experiments show that even advanced models achieve at most 50\% success on the easy split, with performance dropping below 10\% on hard subsets. We further propose GTPO, an online multi-turn reinforcement learning method with specialized reward normalization and reward differencing. Applied to Qwen2.5-32B-Instruct, GTPO improves constraint satisfaction and interaction robustness, outperforming Gemini-3-Pro in our evaluation. We expect TRIP-Bench to advance practical long-horizon interactive agents, and GTPO to provide an effective online RL recipe for robust long-horizon training.