探索失败:交互任务中的语言模型表现
Failing to Explore: Language Models on Interactive Tasks
January 29, 2026
作者: Mahdi JafariRaviz, Keivan Rezaei, Arshia Soltani Moakhar, Zahra Sodagar, Yize Cheng, Soheil Feizi
cs.AI
摘要
我们在有限交互预算下评估语言模型探索交互环境的能力。本文提出三种可调控探索难度的参数化任务,涵盖连续与离散环境。研究发现,当前最先进的模型普遍存在系统性探索不足和次优解问题,其表现往往显著逊于简单的探索-利用启发式基线方法,且随着预算增加呈现弱扩展性。最后我们研究两种轻量级干预措施:将固定预算拆分为并行执行(尽管理论分析显示该措施对我们的任务无增益,但实际性能却意外提升),以及定期总结交互历史(该方法能保留关键发现并进一步提升探索效率)。
English
We evaluate language models on their ability to explore interactive environments under a limited interaction budget. We introduce three parametric tasks with controllable exploration difficulty, spanning continuous and discrete environments. Across state-of-the-art models, we find systematic under-exploration and suboptimal solutions, with performance often significantly worse than simple explore--exploit heuristic baselines and scaling weakly as the budget increases. Finally, we study two lightweight interventions: splitting a fixed budget into parallel executions, which surprisingly improves performance despite a no-gain theoretical result for our tasks, and periodically summarizing the interaction history, which preserves key discoveries and further improves exploration.