QLASS：透過Q引導的逐步搜索增強語言代理推論

摘要

語言代理已成為處理複雜互動任務的一個有前途的解決方案。語言代理成功的關鍵之一是代理工作流軌跡上的獎勵模型，該模型在訓練或推論過程中提供有價值的指導。然而，由於中間互動的標註缺乏，大多數現有作品使用結果獎勵模型來優化整個軌跡上的策略。這可能導致次優策略並阻礙整體性能。為了解決這個問題，我們提出了QLASS（Q引導語言代理逐步搜索），通過逐步估計Q值來自動生成開放語言代理的標註。通過引入推理樹並執行過程獎勵建模，QLASS為每一步提供了有效的中間指導。借助逐步指導，我們提出了一種Q引導生成策略，使語言代理能夠更好地適應長期價值，從而在複雜互動代理任務的模型推論過程中實現顯著的性能改善。值得注意的是，即使使用了幾乎一半的標註數據，QLASS仍保持著強大的性能，展示了其在應對有限監督方面的效率。我們還通過定性分析實證證明，QLASS能夠引導更有效的決策制定。我們將釋出我們的代碼和數據。

English

Language agents have become a promising solution to complex interactive tasks. One of the key ingredients to the success of language agents is the reward model on the trajectory of the agentic workflow, which provides valuable guidance during training or inference. However, due to the lack of annotations of intermediate interactions, most existing works use an outcome reward model to optimize policies across entire trajectories. This may lead to sub-optimal policies and hinder the overall performance. To address this, we propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values in a stepwise manner for open language agents. By introducing a reasoning tree and performing process reward modeling, QLASS provides effective intermediate guidance for each step. With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value, resulting in significant performance improvement during model inference on complex interactive agent tasks. Notably, even with almost half the annotated data, QLASS retains strong performance, demonstrating its efficiency in handling limited supervision. We also empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis. We will release our code and data.

QLASS：透過Q引導的逐步搜索增強語言代理推論

QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

摘要

Support