ChatPaper.aiChatPaper

Husky:用于多步推理的统一开源语言代理

Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning

June 10, 2024
作者: Joongwon Kim, Bhargavi Paranjape, Tushar Khot, Hannaneh Hajishirzi
cs.AI

摘要

语言代理通过使用工具精确执行每个步骤来执行复杂任务。然而,大多数现有代理基于专有模型或设计用于针对特定任务,如数学或多跳问题回答。我们介绍了 Husky,一个全面的、开源的语言代理,它学会了对统一的动作空间进行推理,以解决涉及数字、表格和基于知识的推理的各种复杂任务。Husky在两个阶段之间迭代:1)生成下一个行动以解决给定任务,2)使用专家模型执行行动并更新当前解决方案状态。我们确定了一个详尽的行动本体论,用于解决复杂任务,并筛选高质量数据来训练执行这些行动的专家模型。我们的实验表明,Husky在14个评估数据集上优于先前的语言代理。此外,我们介绍了 HuskyQA,一个新的评估集,用于对语言代理进行混合工具推理的压力测试,重点放在检索缺失知识和进行数字推理上。尽管使用了 7B 模型,Husky 在这些任务上与甚至超过了前沿的语言模型,如 GPT-4,展示了我们全面方法在解决复杂推理问题方面的有效性。我们的代码和模型可在 https://github.com/agent-husky/Husky-v1 获取。
English
Language agents perform complex tasks by using tools to execute each step precisely. However, most existing agents are based on proprietary models or designed to target specific tasks, such as mathematics or multi-hop question answering. We introduce Husky, a holistic, open-source language agent that learns to reason over a unified action space to address a diverse set of complex tasks involving numerical, tabular, and knowledge-based reasoning. Husky iterates between two stages: 1) generating the next action to take towards solving a given task and 2) executing the action using expert models and updating the current solution state. We identify a thorough ontology of actions for addressing complex tasks and curate high-quality data to train expert models for executing these actions. Our experiments show that Husky outperforms prior language agents across 14 evaluation datasets. Moreover, we introduce HuskyQA, a new evaluation set which stress tests language agents for mixed-tool reasoning, with a focus on retrieving missing knowledge and performing numerical reasoning. Despite using 7B models, Husky matches or even exceeds frontier LMs such as GPT-4 on these tasks, showcasing the efficacy of our holistic approach in addressing complex reasoning problems. Our code and models are available at https://github.com/agent-husky/Husky-v1.

Summary

AI-Generated Summary

PDF302December 8, 2024