PokeeResearch:通过AI反馈强化学习与稳健推理框架实现高效深度研究
PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold
October 17, 2025
作者: Yi Wan, Jiuqi Wang, Liam Li, Jinsong Liu, Ruihao Zhu, Zheqing Zhu
cs.AI
摘要
工具增强的大型语言模型(LLMs)正逐渐成为深度研究代理,这类系统能够分解复杂查询、检索外部证据并综合生成有依据的响应。然而,当前的研究代理仍受限于浅层检索、弱对齐指标以及脆弱的工具使用行为。我们推出了PokeeResearch-7B,一个在统一强化学习框架下构建的7B参数深度研究代理,旨在实现鲁棒性、对齐性和可扩展性。PokeeResearch-7B通过无标注的AI反馈强化学习(RLAIF)框架进行训练,利用基于LLM的奖励信号优化策略,这些信号捕捉了事实准确性、引用忠实度和指令遵循度。一个思维链驱动的多轮推理框架进一步增强了鲁棒性,通过自我验证和工具故障的自适应恢复机制。在10个流行的深度研究基准测试中,PokeeResearch-7B在7B规模的深度研究代理中达到了最先进的性能。这表明,精心的强化学习与推理设计能够产生高效、稳健且具备研究级水平的AI代理。该模型及推理代码已根据MIT许可证开源,地址为https://github.com/Pokee-AI/PokeeResearchOSS。
English
Tool-augmented large language models (LLMs) are emerging as deep research
agents, systems that decompose complex queries, retrieve external evidence, and
synthesize grounded responses. Yet current agents remain limited by shallow
retrieval, weak alignment metrics, and brittle tool-use behavior. We introduce
PokeeResearch-7B, a 7B-parameter deep research agent built under a unified
reinforcement learning framework for robustness, alignment, and scalability.
PokeeResearch-7B is trained by an annotation-free Reinforcement Learning from
AI Feedback (RLAIF) framework to optimize policies using LLM-based reward
signals that capture factual accuracy, citation faithfulness, and instruction
adherence. A chain-of-thought-driven multi-call reasoning scaffold further
enhances robustness through self-verification and adaptive recovery from tool
failures. Among 10 popular deep research benchmarks, PokeeResearch-7B achieves
state-of-the-art performance among 7B-scale deep research agents. This
highlights that careful reinforcement learning and reasoning design can produce
efficient, resilient, and research-grade AI agents. The model and inference
code is open-sourced under MIT license at
https://github.com/Pokee-AI/PokeeResearchOSS.