PokeeResearch:基於AI反饋強化學習與穩健推理框架的深度研究效能提升
PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold
October 17, 2025
作者: Yi Wan, Jiuqi Wang, Liam Li, Jinsong Liu, Ruihao Zhu, Zheqing Zhu
cs.AI
摘要
工具增强的大型语言模型(LLMs)正逐渐成为深度研究代理,这些系统能够分解复杂查询、检索外部证据并综合基于事实的回应。然而,当前的代理仍受限于浅层检索、弱对齐指标以及脆弱的工具使用行为。我们推出了PokeeResearch-7B,一个在统一强化学习框架下构建的7B参数深度研究代理,旨在实现鲁棒性、对齐性和可扩展性。PokeeResearch-7B通过无标注的AI反馈强化学习(RLAIF)框架进行训练,利用基于LLM的奖励信号优化策略,这些信号捕捉了事实准确性、引用忠实度和指令遵循度。一个由思维链驱动的多轮推理支架进一步增强了鲁棒性,通过自我验证和从工具故障中自适应恢复来实现。在10个流行的深度研究基准测试中,PokeeResearch-7B在7B规模的深度研究代理中实现了最先进的性能。这表明,精心设计的强化学习和推理机制能够产生高效、坚韧且研究级别的AI代理。该模型和推理代码已在MIT许可下开源,地址为https://github.com/Pokee-AI/PokeeResearchOSS。
English
Tool-augmented large language models (LLMs) are emerging as deep research
agents, systems that decompose complex queries, retrieve external evidence, and
synthesize grounded responses. Yet current agents remain limited by shallow
retrieval, weak alignment metrics, and brittle tool-use behavior. We introduce
PokeeResearch-7B, a 7B-parameter deep research agent built under a unified
reinforcement learning framework for robustness, alignment, and scalability.
PokeeResearch-7B is trained by an annotation-free Reinforcement Learning from
AI Feedback (RLAIF) framework to optimize policies using LLM-based reward
signals that capture factual accuracy, citation faithfulness, and instruction
adherence. A chain-of-thought-driven multi-call reasoning scaffold further
enhances robustness through self-verification and adaptive recovery from tool
failures. Among 10 popular deep research benchmarks, PokeeResearch-7B achieves
state-of-the-art performance among 7B-scale deep research agents. This
highlights that careful reinforcement learning and reasoning design can produce
efficient, resilient, and research-grade AI agents. The model and inference
code is open-sourced under MIT license at
https://github.com/Pokee-AI/PokeeResearchOSS.