PokeeResearch：通过AI反馈强化学习与稳健推理框架实现高效深度研究

摘要

工具增强的大型语言模型（LLMs）正逐渐成为深度研究代理，这类系统能够分解复杂查询、检索外部证据并综合生成有依据的响应。然而，当前的研究代理仍受限于浅层检索、弱对齐指标以及脆弱的工具使用行为。我们推出了PokeeResearch-7B，一个在统一强化学习框架下构建的7B参数深度研究代理，旨在实现鲁棒性、对齐性和可扩展性。PokeeResearch-7B通过无标注的AI反馈强化学习（RLAIF）框架进行训练，利用基于LLM的奖励信号优化策略，这些信号捕捉了事实准确性、引用忠实度和指令遵循度。一个思维链驱动的多轮推理框架进一步增强了鲁棒性，通过自我验证和工具故障的自适应恢复机制。在10个流行的深度研究基准测试中，PokeeResearch-7B在7B规模的深度研究代理中达到了最先进的性能。这表明，精心的强化学习与推理设计能够产生高效、稳健且具备研究级水平的AI代理。该模型及推理代码已根据MIT许可证开源，地址为https://github.com/Pokee-AI/PokeeResearchOSS。

English

Tool-augmented large language models (LLMs) are emerging as deep research agents, systems that decompose complex queries, retrieve external evidence, and synthesize grounded responses. Yet current agents remain limited by shallow retrieval, weak alignment metrics, and brittle tool-use behavior. We introduce PokeeResearch-7B, a 7B-parameter deep research agent built under a unified reinforcement learning framework for robustness, alignment, and scalability. PokeeResearch-7B is trained by an annotation-free Reinforcement Learning from AI Feedback (RLAIF) framework to optimize policies using LLM-based reward signals that capture factual accuracy, citation faithfulness, and instruction adherence. A chain-of-thought-driven multi-call reasoning scaffold further enhances robustness through self-verification and adaptive recovery from tool failures. Among 10 popular deep research benchmarks, PokeeResearch-7B achieves state-of-the-art performance among 7B-scale deep research agents. This highlights that careful reinforcement learning and reasoning design can produce efficient, resilient, and research-grade AI agents. The model and inference code is open-sourced under MIT license at https://github.com/Pokee-AI/PokeeResearchOSS.

PokeeResearch：通过AI反馈强化学习与稳健推理框架实现高效深度研究

PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold

摘要

Support