ChatPaper.aiChatPaper

开放数据合成助力深度研究

Open Data Synthesis For Deep Research

August 30, 2025
作者: Ziyi Xia, Kun Luo, Hongjin Qian, Zheng Liu
cs.AI

摘要

大型语言模型(LLMs)正日益被寄望于超越简单的事实查询,转向深度研究任务,这些任务需要将问题分解为子问题、协调多步推理,并从多样化的来源中综合证据。我们将具有可验证答案的深度研究任务形式化为层次约束满足问题(HCSPs),这与单一约束、多跳或扁平CSP的表述有本质区别。然而,现有的基准测试(如Natural Questions、HotpotQA)未能捕捉到这种复杂性,而近期的合成数据集往往引入了捷径推理、知识泄露或缺乏足够的结构深度。为填补这一空白,我们引入了InfoSeek,一个用于合成复杂深度研究任务的可扩展框架。InfoSeek采用双代理系统,从大规模网页中递归构建研究树,将中间节点模糊化为有效的子问题,并将这些树转化为需要遍历完整层级的自然语言问题。它还支持快速扩展,生成了超过50K的训练样本、精选的测试集,以及通过拒绝采样生成的推理轨迹。实验表明,在InfoSeek上训练的模型持续超越强基线。在具有挑战性的基准测试BrowseComp-Plus上,经过InfoSeek优化的3B LLMs超越了更大的32B模型和轻量级商业API(如Gemini2.5-Flash),同时达到了与更强API(如Gemini2.5-Pro)相当的性能。通过保留中间步骤和检索标签等元信息,InfoSeek进一步支持包括复合奖励设计和轨迹级探索在内的先进优化策略。我们在https://github.com/VectorSpaceLab/InfoSeek{此仓库}中提供了代码和数据集。
English
Large language models (LLMs) are increasingly expected to go beyond simple factual queries toward Deep Research-tasks that require decomposing questions into sub-problems, coordinating multi-step reasoning, and synthesizing evidence from diverse sources. We formalize Deep Research tasks with verifiable answers as Hierarchical Constraint Satisfaction Problems (HCSPs), which are fundamentally different from single-constraint, multi-hop, or flat CSP formulations. However, existing benchmarks (e.g., Natural Questions, HotpotQA) fail to capture this complexity, while recent synthetic datasets often introduce shortcut reasoning, knowledge leakage, or lack sufficient structural depth. To address this gap, we introduce InfoSeek, a scalable framework for synthesizing complex Deep Research tasks. InfoSeek uses a dual-agent system to recursively build a Research Tree from large-scale webpages, blurring intermediate nodes into valid sub-problems, and converting these trees into natural language questions that require traversing the full hierarchy. It also enables rapid scaling, yielding over 50K training examples, a curated test set, and reasoning trajectories generated via reject sampling. Experiments show that models trained on InfoSeek consistently outperform strong baselines. On a challenging benchmark BrowseComp-Plus, 3B LLMs optimized with InfoSeek surpass much larger 32B models and lightweight commercial APIs (e.g., Gemini2.5-Flash), while achieving performance comparable to stronger APIs (e.g., Gemini2.5-Pro). By preserving meta-information such as intermediate steps and retrieval labels, InfoSeek further supports advanced optimization strategies, including compound reward design and trajectory-level exploration. We provide our codes and datasets in https://github.com/VectorSpaceLab/InfoSeek{this repository}.
PDF482September 4, 2025