開放數據合成促進深度研究
Open Data Synthesis For Deep Research
August 30, 2025
作者: Ziyi Xia, Kun Luo, Hongjin Qian, Zheng Liu
cs.AI
摘要
大型语言模型(LLMs)正日益被寄望于超越简单的事实查询,转向需要将问题分解为子问题、协调多步推理以及综合来自不同来源证据的深度研究任务。我们将具有可验证答案的深度研究任务形式化为层次约束满足问题(HCSPs),这与单一约束、多跳或扁平CSP的表述有本质区别。然而,现有基准(如Natural Questions、HotpotQA)未能捕捉这种复杂性,而近期的合成数据集往往引入捷径推理、知识泄露或缺乏足够的结构深度。为填补这一空白,我们引入了InfoSeek,一个用于合成复杂深度研究任务的可扩展框架。InfoSeek采用双代理系统从大规模网页中递归构建研究树,将中间节点模糊化为有效的子问题,并将这些树转化为需要遍历完整层次结构的自然语言问题。它还支持快速扩展,生成了超过50K的训练样本、精选的测试集以及通过拒绝采样生成的推理轨迹。实验表明,在InfoSeek上训练的模型持续优于强基线。在具有挑战性的基准BrowseComp-Plus上,使用InfoSeek优化的3B LLMs超越了更大的32B模型和轻量级商业API(如Gemini2.5-Flash),同时达到了与更强API(如Gemini2.5-Pro)相当的性能。通过保留中间步骤和检索标签等元信息,InfoSeek进一步支持高级优化策略,包括复合奖励设计和轨迹级探索。我们在https://github.com/VectorSpaceLab/InfoSeek{此仓库}中提供了代码和数据集。
English
Large language models (LLMs) are increasingly expected to go beyond simple
factual queries toward Deep Research-tasks that require decomposing questions
into sub-problems, coordinating multi-step reasoning, and synthesizing evidence
from diverse sources. We formalize Deep Research tasks with verifiable answers
as Hierarchical Constraint Satisfaction Problems (HCSPs), which are
fundamentally different from single-constraint, multi-hop, or flat CSP
formulations. However, existing benchmarks (e.g., Natural Questions, HotpotQA)
fail to capture this complexity, while recent synthetic datasets often
introduce shortcut reasoning, knowledge leakage, or lack sufficient structural
depth. To address this gap, we introduce InfoSeek, a scalable framework for
synthesizing complex Deep Research tasks. InfoSeek uses a dual-agent system to
recursively build a Research Tree from large-scale webpages, blurring
intermediate nodes into valid sub-problems, and converting these trees into
natural language questions that require traversing the full hierarchy. It also
enables rapid scaling, yielding over 50K training examples, a curated test set,
and reasoning trajectories generated via reject sampling. Experiments show that
models trained on InfoSeek consistently outperform strong baselines. On a
challenging benchmark BrowseComp-Plus, 3B LLMs optimized with InfoSeek surpass
much larger 32B models and lightweight commercial APIs (e.g., Gemini2.5-Flash),
while achieving performance comparable to stronger APIs (e.g., Gemini2.5-Pro).
By preserving meta-information such as intermediate steps and retrieval labels,
InfoSeek further supports advanced optimization strategies, including compound
reward design and trajectory-level exploration. We provide our codes and
datasets in https://github.com/VectorSpaceLab/InfoSeek{this repository}.