面向指令跟随信息检索的双视角训练
Dual-View Training for Instruction-Following Information Retrieval
April 20, 2026
作者: Qingcheng Zeng, Puxuan Yu, Aman Mehta, Fuheng Zhao, Rajhans Samdani
cs.AI
摘要
指令遵循式信息检索(IF-IR)研究聚焦于检索系统不仅需要查找与查询相关的文档,还必须遵循用户明确的约束条件,如必要属性、排除项或输出偏好。然而,大多数检索模型主要针对语义相关性进行训练,往往难以区分仅符合主题的文档与满足指令要求的文档。我们提出基于极性反转的双视角数据合成策略:给定查询、符合指令的相关文档以及匹配查询但违反指令的困难负样本,通过提示大语言模型生成能使两个文档相关性标签互换的互补指令。通过在同一文档对上呈现能反转其相关性标签的互补指令,训练信号迫使检索模型依据指令重新评估候选集,而非依赖固定的主题线索。在3.05亿参数的编码器上,我们的方法将FollowIR基准测试性能提升45%,超越同等或更大规模的通用嵌入模型。通过等量数据预算下的直接比较,我们进一步证明数据多样性与指令监督具有互补作用:前者保持通用检索质量,后者提升指令敏感度。这些结果凸显了针对性数据合成对于构建兼具广谱能力与指令感知的检索系统的重要价值。
English
Instruction-following information retrieval (IF-IR) studies retrieval systems that must not only find documents relevant to a query, but also obey explicit user constraints such as required attributes, exclusions, or output preferences. However, most retrievers are trained primarily for semantic relevance and often fail to distinguish documents that match the topic from those that satisfy the instruction. We propose a dual-view data synthesis strategy based on polarity reversal: given a query, a document that is relevant under the instruction, and a hard negative that matches the query but violates the instruction, we prompt an LLM to generate a complementary instruction under which the two documents swap relevance labels. By presenting the same document pair under complementary instructions that invert their relevance labels, the training signal forces the retriever to reconsider the same candidate set through the instruction, rather than relying on fixed topical cues. On a 305M-parameter encoder, our method improves performance on the FollowIR benchmark by 45%, surpassing general-purpose embedding models of comparable or larger scale. Through head-to-head comparisons at matched data budgets, we further show that data diversity and instruction supervision play complementary roles: the former preserves general retrieval quality, while the latter improves instruction sensitivity. These results highlight the value of targeted data synthesis for building retrieval systems that are both broadly capable and instruction-aware.