ChatPaper.aiChatPaper

OCC-RAG: 最优认知核心用于忠实问题回答

OCC-RAG: Optimal Cognitive Core for Faithful Question Answering

May 30, 2026
作者: Maksim Savkin, Mikhail Goncharov, Alexander Gambashidze, Alla Chepurova, Dmitrii Tarasov, Nikita Andriianov, Daria Pugacheva, Vasily Konovalov, Andrey Galichin, Ivan Oseledets
cs.AI

摘要

近年来,语言模型的发展以规模为核心特征,每一代模型都将更多世界知识吸收进其权重中。然而,许多实际应用更依赖于稳健的推理能力而非广泛的参数化知识。在此背景下,任务专用型小语言模型(SLMs)提供了一种原则性的设计选择。我们提出最优认知核心(OCC),这是一系列基于这一理念构建的小语言模型。作为OCC的一个变体,我们推出OCC-RAG,它针对基于给定上下文的忠实问答进行了优化。该任务与OCC的设计方法直接契合,要求在提供的段落上进行多跳推理,同时忽略记忆中的知识。为训练OCC-RAG,我们实现了一套新颖的数据合成流程,可规模化生成多上下文、多跳问答数据,最终构建了包含超过三百万个示例的语料库,专注于多跳推理、严格上下文忠实性以及校准式拒答。我们发布了OCC-RAG-0.6B和OCC-RAG-1.7B两个模型,它们均在该语料库上进行了中期训练。模型能生成带有来源引用的结构化推理轨迹,这些引用直接基于上下文中的原文。通过OCC-RAG,我们证明:紧凑的任务专用型小语言模型在多跳推理(HotpotQA、MuSiQue、TAT-QA)、忠实性(ConFiQA)以及拒答(MuSiQue-Un)等基准测试中,能够达到或超越规模大2至6倍的通用模型。
English
Recent progress in the development of language models has been defined by scale, with each generation absorbing more of the world's knowledge into its weights. However, many practical applications benefit more from robust reasoning than from extensive parametric knowledge. In this setting, task-specialized small language models (SLMs) offer a principled design choice. We introduce Optimal Cognitive Core (OCC), a family of SLMs built around this premise. As a variant of OCC, we present OCC-RAG, optimized for faithful question answering (QA) grounded in the provided context. This task directly aligns with the OCC design approach, requiring multi-hop reasoning over supplied passages while ignoring memorized knowledge. To train OCC-RAG, we implement a novel pipeline for synthesizing multi-context, multi-hop QA data at scale, producing a corpus of over three million examples targeting multi-hop reasoning, strict context faithfulness, and calibrated abstention. We release OCC-RAG-0.6B and OCC-RAG-1.7B, both mid-trained on this corpus. The models produce structured reasoning traces with source citations grounded in literal quotes from the context. Through OCC-RAG, we demonstrate that compact, task-specialized SLMs can match or exceed general-purpose models 2 -- 6x their size across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un) benchmarks.