OCC-RAG:最佳認知核心用於忠實問答
OCC-RAG: Optimal Cognitive Core for Faithful Question Answering
May 30, 2026
作者: Maksim Savkin, Mikhail Goncharov, Alexander Gambashidze, Alla Chepurova, Dmitrii Tarasov, Nikita Andriianov, Daria Pugacheva, Vasily Konovalov, Andrey Galichin, Ivan Oseledets
cs.AI
摘要
近年來語言模型的發展以規模為主軸,每一代模型都將更多世界知識吸收進其參數中。然而,許多實際應用更仰賴穩健的推理能力,而非龐大的參數化知識。在此背景下,針對特定任務優化的小型語言模型(SLMs)提供了一套具原則性的設計選擇。我們提出「最優認知核心」(Optimal Cognitive Core,OCC),這是一系列基於此前提建構的小型語言模型。作為OCC的變體,我們發表了OCC-RAG,專為基於提供脈絡的忠實問答(QA)進行優化。此任務與OCC的設計方法直接契合,需對給定段落進行多跳推理,同時忽略記憶中的知識。為訓練OCC-RAG,我們實作了一套新穎的管線,用於大規模合成多脈絡、多跳問答資料,產生了包含超過三百萬個範例的語料庫,聚焦於多跳推理、嚴格的脈絡忠實度,以及校準式棄答。我們釋出了OCC-RAG-0.6B與OCC-RAG-1.7B兩個模型,兩者均在此語料庫上進行中期訓練。這些模型會產出結構化的推理軌跡,並附上基於脈絡字句引用的來源標註。透過OCC-RAG,我們證明緊湊且任務專門化的小型語言模型在多跳推理(HotpotQA、MuSiQue、TAT-QA)、忠實度(ConFiQA)及拒絕回答(MuSiQue-Un)等基準測試中,能夠匹配甚至超越規模大上2至6倍的通用模型。
English
Recent progress in the development of language models has been defined by scale, with each generation absorbing more of the world's knowledge into its weights. However, many practical applications benefit more from robust reasoning than from extensive parametric knowledge. In this setting, task-specialized small language models (SLMs) offer a principled design choice. We introduce Optimal Cognitive Core (OCC), a family of SLMs built around this premise. As a variant of OCC, we present OCC-RAG, optimized for faithful question answering (QA) grounded in the provided context. This task directly aligns with the OCC design approach, requiring multi-hop reasoning over supplied passages while ignoring memorized knowledge. To train OCC-RAG, we implement a novel pipeline for synthesizing multi-context, multi-hop QA data at scale, producing a corpus of over three million examples targeting multi-hop reasoning, strict context faithfulness, and calibrated abstention. We release OCC-RAG-0.6B and OCC-RAG-1.7B, both mid-trained on this corpus. The models produce structured reasoning traces with source citations grounded in literal quotes from the context. Through OCC-RAG, we demonstrate that compact, task-specialized SLMs can match or exceed general-purpose models 2 -- 6x their size across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un) benchmarks.