OCC-RAG: 忠実な質問応答のための最適認知コア

要旨

近年の言語モデルの発展は、その規模によって特徴づけられてきた。世代を重ねるごとに、より多くの世界の知識が各モデルの重みに組み込まれている。しかし、多くの実用的なアプリケーションでは、膨大なパラメトリック知識よりも、堅牢な推論能力が求められる。こうした状況において、特定のタスクに特化した小型言語モデル（SLM）は、原理的に優れた設計選択肢となる。本稿では、この前提に基づいて構築されたSLMファミリーであるOptimal Cognitive Core（OCC）を提案する。OCCの派生モデルとして、提供されたコンテクストに基づいた忠実な質問応答（QA）に最適化されたOCC-RAGを紹介する。このタスクは、記憶された知識を無視しながら、供給された文章に対してマルチホップ推論を実行する必要がある点で、OCCの設計アプローチと直接的に合致する。OCC-RAGを学習させるために、マルチコンテクストかつマルチホップなQAデータを大規模に合成する新規パイプラインを実装し、マルチホップ推論、厳密なコンテクスト忠実性、および調整された棄却を対象とした300万件以上のサンプルからなるコーパスを生成した。このコーパスを用いて中間学習を施したOCC-RAG-0.6BおよびOCC-RAG-1.7Bを公開する。これらのモデルは、コンテクストからの原文引用に基づく出典情報を付与した構造化された推論トレースを生成する。OCC-RAGを通じて、コンパクトでタスク特化型のSLMが、マルチホップ推論（HotpotQA、MuSiQue、TAT-QA）、忠実性（ConFiQA）、および拒否（MuSiQue-Un）の各ベンチマークにおいて、2～6倍の規模の汎用モデルに匹敵するか、あるいはそれを上回る性能を達成できることを実証する。

English

Recent progress in the development of language models has been defined by scale, with each generation absorbing more of the world's knowledge into its weights. However, many practical applications benefit more from robust reasoning than from extensive parametric knowledge. In this setting, task-specialized small language models (SLMs) offer a principled design choice. We introduce Optimal Cognitive Core (OCC), a family of SLMs built around this premise. As a variant of OCC, we present OCC-RAG, optimized for faithful question answering (QA) grounded in the provided context. This task directly aligns with the OCC design approach, requiring multi-hop reasoning over supplied passages while ignoring memorized knowledge. To train OCC-RAG, we implement a novel pipeline for synthesizing multi-context, multi-hop QA data at scale, producing a corpus of over three million examples targeting multi-hop reasoning, strict context faithfulness, and calibrated abstention. We release OCC-RAG-0.6B and OCC-RAG-1.7B, both mid-trained on this corpus. The models produce structured reasoning traces with source citations grounded in literal quotes from the context. Through OCC-RAG, we demonstrate that compact, task-specialized SLMs can match or exceed general-purpose models 2 -- 6x their size across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un) benchmarks.