OCC-RAG: 충실한 질문 응답을 위한 최적 인지 코어

초록

언어 모델 개발의 최근 진보는 규모에 의해 정의되어 왔으며, 각 세대는 더 많은 세계 지식을 가중치에 흡수하고 있다. 그러나 많은 실제 응용에서는 광범위한 파라미터 지식보다는 강건한 추론이 더 큰 이점을 제공한다. 이러한 환경에서 작업별 특화 소형 언어 모델(SLM)은 원칙적인 설계 선택을 제공한다. 우리는 이러한 전제를 바탕으로 구축된 SLM 제품군인 최적 인지 핵심(OCC)을 소개한다. OCC의 변형으로서, 제공된 맥락에 기반한 충실한 질의응답(QA)에 최적화된 OCC-RAG를 제시한다. 이 작업은 제공된 구절에 대한 다중 추론을 요구하면서 기억된 지식을 무시해야 하므로 OCC 설계 접근 방식과 직접적으로 일치한다. OCC-RAG를 훈련하기 위해, 대규모로 다중 맥락 및 다중 추론 QA 데이터를 합성하는 새로운 파이프라인을 구현하여, 다중 추론, 엄격한 맥락 충실도, 조정된 기권을 대상으로 하는 300만 개 이상의 예제로 구성된 코퍼스를 생성한다. 우리는 이 코퍼스로 중간 훈련된 OCC-RAG-0.6B와 OCC-RAG-1.7B를 공개한다. 이 모델들은 맥락의 직인용에 기반한 출처 인용을 포함한 구조화된 추론 흔적을 생성한다. OCC-RAG를 통해, 우리는 소형의 작업별 특화 SLM이 다중 추론(HotpotQA, MuSiQue, TAT-QA), 충실도(ConFiQA), 거절(MuSiQue-Un) 벤치마크에서 크기가 2~6배 더 큰 범용 모델과 동등하거나 더 뛰어난 성능을 낼 수 있음을 입증한다.

English

Recent progress in the development of language models has been defined by scale, with each generation absorbing more of the world's knowledge into its weights. However, many practical applications benefit more from robust reasoning than from extensive parametric knowledge. In this setting, task-specialized small language models (SLMs) offer a principled design choice. We introduce Optimal Cognitive Core (OCC), a family of SLMs built around this premise. As a variant of OCC, we present OCC-RAG, optimized for faithful question answering (QA) grounded in the provided context. This task directly aligns with the OCC design approach, requiring multi-hop reasoning over supplied passages while ignoring memorized knowledge. To train OCC-RAG, we implement a novel pipeline for synthesizing multi-context, multi-hop QA data at scale, producing a corpus of over three million examples targeting multi-hop reasoning, strict context faithfulness, and calibrated abstention. We release OCC-RAG-0.6B and OCC-RAG-1.7B, both mid-trained on this corpus. The models produce structured reasoning traces with source citations grounded in literal quotes from the context. Through OCC-RAG, we demonstrate that compact, task-specialized SLMs can match or exceed general-purpose models 2 -- 6x their size across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un) benchmarks.