소규모 언어 모델을 위한 코드 기반 추론: 실행 가능한 MCQA 스캐폴드 평가

초록

다중 선택 QA 벤치마크는 일반적으로 소형 언어 모델(SLM)을 직접 답변자로 평가하지만, 배포된 언어 모델 시스템은 점점 더 도구, 코드, 반복적인 모델 호출과 같은 외부 스캐폴드에 의존합니다. 본 논문에서는 실행 가능한 추론 스캐폴드가 MCQA 작업에서 SLM 성능을 향상시키는 시점을 측정하기 위한 평가 프로토콜이자 생성 프로그램 리소스인 코드 기반 추론(Code-Guided Reasoning, CGR)을 소개합니다. CGR은 정규화된 항목 인터페이스, 직접 솔버 프롬프트, 생성기 프롬프트, Python 스캐폴드, 솔버 호출 및 추출 도우미, 그리고 삼중 채널 결과 기록의 여섯 가지 구성 요소를 표준화합니다. 로컬에서 준비된 MCQA 번들과 6개의 메타데이터 등록 솔버 모델에서 얻은 20,498개의 유지된 결과 행에 대해, 관찰된 비영점 기준선 분할은 직접 정확도 38.11% 대비 매크로 보조 정확도 66.21%를 보여주며, 쌍 부트스트랩 구간 [20.32, 36.43]에서 +28.10 퍼센트 포인트 차이를 나타냅니다. 더 엄격한 Ab > 30% 직접 신호 게이트 하에서는 매크로 차이가 +14.11 포인트입니다. 이러한 추정치는 기술적입니다. 보조 추론은 더 큰 솔버 호출 예산을 사용하며, 답변 추출이 취약하고, Time-MQA에는 관찰된 회귀가 포함되어 있으며, 일부 생성된 프로그램은 하드 코딩 금지 지침을 위반합니다. CGR은 직접 답변, 보조 답변, 생성기 측 답변, 분할 정의, 생성된 프로그램, 응답 메타데이터 및 감사 결과를 포함하여 이러한 결과를 해석하는 데 필요한 추적 패키지를 제공합니다.

English

Multiple-choice QA benchmarks usually evaluate small language models (SLMs) as direct answerers, but deployed language-model systems increasingly rely on external scaffolds such as tools, code, and repeated model calls. We introduce Code-Guided Reasoning (CGR), an evaluation protocol and generated-program resource for measuring when executable reasoning scaffolds improve SLM performance on MCQA tasks. CGR standardizes six components: a normalized item interface, a direct solver prompt, a generator prompt, a Python scaffold, solver-call and extraction helpers, and a three-channel result record. On 20,498 retained result rows from a locally prepared MCQA bundle and six metadata-registered solver models, the observed non-zero-baseline partition shows 66.21% macro assisted accuracy versus 38.11% direct accuracy, a +28.10 percentage-point difference with a pair-bootstrap interval of [20.32, 36.43]. Under a stricter Ab > 30% direct-signal gate, the macro difference is +14.11 points. These estimates are descriptive. Assisted inference uses a larger solver-call budget, answer extraction is brittle, Time-MQA contains the observed regressions, and some generated programs violate the no-hard-coding instruction. CGR provides the trace package needed to interpret these results, including direct, assisted, and generator-side answers, partition definitions, generated programs, response metadata, and audits.