小型語言模型的代碼引導推理：評估可執行的MCQA框架

摘要

多選問答基準測試通常將小型語言模型（SLMs）評估為直接回答者，但已部署的語言模型系統越來越依賴於外部支架，例如工具、程式碼及重複的模型調用。我們引入程式碼引導推理（Code-Guided Reasoning, CGR），這是一種評估協議與生成的程式資源，用以衡量可執行的推理支架何時能提升SLM在多選問答任務上的表現。CGR標準化了六個組件：標準化的項目介面、直接求解提示、生成器提示、Python支架、求解器調用與提取輔助函式，以及三通道結果記錄。在從本地準備的多選問答組合包與六個元數據註冊的求解器模型中取得的20,498條保留結果行中，觀察到的非零基線分區顯示：宏觀輔助準確率為66.21%，而直接準確率為38.11%，兩者相差+28.10個百分點，成對自助區間為[20.32, 36.43]。在更嚴格的Ab > 30%直接信號閘條件下，宏觀差異縮小為+14.11個百分點。這些估計值屬於描述性統計。輔助推理使用了較大的求解器調用預算，答案提取較為脆弱，Time-MQA中觀察到性能倒退，且部分生成的程式違反了無硬編碼指令。CGR提供了解釋這些結果所需的追蹤套件，包括直接、輔助與生成器端答案、分區定義、生成的程式、回應元數據以及審核。

English

Multiple-choice QA benchmarks usually evaluate small language models (SLMs) as direct answerers, but deployed language-model systems increasingly rely on external scaffolds such as tools, code, and repeated model calls. We introduce Code-Guided Reasoning (CGR), an evaluation protocol and generated-program resource for measuring when executable reasoning scaffolds improve SLM performance on MCQA tasks. CGR standardizes six components: a normalized item interface, a direct solver prompt, a generator prompt, a Python scaffold, solver-call and extraction helpers, and a three-channel result record. On 20,498 retained result rows from a locally prepared MCQA bundle and six metadata-registered solver models, the observed non-zero-baseline partition shows 66.21% macro assisted accuracy versus 38.11% direct accuracy, a +28.10 percentage-point difference with a pair-bootstrap interval of [20.32, 36.43]. Under a stricter Ab > 30% direct-signal gate, the macro difference is +14.11 points. These estimates are descriptive. Assisted inference uses a larger solver-call budget, answer extraction is brittle, Time-MQA contains the observed regressions, and some generated programs violate the no-hard-coding instruction. CGR provides the trace package needed to interpret these results, including direct, assisted, and generator-side answers, partition definitions, generated programs, response metadata, and audits.