面向小型语言模型的代码引导推理:评估可执行的MCQA脚手架
Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds
May 12, 2026
作者: Prateek Biswas, Dhaval Patel, Vedant Khandelwal, Shuxin Lin, Amit Sheth
cs.AI
摘要
多项选择问答(MCQA)基准测试通常将小型语言模型(SLM)作为直接作答者进行评估,但实际部署的语言模型系统越来越多地依赖外部辅助框架(如工具、代码以及重复模型调用)。我们提出代码引导推理(Code-Guided Reasoning, CGR)——一种评估协议及生成程序资源,用于衡量可执行的推理辅助框架在MCQA任务中提升SLM性能的程度。CGR标准化了六个组件:标准化题目接口、直接求解提示、生成器提示、Python代码框架、求解器调用与提取辅助函数,以及三通道结果记录。在本地构建的MCQA数据集(含20,498条保留结果记录)与六个元数据注册的求解器模型上,观察到的非零基线分区显示:宏平均辅助准确率为66.21%,而直接准确率为38.11%,两者相差+28.10个百分点,配对自助法置信区间为[20.32, 36.43]。在更严格的Ab>30%直接信号阈值下,宏平均差异为+14.11个百分点。这些估计值为描述性统计结论。辅助推理需消耗更大的求解器调用预算,答案提取过程存在脆弱性,Time-MQA数据集包含观测到的回归现象,且部分生成程序违反了不硬编码指令的约束。CGR提供了解读这些结果所需的完整追踪包,包括直接答案、辅助答案、生成器侧答案、分区定义、生成程序、响应元数据及审计信息。
English
Multiple-choice QA benchmarks usually evaluate small language models (SLMs) as direct answerers, but deployed language-model systems increasingly rely on external scaffolds such as tools, code, and repeated model calls. We introduce Code-Guided Reasoning (CGR), an evaluation protocol and generated-program resource for measuring when executable reasoning scaffolds improve SLM performance on MCQA tasks. CGR standardizes six components: a normalized item interface, a direct solver prompt, a generator prompt, a Python scaffold, solver-call and extraction helpers, and a three-channel result record. On 20,498 retained result rows from a locally prepared MCQA bundle and six metadata-registered solver models, the observed non-zero-baseline partition shows 66.21% macro assisted accuracy versus 38.11% direct accuracy, a +28.10 percentage-point difference with a pair-bootstrap interval of [20.32, 36.43]. Under a stricter Ab > 30% direct-signal gate, the macro difference is +14.11 points. These estimates are descriptive. Assisted inference uses a larger solver-call budget, answer extraction is brittle, Time-MQA contains the observed regressions, and some generated programs violate the no-hard-coding instruction. CGR provides the trace package needed to interpret these results, including direct, assisted, and generator-side answers, partition definitions, generated programs, response metadata, and audits.