T1: 小型言語モデルにおけるテスト時計算スケーリングのためのツール統合型自己検証

要旨

最近の研究では、テスト時の計算リソースのスケーリングが、小規模言語モデル（sLM）の性能を効果的に向上させることが実証されています。しかし、これまでの研究では、主に検証用のより大規模なモデルを追加した場合のテスト時計算スケーリングが検討されており、sLMによる自己検証については十分に検討されていませんでした。本研究では、sLMがテスト時スケーリング下で自身の出力を信頼性高く自己検証できるかどうかを調査します。その結果、大規模な検証モデルからの知識蒸留を行った場合でも、sLMは数値計算や事実確認といった記憶を要する検証タスクに苦戦することがわかりました。この制約に対処するため、我々はツール統合型自己検証（T1）を提案します。T1では、コードインタプリタなどの外部ツールに記憶負荷の高い検証ステップを委譲します。理論分析により、ツール統合が記憶要求を軽減し、テスト時スケーリングの性能を向上させることが示されました。MATHベンチマークでの実験では、T1を適用したLlama-3.2 1Bモデルが、テスト時スケーリング下で、はるかに大規模なLlama-3.1 8Bモデルを上回る性能を示しました。さらに、T1は数学的タスク（MATH500）と多分野にわたる知識集約型タスク（MMLU-Pro）の両方に効果的に汎化することが確認されました。本研究の結果は、ツール統合がsLMの自己検証能力を大幅に向上させる可能性があることを示唆しています。

English

Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving self-verification by sLMs underexplored. In this work, we investigate whether sLMs can reliably self-verify their outputs under test-time scaling. We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking. To address this limitation, we propose Tool-integrated self-verification (T1), which delegates memorization-heavy verification steps to external tools, such as a code interpreter. Our theoretical analysis shows that tool integration reduces memorization demands and improves test-time scaling performance. Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model. Moreover, T1 generalizes effectively to both mathematical (MATH500) and multi-domain knowledge-intensive tasks (MMLU-Pro). Our findings highlight the potential of tool integration to substantially improve the self-verification abilities of sLMs.

T1: 小型言語モデルにおけるテスト時計算スケーリングのためのツール統合型自己検証

T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models

要旨

Support