TinyGSM: 小型言語モデルでGSM8kの80%超えを達成

要旨

小規模モデルは様々な計算上の利点を提供するが、問題解決能力においてサイズがどの程度重要であるかは未解決の問題である。特に小学校レベルの算数を解く場合、GSM8Kベンチマークで80％の壁を突破するために必要な最小モデルサイズは依然として340億パラメータである。本研究では、高品質なデータセットが小規模言語モデルが数学的推論能力を獲得する鍵となる可能性を探る。GPT-3.5によって完全に生成された、1,230万件の小学校算数問題とPythonによる解法をペアにした合成データセットTinyGSMを導入する。TinyGSMでファインチューニングを行った結果、13億パラメータの生成モデルと13億パラメータの検証モデルのペアが81.5％の精度を達成し、桁違いに大規模な既存モデルを凌駕することがわかった。これはまた、本モデルの学習データを生成したGPT-3.5「教師」モデルの性能（77.4％）にも匹敵する。我々のアプローチはシンプルで、2つの重要な要素からなる：1）高品質なデータセットTinyGSM、2）複数の候補生成から最終出力を選択する検証モデルの使用である。

English

Small-scale models offer various computational advantages, and yet to which extent size is critical for problem-solving abilities remains an open question. Specifically for solving grade school math, the smallest model size so far required to break the 80\% barrier on the GSM8K benchmark remains to be 34B. Our work studies how high-quality datasets may be the key for small language models to acquire mathematical reasoning. We introduce TinyGSM, a synthetic dataset of 12.3M grade school math problems paired with Python solutions, generated fully by GPT-3.5. After finetuning on TinyGSM, we find that a duo of a 1.3B generation model and a 1.3B verifier model can achieve 81.5\% accuracy, outperforming existing models that are orders of magnitude larger. This also rivals the performance of the GPT-3.5 ``teacher'' model (77.4\%), from which our model's training data is generated. Our approach is simple and has two key components: 1) the high-quality dataset TinyGSM, 2) the use of a verifier, which selects the final outputs from multiple candidate generations.

TinyGSM: 小型言語モデルでGSM8kの80%超えを達成

TinyGSM: achieving >80% on GSM8k with small language models

要旨

Support