VibeThinker-3B: 小規模言語モデルにおける検証可能な推論の最前線を探る

要旨

本技術報告では、検証可能な推論を厳密に小型モデルの枠組み内でどこまで押し進められるかを調査するために開発された、3Bパラメータのコンパクトな高密度モデルであるVibeThinker-3Bを紹介する。Spectrum-to-Signalポストトレーニングパラダイムを基盤として、カリキュラムベースの教師ありファインチューニング、多領域強化学習、オフライン自己蒸留を含む最適化されたパイプラインを通じて、モデルを体系的に強化する。実験評価により、VibeThinker-3Bは非常に厳しい検証可能タスクにおいて最先端レベルの性能を達成することが示された。具体的には、AIME26で94.3（クレームレベルのテスト時スケーリングにより97.1に向上）、LiveCodeBench v6で80.2のPass@1を記録し、最近の未見LeetCodeコンテストでは96.1%の受理率を示すなど、強力な分布外汎化能力を発揮する。これにより、VibeThinker-3Bは事実上、DeepSeek V3.2、GLM-5、Gemini 3 Proといった桁違いに大規模な旗艦モデルと同等かそれを上回る、第一級の推論システムの性能帯に位置づけられる。さらに、IFEvalでの93.4というスコアは、この極端な推論能力の強化が厳密な指示に対する制御可能性を損なわないことを確認している。以前の1.5Bモデルでの研究を発展させ、これらの知見は「パラメトリック圧縮-カバレッジ仮説」を動機づける。この仮説は、検証可能な推論はコンパクトな推論コアに圧縮可能である一方、オープンドメイン知識や汎用的な能力は事実、概念、長尾シナリオに対する広範なパラメータカバレッジを必要とするという視点を提示する。この見解は、コンパクトモデルが単に展開効率の良い代替品ではなく、パラメータ密度の高い能力領域において最先端性能を実現するための補完的な経路であることを示唆している。

English

This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. Experimental evaluations demonstrate that VibeThinker-3B achieves frontier-level performance on highly demanding verifiable tasks. Specifically, it attains a score of 94.3 on AIME26 (improving to 97.1 with claim-level test-time scaling), an 80.2 Pass@1 on LiveCodeBench v6, and exhibits strong out-of-distribution generalization with a 96.1\% acceptance rate on recent unseen LeetCode contests. This effectively places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Furthermore, a score of 93.4 on IFEval confirms that this extreme reasoning enhancement does not compromise strict instruction controllability. Extending our previous 1.5B work, these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios. This perspective suggests that compact models are not merely deployment-efficient substitutes, but a complementary path toward frontier-level performance in parameter-dense capability regimes.