Z1：基于代码的高效测试时扩展

摘要

大型語言模型（LLMs）能夠通過測試時計算擴展來增強複雜問題的解決能力，但這通常伴隨著更長的上下文和大量的推理代幣成本。本文提出了一種高效的測試時擴展方法，該方法在代碼相關的推理軌跡上訓練LLMs，從而促進其減少多餘的思考代幣，同時保持性能。首先，我們創建了Z1-Code-Reasoning-107K，這是一個精心策劃的數據集，包含簡單和複雜的編碼問題及其短和長的解決軌跡。其次，我們提出了一種新穎的「移位思考窗口」，通過移除上下文分隔標籤（例如，<think>. . . </think>）並限制推理代幣來減輕過度思考的開銷。通過長短軌跡數據的訓練並配備移位思考窗口，我們的模型Z1-7B展示了根據問題複雜度調整其推理水平的能力，並在不同推理任務中表現出高效的測試時擴展，其平均思考代幣約為R1-Distill-Qwen-7B的30%。值得注意的是，僅通過代碼軌跡進行微調的Z1-7B在更廣泛的推理任務上展現了泛化能力（在GPQA Diamond上達到47.5%）。我們對高效推理引導的分析也為未來研究提供了寶貴的見解。

English

Large Language Models (LLMs) can achieve enhanced complex problem-solving through test-time computing scaling, yet this often entails longer contexts and numerous reasoning token costs. In this paper, we propose an efficient test-time scaling method that trains LLMs on code-related reasoning trajectories, facilitating their reduction of excess thinking tokens while maintaining performance. First, we create Z1-Code-Reasoning-107K, a curated dataset of simple and complex coding problems paired with their short and long solution trajectories. Second, we present a novel Shifted Thinking Window to mitigate overthinking overhead by removing context-delimiting tags (e.g., <think>. . . </think>) and capping reasoning tokens. Trained with long and short trajectory data and equipped with Shifted Thinking Window, our model, Z1-7B, demonstrates the ability to adjust its reasoning level as the complexity of problems and exhibits efficient test-time scaling across different reasoning tasks that matches R1-Distill-Qwen-7B performance with about 30% of its average thinking tokens. Notably, fine-tuned with only code trajectories, Z1-7B demonstrates generalization to broader reasoning tasks (47.5% on GPQA Diamond). Our analysis of efficient reasoning elicitation also provides valuable insights for future research.

Z1：基于代码的高效测试时扩展

Z1: Efficient Test-time Scaling with Code

摘要

Support