深入JSON思維：強化策略以嚴格遵循LLM架構

摘要

本文探討了如何利用大型語言模型（LLM）的推理能力來強化其在生成過程中對嚴格模式的遵循。基於DeepSeek R1強化學習框架，我們的方法通過一個結合合成推理數據集構建與群組相對策略優化（GRPO）下自定義獎勵函數的新穎管道，訓練了一個15億參數模型的結構化推理技能。具體而言，我們首先在一個包含20,000個樣本的無結構到有結構數據集上進行R1強化學習，模仿原始DeepSeek R1方法，以建立核心推理能力。隨後，我們在一個獨立的10,000個推理樣本數據集上進行監督微調，專注於精煉下游任務的模式遵循。儘管訓練範圍相對有限，GRPO訓練在8xH100 GPU集群上約需20小時，SFT在1xA100上約需3小時，但我們的模型在確保模式一致性方面展現了穩健的性能。我們將ThinkJSON方法與原始DeepSeek R1（671B）、DeepSeek R1的蒸餾版本（Qwen-1.5B和Qwen-7B）以及Gemini 2.0 Flash（70B）進行了比較，展示了其在實際應用中的有效性。我們的結果凸顯了資源高效框架在模式約束文本生成中的實際效用。

English

In this paper, we address the challenge of enforcing strict schema adherence in large language model (LLM) generation by leveraging LLM reasoning capabilities. Building on the DeepSeek R1 reinforcement learning framework, our approach trains structured reasoning skills of a 1.5B parameter model through a novel pipeline that combines synthetic reasoning dataset construction with custom reward functions under Group Relative Policy Optimization (GRPO). Specifically, we first perform R1 reinforcement learning on a 20K sample unstructured-to-structured dataset, mirroring the original DeepSeek R1 methods, to establish core reasoning abilities. Subsequently, we performed supervised fine-tuning on a separate 10K reasoning sample dataset, focusing on refining schema adherence for downstream tasks. Despite the relatively modest training scope, requiring approximately 20 hours on an 8xH100 GPU cluster for GRPO training and 3 hours on 1xA100 for SFT, our model demonstrates robust performance in enforcing schema consistency. We compare our ThinkJSON approach against the original DeepSeek R1 (671B), distilled versions of DeepSeek R1 (Qwen-1.5B and Qwen-7B), and Gemini 2.0 Flash (70B), showcasing its effectiveness in real-world applications. Our results underscore the practical utility of a resource-efficient framework for schema-constrained text generation.

深入JSON思維：強化策略以嚴格遵循LLM架構

Think Inside the JSON: Reinforcement Strategy for Strict LLM Schema Adherence

摘要

Support