JSON内部で考える：厳密なLLMスキーマ準拠のための強化学習戦略

要旨

本論文では、大規模言語モデル（LLM）の生成において厳密なスキーマ準拠を強制する課題に取り組むため、LLMの推論能力を活用する手法を提案する。DeepSeek R1強化学習フレームワークを基盤として、1.5Bパラメータモデルの構造化推論スキルを、合成推論データセット構築とGroup Relative Policy Optimization（GRPO）下でのカスタム報酬関数を組み合わせた新たなパイプラインを通じて訓練する。具体的には、まず20Kサンプルの非構造化から構造化へのデータセットに対してR1強化学習を実施し、元のDeepSeek R1手法を模倣して中核的な推論能力を確立する。その後、別の10K推論サンプルデータセットに対して教師ありファインチューニングを行い、下流タスクにおけるスキーマ準拠の精度向上に焦点を当てる。比較的控えめな訓練範囲（GRPO訓練に8xH100 GPUクラスターで約20時間、SFTに1xA100で3時間を要する）にもかかわらず、本モデルはスキーマ一貫性の強制において堅牢な性能を示す。我々のThinkJSONアプローチを、元のDeepSeek R1（671B）、DeepSeek R1の蒸留版（Qwen-1.5BおよびQwen-7B）、Gemini 2.0 Flash（70B）と比較し、実世界のアプリケーションにおける有効性を実証する。結果は、スキーマ制約付きテキスト生成のためのリソース効率的なフレームワークの実用性を強調するものである。

English

In this paper, we address the challenge of enforcing strict schema adherence in large language model (LLM) generation by leveraging LLM reasoning capabilities. Building on the DeepSeek R1 reinforcement learning framework, our approach trains structured reasoning skills of a 1.5B parameter model through a novel pipeline that combines synthetic reasoning dataset construction with custom reward functions under Group Relative Policy Optimization (GRPO). Specifically, we first perform R1 reinforcement learning on a 20K sample unstructured-to-structured dataset, mirroring the original DeepSeek R1 methods, to establish core reasoning abilities. Subsequently, we performed supervised fine-tuning on a separate 10K reasoning sample dataset, focusing on refining schema adherence for downstream tasks. Despite the relatively modest training scope, requiring approximately 20 hours on an 8xH100 GPU cluster for GRPO training and 3 hours on 1xA100 for SFT, our model demonstrates robust performance in enforcing schema consistency. We compare our ThinkJSON approach against the original DeepSeek R1 (671B), distilled versions of DeepSeek R1 (Qwen-1.5B and Qwen-7B), and Gemini 2.0 Flash (70B), showcasing its effectiveness in real-world applications. Our results underscore the practical utility of a resource-efficient framework for schema-constrained text generation.

JSON内部で考える：厳密なLLMスキーマ準拠のための強化学習戦略

Think Inside the JSON: Reinforcement Strategy for Strict LLM Schema Adherence

要旨

Support