JSON 내부 사고: 엄격한 LLM 스키마 준수를 위한 강화 전략

초록

본 논문에서는 대규모 언어 모델(LLM)의 추론 능력을 활용하여 엄격한 스키마 준수를 강제하는 과제를 다룹니다. DeepSeek R1 강화 학습 프레임워크를 기반으로, 우리의 접근 방식은 Group Relative Policy Optimization(GRPO) 하에서 합성 추론 데이터셋 구축과 맞춤형 보상 함수를 결합한 새로운 파이프라인을 통해 1.5B 파라미터 모델의 구조화된 추론 능력을 훈련시킵니다. 구체적으로, 먼저 원본 DeepSeek R1 방법을 반영하여 20K 샘플의 비구조화-구조화 데이터셋에 대해 R1 강화 학습을 수행하여 핵심 추론 능력을 확립합니다. 이후, 10K 추론 샘플 데이터셋에 대해 지도 미세 조정을 수행하여 다운스트림 작업을 위한 스키마 준수를 개선하는 데 초점을 맞춥니다. 상대적으로 적은 훈련 범위(GRPO 훈련은 8xH100 GPU 클러스터에서 약 20시간, SFT는 1xA100에서 3시간 소요)에도 불구하고, 우리의 모델은 스키마 일관성 강제에서 견고한 성능을 보여줍니다. 우리는 ThinkJSON 접근 방식을 원본 DeepSeek R1(671B), DeepSeek R1의 축소 버전(Qwen-1.5B 및 Qwen-7B), 그리고 Gemini 2.0 Flash(70B)와 비교하여 실제 응용 프로그램에서의 효과를 입증합니다. 우리의 결과는 스키마 제약 텍스트 생성을 위한 자원 효율적인 프레임워크의 실용적 유용성을 강조합니다.

English

In this paper, we address the challenge of enforcing strict schema adherence in large language model (LLM) generation by leveraging LLM reasoning capabilities. Building on the DeepSeek R1 reinforcement learning framework, our approach trains structured reasoning skills of a 1.5B parameter model through a novel pipeline that combines synthetic reasoning dataset construction with custom reward functions under Group Relative Policy Optimization (GRPO). Specifically, we first perform R1 reinforcement learning on a 20K sample unstructured-to-structured dataset, mirroring the original DeepSeek R1 methods, to establish core reasoning abilities. Subsequently, we performed supervised fine-tuning on a separate 10K reasoning sample dataset, focusing on refining schema adherence for downstream tasks. Despite the relatively modest training scope, requiring approximately 20 hours on an 8xH100 GPU cluster for GRPO training and 3 hours on 1xA100 for SFT, our model demonstrates robust performance in enforcing schema consistency. We compare our ThinkJSON approach against the original DeepSeek R1 (671B), distilled versions of DeepSeek R1 (Qwen-1.5B and Qwen-7B), and Gemini 2.0 Flash (70B), showcasing its effectiveness in real-world applications. Our results underscore the practical utility of a resource-efficient framework for schema-constrained text generation.

JSON 내부 사고: 엄격한 LLM 스키마 준수를 위한 강화 전략

Think Inside the JSON: Reinforcement Strategy for Strict LLM Schema Adherence

초록

Support