약한 LLM과 강한 LLM에서 텍스트-SQL 데이터를 합성하기

초록

오픈 소스와 폐쇄 소스 대형 언어 모델(LLM) 간의 능력 차이는 텍스트-SQL 작업에서 여전히 도전과제로 남아 있습니다. 본 논문에서는 더 크고 강력한 모델(강력한 모델)에 의해 생성된 데이터와 작은, 정렬되지 않은 모델(약한 모델)에 의해 생성된 오류 정보 데이터를 결합하는 합성 데이터 접근 방식을 소개합니다. 이 방법은 텍스트-SQL 모델의 도메인 일반화 능력을 향상시킬 뿐만 아니라, 선호 학습을 통해 오류 데이터 감독의 잠재력을 탐구합니다. 더불어, 우리는 이 합성 데이터 접근 방식을 오픈 소스 LLM의 지시 조정에 적용하여 SENSE, 특화된 텍스트-SQL 모델을 만들었습니다. SENSE의 효과는 SPIDER와 BIRD 벤치마크에서의 최첨단 결과를 통해 입증되었으며, 이를 통해 오픈 소스 모델과 폐쇄 소스 모델에 의해 유발된 방법 간의 성능 차이를 줄였습니다.

English

The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to-SQL tasks. In this paper, we introduce a synthetic data approach that combines data produced by larger, more powerful models (strong models) with error information data generated by smaller, not well-aligned models (weak models). The method not only enhances the domain generalization of text-to-SQL models but also explores the potential of error data supervision through preference learning. Furthermore, we employ the synthetic data approach for instruction tuning on open-source LLMs, resulting SENSE, a specialized text-to-SQL model. The effectiveness of SENSE is demonstrated through state-of-the-art results on the SPIDER and BIRD benchmarks, bridging the performance gap between open-source models and methods prompted by closed-source models.

약한 LLM과 강한 LLM에서 텍스트-SQL 데이터를 합성하기

Synthesizing Text-to-SQL Data from Weak and Strong LLMs

초록

Support