LLM을 활용한 데이터베이스 네이티브 함수 코드 자동 합성

초록

데이터베이스 시스템은 새로운 애플리케이션 지원 및 비즈니스 마이그레이션과 같은 시나리오를 위해 커널(일명 데이터베이스 네이티브 함수)에 지속적으로 증가하는 함수들을 통합하고 있습니다. 이러한 증가는 자동화된 데이터베이스 네이티브 함수 합성에 대한 긴급한 수요를 야기합니다. LLM 기반 코드 생성(예: Claude Code)의 최근 발전은 가능성을 보여주지만, 데이터베이스 특화 개발에는 너무 일반적입니다. 데이터베이스 함수 합성은 본질적으로 복잡하고 오류가 발생하기 쉬운 작업으로, 단일 함수를 합성하는 데 여러 함수 단위 등록, 내부 참조 연결, 논리 정확 구현이 수반될 수 있어, 이러한 일반 모델들은 종종 환각(hallucination)을 일으키거나 중요한 맥락을 간과합니다. 이를 위해 우리는 데이터베이스 네이티브 함수를 자동으로 합성하기 위한 LLM 기반 시스템인 DBCooker를 제안합니다. DBCooker는 세 가지 구성 요소로 이루어져 있습니다. 첫째, 함수 특징화 모듈은 다중 소스 선언들을 집계하고, 특화된 코딩이 필요한 함수 단위를 식별하며, 단위 간 종속성을 추적합니다. 둘째, 주요 합성 과제를 해결하기 위한 작업들을 설계합니다: (1) 재사용 가능한 참조 함수와 같은 핵심 요소를 식별하여 구조화된 구현 골격을 구성하는 의사 코드 기반 코딩 계획 생성기, (2) 확률적 사전 정보와 구성 요소 인식에 기반하여 핵심 논리를 재사용 가능한 루틴과 통합하는 하이브리드 빈칸 채우기 모델, (3) 구문 검사, 표준 준수, LLM 주도 의미론적 검증을 포함하는 3단계 점진적 검증. 마지막으로, 적응형 오케스트레이션 전략은 이러한 작업들을 기존 도구들과 통일하고 유사 함수들의 오케스트레이션 기록을 통해 동적으로 그 실행 순서를 결정합니다. 결과는 DBCooker가 SQLite, PostgreSQL, DuckDB에서 다른 방법들보다 평균 34.55% 높은 정확도로 우수한 성능을 보이며, 최신 SQLite(v3.50)에 존재하지 않는 새로운 함수들도 합성할 수 있음을 보여줍니다.

English

Database systems incorporate an ever-growing number of functions in their kernels (a.k.a., database native functions) for scenarios like new application support and business migration. This growth causes an urgent demand for automatic database native function synthesis. While recent advances in LLM-based code generation (e.g., Claude Code) show promise, they are too generic for database-specific development. They often hallucinate or overlook critical context because database function synthesis is inherently complex and error-prone, where synthesizing a single function may involve registering multiple function units, linking internal references, and implementing logic correctly. To this end, we propose DBCooker, an LLM-based system for automatically synthesizing database native functions. It consists of three components. First, the function characterization module aggregates multi-source declarations, identifies function units that require specialized coding, and traces cross-unit dependencies. Second, we design operations to address the main synthesis challenges: (1) a pseudo-code-based coding plan generator that constructs structured implementation skeletons by identifying key elements such as reusable referenced functions; (2) a hybrid fill-in-the-blank model guided by probabilistic priors and component awareness to integrate core logic with reusable routines; and (3) three-level progressive validation, including syntax checking, standards compliance, and LLM-guided semantic verification. Finally, an adaptive orchestration strategy unifies these operations with existing tools and dynamically sequences them via the orchestration history of similar functions. Results show that DBCooker outperforms other methods on SQLite, PostgreSQL, and DuckDB (34.55% higher accuracy on average), and can synthesize new functions absent in the latest SQLite (v3.50).

LLM을 활용한 데이터베이스 네이티브 함수 코드 자동 합성

Automating Database-Native Function Code Synthesis with LLMs

초록

Support