LLMによるデータベースネイティブ関数コードの自動合成

要旨

データベースシステムでは、新規アプリケーションのサポートや業務移行などのシナリオに対応するため、カーネル内（すなわちデータベースネイティブ関数）にますます多くの機能が組み込まれている。この拡大に伴い、データベースネイティブ関数の自動合成に対する需要が緊急の課題となっている。LLMベースのコード生成（例：Claude Code）の最近の進歩は有望であるが、データベース特有の開発には汎用的すぎる。データベース関数の合成は本質的に複雑で誤りが生じやすく、単一関数の合成でも複数の関数単位の登録、内部参照のリンク、正確なロジック実装が必要となるため、幻覚生成や重要な文脈の見落としが頻発する。そこで我々は、データベースネイティブ関数を自動合成するLLMベースのシステム「DBCooker」を提案する。本システムは3つの構成要素から成る。第一に、関数特性解析モジュールがマルチソース宣言を集約し、専門的なコーディングを要する関数単位を特定し、単位間の依存関係を追跡する。第二に、主要な合成課題に対処するため以下の操作を設計する：（1）擬似コードベースのコーディング計画生成器が、再利用可能な参照関数などの主要要素を特定し構造化された実装骨格を構築、（2）確率的事前分布とコンポーネント認識に基づくハイブリッド空白補充モデルにより、核心ロジックと再利用可能ルーチンを統合、（3）構文チェック、標準準拠、LLM誘導意味検証を含む三段階漸進的検証。最後に、適応的オーケストレーション戦略がこれらの操作を既存ツールと統合し、類似関数のオーケストレーション履歴を通じて動的に順序付けする。評価結果では、DBCookerがSQLite、PostgreSQL、DuckDBにおいて他手法を優位に上回り（平均34.55%高い精度）、最新版SQLite（v3.50）に存在しない新機能の合成も可能であることを示す。

English

Database systems incorporate an ever-growing number of functions in their kernels (a.k.a., database native functions) for scenarios like new application support and business migration. This growth causes an urgent demand for automatic database native function synthesis. While recent advances in LLM-based code generation (e.g., Claude Code) show promise, they are too generic for database-specific development. They often hallucinate or overlook critical context because database function synthesis is inherently complex and error-prone, where synthesizing a single function may involve registering multiple function units, linking internal references, and implementing logic correctly. To this end, we propose DBCooker, an LLM-based system for automatically synthesizing database native functions. It consists of three components. First, the function characterization module aggregates multi-source declarations, identifies function units that require specialized coding, and traces cross-unit dependencies. Second, we design operations to address the main synthesis challenges: (1) a pseudo-code-based coding plan generator that constructs structured implementation skeletons by identifying key elements such as reusable referenced functions; (2) a hybrid fill-in-the-blank model guided by probabilistic priors and component awareness to integrate core logic with reusable routines; and (3) three-level progressive validation, including syntax checking, standards compliance, and LLM-guided semantic verification. Finally, an adaptive orchestration strategy unifies these operations with existing tools and dynamically sequences them via the orchestration history of similar functions. Results show that DBCooker outperforms other methods on SQLite, PostgreSQL, and DuckDB (34.55% higher accuracy on average), and can synthesize new functions absent in the latest SQLite (v3.50).

LLMによるデータベースネイティブ関数コードの自動合成

Automating Database-Native Function Code Synthesis with LLMs

要旨

Support