技術報告:Q程式語言的全面微調策略
Technical Report: Full-Stack Fine-Tuning for the Q Programming Language
August 9, 2025
作者: Brendan R. Hogan, Will Brown, Adel Boyarsky, Anderson Schneider, Yuriy Nevmyvaka
cs.AI
摘要
尽管大型语言模型的能力日益增强,但期望它们在互联网上代表性不足的任务中表现出色仍然是不合理的。利用LLMs进行专门应用,特别是在小众编程语言和私有领域中,仍然具有挑战性且大部分尚未解决。在本研究中,我们通过提出一种全面的开源方法来解决这一差距,该方法旨在使LLMs适应Q编程语言,这是一种在量化金融中流行的工具,与Python、C、Java等“主流”语言相比,在互联网上的存在感要低得多,因此并非通用AI模型的强项。我们引入了一个新的LeetCode风格的Q评估数据集,在该数据集上对主要前沿模型进行基准测试,然后通过预训练、监督微调和强化学习,基于Qwen-2.5系列训练了一套推理和非推理模型,涵盖五个参数规模(1.5B、3B、7B、14B、32B)。我们的最佳模型在我们的Q基准测试中达到了59%的pass@1准确率,超越了表现最佳的前沿模型Claude Opus-4,领先29.5个百分点。此外,所有模型,甚至我们的1.5B模型,在此任务上都优于GPT-4.1。除了发布模型、代码和数据外,我们还提供了数据集构建、模型预训练、监督微调和强化学习的详细蓝图。我们的方法论具有广泛的适用性,并讨论了这些技术如何扩展到其他任务,包括那些评估可能依赖于软性或主观信号的任务。
English
Even though large language models are becoming increasingly capable, it is
still unreasonable to expect them to excel at tasks that are under-represented
on the Internet. Leveraging LLMs for specialized applications, particularly in
niche programming languages and private domains, remains challenging and
largely unsolved. In this work, we address this gap by presenting a
comprehensive, open-source approach for adapting LLMs to the Q programming
language, a popular tool in quantitative finance that is much less present on
the Internet compared to Python, C, Java, and other ``mainstream" languages and
is therefore not a strong suit of general-purpose AI models. We introduce a new
Leetcode style evaluation dataset for Q, benchmark major frontier models on the
dataset, then do pretraining, supervised fine tuning, and reinforcement
learning to train a suite of reasoning and non-reasoning models based on the
Qwen-2.5 series, spanning five parameter sizes (1.5B, 3B, 7B, 14B, 32B). Our
best model achieves a pass@1 accuracy of 59 percent on our Q benchmark,
surpassing the best-performing frontier model, Claude Opus-4 by 29.5 percent.
Additionally, all models, even our 1.5B model, outperform GPT-4.1 on this task.
In addition to releasing models, code, and data, we provide a detailed
blueprint for dataset construction, model pretraining, supervised fine-tuning,
and reinforcement learning. Our methodology is broadly applicable, and we
discuss how these techniques can be extended to other tasks, including those
where evaluation may rely on soft or subjective signals.