ChatPaper.aiChatPaper

S0调谐:混合循环注意力模型的零开销自适应

S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

April 1, 2026
作者: Jack Young
cs.AI

摘要

基于约48个经过执行验证的HumanEval训练方案,通过为每个循环层调整单一初始状态矩阵(零推理开销),该方法在HumanEval上的表现优于LoRA达+10.8个百分点(p < 0.001)。这种被我们称为S0调优的方法,在冻结所有权重参数的同时,针对每个循环层优化一个状态矩阵。在Qwen3.5-4B(GatedDeltaNet混合架构)上,S0调优将贪婪策略的pass@1指标提升+23.6±1.7个百分点(10次随机种子实验)。在FalconH1-7B(Mamba-2混合架构)上,S0达到71.8%±1.3,LoRA为71.4%±2.4(3次随机种子),在当前样本量下统计无差异且无需权重合并。跨领域迁移在MATH-500(+4.8个百分点,p=0.00002,8次种子)和GSM8K(+2.8个百分点,p=0.0003,10次种子)上表现显著;文本转SQL基准测试(Spider)未显示迁移效果,这与轨迹导向机制相符。在纯Transformer架构(Qwen2.5-3B)上进行的前缀调优对照组,所有九种配置下性能均下降13.9个百分点。在Qwen3.5上,每步状态偏移变体达到+27.1个百分点,超越S0和LoRA但需承担每步推理成本。综合结果表明,当已验证监督数据稀缺时,循环状态初始化是混合语言模型中一种强效的零推理开销参数高效微调方案。调优后的状态文件约48MB;任务切换无需权重合并或模型重载。代码与库:https://github.com/jackyoung27/s0-tuning。
English
Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/- 1.7 pp (10 seeds). On FalconH1-7B (Mamba-2 hybrid), S0 reaches 71.8% +/- 1.3 and LoRA reaches 71.4% +/- 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism. A prefix-tuning control on a pure Transformer (Qwen2.5-3B) degrades performance by -13.9 pp under all nine configurations tested. On Qwen3.5, a per-step state-offset variant reaches +27.1 pp, above both S0 and LoRA but with per-step inference cost. Taken together, the results show that recurrent state initialization is a strong zero-inference-overhead PEFT surface for hybrid language models when verified supervision is scarce. The tuned state is a ~48 MB file; task switching requires no weight merging or model reload. Code and library: https://github.com/jackyoung27/s0-tuning.
PDF12April 3, 2026