通过推理时技术激发微调Transformer的潜能
Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques
June 9, 2025
作者: Asankhaya Sharma
cs.AI
摘要
大型语言模型已彻底革新了自然语言处理领域,然而监督微调(SFT)仍是一项计算密集型任务。本文正式证明,在理想化假设下,包括无限计算资源和访问微调数据集,通过推理时技术,特别是上下文学习(ICL),无需改变模型参数,即可由基础Transformer模型近似获得SFT所赋予的能力。我们将这些结果推广至实际场景,考虑有限上下文长度和部分数据集访问的情况。对于输出长度固定的文本生成任务,数据集规模为O(mV/ε² log m/δ)或在有界上下文情况下为O(l log V/ε² log 1/δ),足以在m个上下文中以误差ε近似微调行为,其中V为词汇量,δ为失败概率。对于线性分类,数据集规模为O(d/ε)或在固定上下文情况下为O(1/ε² log 1/δ)即足够,其中d为输入维度。基于Transformer的图灵完备性,这些结果为大型语言模型的资源高效部署提供了理论基础,而检索增强生成等实用技术则架起了理论与现实应用之间的桥梁。
English
Large language models have transformed natural language processing, yet
supervised fine-tuning (SFT) remains computationally intensive. This paper
formally proves that capabilities acquired through SFT can be approximated by a
base transformer model using inference-time techniques, specifically in-context
learning (ICL), without altering model parameters, under idealized assumptions
including unbounded computational resources and access to the fine-tuning
dataset. We extend these results to practical scenarios with finite context
lengths and partial dataset access. For text generation tasks with fixed output
length l, datasets of size Oleft( m V{varepsilon^2} log
m{delta} right) or, with bounded context, Oleft( l
log V{varepsilon^2} log 1{delta} right) suffice to approximate
fine-tuned behavior across m contexts within error varepsilon, where V
is the vocabulary size and delta is the failure probability. For linear
classification, datasets of size Oleft( d{varepsilon}
right) or, with fixed context, Oleft( 1{varepsilon^2} log
1{delta} right) are sufficient, where d is the input dimension.
Grounded in the Turing completeness of transformers, these results provide a
theoretical foundation for resource-efficient deployment of large language
models, with practical techniques like retrieval-augmented generation bridging
theory to real-world applications.