通过推理时技术激发微调Transformer的潜能

摘要

大型语言模型已彻底革新了自然语言处理领域，然而监督微调（SFT）仍是一项计算密集型任务。本文正式证明，在理想化假设下，包括无限计算资源和访问微调数据集，通过推理时技术，特别是上下文学习（ICL），无需改变模型参数，即可由基础Transformer模型近似获得SFT所赋予的能力。我们将这些结果推广至实际场景，考虑有限上下文长度和部分数据集访问的情况。对于输出长度固定的文本生成任务，数据集规模为O(mV/ε² log m/δ)或在有界上下文情况下为O(l log V/ε² log 1/δ)，足以在m个上下文中以误差ε近似微调行为，其中V为词汇量，δ为失败概率。对于线性分类，数据集规模为O(d/ε)或在固定上下文情况下为O(1/ε² log 1/δ)即足够，其中d为输入维度。基于Transformer的图灵完备性，这些结果为大型语言模型的资源高效部署提供了理论基础，而检索增强生成等实用技术则架起了理论与现实应用之间的桥梁。

English

Large language models have transformed natural language processing, yet supervised fine-tuning (SFT) remains computationally intensive. This paper formally proves that capabilities acquired through SFT can be approximated by a base transformer model using inference-time techniques, specifically in-context learning (ICL), without altering model parameters, under idealized assumptions including unbounded computational resources and access to the fine-tuning dataset. We extend these results to practical scenarios with finite context lengths and partial dataset access. For text generation tasks with fixed output length l, datasets of size Oleft( m V{varepsilon^2} log m{delta} right) or, with bounded context, Oleft( l log V{varepsilon^2} log 1{delta} right) suffice to approximate fine-tuned behavior across m contexts within error varepsilon, where V is the vocabulary size and delta is the failure probability. For linear classification, datasets of size Oleft( d{varepsilon} right) or, with fixed context, Oleft( 1{varepsilon^2} log 1{delta} right) are sufficient, where d is the input dimension. Grounded in the Turing completeness of transformers, these results provide a theoretical foundation for resource-efficient deployment of large language models, with practical techniques like retrieval-augmented generation bridging theory to real-world applications.

通过推理时技术激发微调Transformer的潜能

Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques

摘要

Support