透過推理時技術激發微調Transformer模型的潛能

摘要

大型語言模型已徹底改變了自然語言處理領域，然而監督式微調（SFT）仍耗費大量計算資源。本文正式證明，在理想化假設下，包括無限的計算資源和對微調數據集的訪問，通過SFT獲取的能力可被基礎Transformer模型利用推理時技術，特別是上下文學習（ICL），在不改變模型參數的情況下近似實現。我們將這些結果推廣到具有有限上下文長度和部分數據集訪問的實際場景中。對於輸出長度固定的文本生成任務，數據集大小為O(m/ε² log(m/δ))，或在有界上下文情況下，O(l log(V)/ε² log(1/δ))，足以在m個上下文中以誤差ε近似微調行為，其中V是詞彙量，δ是失敗概率。對於線性分類，數據集大小為O(d/ε)，或在固定上下文情況下，O(1/ε² log(1/δ))即足夠，其中d是輸入維度。基於Transformer的圖靈完備性，這些結果為大型語言模型的資源高效部署提供了理論基礎，而檢索增強生成等實用技術則將理論與實際應用相連接。

English

Large language models have transformed natural language processing, yet supervised fine-tuning (SFT) remains computationally intensive. This paper formally proves that capabilities acquired through SFT can be approximated by a base transformer model using inference-time techniques, specifically in-context learning (ICL), without altering model parameters, under idealized assumptions including unbounded computational resources and access to the fine-tuning dataset. We extend these results to practical scenarios with finite context lengths and partial dataset access. For text generation tasks with fixed output length l, datasets of size Oleft( m V{varepsilon^2} log m{delta} right) or, with bounded context, Oleft( l log V{varepsilon^2} log 1{delta} right) suffice to approximate fine-tuned behavior across m contexts within error varepsilon, where V is the vocabulary size and delta is the failure probability. For linear classification, datasets of size Oleft( d{varepsilon} right) or, with fixed context, Oleft( 1{varepsilon^2} log 1{delta} right) are sufficient, where d is the input dimension. Grounded in the Turing completeness of transformers, these results provide a theoretical foundation for resource-efficient deployment of large language models, with practical techniques like retrieval-augmented generation bridging theory to real-world applications.

透過推理時技術激發微調Transformer模型的潛能

Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques

摘要

Support