編譯器對大型語言模型的優先採樣
Priority Sampling of Large Language Models for Compilers
February 28, 2024
作者: Dejan Grubisic, Chris Cummins, Volker Seeker, Hugh Leather
cs.AI
摘要
大型語言模型在生成和優化程式碼方面展現出巨大潛力。廣泛使用的取樣方法,如核心取樣,增加了生成的多樣性,但在低溫度下常產生重複的樣本,在高溫度下則產生不連貫的樣本。此外,溫度係數必須針對每個任務進行調整,限制了其可用性。我們提出了優先取樣,一種簡單且確定性的取樣技術,可按模型的信心順序生成獨特樣本。每個新樣本都會擴展擴增搜索樹中概率最高的未擴展標記。此外,優先取樣支持基於正則表達式的生成,提供可控且結構化的探索過程。優先取樣在任何樣本數量下均優於核心取樣,將原始模型的性能從2.87%提升至5%。此外,僅通過30個樣本,優先取樣就優於用於生成標籤以訓練原始模型的自動調整器。
English
Large language models show great potential in generating and optimizing code.
Widely used sampling methods such as Nucleus Sampling increase the diversity of
generation but often produce repeated samples for low temperatures and
incoherent samples for high temperatures. Furthermore, the temperature
coefficient has to be tuned for each task, limiting its usability. We present
Priority Sampling, a simple and deterministic sampling technique that produces
unique samples ordered by the model's confidence. Each new sample expands the
unexpanded token with the highest probability in the augmented search tree.
Additionally, Priority Sampling supports generation based on regular expression
that provides a controllable and structured exploration process. Priority
Sampling outperforms Nucleus Sampling for any number of samples, boosting the
performance of the original model from 2.87% to 5% improvement over -Oz.
Moreover, it outperforms the autotuner used for the generation of labels for
the training of the original model in just 30 samples.