編譯器對大型語言模型的優先採樣

摘要

大型語言模型在生成和優化程式碼方面展現出巨大潛力。廣泛使用的取樣方法，如核心取樣，增加了生成的多樣性，但在低溫度下常產生重複的樣本，在高溫度下則產生不連貫的樣本。此外，溫度係數必須針對每個任務進行調整，限制了其可用性。我們提出了優先取樣，一種簡單且確定性的取樣技術，可按模型的信心順序生成獨特樣本。每個新樣本都會擴展擴增搜索樹中概率最高的未擴展標記。此外，優先取樣支持基於正則表達式的生成，提供可控且結構化的探索過程。優先取樣在任何樣本數量下均優於核心取樣，將原始模型的性能從2.87%提升至5%。此外，僅通過30個樣本，優先取樣就優於用於生成標籤以訓練原始模型的自動調整器。

English

Large language models show great potential in generating and optimizing code. Widely used sampling methods such as Nucleus Sampling increase the diversity of generation but often produce repeated samples for low temperatures and incoherent samples for high temperatures. Furthermore, the temperature coefficient has to be tuned for each task, limiting its usability. We present Priority Sampling, a simple and deterministic sampling technique that produces unique samples ordered by the model's confidence. Each new sample expands the unexpanded token with the highest probability in the augmented search tree. Additionally, Priority Sampling supports generation based on regular expression that provides a controllable and structured exploration process. Priority Sampling outperforms Nucleus Sampling for any number of samples, boosting the performance of the original model from 2.87% to 5% improvement over -Oz. Moreover, it outperforms the autotuner used for the generation of labels for the training of the original model in just 30 samples.

編譯器對大型語言模型的優先採樣

Priority Sampling of Large Language Models for Compilers

摘要

Support