優良、惡劣與貪婪：評估語言模型時不應忽略非確定性

摘要

目前對大型語言模型（LLMs）的評估往往忽略了非確定性，通常專注於每個示例的單個輸出。這限制了我們對LLM在實際應用中性能變異性的理解。我們的研究通過探索有關貪婪解碼和抽樣之間性能差異的關鍵問題，確定基準在非確定性方面的一致性，並檢查獨特的模型行為來解決這個問題。通過大量實驗，我們觀察到，對於大多數評估任務，貪婪解碼通常優於抽樣方法。我們還觀察到，在不同的LLM大小和對齊方法之間表現一致，並指出對齊可以減少抽樣變異。此外，我們的最佳N抽樣方法表明，較小的LLMs可以與或超越諸如GPT-4-Turbo之類的較大模型，突顯了較小LLMs的潛力。這項研究顯示了在LLM評估中考慮非確定性的重要性，並為未來LLM的開發和評估提供了見解。

English

Current evaluations of large language models (LLMs) often overlook non-determinism, typically focusing on a single output per example. This limits our understanding of LLM performance variability in real-world applications. Our study addresses this issue by exploring key questions about the performance differences between greedy decoding and sampling, identifying benchmarks' consistency regarding non-determinism, and examining unique model behaviors. Through extensive experiments, we observe that greedy decoding generally outperforms sampling methods for most evaluated tasks. We also observe consistent performance across different LLM sizes and alignment methods, noting that alignment can reduce sampling variance. Moreover, our best-of-N sampling approach demonstrates that smaller LLMs can match or surpass larger models such as GPT-4-Turbo, highlighting the untapped potential of smaller LLMs. This research shows the importance of considering non-determinism in LLM evaluations and provides insights for future LLM development and evaluation.

優良、惡劣與貪婪：評估語言模型時不應忽略非確定性

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

摘要

Support