优缺点与贪婪：LLM评估不应忽略非确定性。

摘要

目前对大型语言模型（LLMs）的评估经常忽视了非确定性，通常集中在每个示例的单个输出上。这限制了我们对LLM在实际应用中性能变化的理解。我们的研究通过探讨贪婪解码和抽样之间的性能差异等关键问题，识别基准测试在非确定性方面的一致性，并检查独特的模型行为来解决这个问题。通过大量实验，我们观察到贪婪解码通常在大多数评估任务中优于抽样方法。我们还观察到在不同LLM大小和对齐方法之间有一致的性能表现，指出对齐可以减少抽样方差。此外，我们的N次抽样方法表明，较小的LLMs可以匹敌或超越像GPT-4-Turbo这样的更大模型，突显了较小LLMs的潜力。这项研究显示了在LLM评估中考虑非确定性的重要性，并为未来LLM的发展和评估提供了见解。

English

Current evaluations of large language models (LLMs) often overlook non-determinism, typically focusing on a single output per example. This limits our understanding of LLM performance variability in real-world applications. Our study addresses this issue by exploring key questions about the performance differences between greedy decoding and sampling, identifying benchmarks' consistency regarding non-determinism, and examining unique model behaviors. Through extensive experiments, we observe that greedy decoding generally outperforms sampling methods for most evaluated tasks. We also observe consistent performance across different LLM sizes and alignment methods, noting that alignment can reduce sampling variance. Moreover, our best-of-N sampling approach demonstrates that smaller LLMs can match or surpass larger models such as GPT-4-Turbo, highlighting the untapped potential of smaller LLMs. This research shows the importance of considering non-determinism in LLM evaluations and provides insights for future LLM development and evaluation.

优缺点与贪婪：LLM评估不应忽略非确定性。

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

摘要

Support