善、悪、そして貪欲：LLMの評価は非決定性を無視すべきではない

要旨

現在の大規模言語モデル（LLM）の評価では、非決定性が見過ごされることが多く、通常は各例に対して単一の出力に焦点が当てられています。これにより、実世界のアプリケーションにおけるLLMの性能変動に対する理解が制限されています。本研究では、貪欲デコーディングとサンプリングの間の性能差に関する重要な疑問を探り、非決定性に関するベンチマークの一貫性を特定し、独特なモデルの挙動を検証することで、この問題に取り組んでいます。広範な実験を通じて、貪欲デコーディングがほとんどの評価タスクにおいてサンプリング手法を上回ることを観察しました。また、異なるLLMサイズやアライメント手法においても一貫した性能が見られ、アライメントがサンプリングの分散を減少させることが分かりました。さらに、我々のベスト・オブ・Nサンプリングアプローチは、より小さなLLMがGPT-4-Turboのような大規模モデルに匹敵し、あるいはそれを上回る可能性を示しており、小さなLLMの未開拓の可能性を強調しています。この研究は、LLM評価において非決定性を考慮することの重要性を示し、将来のLLM開発と評価に対する洞察を提供します。

English

Current evaluations of large language models (LLMs) often overlook non-determinism, typically focusing on a single output per example. This limits our understanding of LLM performance variability in real-world applications. Our study addresses this issue by exploring key questions about the performance differences between greedy decoding and sampling, identifying benchmarks' consistency regarding non-determinism, and examining unique model behaviors. Through extensive experiments, we observe that greedy decoding generally outperforms sampling methods for most evaluated tasks. We also observe consistent performance across different LLM sizes and alignment methods, noting that alignment can reduce sampling variance. Moreover, our best-of-N sampling approach demonstrates that smaller LLMs can match or surpass larger models such as GPT-4-Turbo, highlighting the untapped potential of smaller LLMs. This research shows the importance of considering non-determinism in LLM evaluations and provides insights for future LLM development and evaluation.

善、悪、そして貪欲：LLMの評価は非決定性を無視すべきではない

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

要旨

Support