UnpredictaBench: 大規模言語モデルにおける分布のランダム性を評価するためのベンチマーク

要旨

私たちは、大規模言語モデル（LLM）が真の背後分布を捉える能力を評価するUnpredictaBenchを紹介します。LLMが経済シミュレーションにおける人間など、他の実体の代替としてますます利用される中で、多くのモデルが唯一の妥当な回答に収束する傾向は、現実システムの予測不可能性を捉え損ねることを意味します。出力の多様性を向上させる最近の研究は、この設定では不十分です。シミュレーションには、単に多様な出力ではなく、目標分布に較正された標本が必要です。UnpredictaBenchは、この問題の簡略化された基本的なバージョンに焦点を当てます。すなわち、標準的な統計分布、確率的プログラムによって誘導される分布、ランダムな過程を説明する自然言語シナリオを含む個別の目標分布からの結果の標本抽出です。我々は、448個のこのような問題とともに、KS@Nという汎用的な評価指標を導入します。これは、Kolmogorov-Smirnov統計検定を介して、モデル出力がブラックボックスの目標分布をどの程度近似しているかを定量化します。この指標は、サイズNのモデル標本を真の標本に対して棄却できない割合を示し、Nが大きいほど難易度が高いことを意味します。オープンモデルとプロプライエタリモデルの両方でテストしたところ、分布に関する能力には大きなばらつきがあることがわかりました。例えば、モデルがサイズ100の標本を生成する場合（KS@100、我々の標準指標）、スコアはほぼ0%から20%超まで広がりました。KS@100で40%以上を達成できるモデルはなく、分布標本抽出という能力には大きな改善の余地があることが示されています。推論を追加することでスコアがある程度上がるものの、この問題に対する即効性のある解決策は見つかりませんでした。UnpredictaBenchは、単純な分布シミュレーションでさえ依然として困難であることを示しており、LLMを複雑なシステムの代役として使用するための必要な第一歩となります。

English

We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in economic simulations), the tendency of many models to collapse towards a single plausible answer means a failure to capture the unpredictability of real systems. Recent work on improving output diversity is insufficient for this setting: simulation requires samples that are calibrated to a target distribution, not merely varied outputs. UnpredictaBench isolates a simplified but fundamental version of this problem: sampling outcomes from individual target distributions, including canonical statistical distributions, distributions induced by stochastic programs, and natural-language scenarios that describe random processes. We introduce 448 such problems together with KS@N, a general-purpose evaluation metric that quantifies how well a model outputs approximate black-box target distributions via the Kolmogorov-Smirnov statistical test. This is the rate at which we fail to reject model samples of size N against ground-truth samples, with larger N indicating greater difficulty. Tested across open and proprietary models, we find a large spread in distributional capabilities. For instance, when models generate samples of size 100 (KS@100, our standard metric), scores range from near 0 to over 20%. No model is able to achieve over 40% at KS@100, showing significant headroom in distributional sampling as a capability. Although adding reasoning can somewhat increase scores, we find no immediate solution for this issue. UnpredictaBench shows that even simple distributional simulation remains challenging, making it a necessary first step toward using LLMs as stand-ins for complex systems.