UnpredictaBench：評估大型語言模型分布隨機性的基準測試

摘要

我們提出UnpredictaBench，這項評測旨在檢驗大型語言模型（LLM）捕捉真實潛在分佈的能力。隨著LLM日益被用作其他實體的替代（例如，在經濟模擬中替代人類），許多模型傾向於收斂到單一合理答案的現象，意味著它們未能捕捉真實系統的不可預測性。近期針對提升輸出多樣性的研究在此情境下仍顯不足：模擬需要的樣本必須校準到目標分佈，而非僅僅是變化的輸出。UnpredictaBench將此問題簡化為基礎版本：從個別目標分佈中抽取結果，包括標準統計分佈、隨機程式產生的分佈，以及描述隨機過程的自然語言場景。我們引入了448道此類問題，並搭配KS@N這項通用評量指標，透過柯爾莫哥洛夫-斯米爾諾夫統計檢定，量化模型輸出近似黑箱目標分佈的能力。此指標衡量的是在樣本數N下，我們無法拒絕模型樣本與真實樣本來自相同分佈的比率，N越大表示難度越高。在開放與專有模型上的測試結果顯示，分佈能力存在極大差異。例如，當模型生成樣本數為100時（KS@100，我們的標準指標），得分範圍從接近0到超過20%。沒有任何模型能在KS@100上達到40%以上，顯示分佈取樣作為一項能力仍有顯著進步空間。雖然加入推理步驟能略微提升分數，但我們發現此問題並無立即解決方案。UnpredictaBench證明，即便是簡單的分佈模擬仍具挑戰性，這使其成為將LLM用作複雜系統替代品時的必要第一步。

English

We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in economic simulations), the tendency of many models to collapse towards a single plausible answer means a failure to capture the unpredictability of real systems. Recent work on improving output diversity is insufficient for this setting: simulation requires samples that are calibrated to a target distribution, not merely varied outputs. UnpredictaBench isolates a simplified but fundamental version of this problem: sampling outcomes from individual target distributions, including canonical statistical distributions, distributions induced by stochastic programs, and natural-language scenarios that describe random processes. We introduce 448 such problems together with KS@N, a general-purpose evaluation metric that quantifies how well a model outputs approximate black-box target distributions via the Kolmogorov-Smirnov statistical test. This is the rate at which we fail to reject model samples of size N against ground-truth samples, with larger N indicating greater difficulty. Tested across open and proprietary models, we find a large spread in distributional capabilities. For instance, when models generate samples of size 100 (KS@100, our standard metric), scores range from near 0 to over 20%. No model is able to achieve over 40% at KS@100, showing significant headroom in distributional sampling as a capability. Although adding reasoning can somewhat increase scores, we find no immediate solution for this issue. UnpredictaBench shows that even simple distributional simulation remains challenging, making it a necessary first step toward using LLMs as stand-ins for complex systems.