UnpredictaBench: LLM의 분포적 무작위성을 평가하기 위한 벤치마크

초록

우리는 UnpredictaBench를 소개한다. 이는 대규모 언어 모델(LLM)이 실제 기저 분포를 포착하는 능력을 평가하는 벤치마크이다. LLM이 다른 개체(예: 경제 시뮬레이션에서 인간을 대신하는 용도)의 대체재로 점점 더 사용됨에 따라, 많은 모델이 단일한 그럴듯한 답변으로 수렴하는 경향은 실제 시스템의 예측 불가능성을 포착하지 못함을 의미한다. 최근 출력 다양성을 개선하려는 연구는 이러한 설정에 충분하지 않다. 시뮬레이션은 단순히 다양한 출력이 아닌, 목표 분포에 보정된 샘플을 필요로 한다. UnpredictaBench는 이 문제의 단순화되었지만 근본적인 버전을 분리한다. 즉, 개별 목표 분포(표준 통계 분포, 확률적 프로그램에 의해 유도된 분포, 무작위 과정을 기술하는 자연어 시나리오 포함)에서 결과를 샘플링하는 것이다. 우리는 448개의 이러한 문제와 함께 KS@N이라는 범용 평가 지표를 도입한다. 이 지표는 Kolmogorov-Smirnov 통계 검정을 통해 모델이 블랙박스 목표 분포를 얼마나 잘 근사하는 출력을 생성하는지 정량화한다. 이는 크기 N의 모델 샘플을 실제 샘플과 비교하여 기각하지 못하는 비율이며, N이 클수록 난이도가 높아짐을 의미한다. 오픈 및 독점 모델을 대상으로 테스트한 결과, 분포적 능력에 큰 차이가 있음을 발견했다. 예를 들어, 모델이 크기 100의 샘플을 생성할 때(KS@100, 우리의 표준 지표), 점수는 거의 0%에서 20% 이상까지 분포한다. 어떤 모델도 KS@100에서 40% 이상을 달성하지 못하여, 분포 샘플링이라는 능력에 상당한 개선 여지가 있음을 보여준다. 추론을 추가하면 점수가 다소 향상될 수 있지만, 이 문제에 대한 즉각적인 해결책은 발견되지 않았다. UnpredictaBench는 단순한 분포적 시뮬레이션조차 여전히 어려움을 시사하며, 이는 LLM을 복잡한 시스템의 대리자로 사용하기 위한 필수적인 첫걸음이 된다.

English

We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in economic simulations), the tendency of many models to collapse towards a single plausible answer means a failure to capture the unpredictability of real systems. Recent work on improving output diversity is insufficient for this setting: simulation requires samples that are calibrated to a target distribution, not merely varied outputs. UnpredictaBench isolates a simplified but fundamental version of this problem: sampling outcomes from individual target distributions, including canonical statistical distributions, distributions induced by stochastic programs, and natural-language scenarios that describe random processes. We introduce 448 such problems together with KS@N, a general-purpose evaluation metric that quantifies how well a model outputs approximate black-box target distributions via the Kolmogorov-Smirnov statistical test. This is the rate at which we fail to reject model samples of size N against ground-truth samples, with larger N indicating greater difficulty. Tested across open and proprietary models, we find a large spread in distributional capabilities. For instance, when models generate samples of size 100 (KS@100, our standard metric), scores range from near 0 to over 20%. No model is able to achieve over 40% at KS@100, showing significant headroom in distributional sampling as a capability. Although adding reasoning can somewhat increase scores, we find no immediate solution for this issue. UnpredictaBench shows that even simple distributional simulation remains challenging, making it a necessary first step toward using LLMs as stand-ins for complex systems.