Qworld: Vraagspecifieke Evaluatiecriteria voor LLM's

Samenvatting

Het evalueren van grote taalmodellen (LLM's) op open vragen is moeilijk omdat de kwaliteit van het antwoord afhangt van de context van de vraag. Binaire scores en statische beoordelingsrubrieken slagen er niet in om deze contextafhankelijke vereisten vast te leggen. Bestaande methoden definiëren criteria op datasetniveau of genereren deze in één keer, wat hun vermogen beperkt om de evaluatieruimte die elke vraag impliceert te verkennen. Wij introduceren One-Question-One-World (Qworld), een methode die vragen-specifieke evaluatiecriteria genereert met behulp van een recursieve expansieboom. Gegeven een vraag ontleedt Qworld deze in scenario's, perspectieven en fijnmazige binaire criteria via gestructureerde hiërarchische en horizontale expansie. De resulterende criteria specificeren wat een hoogwaardig antwoord voor die vraag moet behandelen. Op HealthBench dekt Qworld 89% van de door experts opgestelde criteria en genereert het 79% nieuwe criteria die door menselijke experts zijn gevalideerd. Experts beoordelen Qworld-criteria hoger in inzicht en granulariteit dan criteria gegenereerd door eerdere methoden. Wanneer toegepast op 11 frontier-LLM's op HealthBench en Humanity's Last Exam, onthult Qworld capaciteitsverschillen in dimensies zoals langetermijnimpact, billijkheid, foutafhandeling en interdisciplinair redeneren die grove rubrieken niet onderscheiden. Door criteriumgeneratie te formuleren als gestructureerde dekking van vraag-geïmpliceerde evaluatie-assen, stelt Qworld evaluatie in staat die zich aanpast aan elke vraag in plaats van te vertrouwen op vaste criteria op taakniveau.

English

Evaluating large language models (LLMs) on open-ended questions is difficult because response quality depends on the question's context. Binary scores and static rubrics fail to capture these context-dependent requirements. Existing methods define criteria at the dataset level or generate them in a single pass, which limits their ability to explore the evaluation space implied by each question. We introduce One-Question-One-World (Qworld), a method that generates question-specific evaluation criteria using a recursive expansion tree. Given a question, Qworld decomposes it into scenarios, perspectives, and fine-grained binary criteria through structured hierarchical and horizontal expansion. The resulting criteria specify what a high-quality answer must address for that question. On HealthBench, Qworld covers 89% of expert-authored criteria and generates 79% novel criteria validated by human experts. Experts rate Qworld criteria higher in insight and granularity than those produced by prior methods. When applied to 11 frontier LLMs on HealthBench and Humanity's Last Exam, Qworld reveals capability differences in dimensions such as long-term impact, equity, error handling, and interdisciplinary reasoning that coarse rubrics do not distinguish. By formulating criteria generation as structured coverage of question-implied evaluation axes, Qworld enables evaluation that adapts to each question rather than relying on fixed task-level criteria.

Qworld: Vraagspecifieke Evaluatiecriteria voor LLM's

Qworld: Question-Specific Evaluation Criteria for LLMs

Samenvatting

Support