以下哪項最能描述使用大型語言模型進行多選題評估的情況?A) 強制性 B) 有缺陷 C) 可修正 D) 以上皆是
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
February 19, 2025
作者: Nishant Balepur, Rachel Rudinger, Jordan Lee Boyd-Graber
cs.AI
摘要
多選題問答(MCQA)因其簡便性和類似人類測試的特性,在大型語言模型(LLM)評估中廣受歡迎,但我們主張對其進行改革。首先,我們揭示了MCQA格式的缺陷,它難以:1)測試生成能力/主觀性;2)匹配LLM的實際應用場景;3)全面測試知識。相反,我們提倡基於人類測試的生成式格式——讓LLM構建並解釋答案——這樣能更好地捕捉用戶需求和知識,同時保持評分的簡便性。接著,我們指出即使MCQA作為一種有用格式,其數據集也存在問題:洩露、無法回答、捷徑和飽和。針對每個問題,我們從教育領域借鑒解決方案,如使用評分標準指導多選題編寫;採用評分方法抑制猜測;以及應用項目反應理論來構建更難的多選題。最後,我們探討了LLM在MCQA中的錯誤——魯棒性、偏見和不忠實的解釋——展示了我們之前的解決方案如何更好地衡量或解決這些問題。雖然我們無需完全摒棄MCQA,但我們鼓勵基於教育測試的更多努力來精煉這一任務,從而推進評估的進步。
English
Multiple choice question answering (MCQA) is popular for LLM evaluation due
to its simplicity and human-like testing, but we argue for its reform. We first
reveal flaws in MCQA's format, as it struggles to: 1) test
generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge.
We instead advocate for generative formats based on human testing-where LLMs
construct and explain answers-better capturing user needs and knowledge while
remaining easy to score. We then show even when MCQA is a useful format, its
datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In
each issue, we give fixes from education, like rubrics to guide MCQ writing;
scoring methods to bridle guessing; and Item Response Theory to build harder
MCQs. Lastly, we discuss LLM errors in MCQA-robustness, biases, and unfaithful
explanations-showing how our prior solutions better measure or address these
issues. While we do not need to desert MCQA, we encourage more efforts in
refining the task based on educational testing, advancing evaluations.Summary
AI-Generated Summary