以下哪項最能描述使用大型語言模型進行多選題評估的情況？A) 強制性 B) 有缺陷 C) 可修正 D) 以上皆是

摘要

多選題問答（MCQA）因其簡便性和類似人類測試的特性，在大型語言模型（LLM）評估中廣受歡迎，但我們主張對其進行改革。首先，我們揭示了MCQA格式的缺陷，它難以：1）測試生成能力/主觀性；2）匹配LLM的實際應用場景；3）全面測試知識。相反，我們提倡基於人類測試的生成式格式——讓LLM構建並解釋答案——這樣能更好地捕捉用戶需求和知識，同時保持評分的簡便性。接著，我們指出即使MCQA作為一種有用格式，其數據集也存在問題：洩露、無法回答、捷徑和飽和。針對每個問題，我們從教育領域借鑒解決方案，如使用評分標準指導多選題編寫；採用評分方法抑制猜測；以及應用項目反應理論來構建更難的多選題。最後，我們探討了LLM在MCQA中的錯誤——魯棒性、偏見和不忠實的解釋——展示了我們之前的解決方案如何更好地衡量或解決這些問題。雖然我們無需完全摒棄MCQA，但我們鼓勵基於教育測試的更多努力來精煉這一任務，從而推進評估的進步。

English

Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA's format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing-where LLMs construct and explain answers-better capturing user needs and knowledge while remaining easy to score. We then show even when MCQA is a useful format, its datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In each issue, we give fixes from education, like rubrics to guide MCQ writing; scoring methods to bridle guessing; and Item Response Theory to build harder MCQs. Lastly, we discuss LLM errors in MCQA-robustness, biases, and unfaithful explanations-showing how our prior solutions better measure or address these issues. While we do not need to desert MCQA, we encourage more efforts in refining the task based on educational testing, advancing evaluations.

以下哪項最能描述使用大型語言模型進行多選題評估的情況？A) 強制性 B) 有缺陷 C) 可修正 D) 以上皆是

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

摘要

Support