これらのうち、LLMを用いた多肢選択評価を最も適切に表すのはどれか？ A) 強制的 B) 欠陥がある C) 修正可能 D) 上記すべて

要旨

多肢選択式質問応答（MCQA）は、その簡便さと人間らしいテスト形式から大規模言語モデル（LLM）の評価に広く用いられていますが、私たちはその改革を提唱します。まず、MCQAの形式には以下のような欠点があることを明らかにします：1）生成能力や主観性のテストが困難であること、2）LLMのユースケースに合致しないこと、3）知識を完全にテストできないこと。代わりに、人間のテストに基づく生成的フォーマットを推奨します。このフォーマットでは、LLMが回答を構築し説明するため、ユーザーのニーズや知識をより適切に捉えつつ、採点が容易であるという利点があります。さらに、MCQAが有用な形式である場合でも、そのデータセットには以下の問題があることを示します：リーク、回答不可能性、ショートカット、飽和。これらの問題に対して、教育分野から得られる解決策を提示します。例えば、MCQ作成をガイドするルーブリック、推測を抑制する採点方法、より難しいMCQを作成するための項目反応理論などです。最後に、MCQAにおけるLLMのエラー（頑健性、バイアス、不誠実な説明）について議論し、私たちが提案した解決策がこれらの問題をより適切に測定または対処する方法を示します。MCQAを完全に放棄する必要はありませんが、教育テストに基づいてタスクを改良し、評価を進化させるためのさらなる努力を奨励します。

English

Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA's format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing-where LLMs construct and explain answers-better capturing user needs and knowledge while remaining easy to score. We then show even when MCQA is a useful format, its datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In each issue, we give fixes from education, like rubrics to guide MCQ writing; scoring methods to bridle guessing; and Item Response Theory to build harder MCQs. Lastly, we discuss LLM errors in MCQA-robustness, biases, and unfaithful explanations-showing how our prior solutions better measure or address these issues. While we do not need to desert MCQA, we encourage more efforts in refining the task based on educational testing, advancing evaluations.

これらのうち、LLMを用いた多肢選択評価を最も適切に表すのはどれか？ A) 強制的 B) 欠陥がある C) 修正可能 D) 上記すべて

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

要旨

Support