ChatPaper.aiChatPaper

by Leveraging Word-in-Context Knowledge WiCkeD:一種利用上下文詞彙知識提升多選題基準測試難度的簡易方法

WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging

February 25, 2025
作者: Ahmed Elhady, Eneko Agirre, Mikel Artetxe
cs.AI

摘要

我們介紹了WiCkeD,這是一種簡單的方法,通過隨機將選項替換為「以上皆非」來增加現有多項選擇基準的複雜性,這種方法在教育測試中經常使用。我們展示了WiCkeD可以自動應用於任何現有的基準,使其更具挑戰性。我們將WiCkeD應用於6個流行的基準,並用它來評估18個開源權重的大型語言模型(LLMs)。與原始數據集版本相比,模型的性能平均下降了12.1個百分點。在3個MMLU數據集上使用思維鏈時,WiCkeD變體的性能下降與直接使用LLMs時觀察到的下降相似,這表明WiCkeD對於具有增強推理能力的模型也具有挑戰性。WiCkeD還揭示了一些模型對額外推理需求的敏感性,提供了相對於原始基準的額外信息。我們在https://github.com/ahmedselhady/wicked-benchmarks發布了我們的代碼和數據。
English
We introduce WiCkeD, a simple method to increase the complexity of existing multiple-choice benchmarks by randomly replacing a choice with "None of the above", a method often used in educational tests. We show that WiCkeD can be automatically applied to any existing benchmark, making it more challenging. We apply WiCkeD to 6 popular benchmarks and use it to evaluate 18 open-weight LLMs. The performance of the models drops 12.1 points on average with respect to the original versions of the datasets. When using chain-of-thought on 3 MMLU datasets, the performance drop for the WiCkeD variant is similar to the one observed when using the LLMs directly, showing that WiCkeD is also challenging for models with enhanced reasoning abilities. WiCkeD also uncovers that some models are more sensitive to the extra reasoning required, providing additional information with respect to the original benchmarks. We relase our code and data at https://github.com/ahmedselhady/wicked-benchmarks.

Summary

AI-Generated Summary

PDF22February 26, 2025