LINGOLY-TOO:透過語言模板化與拼寫混淆分離記憶與推理
LINGOLY-TOO: Disentangling Memorisation from Reasoning with Linguistic Templatisation and Orthographic Obfuscation
March 4, 2025
作者: Jude Khouja, Karolina Korgul, Simi Hellsten, Lingyi Yang, Vlad Neacs, Harry Mayne, Ryan Kearns, Andrew Bean, Adam Mahdi
cs.AI
摘要
大型語言模型(LLMs)的推理能力評估容易因評估基準的數據暴露而被高估。我們引入了一個框架,用於生成語言推理問題,以減少記憶效應對模型性能估計的影響,並應用此框架開發了LINGOLY-TOO,這是一個具有挑戰性的語言推理評估基準。通過開發正字法模板,我們動態地模糊真實語言的書寫系統,以生成多種問題變體。這些變體保留了每個解決方案所需的推理步驟,同時降低了特定問題實例出現在模型訓練數據中的可能性。我們的實驗表明,包括OpenAI o1-preview和DeepSeem R1在內的前沿模型在處理高級推理時表現不佳。我們的分析還顯示,LLMs在相同問題的不同排列上表現出顯著的準確性差異,並且平均而言,在原始正字法出現的問題上表現更好。我們的研究結果突顯了LLMs在生成回應時的不透明性,並提供了證據表明,先前的數據暴露導致了對前沿模型推理能力的高估。
English
Effective evaluation of the reasoning capabilities of large language models
(LLMs) are susceptible to overestimation due to data exposure of evaluation
benchmarks. We introduce a framework for producing linguistic reasoning
problems that reduces the effect of memorisation in model performance estimates
and apply this framework to develop LINGOLY-TOO, a challenging evaluation
benchmark for linguistic reasoning. By developing orthographic templates, we
dynamically obfuscate the writing systems of real languages to generate
numerous question variations. These variations preserve the reasoning steps
required for each solution while reducing the likelihood of specific problem
instances appearing in model training data. Our experiments demonstrate that
frontier models, including OpenAI o1-preview and DeepSeem R1, struggle with
advanced reasoning. Our analysis also shows that LLMs exhibit noticeable
variance in accuracy across permutations of the same problem, and on average
perform better on questions appearing in their original orthography. Our
findings highlight the opaque nature of response generation in LLMs and provide
evidence that prior data exposure contributes to overestimating the reasoning
capabilities of frontier models.Summary
AI-Generated Summary