LINGOLY-TOO：透過語言模板化與拼寫混淆分離記憶與推理

摘要

大型語言模型（LLMs）的推理能力評估容易因評估基準的數據暴露而被高估。我們引入了一個框架，用於生成語言推理問題，以減少記憶效應對模型性能估計的影響，並應用此框架開發了LINGOLY-TOO，這是一個具有挑戰性的語言推理評估基準。通過開發正字法模板，我們動態地模糊真實語言的書寫系統，以生成多種問題變體。這些變體保留了每個解決方案所需的推理步驟，同時降低了特定問題實例出現在模型訓練數據中的可能性。我們的實驗表明，包括OpenAI o1-preview和DeepSeem R1在內的前沿模型在處理高級推理時表現不佳。我們的分析還顯示，LLMs在相同問題的不同排列上表現出顯著的準確性差異，並且平均而言，在原始正字法出現的問題上表現更好。我們的研究結果突顯了LLMs在生成回應時的不透明性，並提供了證據表明，先前的數據暴露導致了對前沿模型推理能力的高估。

English

Effective evaluation of the reasoning capabilities of large language models (LLMs) are susceptible to overestimation due to data exposure of evaluation benchmarks. We introduce a framework for producing linguistic reasoning problems that reduces the effect of memorisation in model performance estimates and apply this framework to develop LINGOLY-TOO, a challenging evaluation benchmark for linguistic reasoning. By developing orthographic templates, we dynamically obfuscate the writing systems of real languages to generate numerous question variations. These variations preserve the reasoning steps required for each solution while reducing the likelihood of specific problem instances appearing in model training data. Our experiments demonstrate that frontier models, including OpenAI o1-preview and DeepSeem R1, struggle with advanced reasoning. Our analysis also shows that LLMs exhibit noticeable variance in accuracy across permutations of the same problem, and on average perform better on questions appearing in their original orthography. Our findings highlight the opaque nature of response generation in LLMs and provide evidence that prior data exposure contributes to overestimating the reasoning capabilities of frontier models.

LINGOLY-TOO：透過語言模板化與拼寫混淆分離記憶與推理

LINGOLY-TOO: Disentangling Memorisation from Reasoning with Linguistic Templatisation and Orthographic Obfuscation

摘要

Support