非言語,而是事物:大型語言模型是義大利謎題的弱求解器。
Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses
August 1, 2024
作者: Gabriele Sarti, Tommaso Caselli, Malvina Nissim, Arianna Bisazza
cs.AI
摘要
Rebuses 是需要受限制的多步驗證推理才能從一組圖像和字母中識別出隱藏短語的謎題。在這項工作中,我們引入了一個大量以義語形式呈現的 Rebuses 集合,用於評估最先進的大型語言模型在解決 Rebuses 上的能力。儘管像 LLaMA-3 和 GPT-4o 這樣的通用系統在此任務上表現不佳,但特定調整似乎可以提高模型的性能。然而,我們發現訓練帶來的性能提升主要是受到記憶的影響。我們的結果表明,Rebuses 解決仍然是一個具有挑戰性的測試平臺,用於評估大型語言模型的語言能力和順序指令遵循技能。
English
Rebuses are puzzles requiring constrained multi-step reasoning to identify a
hidden phrase from a set of images and letters. In this work, we introduce a
large collection of verbalized rebuses for the Italian language and use it to
assess the rebus-solving capabilities of state-of-the-art large language
models. While general-purpose systems such as LLaMA-3 and GPT-4o perform poorly
on this task, ad-hoc fine-tuning seems to improve models' performance. However,
we find that performance gains from training are largely motivated by
memorization. Our results suggest that rebus solving remains a challenging test
bed to evaluate large language models' linguistic proficiency and sequential
instruction-following skills.Summary
AI-Generated Summary