非言语，而是实物：大型语言模型是解决意大利谜题的弱工具

摘要

Rebuses 是需要受限制的多步推理来识别一组图像和字母中的隐藏短语的谜题。在这项工作中，我们介绍了一个用意大利语口头表达的大量rebus，并使用它来评估最先进的大型语言模型的rebus解决能力。虽然诸如 LLaMA-3 和 GPT-4o 等通用系统在这项任务上表现不佳，但专门的微调似乎可以提高模型的性能。然而，我们发现训练带来的性能提升在很大程度上是出于记忆。我们的结果表明，rebus 解决仍然是一个具有挑战性的测试平台，用于评估大型语言模型的语言能力和顺序指令遵循技能。

English

Rebuses are puzzles requiring constrained multi-step reasoning to identify a hidden phrase from a set of images and letters. In this work, we introduce a large collection of verbalized rebuses for the Italian language and use it to assess the rebus-solving capabilities of state-of-the-art large language models. While general-purpose systems such as LLaMA-3 and GPT-4o perform poorly on this task, ad-hoc fine-tuning seems to improve models' performance. However, we find that performance gains from training are largely motivated by memorization. Our results suggest that rebus solving remains a challenging test bed to evaluate large language models' linguistic proficiency and sequential instruction-following skills.

非言语，而是实物：大型语言模型是解决意大利谜题的弱工具

Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses

摘要

Support