ChatPaper.aiChatPaper

LTD-Bench:通过绘图能力评估大语言模型

LTD-Bench: Evaluating Large Language Models by Letting Them Draw

November 4, 2025
作者: Liuhao Lin, Ke Li, Zihan Xu, Yuchen Shi, Yulei Qin, Yan Zhang, Xing Sun, Rongrong Ji
cs.AI

摘要

当前大语言模型(LLM)的评估范式存在研究盲区——依赖不透明的数值指标掩盖了空间推理的根本缺陷,且无法直观呈现模型能力。这种缺陷导致报告性能与实际应用能力间出现危险脱节,尤其在需要物理世界认知的场景下。我们推出突破性基准测试LTD-Bench,通过要求模型在点阵上生成绘图或可执行代码,将LLM评估从抽象分数转化为可直接观测的可视化输出。该方法使空间推理缺陷即使对非专业人士也一目了然,弥合了统计性能与直觉评估间的本质鸿沟。LTD-Bench采用包含生成任务(测试空间想象)与识别任务(评估空间感知)的完整方法论,在三个渐进难度层级上系统检验语言-空间映射的关键双向能力。我们对顶尖模型的大规模实验揭示了惊人缺陷:即便在传统基准中表现优异的LLM,在建立语言与空间概念双向映射时仍存在深层不足——这一根本局限削弱了其作为真实世界模型的潜力。此外,LTD-Bench的可视化输出支持强大的诊断分析,为探究模型相似性提供了新路径。
English
Current evaluation paradigms for large language models (LLMs) represent a critical blind spot in AI research--relying on opaque numerical metrics that conceal fundamental limitations in spatial reasoning while providing no intuitive understanding of model capabilities. This deficiency creates a dangerous disconnect between reported performance and practical abilities, particularly for applications requiring physical world understanding. We introduce LTD-Bench, a breakthrough benchmark that transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code. This approach makes spatial reasoning limitations immediately apparent even to non-experts, bridging the fundamental gap between statistical performance and intuitive assessment. LTD-Bench implements a comprehensive methodology with complementary generation tasks (testing spatial imagination) and recognition tasks (assessing spatial perception) across three progressively challenging difficulty levels, methodically evaluating both directions of the critical language-spatial mapping. Our extensive experiments with state-of-the-art models expose an alarming capability gap: even LLMs achieving impressive results on traditional benchmarks demonstrate profound deficiencies in establishing bidirectional mappings between language and spatial concept--a fundamental limitation that undermines their potential as genuine world models. Furthermore, LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
PDF81December 2, 2025