LTD-Bench:通过绘图能力评估大语言模型
LTD-Bench: Evaluating Large Language Models by Letting Them Draw
November 4, 2025
作者: Liuhao Lin, Ke Li, Zihan Xu, Yuchen Shi, Yulei Qin, Yan Zhang, Xing Sun, Rongrong Ji
cs.AI
摘要
当前大语言模型(LLM)的评估范式存在重大研究盲区——依赖不透明的数值指标不仅掩盖了空间推理的根本局限,更无法提供对模型能力的直观认知。这种缺陷导致报告性能与实际应用能力间出现危险的脱节,在需要物理世界理解的应用场景中尤为明显。我们推出突破性基准测试LTD-Bench,通过要求模型通过点阵绘图或生成可执行代码的方式,将LLM评估从抽象分数转化为可直接观测的可视化输出。该方法使空间推理缺陷即使对非专业人士也一目了然,弥合了统计性能与直觉评估之间的本质鸿沟。LTD-Bench采用包含生成任务(测试空间想象力)与识别任务(评估空间感知力)的双轨方法论,在三个渐进难度层级上系统检验语言-空间映射的关键双向能力。我们通过对前沿模型的大规模实验发现惊人能力断层:即便在传统基准测试中表现优异的LLM,在建立语言与空间概念的双向映射时仍存在严重缺陷——这一根本局限削弱了其作为真实世界模型的潜力。此外,LTD-Bench的可视化输出支持强大的诊断分析,为探究模型相似性提供了新路径。
English
Current evaluation paradigms for large language models (LLMs) represent a
critical blind spot in AI research--relying on opaque numerical metrics that
conceal fundamental limitations in spatial reasoning while providing no
intuitive understanding of model capabilities. This deficiency creates a
dangerous disconnect between reported performance and practical abilities,
particularly for applications requiring physical world understanding. We
introduce LTD-Bench, a breakthrough benchmark that transforms LLM evaluation
from abstract scores to directly observable visual outputs by requiring models
to generate drawings through dot matrices or executable code. This approach
makes spatial reasoning limitations immediately apparent even to non-experts,
bridging the fundamental gap between statistical performance and intuitive
assessment. LTD-Bench implements a comprehensive methodology with complementary
generation tasks (testing spatial imagination) and recognition tasks (assessing
spatial perception) across three progressively challenging difficulty levels,
methodically evaluating both directions of the critical language-spatial
mapping. Our extensive experiments with state-of-the-art models expose an
alarming capability gap: even LLMs achieving impressive results on traditional
benchmarks demonstrate profound deficiencies in establishing bidirectional
mappings between language and spatial concept--a fundamental limitation that
undermines their potential as genuine world models. Furthermore, LTD-Bench's
visual outputs enable powerful diagnostic analysis, offering a potential
approach to investigate model similarity.