DrafterBench：針對土木工程任務自動化的大型語言模型基準測試

摘要

大型語言模型（LLM）代理在解決現實世界問題方面展現出巨大潛力，並有望成為工業任務自動化的解決方案。然而，從工業角度（例如土木工程）系統評估自動化代理，尚需更多基準測試。因此，我們提出了DrafterBench，用於在技術圖紙修訂（土木工程中的一項代表性任務）背景下全面評估LLM代理。DrafterBench包含從實際圖紙文件中總結出的十二類任務，配備46個定制功能/工具，總計1920項任務。作為一個開源基準，DrafterBench嚴格測試AI代理在解讀複雜且長上下文指令、利用先驗知識以及通過隱含策略意識適應動態指令質量方面的熟練程度。該工具包全面評估了結構化數據理解、功能執行、指令遵循和批判性推理等不同能力。DrafterBench提供任務準確性和錯誤統計的詳細分析，旨在深入洞察代理能力，並為LLM在工程應用中的整合確定改進目標。我們的基準測試可在https://github.com/Eason-Li-AIS/DrafterBench獲取，測試集則托管於https://huggingface.co/datasets/Eason666/DrafterBench。

English

Large Language Model (LLM) agents have shown great potential for solving real-world problems and promise to be a solution for tasks automation in industry. However, more benchmarks are needed to systematically evaluate automation agents from an industrial perspective, for example, in Civil Engineering. Therefore, we propose DrafterBench for the comprehensive evaluation of LLM agents in the context of technical drawing revision, a representation task in civil engineering. DrafterBench contains twelve types of tasks summarized from real-world drawing files, with 46 customized functions/tools and 1920 tasks in total. DrafterBench is an open-source benchmark to rigorously test AI agents' proficiency in interpreting intricate and long-context instructions, leveraging prior knowledge, and adapting to dynamic instruction quality via implicit policy awareness. The toolkit comprehensively assesses distinct capabilities in structured data comprehension, function execution, instruction following, and critical reasoning. DrafterBench offers detailed analysis of task accuracy and error statistics, aiming to provide deeper insight into agent capabilities and identify improvement targets for integrating LLMs in engineering applications. Our benchmark is available at https://github.com/Eason-Li-AIS/DrafterBench, with the test set hosted at https://huggingface.co/datasets/Eason666/DrafterBench.

DrafterBench：針對土木工程任務自動化的大型語言模型基準測試

DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering

摘要

Support