DrafterBench: 土木工学におけるタスク自動化のための大規模言語モデルのベンチマーキング

要旨

大規模言語モデル（LLM）エージェントは、現実世界の問題解決において大きな可能性を示しており、産業界におけるタスク自動化のソリューションとして期待されています。しかし、特に土木工学などの産業的観点から自動化エージェントを体系的に評価するためのベンチマークがさらに必要とされています。そこで我々は、土木工学における表現タスクである技術図面の修正という文脈でLLMエージェントを包括的に評価するためのDrafterBenchを提案します。DrafterBenchは、実世界の図面ファイルからまとめられた12種類のタスク、46のカスタマイズされた関数/ツール、合計1920のタスクを含んでいます。DrafterBenchはオープンソースのベンチマークであり、複雑で長文脈の指示を解釈する能力、事前知識の活用、暗黙的なポリシー認識を通じた動的な指示品質への適応といったAIエージェントの熟練度を厳密にテストすることを目的としています。このツールキットは、構造化データの理解、関数の実行、指示の遵守、批判的推論といった異なる能力を包括的に評価します。DrafterBenchは、タスクの精度とエラー統計の詳細な分析を提供し、エージェントの能力に対する深い洞察を得るとともに、LLMを工学アプリケーションに統合するための改善目標を特定することを目指しています。我々のベンチマークはhttps://github.com/Eason-Li-AIS/DrafterBenchで公開されており、テストセットはhttps://huggingface.co/datasets/Eason666/DrafterBenchでホストされています。

English

Large Language Model (LLM) agents have shown great potential for solving real-world problems and promise to be a solution for tasks automation in industry. However, more benchmarks are needed to systematically evaluate automation agents from an industrial perspective, for example, in Civil Engineering. Therefore, we propose DrafterBench for the comprehensive evaluation of LLM agents in the context of technical drawing revision, a representation task in civil engineering. DrafterBench contains twelve types of tasks summarized from real-world drawing files, with 46 customized functions/tools and 1920 tasks in total. DrafterBench is an open-source benchmark to rigorously test AI agents' proficiency in interpreting intricate and long-context instructions, leveraging prior knowledge, and adapting to dynamic instruction quality via implicit policy awareness. The toolkit comprehensively assesses distinct capabilities in structured data comprehension, function execution, instruction following, and critical reasoning. DrafterBench offers detailed analysis of task accuracy and error statistics, aiming to provide deeper insight into agent capabilities and identify improvement targets for integrating LLMs in engineering applications. Our benchmark is available at https://github.com/Eason-Li-AIS/DrafterBench, with the test set hosted at https://huggingface.co/datasets/Eason666/DrafterBench.

DrafterBench: 土木工学におけるタスク自動化のための大規模言語モデルのベンチマーキング

DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering

要旨

Support