DrafterBench: 토목 공학 분야 작업 자동화를 위한 대형 언어 모델 벤치마킹

초록

대형 언어 모델(LLM) 에이전트는 실제 문제 해결에 있어 큰 잠재력을 보여주며, 산업 분야에서 작업 자동화를 위한 해결책으로 기대받고 있습니다. 그러나 특히 토목 공학과 같은 산업적 관점에서 자동화 에이전트를 체계적으로 평가하기 위한 더 많은 벤치마크가 필요합니다. 이에 따라, 우리는 토목 공학에서의 표현 작업인 기술 도면 수정 작업 맥락에서 LLM 에이전트를 종합적으로 평가하기 위한 DrafterBench를 제안합니다. DrafterBench은 실제 도면 파일에서 요약된 12가지 유형의 작업과 46개의 맞춤형 함수/도구, 총 1920개의 작업으로 구성되어 있습니다. DrafterBench은 오픈소스 벤치마크로, 복잡하고 긴 맥락의 지시를 해석하고, 사전 지식을 활용하며, 암묵적인 정책 인식을 통해 동적 지시 품질에 적응하는 AI 에이전트의 숙련도를 엄격히 테스트합니다. 이 툴킷은 구조화된 데이터 이해, 함수 실행, 지시 따르기, 비판적 사고와 같은 다양한 역량을 종합적으로 평가합니다. DrafterBench은 작업 정확도와 오류 통계에 대한 상세한 분석을 제공하여, 엔지니어링 애플리케이션에 LLM을 통합하는 데 있어 에이전트의 역량을 더 깊이 이해하고 개선 목표를 식별하는 데 목적을 두고 있습니다. 우리의 벤치마크는 https://github.com/Eason-Li-AIS/DrafterBench에서 확인할 수 있으며, 테스트 세트는 https://huggingface.co/datasets/Eason666/DrafterBench에 호스팅되어 있습니다.

English

Large Language Model (LLM) agents have shown great potential for solving real-world problems and promise to be a solution for tasks automation in industry. However, more benchmarks are needed to systematically evaluate automation agents from an industrial perspective, for example, in Civil Engineering. Therefore, we propose DrafterBench for the comprehensive evaluation of LLM agents in the context of technical drawing revision, a representation task in civil engineering. DrafterBench contains twelve types of tasks summarized from real-world drawing files, with 46 customized functions/tools and 1920 tasks in total. DrafterBench is an open-source benchmark to rigorously test AI agents' proficiency in interpreting intricate and long-context instructions, leveraging prior knowledge, and adapting to dynamic instruction quality via implicit policy awareness. The toolkit comprehensively assesses distinct capabilities in structured data comprehension, function execution, instruction following, and critical reasoning. DrafterBench offers detailed analysis of task accuracy and error statistics, aiming to provide deeper insight into agent capabilities and identify improvement targets for integrating LLMs in engineering applications. Our benchmark is available at https://github.com/Eason-Li-AIS/DrafterBench, with the test set hosted at https://huggingface.co/datasets/Eason666/DrafterBench.

DrafterBench: 토목 공학 분야 작업 자동화를 위한 대형 언어 모델 벤치마킹

DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering

초록

Support