AutoDetect: 대규모 언어 모델의 자동화된 약점 탐지를 위한 통합 프레임워크

초록

대형 언어 모델(LLMs)이 점점 더 강력해지고 있음에도 불구하고, 여전히 명령 수행이나 코딩 작업에서의 실수와 같은 상당하지만 미묘한 약점을 보입니다. 이러한 예상치 못한 오류는 실제 배포에서 심각한 결과를 초래할 수 있으므로, LLMs의 한계를 체계적으로 조사하는 것이 중요합니다. 기존의 벤치마킹 접근법은 특정 모델의 결함을 철저히 파악할 수 없으며, 수동 검사는 비용이 많이 들고 확장성이 없습니다. 본 논문에서는 다양한 작업에서 LLMs의 약점을 자동으로 드러내는 통합 프레임워크인 AutoDetect를 소개합니다. 학생들의 학습 성과를 측정하는 교육 평가 과정에서 영감을 받은 AutoDetect는 Examiner, Questioner, Assessor라는 세 가지 LLM 기반 에이전트로 구성됩니다. 이 세 에이전트 간의 협업은 포괄적이고 심층적인 약점 식별을 실현하도록 설계되었습니다. 우리의 프레임워크는 ChatGPT와 Claude와 같은 주요 모델에서 30%를 넘는 식별 성공률로 결함을 발견하는 데 상당한 성과를 보였습니다. 더 중요한 것은, 이러한 식별된 약점이 특정 모델 개선을 안내할 수 있어 Self-Instruct와 같은 비목표적 데이터 증강 방법보다 더 효과적임이 입증되었다는 점입니다. 우리의 접근 방식은 Llama 시리즈와 Mistral-7b를 포함한 인기 있는 LLMs의 성능을 여러 벤치마크에서 10% 이상 향상시키는 상당한 개선을 이끌어냈습니다. 코드와 데이터는 https://github.com/thu-coai/AutoDetect에서 공개되어 있습니다.

English

Although Large Language Models (LLMs) are becoming increasingly powerful, they still exhibit significant but subtle weaknesses, such as mistakes in instruction-following or coding tasks. As these unexpected errors could lead to severe consequences in practical deployments, it is crucial to investigate the limitations within LLMs systematically. Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies, while manual inspections are costly and not scalable. In this paper, we introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks. Inspired by the educational assessment process that measures students' learning outcomes, AutoDetect consists of three LLM-powered agents: Examiner, Questioner, and Assessor. The collaboration among these three agents is designed to realize comprehensive and in-depth weakness identification. Our framework demonstrates significant success in uncovering flaws, with an identification success rate exceeding 30% in prominent models such as ChatGPT and Claude. More importantly, these identified weaknesses can guide specific model improvements, proving more effective than untargeted data augmentation methods like Self-Instruct. Our approach has led to substantial enhancements in popular LLMs, including the Llama series and Mistral-7b, boosting their performance by over 10% across several benchmarks. Code and data are publicly available at https://github.com/thu-coai/AutoDetect.

AutoDetect: 대규모 언어 모델의 자동화된 약점 탐지를 위한 통합 프레임워크

AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models

초록

Support