自动检测：面向大型语言模型的自动弱点检测统一框架

摘要

尽管大型语言模型（LLMs）变得日益强大，但仍然存在显著但微妙的弱点，比如在遵循指令或编码任务中出现错误。由于这些意外错误可能导致实际部署中的严重后果，因此有必要系统地调查LLMs中的局限性。传统的基准测试方法无法全面准确地指出特定模型的缺陷，而手动检查成本高且不可扩展。在本文中，我们介绍了一个统一的框架AutoDetect，用于自动揭示LLMs在各种任务中的弱点。受到衡量学生学习成果的教育评估过程的启发，AutoDetect包括三个由LLM驱动的代理：审查员、提问者和评估员。这三个代理之间的协作旨在实现全面和深入的弱点识别。我们的框架在揭示缺陷方面取得了显著成功，在ChatGPT和Claude等知名模型中的识别成功率超过30%。更重要的是，这些确定的弱点可以指导特定模型改进，证明比像Self-Instruct这样的非定向数据增强方法更有效。我们的方法已经显著增强了流行的LLMs，包括Llama系列和Mistral-7b，在多个基准测试中将它们的性能提高了超过10%。代码和数据可在https://github.com/thu-coai/AutoDetect 上公开获取。

English

Although Large Language Models (LLMs) are becoming increasingly powerful, they still exhibit significant but subtle weaknesses, such as mistakes in instruction-following or coding tasks. As these unexpected errors could lead to severe consequences in practical deployments, it is crucial to investigate the limitations within LLMs systematically. Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies, while manual inspections are costly and not scalable. In this paper, we introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks. Inspired by the educational assessment process that measures students' learning outcomes, AutoDetect consists of three LLM-powered agents: Examiner, Questioner, and Assessor. The collaboration among these three agents is designed to realize comprehensive and in-depth weakness identification. Our framework demonstrates significant success in uncovering flaws, with an identification success rate exceeding 30% in prominent models such as ChatGPT and Claude. More importantly, these identified weaknesses can guide specific model improvements, proving more effective than untargeted data augmentation methods like Self-Instruct. Our approach has led to substantial enhancements in popular LLMs, including the Llama series and Mistral-7b, boosting their performance by over 10% across several benchmarks. Code and data are publicly available at https://github.com/thu-coai/AutoDetect.

自动检测：面向大型语言模型的自动弱点检测统一框架

AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models

摘要

Support