自動偵測：朝向大型語言模型自動弱點偵測的統一框架

摘要

儘管大型語言模型（LLMs）變得日益強大，但仍然存在顯著而微妙的弱點，例如在遵循指示或編碼任務中出現的錯誤。由於這些意外錯誤可能導致實際部署中的嚴重後果，因此有必要系統地研究LLMs的限制。傳統的基準測試方法無法徹底指出特定模型的缺陷，而手動檢查成本高且不具擴展性。在本文中，我們介紹了一個統一的框架AutoDetect，用於自動揭示LLMs在各種任務中的弱點。受教育評估過程的啟發，該AutoDetect包括三個由LLM驅動的代理人：Examiner、Questioner和Assessor。這三個代理人之間的合作旨在實現全面和深入的弱點識別。我們的框架在揭示缺陷方面取得了顯著成功，在ChatGPT和Claude等知名模型中的識別成功率超過30%。更重要的是，這些識別出的弱點可以引導特定模型的改進，證明比像Self-Instruct這樣的非針對性數據增強方法更有效。我們的方法已經在流行的LLMs中實現了顯著的增強，包括Llama系列和Mistral-7b，將它們在多個基準測試中的性能提高了超過10%。代碼和數據可在https://github.com/thu-coai/AutoDetect 公開獲取。

English

Although Large Language Models (LLMs) are becoming increasingly powerful, they still exhibit significant but subtle weaknesses, such as mistakes in instruction-following or coding tasks. As these unexpected errors could lead to severe consequences in practical deployments, it is crucial to investigate the limitations within LLMs systematically. Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies, while manual inspections are costly and not scalable. In this paper, we introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks. Inspired by the educational assessment process that measures students' learning outcomes, AutoDetect consists of three LLM-powered agents: Examiner, Questioner, and Assessor. The collaboration among these three agents is designed to realize comprehensive and in-depth weakness identification. Our framework demonstrates significant success in uncovering flaws, with an identification success rate exceeding 30% in prominent models such as ChatGPT and Claude. More importantly, these identified weaknesses can guide specific model improvements, proving more effective than untargeted data augmentation methods like Self-Instruct. Our approach has led to substantial enhancements in popular LLMs, including the Llama series and Mistral-7b, boosting their performance by over 10% across several benchmarks. Code and data are publicly available at https://github.com/thu-coai/AutoDetect.

自動偵測：朝向大型語言模型自動弱點偵測的統一框架

AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models

摘要

Support