此乃汝之Doge,若汝悦之:探索大型語言模型混合中的欺騙與魯棒性
This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs
March 7, 2025
作者: Lorenz Wolf, Sangwoong Yoon, Ilija Bogunovic
cs.AI
摘要
大型語言模型(LLM)代理混合架構(MoA)通過在推理時利用多個LLM的協作,在AlpacaEval 2.0等知名基準測試中取得了最先進的性能。儘管取得了這些成功,但對MoA的安全性和可靠性的評估仍然缺失。我們首次全面研究了MoA在面對故意提供誤導性回應的欺騙性LLM代理時的魯棒性。我們考察了欺騙性信息的傳播、模型大小和信息可用性等因素,並揭示了關鍵的脆弱性。在AlpacaEval 2.0上,流行的LLaMA 3.1-70B模型與3層MoA(6個LLM代理)結合時,長度控制勝率(LC WR)達到49.2%。然而,我們證明,僅在MoA中引入一個精心指示的欺騙性代理,即可將性能降低至37.9%,從而完全抵消了MoA的所有增益。在QuALITY這項多項選擇理解任務中,影響同樣嚴重,準確率驚人地下降了48.5%。部分受到歷史上威尼斯總督投票過程的啟發,該過程旨在最小化影響和欺騙,我們提出了一系列無監督防禦機制,能夠恢復大部分損失的性能。
English
Mixture of large language model (LLMs) Agents (MoA) architectures achieve
state-of-the-art performance on prominent benchmarks like AlpacaEval 2.0 by
leveraging the collaboration of multiple LLMs at inference time. Despite these
successes, an evaluation of the safety and reliability of MoA is missing. We
present the first comprehensive study of MoA's robustness against deceptive LLM
agents that deliberately provide misleading responses. We examine factors like
the propagation of deceptive information, model size, and information
availability, and uncover critical vulnerabilities. On AlpacaEval 2.0, the
popular LLaMA 3.1-70B model achieves a length-controlled Win Rate (LC WR) of
49.2% when coupled with 3-layer MoA (6 LLM agents). However, we demonstrate
that introducing only a single carefully-instructed deceptive agent
into the MoA can reduce performance to 37.9%, effectively nullifying all MoA
gains. On QuALITY, a multiple-choice comprehension task, the impact is also
severe, with accuracy plummeting by a staggering 48.5%. Inspired in part by the
historical Doge of Venice voting process, designed to minimize influence and
deception, we propose a range of unsupervised defense mechanisms that recover
most of the lost performance.Summary
AI-Generated Summary