他們能走多遠？以大型語言模型進行線上影響力的紅隊測試

摘要

隨著基於大型語言模型（LLM）的智能體日益參與線上論述，對其支援政治影響活動的能力進行紅隊測試，對於資訊完整性至關重要。為達成此目標，我們聚焦於本地部署的開源LLM，而非僅限API的頂尖模型，因為前者更符合注重隱私的惡意行為者在社群媒體環境中運作的限制。我們提出一套實證紅隊測試框架，用以衡量LLM的奧弗頓窗口（OW），即模型在爭議性議題上能夠可靠表達的政治觀點範圍，並量化簡單的自然語言越獄如何擴展該範圍。我們評估了橫跨10個模型家族、來自五個國家的30多個LLM。研究發現政治表達性存在系統性不對稱：開源LLM通常更願意生成左傾的社群媒體內容；奧弗頓窗口往往隨著模型規模增大而縮小；且儘管開源生態系統中的代表性不均，區域差異仍相當顯著。越獄效力在不同模型家族間亦有明顯差異，這促使我們建立一套工作流程，以識別越獄技術的有效組合。綜合來看，我們的研究結果建立了一套實用框架，用於審計開源LLM的政治可操控性，並協助未來研究人員設計更強而有力的對策，以應對由LLM驅動的影響活動。

English

As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus on locally deployed open-source LLMs, as opposed to frontier API-only models, given their superior alignment with the operational constraints of privacy-conscious malicious actors deployed in social media environments. We introduce an empirical red-teaming framework for measuring LLM Overton Windows (OWs), defined as the range of political opinions a model can reliably express on controversial topics, and for quantifying how simple natural-language jailbreaks expand that range. We evaluate more than 30 LLMs spanning 10 model families and five countries of origin. We find systematic asymmetries in political expressivity: open-source LLMs are typically more willing to generate left-leaning social media content, OWs tend to contract inversely to model size, and regional differences are substantial despite uneven representation in the open-source ecosystem. Jailbreak potency also varies sharply across model families, motivating a workflow for identifying effective combinations of jailbreak techniques. Taken together, our results establish a practical framework for auditing the political steerability of open-source LLMs and for helping future researchers design stronger countermeasures against LLM-enabled influence campaigns.