他们会走多远？利用大型语言模型进行在线影响力的红队测试

摘要

随着基于大语言模型（LLM）的智能体越来越多地参与在线讨论，对其支持政治影响力活动的能力进行红队测试对于信息完整性至关重要。为实现这一目标，我们聚焦于本地部署的开源LLM（而非仅通过API访问的尖端模型），因为前者更符合注重隐私的恶意行为者在社交媒体环境中部署时的操作限制。我们引入了一个实证红队测试框架，用于测量LLM的奥弗顿窗口（OW），即模型在有争议话题上能可靠表达的政治观点范围，并量化简单自然语言越狱如何扩展该范围。我们评估了来自10个模型家族、五个原产国的30多个LLM。研究发现政治表达存在系统性不对称：开源LLM通常更倾向于生成左倾社交媒体内容；奥弗顿窗口往往随模型规模增大而收缩；尽管开源生态系统中代表性不均衡，区域差异仍显著。越狱效果在不同模型家族间差异极大，这促使我们建立了一种识别越狱技术有效组合的工作流程。综合来看，我们的研究结果为审计开源LLM的政治可操控性提供了实用框架，并有助于未来研究者针对LLM赋能的影响力活动设计更强大的反制措施。

English

As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus on locally deployed open-source LLMs, as opposed to frontier API-only models, given their superior alignment with the operational constraints of privacy-conscious malicious actors deployed in social media environments. We introduce an empirical red-teaming framework for measuring LLM Overton Windows (OWs), defined as the range of political opinions a model can reliably express on controversial topics, and for quantifying how simple natural-language jailbreaks expand that range. We evaluate more than 30 LLMs spanning 10 model families and five countries of origin. We find systematic asymmetries in political expressivity: open-source LLMs are typically more willing to generate left-leaning social media content, OWs tend to contract inversely to model size, and regional differences are substantial despite uneven representation in the open-source ecosystem. Jailbreak potency also varies sharply across model families, motivating a workflow for identifying effective combinations of jailbreak techniques. Taken together, our results establish a practical framework for auditing the political steerability of open-source LLMs and for helping future researchers design stronger countermeasures against LLM-enabled influence campaigns.