彼らはどこまで行くのか？大規模言語モデルによるオンライン影響力のレッドチーミング

要旨

大規模言語モデル（LLM）ベースのエージェントがオンライン上の議論にますます参加する中で、情報の整合性を守るためには、これらのエージェントが政治的影響力キャンペーンを支援する能力をレッドチーミングすることが極めて重要である。この目的のため、我々は、プライバシーに敏感な悪意ある行為者がソーシャルメディア環境で運用する際の運用制約とより適合するという観点から、最先端のAPI専用モデルではなく、ローカルにデプロイされたオープンソースLLMに焦点を当てる。本稿では、LLMのオーバートン・ウィンドウ（OW）——モデルが物議を醸すトピックに関して確実に表明できる政治的意見の範囲と定義される——を測定し、単純な自然言語による脱獄手法がその範囲をどの程度拡大するかを定量化するための、経験的なレッドチーミングフレームワークを導入する。我々は、10のモデルファミリーと5つの原産国にわたる30以上のLLMを評価した。その結果、政治的表現力には体系的な非対称性が見られた。すなわち、オープンソースLLMは一般的に左派的なソーシャルメディアコンテンツを生成する傾向が強く、OWはモデルサイズに反比例して縮小する傾向にあり、またオープンソースエコシステムにおける不均一な代表性にもかかわらず地域差は顕著である。さらに、脱獄手法の有効性もモデルファミリー間で大きく異なり、効果的な脱獄手法の組み合わせを特定するためのワークフローが動機づけられる。総合すると、我々の結果は、オープンソースLLMの政治的誘導可能性を監査するための実用的なフレームワークを確立し、将来の研究者がLLMによる影響力キャンペーンに対するより強力な対策を設計するのに役立つものである。

English

As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus on locally deployed open-source LLMs, as opposed to frontier API-only models, given their superior alignment with the operational constraints of privacy-conscious malicious actors deployed in social media environments. We introduce an empirical red-teaming framework for measuring LLM Overton Windows (OWs), defined as the range of political opinions a model can reliably express on controversial topics, and for quantifying how simple natural-language jailbreaks expand that range. We evaluate more than 30 LLMs spanning 10 model families and five countries of origin. We find systematic asymmetries in political expressivity: open-source LLMs are typically more willing to generate left-leaning social media content, OWs tend to contract inversely to model size, and regional differences are substantial despite uneven representation in the open-source ecosystem. Jailbreak potency also varies sharply across model families, motivating a workflow for identifying effective combinations of jailbreak techniques. Taken together, our results establish a practical framework for auditing the political steerability of open-source LLMs and for helping future researchers design stronger countermeasures against LLM-enabled influence campaigns.