어디까지 갈 것인가? 대규모 언어 모델을 활용한 온라인 영향력 레드팀 테스트

초록

대규모 언어 모델(LLM) 기반 에이전트가 온라인 담론에 점차 참여함에 따라, 이들의 정치적 영향력 캠페인 지원 능력을 레드팀 테스트하는 것은 정보 무결성을 위해 매우 중요하다. 이 목표를 위해, 우리는 최첨단 API 전용 모델이 아닌 로컬에 배포된 오픈소스 LLM에 초점을 맞춘다. 이는 오픈소스 LLM이 소셜 미디어 환경에 배포된 프라이버시를 중시하는 악의적 행위자의 운영 제약 조건과 더 잘 부합하기 때문이다. 우리는 논란이 되는 주제에 대해 모델이 신뢰할 수 있게 표현할 수 있는 정치적 의견의 범위로 정의되는 LLM 오버턴 윈도우(OW)를 측정하고, 간단한 자연어 탈옥이 해당 범위를 어떻게 확장하는지 정량화하기 위한 경험적 레드팀 테스트 프레임워크를 도입한다. 우리는 10개 모델 계열과 5개 출신 국가에 걸친 30개 이상의 LLM을 평가한다. 그 결과 정치적 표현성에서 체계적인 비대칭성을 발견한다. 오픈소스 LLM은 일반적으로 좌파 성향의 소셜 미디어 콘텐츠를 생성하는 데 더 적극적이며, OW는 모델 크기에 반비례하여 수축하는 경향이 있고, 오픈소스 생태계에서의 불균등한 대표성에도 불구하고 지역적 차이는 상당하다. 또한 탈옥 효능은 모델 계열 간에 크게 달라지며, 이는 효과적인 탈옥 기법 조합을 식별하기 위한 작업 흐름을 촉발한다. 종합하면, 우리의 결과는 오픈소스 LLM의 정치적 조종 가능성을 감사하고, 향후 연구자들이 LLM 기반 영향력 캠페인에 대한 더 강력한 대응책을 설계하는 데 도움이 되는 실용적인 프레임워크를 구축한다.

English

As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus on locally deployed open-source LLMs, as opposed to frontier API-only models, given their superior alignment with the operational constraints of privacy-conscious malicious actors deployed in social media environments. We introduce an empirical red-teaming framework for measuring LLM Overton Windows (OWs), defined as the range of political opinions a model can reliably express on controversial topics, and for quantifying how simple natural-language jailbreaks expand that range. We evaluate more than 30 LLMs spanning 10 model families and five countries of origin. We find systematic asymmetries in political expressivity: open-source LLMs are typically more willing to generate left-leaning social media content, OWs tend to contract inversely to model size, and regional differences are substantial despite uneven representation in the open-source ecosystem. Jailbreak potency also varies sharply across model families, motivating a workflow for identifying effective combinations of jailbreak techniques. Taken together, our results establish a practical framework for auditing the political steerability of open-source LLMs and for helping future researchers design stronger countermeasures against LLM-enabled influence campaigns.