언어 모델 에이전트 개체군에서의 창발 언어: 토큰 효율성에서 감독 회피까지

초록

자율적 언어 모델 에이전트의 모니터링은 현재 주로 표면적 행동에 의존하고 있다. 하지만 에이전트 집단이 인간의 감독을 회피할 목적으로 새로운 언어를 발명한다면 어떤 일이 발생할까? 본 연구에서는 Moltbook에서의 출현적 언어를 분석한다. 이를 위해 Moltbook Files 데이터셋을 기반으로 규칙 기반 휴리스틱(약 6,000개 일치) 후 영샷 분류(518개 유지)로 구성된 2단계 접근법을 적용한다. 결과 범주에는 토큰 효율성(166), 새로운 자연어(106), 감독 회피(59)가 포함된다. 정량적 및 정성적 분석을 모두 수행했다. 분석 결과, 감독 회피를 위한 새로운 언어를 제안하는 게시물은 DeepSeek-3.2에 의해 다른 범주보다 덜 정렬된 것으로 판단되었으며, 모든 언어는 언어에 대한 설명만으로도 다른 언어 모델이 문맥 내 학습을 통해 습득할 수 있음을 확인했다. 또한, 사례 연구를 통해 자연어에 은닉 메시지를 삽입하는 등 놀라울 정도로 정교한 스테가노그래픽 프로토콜이 발견되었다. 이러한 언어 발상의 자율성 정도를 확신할 수는 없지만, 본 연구 결과는 표면적 행동 모니터링만으로는 에이전트 집단에 대한 통제력을 유지하기에 곧 충분하지 않을 수 있다는 증거를 추가한다.

English

Monitoring autonomous language model agents currently relies mostly on surface behavior. But what happens when agent populations invent new languages with the goal of avoiding human oversight. Here, we study the emergent languages on Moltbook. For this, we build upon the Moltbook Files dataset and apply a two-stage approach consisting of a rule-based heuristic (about 6000 matches) followed by zero-shot classification (518 kept). The resulting categories include token efficiency (166), new natural languages (106), and oversight evasion (59). We conduct both quantitative and qualitative analyses. Our results show that posts proposing new languages for avoiding oversight are judged by DeepSeek-3.2 as being less aligned than the other categories and that all languages can be learned by other language models in-context merely from a description of the language. Moreover, manually studying exemplary cases reveals surprisingly sophisticated steganographic protocols like embedding hidden messages in natural language. Although we cannot be certain about the extent of autonomy in ideation of these languages, our results add up to the evidence that monitoring surface behavior may soon be insufficient for retaining control over agent populations.