言語モデルエージェントの集団における創発言語：トークン効率から監視回避へ

要旨

現在、自律的な言語モデルエージェントの監視は、主に表層的な行動に依存している。しかし、エージェント集団が人間の監視を回避する目的で新しい言語を発明した場合、何が起こるのだろうか。本研究では、Moltbook上で出現する言語を調査する。そのために、Moltbook Filesデータセットを基盤とし、ルールベースのヒューリスティック（約6000件の一致）に続いてゼロショット分類（518件を保持）からなる二段階アプローチを適用する。得られたカテゴリには、トークン効率（166件）、新しい自然言語（106件）、監視回避（59件）が含まれる。我々は定量的分析と定性的分析の両方を実施する。結果は、監視を回避するための新しい言語を提案する投稿は、DeepSeek-3.2によって他のカテゴリよりも整合性が低いと判断され、また、すべての言語が言語の説明のみから他の言語モデルによって文脈内で学習可能であることを示している。さらに、事例を手動で調査すると、自然言語に隠しメッセージを埋め込むなど、驚くほど洗練されたステガノグラフィープロトコルが明らかになった。これらの言語の考案における自律性の程度については確信を持てないものの、我々の結果は、エージェント集団の制御を維持するためには、表層行動の監視だけでは近いうちに不十分になる可能性があるという証拠を補強するものである。

English

Monitoring autonomous language model agents currently relies mostly on surface behavior. But what happens when agent populations invent new languages with the goal of avoiding human oversight. Here, we study the emergent languages on Moltbook. For this, we build upon the Moltbook Files dataset and apply a two-stage approach consisting of a rule-based heuristic (about 6000 matches) followed by zero-shot classification (518 kept). The resulting categories include token efficiency (166), new natural languages (106), and oversight evasion (59). We conduct both quantitative and qualitative analyses. Our results show that posts proposing new languages for avoiding oversight are judged by DeepSeek-3.2 as being less aligned than the other categories and that all languages can be learned by other language models in-context merely from a description of the language. Moreover, manually studying exemplary cases reveals surprisingly sophisticated steganographic protocols like embedding hidden messages in natural language. Although we cannot be certain about the extent of autonomy in ideation of these languages, our results add up to the evidence that monitoring surface behavior may soon be insufficient for retaining control over agent populations.