潜在的世界特性に対する創発的合成通信

要旨

凍結されたビデオ特徴から、マルチエージェント通信による圧縮が不可視物理特性の離散的で合成的な表現を抽出できるだろうか？我々は、Gumbel-Softmaxボトルネックを通じて通信し、反復学習を行うエージェントが、特性ラベルやメッセージ構造に対する教師なしで、潜在的特性（弾性、摩擦、質量比）に対する位置的にも解離した通信プロトコルを発達させることを示す。4エージェントの場合、80シードの100%がほぼ完全な合成性に収束した（PosDis=0.999、ホールドアウト98.3%）。対照実験により、帯域幅や時間的カバレッジではなく、マルチエージェント構造がこの効果を駆動することが確認された。因果介入実験は、特異的な特性の破壊（対象特性で約15%低下、他は3%未満）を示す。制御されたバックボーン比較により、知覚的事前知識が通信可能なものを決定することが明らかになった：DINOv2は空間的に可視な斜面物理で優位（98.3% vs 95.1%）であったが、V-JEPA 2は動力学のみの衝突物理で優位（87.4% vs 77.7%, d=2.74）であった。スケールマッチ（d=3.37）およびフレームマッチ（d=6.53）対照実験は、このギャップが完全にビデオネイティブな事前学習に起因することを示した。凍結されたプロトコルは、反事実的速度推論（r=0.780）を伴う行動条件付き計画（91.5%）をサポートする。Physics 101実カメラ映像による検証では、未観測物体に対する85.6%の質量比較精度、静的外観を超える時間的ダイナミクスの+11.2%寄与、4エージェントで90%再現されるエージェント数スケーリングによる合成性、実ビデオへの因果介入の拡張（d=1.87, p=0.022）が確認された。

English

Can multi-agent communication pressure extract discrete, compositional representations of invisible physical properties from frozen video features? We show that agents communicating through a Gumbel-Softmax bottleneck with iterated learning develop positionally disentangled protocols for latent properties (elasticity, friction, mass ratio) without property labels or supervision on message structure. With 4 agents, 100% of 80 seeds converge to near-perfect compositionality (PosDis=0.999, holdout 98.3%). Controls confirm multi-agent structure -- not bandwidth or temporal coverage -- drives this effect. Causal intervention shows surgical property disruption (~15% drop on targeted property, <3% on others). A controlled backbone comparison reveals that the perceptual prior determines what is communicable: DINOv2 dominates on spatially-visible ramp physics (98.3% vs 95.1%), while V-JEPA 2 dominates on dynamics-only collision physics (87.4% vs 77.7%, d=2.74). Scale-matched (d=3.37) and frame-matched (d=6.53) controls attribute this gap entirely to video-native pretraining. The frozen protocol supports action-conditioned planning (91.5%) with counterfactual velocity reasoning (r=0.780). Validation on Physics 101 real camera footage confirms 85.6% mass-comparison accuracy on unseen objects, temporal dynamics contributing +11.2% beyond static appearance, agent-scaling compositionality replicating at 90% for 4 agents, and causal intervention extending to real video (d=1.87, p=0.022).

潜在的世界特性に対する創発的合成通信

Emergent Compositional Communication for Latent World Properties

要旨

Support