潜在世界属性的涌现组合式通信
Emergent Compositional Communication for Latent World Properties
March 18, 2026
作者: Tomek Kaszyński
cs.AI
摘要
多智能体通信压力能否从冻结的视频特征中提取出不可见物理属性的离散组合表征?我们证明,通过Gumbel-Softmax瓶颈进行迭代学习的通信智能体,能够在没有属性标签或消息结构监督的情况下,针对潜在属性(弹性、摩擦系数、质量比)发展出位置解耦的通信协议。在4智能体设置中,80次实验的100%收敛至近乎完美的组合性(位置解耦指数=0.999,留一验证98.3%)。对照实验证实驱动该效果的是多智能体结构——而非带宽或时间覆盖范围。因果干预显示可精准破坏特定属性(目标属性下降约15%,其他属性变化<3%)。骨干网络对照实验表明感知先验决定可通信内容:DINOv2在空间可见的斜面物理任务中占优(98.3% vs 95.1%),而V-JEPA 2在纯动力学的碰撞物理任务中领先(87.4% vs 77.7%,d=2.74)。规模匹配(d=3.37)与帧数匹配(d=6.53)的对照实验将此差异完全归因于视频原生预训练。冻结协议支持动作条件规划(91.5%)及反事实速度推理(r=0.780)。在Physics 101真实摄像数据上的验证表明:未知物体的质量比较准确率达85.6%,时序动力学相比静态外观贡献额外11.2%增益,4智能体规模下的组合性复现率达90%,且因果干预可扩展至真实视频(d=1.87, p=0.022)。
English
Can multi-agent communication pressure extract discrete, compositional representations of invisible physical properties from frozen video features? We show that agents communicating through a Gumbel-Softmax bottleneck with iterated learning develop positionally disentangled protocols for latent properties (elasticity, friction, mass ratio) without property labels or supervision on message structure. With 4 agents, 100% of 80 seeds converge to near-perfect compositionality (PosDis=0.999, holdout 98.3%). Controls confirm multi-agent structure -- not bandwidth or temporal coverage -- drives this effect. Causal intervention shows surgical property disruption (~15% drop on targeted property, <3% on others). A controlled backbone comparison reveals that the perceptual prior determines what is communicable: DINOv2 dominates on spatially-visible ramp physics (98.3% vs 95.1%), while V-JEPA 2 dominates on dynamics-only collision physics (87.4% vs 77.7%, d=2.74). Scale-matched (d=3.37) and frame-matched (d=6.53) controls attribute this gap entirely to video-native pretraining. The frozen protocol supports action-conditioned planning (91.5%) with counterfactual velocity reasoning (r=0.780). Validation on Physics 101 real camera footage confirms 85.6% mass-comparison accuracy on unseen objects, temporal dynamics contributing +11.2% beyond static appearance, agent-scaling compositionality replicating at 90% for 4 agents, and causal intervention extending to real video (d=1.87, p=0.022).