잠재적 세계 특성에 대한 창발적 구성적 의사소통

초록

다중 에이전트 통신 압력이 고정된 비디오 특징에서 보이지 않는 물리적 속성의 이산적이고 구성적인 표현을 추출할 수 있을까? 우리는 Gumbel-Softmax 병목 현상을 통해 통신하는 에이전트들이 반복 학습을 통해 속성 라벨이나 메시지 구조에 대한 감독 없이도 잠재적 속성(탄성, 마찰, 질량 비율)에 대한 위치적으로 분리된 프로토콜을 개발함을 보여준다. 4개의 에이전트를 사용할 때, 80개 시드의 100%가 거의 완벽한 구성성(PosDis=0.999, 홀드아웃 98.3%)에 수렴한다. 대조군 실험을 통해 대역폭이나 시간적 coverage가 아닌 다중 에이전트 구조가 이 효과를 주도함을 확인했다. 인과적 개입 실험은 특정 속성에 대한 선택적 disruption(대상 속성 약 15% 하락, 다른 속성 <3% 하락)을 보여준다. 통제된 백본 비교는 지각적 사전 지식이 통신 가능한 대상을 결정함을 밝힌다: DINOv2는 공간적으로 가시적인 경사 물리학에서 우세하고(98.3% 대 95.1%), V-JEPA 2는 동역학만 있는 충돌 물리학에서 우세하다(87.4% 대 77.7%, d=2.74). 규모 일치(d=3.37) 및 프레임 일치(d=6.53) 대조군은 이 격차가 전적으로 비디오 기반 사전 학습 때문임을 보여준다. 고정된 프로토콜은 반사실 속도 추론(r=0.780)과 함께 행동 조건 계획(91.5%)을 지원한다. Physics 101 실제 카메라 영상 검증은 보지 않은 객체에 대해 85.6%의 질량 비교 정확도, 정적 외관 대비 시간적 동역학의 +11.2% 기여, 4개 에이전트에서 90%로 재현된 에이전트 규모 구성성, 실제 비디오로 확장된 인과적 개입(d=1.87, p=0.022)을 확인한다.

English

Can multi-agent communication pressure extract discrete, compositional representations of invisible physical properties from frozen video features? We show that agents communicating through a Gumbel-Softmax bottleneck with iterated learning develop positionally disentangled protocols for latent properties (elasticity, friction, mass ratio) without property labels or supervision on message structure. With 4 agents, 100% of 80 seeds converge to near-perfect compositionality (PosDis=0.999, holdout 98.3%). Controls confirm multi-agent structure -- not bandwidth or temporal coverage -- drives this effect. Causal intervention shows surgical property disruption (~15% drop on targeted property, <3% on others). A controlled backbone comparison reveals that the perceptual prior determines what is communicable: DINOv2 dominates on spatially-visible ramp physics (98.3% vs 95.1%), while V-JEPA 2 dominates on dynamics-only collision physics (87.4% vs 77.7%, d=2.74). Scale-matched (d=3.37) and frame-matched (d=6.53) controls attribute this gap entirely to video-native pretraining. The frozen protocol supports action-conditioned planning (91.5%) with counterfactual velocity reasoning (r=0.780). Validation on Physics 101 real camera footage confirms 85.6% mass-comparison accuracy on unseen objects, temporal dynamics contributing +11.2% beyond static appearance, agent-scaling compositionality replicating at 90% for 4 agents, and causal intervention extending to real video (d=1.87, p=0.022).

잠재적 세계 특성에 대한 창발적 구성적 의사소통

Emergent Compositional Communication for Latent World Properties

초록

Support