Opkomende compositionele communicatie voor latente wereldkenmerken

Samenvatting

Kan communicatiedruk in multi-agent systemen discrete, compositionele representaties extraheren van onzichtbare fysische eigenschappen uit bevroren videokenmerken? Wij tonen aan dat agenten die communiceren via een Gumbel-Softmax-flessenhals met iteratief leren, positioneel ontwarde protocollen ontwikkelen voor latente eigenschappen (elasticiteit, wrijving, massaverhouding) zonder eigenschapslabels of supervisie op de berichtstructuur. Met 4 agenten convergeert 100% van 80 seeds naar bijna perfecte compositionaliteit (PosDis=0.999, holdout 98,3%). Controlegroepen bevestigen dat de multi-agentstructuur – niet bandbreedte of temporele dekking – dit effect drijft. Causale interventie toont gerichte eigenschapsverstoring (~15% daling op beoogde eigenschap, <3% op anderen). Een gecontroleerde backbone-vergelijking onthult dat het perceptuele vooroordeel bepaalt wat communiceerbaar is: DINOv2 domineert bij ruimtelijk zichtbare hellingsfysica (98,3% vs 95,1%), terwijl V-JEPA 2 domineert bij uitsluitend dynamische botsingsfysica (87,4% vs 77,7%, d=2,74). Schaal-gematchede (d=3,37) en frame-gematchede (d=6,53) controles schrijven dit verschil volledig toe aan video-native pretraining. Het bevroren protocol ondersteunt actie-geconditioneerd plannen (91,5%) met contrafeitelijke snelheidsredenering (r=0,780). Validatie op Physics 101-beelden van echte camera's bevestigt 85,6% massa-vergelijkingsnauwkeurigheid op onzichtbare objecten, waarbij temporele dynamiek +11,2% bijdraagt beyond statisch uiterlijk, compositionaliteit bij agent-schaling repliceert op 90% voor 4 agenten, en causale interventie zich uitstrekt tot echte video (d=1,87, p=0,022).

English

Can multi-agent communication pressure extract discrete, compositional representations of invisible physical properties from frozen video features? We show that agents communicating through a Gumbel-Softmax bottleneck with iterated learning develop positionally disentangled protocols for latent properties (elasticity, friction, mass ratio) without property labels or supervision on message structure. With 4 agents, 100% of 80 seeds converge to near-perfect compositionality (PosDis=0.999, holdout 98.3%). Controls confirm multi-agent structure -- not bandwidth or temporal coverage -- drives this effect. Causal intervention shows surgical property disruption (~15% drop on targeted property, <3% on others). A controlled backbone comparison reveals that the perceptual prior determines what is communicable: DINOv2 dominates on spatially-visible ramp physics (98.3% vs 95.1%), while V-JEPA 2 dominates on dynamics-only collision physics (87.4% vs 77.7%, d=2.74). Scale-matched (d=3.37) and frame-matched (d=6.53) controls attribute this gap entirely to video-native pretraining. The frozen protocol supports action-conditioned planning (91.5%) with counterfactual velocity reasoning (r=0.780). Validation on Physics 101 real camera footage confirms 85.6% mass-comparison accuracy on unseen objects, temporal dynamics contributing +11.2% beyond static appearance, agent-scaling compositionality replicating at 90% for 4 agents, and causal intervention extending to real video (d=1.87, p=0.022).

Opkomende compositionele communicatie voor latente wereldkenmerken

Emergent Compositional Communication for Latent World Properties

Samenvatting

Support