流体世界：反应-扩散动力学作为世界模型的预测性基底

摘要

世界模型通过学习预测环境未来状态，实现规划与心理模拟。当前方法普遍采用基于Transformer的预测器在潜在空间中进行运算，但这带来了双重代价：O(N²)的计算复杂度与显式空间归纳偏置的缺失。本文提出一个基础性质疑：自注意力机制是否为预测性世界建模的必要条件？是否存在替代性计算基质能实现相当或更优的效果？我们提出FluidWorld概念验证模型，其预测动力学由反应-扩散型偏微分方程控制。该模型摒弃独立的神经网络预测器，直接通过PDE积分生成未来状态预测。在无条件UCF-101视频预测任务中（64x64分辨率，约80万参数，采用完全相同的编码器、解码器、损失函数及数据），我们进行了严格参数匹配的三向消融实验：FluidWorld与Transformer基线（自注意力）和ConvLSTM基线（卷积递归）对比。虽然三者均达到相当的单步预测损失，但FluidWorld实现了2倍更低的重构误差，其表征空间结构保持度提升10-15%，有效维度增加18-25%，关键优势在于能保持连贯的多步推演，而两个基线模型均快速退化。所有实验均在单台消费级PC（Intel Core i5, NVIDIA RTX 4070 Ti）上完成，未使用大规模算力。这些结果表明：基于PDE的动力学机制天然具备O(N)空间复杂度、自适应计算能力及通过扩散实现的全局空间一致性，是世界建模中可替代注意力与卷积递归的参效兼顾方案。

English

World models learn to predict future states of an environment, enabling planning and mental simulation. Current approaches default to Transformer-based predictors operating in learned latent spaces. This comes at a cost: O(N^2) computation and no explicit spatial inductive bias. This paper asks a foundational question: is self-attention necessary for predictive world modeling, or can alternative computational substrates achieve comparable or superior results? I introduce FluidWorld, a proof-of-concept world model whose predictive dynamics are governed by partial differential equations (PDEs) of reaction-diffusion type. Instead of using a separate neural network predictor, the PDE integration itself produces the future state prediction. In a strictly parameter-matched three-way ablation on unconditional UCF-101 video prediction (64x64, ~800K parameters, identical encoder, decoder, losses, and data), FluidWorld is compared against both a Transformer baseline (self-attention) and a ConvLSTM baseline (convolutional recurrence). While all three models converge to comparable single-step prediction loss, FluidWorld achieves 2x lower reconstruction error, produces representations with 10-15% higher spatial structure preservation and 18-25% more effective dimensionality, and critically maintains coherent multi-step rollouts where both baselines degrade rapidly. All experiments were conducted on a single consumer-grade PC (Intel Core i5, NVIDIA RTX 4070 Ti), without any large-scale compute. These results establish that PDE-based dynamics, which natively provide O(N) spatial complexity, adaptive computation, and global spatial coherence through diffusion, are a viable and parameter-efficient alternative to both attention and convolutional recurrence for world modeling.