RS-WorldModel：遥感理解与未来感知预测的统一模型

摘要

遥感世界模型旨在同时解释观测到的变化并预测合理的未来情景，这两项任务共享时空先验知识。然而现有方法通常将二者割裂处理，限制了跨任务迁移能力。我们提出RS-WorldModel——一个统一处理时空变化理解与文本引导未来场景预测的遥感世界模型，并构建了包含110万样本、覆盖双任务的带丰富语言标注数据集RSWBench-1.1M。该模型采用三阶段训练框架：（1）地理感知生成预训练通过地理和采集元数据约束预测条件；（2）协同指令微调实现理解与预测的联合训练；（3）可验证强化优化利用可验证的任务特定奖励微调输出。仅凭20亿参数的RS-WorldModel在多数时空变化问答指标上超越了参数规模达其120倍的开源模型，其文本引导未来场景预测的FID指标达到43.13，优于所有开源基线及闭源的Gemini-2.5-Flash Image（Nano Banana）模型。

English

Remote sensing world models aim to both explain observed changes and forecast plausible futures, two tasks that share spatiotemporal priors. Existing methods, however, typically address them separately, limiting cross-task transfer. We present RS-WorldModel, a unified world model for remote sensing that jointly handles spatiotemporal change understanding and text-guided future scene forecasting, and we build RSWBench-1.1M, a 1.1 million sample dataset with rich language annotations covering both tasks. RS-WorldModel is trained in three stages: (1) Geo-Aware Generative Pre-training (GAGP) conditions forecasting on geographic and acquisition metadata; (2) synergistic instruction tuning (SIT) jointly trains understanding and forecasting; (3) verifiable reinforcement optimization (VRO) refines outputs with verifiable, task-specific rewards. With only 2B parameters, RS-WorldModel surpasses open-source models up to 120 times larger on most spatiotemporal change question-answering metrics. It achieves an FID of 43.13 on text-guided future scene forecasting, outperforming all open-source baselines as well as the closed-source Gemini-2.5-Flash Image (Nano Banana).

RS-WorldModel：遥感理解与未来感知预测的统一模型

RS-WorldModel: a Unified Model for Remote Sensing Understanding and Future Sense Forecasting

摘要

Support