RS-WorldModel：リモートセンシング理解と将来予測のための統合モデル

要旨

リモートセンシング世界モデルは、観測された変化の説明と妥当な将来の予測という、時空間的な事前知識を共有する2つのタスクを同時に目的としている。しかし、既存手法では通常これらを別個に扱うため、タスク間の知識転移が制限されている。本研究では、時空間変化の理解とテキスト誘導型将来シーン予測を統一的に扱うリモートセンシング向け世界モデル「RS-WorldModel」を提案し、両タスクをカバーする110万サンプルからなる豊富な言語注釈付きデータセット「RSWBench-1.1M」を構築した。RS-WorldModelは3段階で学習を行う：(1) 地理・取得メタデータに基づく予測条件付けのための地理認識生成事前学習（GAGP）、(2) 理解タスクと予測タスクの協調的学習を実現する相乗的指示チューニング（SIT）、(3) 検証可能なタスク特化型報酬による出力改良を図る検証可能強化学習最適化（VRO）。パラメータ数が20億に過ぎないにもかかわらず、RS-WorldModelは、大半の時空間変化質問応答指標において、最大120倍大規模なオープンソースモデルを凌駕する。テキスト誘導型将来シーン予測ではFID値43.13を達成し、全てのオープンソースベースラインおよびクローズドソースのGemini-2.5-Flash Image (Nano Banana)を上回った。

English

Remote sensing world models aim to both explain observed changes and forecast plausible futures, two tasks that share spatiotemporal priors. Existing methods, however, typically address them separately, limiting cross-task transfer. We present RS-WorldModel, a unified world model for remote sensing that jointly handles spatiotemporal change understanding and text-guided future scene forecasting, and we build RSWBench-1.1M, a 1.1 million sample dataset with rich language annotations covering both tasks. RS-WorldModel is trained in three stages: (1) Geo-Aware Generative Pre-training (GAGP) conditions forecasting on geographic and acquisition metadata; (2) synergistic instruction tuning (SIT) jointly trains understanding and forecasting; (3) verifiable reinforcement optimization (VRO) refines outputs with verifiable, task-specific rewards. With only 2B parameters, RS-WorldModel surpasses open-source models up to 120 times larger on most spatiotemporal change question-answering metrics. It achieves an FID of 43.13 on text-guided future scene forecasting, outperforming all open-source baselines as well as the closed-source Gemini-2.5-Flash Image (Nano Banana).

RS-WorldModel：リモートセンシング理解と将来予測のための統合モデル

RS-WorldModel: a Unified Model for Remote Sensing Understanding and Future Sense Forecasting

要旨

Support