OceanPile: 基盤モデルのための大規模マルチモーダル海洋コーパス

要旨

広大で未開拓の海洋は、地球規模の気候調整や海洋生物多様性の維持において重要な役割を果たしているにもかかわらず、これまで人工知能のこの分野への貢献は限定的であった。その根本的な原因はデータのボトルネックにある。具体的には、海洋データは多様な情報源に高度に分散しており、本質的にマルチモーダル性、高ノイズ性、弱いラベル付けという特性を示し、統一されたスキーマや意味的整合性を欠いている。マルチモーダル大規模言語モデル（MLLM）は一般領域で目覚ましい成功を収めているが、海洋環境に特化した大規模で整合性の取れたマルチモーダルデータセットの欠如により、海洋科学への応用は大きく制約されている。この隔たりを埋めるため、我々は海洋基盤モデル向けに設計された大規模マルチモーダルコーパス「OceanPile」を提案する。これは3つの主要構成要素から成る：多様な権威ある情報源からソナーデータ、水中画像、海洋科学ビジュアル、科学テキストを統合した統一コレクション「OceanCorpus」、階層型海洋概念知識グラフに基づく新規パイプラインで合成された高品質指示データセット「OceanInstruction」、厳格な評価のための手作業で精選されたベンチマーク「OceanBenchmark」である。モダリティ間の科学的妥当性と整合性を確保するため、多段階の品質管理プロセスを確立した。実験的検証により、本データで訓練されたモデルが性能を大幅に向上させることを実証した。すべてのデータセットは公開され、海洋人工知能の進展とドメイン特化型MLLMの発展に貢献する。

English

The vast and underexplored ocean plays a critical role in regulating global climate and supporting marine biodiversity, yet artificial intelligence has so far delivered limited impact in this domain due to a fundamental data bottleneck. Specifically, ocean data are highly fragmented across disparate sources and inherently exhibit multi-modal, high-noise, and weakly labeled characteristics, lacking unified schemas and semantic alignment. Although Multimodal Large Language Models (MLLMs) have achieved remarkable success in general domains, their application to ocean science remains severely constrained by the absence of large-scale, well-aligned multimodal datasets tailored to marine environments. To bridge this gap, we introduce OceanPile, a large-scale multimodal corpus designed for ocean foundation models. It comprises three key components: OceanCorpus, a unified collection integrating sonar data, underwater imagery, marine science visuals, and scientific text from diverse authoritative sources; OceanInstruction, a high-quality instruction dataset synthesized via a novel pipeline guided by a hierarchical Ocean Concept Knowledge Graph; and OceanBenchmark, a manually curated evaluation benchmark for rigorous assessment. We establish a multi-stage quality control process to ensure scientific validity and alignment across modalities. Experimental validation demonstrates significant performance improvements for models trained on our data. All datasets are publicly released to advance the field of marine artificial intelligence and empower domain-specific MLLMs.

OceanPile: 基盤モデルのための大規模マルチモーダル海洋コーパス

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

要旨

Support