海洋语料库：面向基础模型的大规模多模态海洋数据集

摘要

广阔而尚未充分开发的海洋在调节全球气候和支持海洋生物多样性方面发挥着关键作用，但人工智能在该领域的影响至今有限，其根本原因在于数据瓶颈。具体而言，海洋数据高度分散于不同来源，本质上呈现多模态、高噪声和弱标注特性，缺乏统一的数据模式与语义对齐。尽管多模态大语言模型在通用领域已取得显著成功，但由于缺乏针对海洋环境的大规模、高质量多模态数据集，其海洋科学应用仍受到严重制约。为弥补这一空白，我们推出OceanPile——专为海洋基础模型设计的大规模多模态语料库。该语料库包含三个核心组成部分：OceanCorpus整合了声纳数据、水下影像、海洋科学可视化资料及来自多元权威来源的科学文本；OceanInstruction是通过基于分层式海洋概念知识图谱的新型流程合成的高质量指令数据集；OceanBenchmark则是用于严谨评估的人工精编评测基准。我们建立了多阶段质量控制流程以确保跨模态数据的科学有效性与对齐度。实验验证表明，使用本数据训练的模型性能获得显著提升。所有数据集均已公开发布，以推动海洋人工智能领域发展并赋能领域专用多模态大语言模型。

English

The vast and underexplored ocean plays a critical role in regulating global climate and supporting marine biodiversity, yet artificial intelligence has so far delivered limited impact in this domain due to a fundamental data bottleneck. Specifically, ocean data are highly fragmented across disparate sources and inherently exhibit multi-modal, high-noise, and weakly labeled characteristics, lacking unified schemas and semantic alignment. Although Multimodal Large Language Models (MLLMs) have achieved remarkable success in general domains, their application to ocean science remains severely constrained by the absence of large-scale, well-aligned multimodal datasets tailored to marine environments. To bridge this gap, we introduce OceanPile, a large-scale multimodal corpus designed for ocean foundation models. It comprises three key components: OceanCorpus, a unified collection integrating sonar data, underwater imagery, marine science visuals, and scientific text from diverse authoritative sources; OceanInstruction, a high-quality instruction dataset synthesized via a novel pipeline guided by a hierarchical Ocean Concept Knowledge Graph; and OceanBenchmark, a manually curated evaluation benchmark for rigorous assessment. We establish a multi-stage quality control process to ensure scientific validity and alignment across modalities. Experimental validation demonstrates significant performance improvements for models trained on our data. All datasets are publicly released to advance the field of marine artificial intelligence and empower domain-specific MLLMs.