海洋資料庫：面向基礎模型的大規模多模態海洋語料集

摘要

廣袤且尚未充分探索的海洋在調節全球氣候和支撐海洋生物多樣性方面起著關鍵作用，但人工智慧在該領域的影響力迄今仍因基礎性數據瓶頸而受限。具體而言，海洋數據高度分散於不同來源，且本質上呈現多模態、高噪聲、弱標註等特徵，缺乏統一架構與語義對齊。儘管多模態大語言模型在通用領域已取得顯著成功，但其在海洋科學中的應用仍因缺乏針對海洋環境大規模、高質量對齊的多模態數據集而嚴重受限。為彌合此差距，我們推出專為海洋基礎模型設計的大規模多模態語料庫OceanPile，其包含三個核心組件：整合聲納數據、水下影像、海洋科學視覺資料及來自多元權威來源科學文本的統一集合OceanCorpus；通過基於分層式海洋概念知識圖譜的新穎流程合成的高質量指令數據集OceanInstruction；以及用於嚴謹評估的人工精校評測基準OceanBenchmark。我們建立了多階段質量控制流程以確保跨模態的科學有效性與對齊度。實驗驗證表明，使用本數據集訓練的模型性能顯著提升。所有數據集均公開釋出，以推動海洋人工智慧領域發展並賦能領域專用多模態大語言模型。

English

The vast and underexplored ocean plays a critical role in regulating global climate and supporting marine biodiversity, yet artificial intelligence has so far delivered limited impact in this domain due to a fundamental data bottleneck. Specifically, ocean data are highly fragmented across disparate sources and inherently exhibit multi-modal, high-noise, and weakly labeled characteristics, lacking unified schemas and semantic alignment. Although Multimodal Large Language Models (MLLMs) have achieved remarkable success in general domains, their application to ocean science remains severely constrained by the absence of large-scale, well-aligned multimodal datasets tailored to marine environments. To bridge this gap, we introduce OceanPile, a large-scale multimodal corpus designed for ocean foundation models. It comprises three key components: OceanCorpus, a unified collection integrating sonar data, underwater imagery, marine science visuals, and scientific text from diverse authoritative sources; OceanInstruction, a high-quality instruction dataset synthesized via a novel pipeline guided by a hierarchical Ocean Concept Knowledge Graph; and OceanBenchmark, a manually curated evaluation benchmark for rigorous assessment. We establish a multi-stage quality control process to ensure scientific validity and alignment across modalities. Experimental validation demonstrates significant performance improvements for models trained on our data. All datasets are publicly released to advance the field of marine artificial intelligence and empower domain-specific MLLMs.

海洋資料庫：面向基礎模型的大規模多模態海洋語料集

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

摘要

Support