ChatPaper.aiChatPaper

Wasm:构建结构化阿拉伯语交错多模态语料库的流程

Wasm: A Pipeline for Constructing Structured Arabic Interleaved Multimodal Corpora

November 10, 2025
作者: Khalil Hennara, Ahmad Bastati, Muhammad Hreden, Mohamed Motasim Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan
cs.AI

摘要

大型语言模型(LLMs)与大型多模态模型(LMMs)的性能高度依赖于其预训练数据的质量与规模。近期研究表明,在自然文档中图像与文本交错编排的多模态模型训练效果,在广泛基准测试中优于仅使用图文对训练的模型。这类模型通过先进的预训练技术实现了语义对齐、图像序列一致性和文本连贯性。然而对于阿拉伯语而言,由于缺乏能保持文档结构的高质量多模态数据集,相关研究进展受到限制。本文提出Wasm处理流程,通过对Common Crawl数据集进行加工,构建了首个提供Markdown格式输出的阿拉伯语多模态数据集。与现有仅关注文本提取的阿拉伯语语料库不同,我们的方法在保持网页内容结构完整性的同时,兼顾纯文本与多模态预训练场景的灵活性。我们通过详尽的对比分析,将本数据处理流程与现有主流数据集构建方法进行对比,既揭示了过滤策略的共性特征,也论证了特定设计决策的合理性。为支持后续研究,我们公开发布了具有代表性的数据集样本及完整的阿拉伯语多模态处理流程。
English
The performance of large language models (LLMs) and large multimodal models (LMMs) depends heavily on the quality and scale of their pre-training datasets. Recent research shows that large multimodal models trained on natural documents where images and text are interleaved outperform those trained only on image-text pairs across a wide range of benchmarks, leveraging advanced pre- trained models to enforce semantic alignment, image-sequence consistency, and textual coherence. For Arabic, however, the lack of high-quality multimodal datasets that preserve document structure has limited progress. In this paper, we present our pipeline Wasm for processing the Common Crawl dataset to create a new Arabic multimodal dataset that uniquely provides markdown output. Unlike existing Arabic corpora that focus solely on text extraction, our approach preserves the structural integrity of web content while maintaining flexibility for both text-only and multimodal pre-training scenarios. We provide a comprehensive comparative analysis of our data processing pipeline against those used for major existing datasets, highlighting the convergences in filtering strategies and justifying our specific design choices. To support future research, we publicly release a representative dataset dump along with the multimodal processing pipeline for Arabic.
PDF312December 2, 2025