ChatPaper.aiChatPaper

DataClaw0:从原始流中智能裁剪多模态数据

DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

June 19, 2026
作者: Cong Wan, Zeyu Guo, Zijian Cai, Jiangyang Li, SongLin Dong, Lin Peng, Xiangyang Luo, Zhiheng Ma, Yihong Gong
cs.AI

摘要

海量非结构化多模态流数据存在高“数据熵”,既阻碍了人类高效知识获取,也制约了高质量AI后训练。现有依赖启发式规则或通用视觉语言模型的被动标注范式,成本高昂、模式单一,且无法挖掘原始数据中蕴含的深层过程逻辑。我们将数据处理提升为一种可学习能力,提出向“主体式数据精炼”的范式转变,通过主动优化和结构化数据,使其与多样化的用户及下游意图对齐。为突破训练此类高阶能力时的数据稀缺瓶颈,我们设计了一个两阶段流水线,将生成式语义合成锚定于确定性事实锚点,从而构建覆盖五个核心物理与数字领域的大规模数据集。在此基础上,DataClaw_0-9B模型融合了监督微调与组相对策略优化,实现了对复杂精炼与整理意图的稳健对齐。为系统量化该能力,我们构建了DataClaw_0-val——首个专用于数据精炼的基准测试。关键的是,我们以下游后训练作为最终验证试金石。在视频生成、真实世界视觉问答及GUI导航上的评估证实,DataClaw_0能够产出高信息密度的精炼数据,从而在有限训练数据条件下促进模型高效适应新任务。项目页面:https://czjdsg.github.io/MakeAnyData
English
Massive unstructured multimodal streams suffer from high "data entropy," impeding both efficient human knowledge acquisition and high-quality AI post-training. Existing passive annotation paradigms, heavily reliant on heuristic rules or general VLMs, are costly, monotonous, and fail to unlock the deep procedural logic embedded in raw data. We elevate data processing to a learnable capability, proposing a paradigm shift towards Agentic Data Tailoring, which actively refining and structuring data to align with diverse user and downstream intents. To overcome the data scarcity bottleneck in training such high-order capabilities, we design a two-stage pipeline grounding generative semantic synthesis in deterministic Factual Anchors, yielding a large-scale dataset spanning five core physical and digital domains. Building upon this, DataClaw_0-9B model synergizes Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), achieving robust alignment with complex refinement and tailoring intents. To systematically quantify this capability, we construct DataClaw_0-val, the first benchmark dedicated to data refinement. Crucially, we adopt downstream post-training as the ultimate validation touchstone. Evaluations on video generation, real-world VQA, and GUI navigation confirm that DataClaw_0 delivers high-information-density tailored data, facilitating efficient model adaptation to new tasks under limited training data regimes. Project page: https://czjdsg.github.io/MakeAnyData