DataClaw0: 生ストリームからのマルチモーダルデータのエージェンティックな調整

要旨

大量の非構造化マルチモーダルストリームは高い「データエントロピー」に悩まされ、効率的な人間の知識獲得と高品質なAIのポストトレーニングの両方を妨げています。既存の受動的アノテーションパラダイムは、ヒューリスティックルールや汎用VLMに大きく依存しており、コストが高く、単調であり、生データに埋め込まれた深い手続き的論理を引き出すことができません。我々はデータ処理を学習可能な能力に昇華させ、能動的にデータを精緻化・構造化して多様なユーザーや下流の意図に適合させる「エージェンティックデータテーラリング」へのパラダイムシフトを提案します。このような高次能力の訓練におけるデータ不足のボトルネックを克服するため、我々は生成意味合成を決定論的事実的アンカーに基づかせる二段階パイプラインを設計し、五つの主要な物理領域とデジタル領域にわたる大規模データセットを生成しました。これに基づき、DataClaw_0.9Bモデルは教師ありファインチューニング（SFT）とグループ相対方策最適化（GRPO）を相乗的に組み合わせ、複雑な精緻化やテーラリングの意図とのロバストな整合を実現します。この能力を体系的に定量化するため、我々はデータ精緻化に特化した初のベンチマークであるDataClaw_0-valを構築しました。重要なことに、我々は最終的な検証の試金石として下流のポストトレーニングを採用しています。動画生成、実世界VQA、GUIナビゲーションに関する評価により、DataClaw_0が高情報密度のテーラリングデータを提供し、限られた訓練データ条件下での新しいタスクへの効率的なモデル適応を促進することが確認されました。プロジェクトページ: https://czjdsg.github.io/MakeAnyData

English

Massive unstructured multimodal streams suffer from high "data entropy," impeding both efficient human knowledge acquisition and high-quality AI post-training. Existing passive annotation paradigms, heavily reliant on heuristic rules or general VLMs, are costly, monotonous, and fail to unlock the deep procedural logic embedded in raw data. We elevate data processing to a learnable capability, proposing a paradigm shift towards Agentic Data Tailoring, which actively refining and structuring data to align with diverse user and downstream intents. To overcome the data scarcity bottleneck in training such high-order capabilities, we design a two-stage pipeline grounding generative semantic synthesis in deterministic Factual Anchors, yielding a large-scale dataset spanning five core physical and digital domains. Building upon this, DataClaw_0-9B model synergizes Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), achieving robust alignment with complex refinement and tailoring intents. To systematically quantify this capability, we construct DataClaw_0-val, the first benchmark dedicated to data refinement. Crucially, we adopt downstream post-training as the ultimate validation touchstone. Evaluations on video generation, real-world VQA, and GUI navigation confirm that DataClaw_0 delivers high-information-density tailored data, facilitating efficient model adaptation to new tasks under limited training data regimes. Project page: https://czjdsg.github.io/MakeAnyData