cadrille：基於線上強化學習的多模態CAD重建

摘要

计算机辅助设计（CAD）在工程与制造领域扮演着核心角色，它使得创建精确且可编辑的三维模型成为可能。利用多种传感器或用户提供的数据作为CAD重建的输入，可以普及设计应用的使用。然而，现有方法通常局限于单一输入模态，如点云、图像或文本，这限制了其通用性和鲁棒性。借助视觉-语言模型（VLM）的最新进展，我们提出了一种多模态CAD重建模型，该模型能够同时处理上述三种输入模态。受大型语言模型（LLM）训练范式的启发，我们采用了两阶段流程：首先在大规模程序生成的数据上进行监督微调（SFT），随后利用程序化获取的在线反馈进行强化学习（RL）微调。此外，我们率先探索了将LLM应用于CAD任务中的RL微调，证明了在线RL算法，如群体相对偏好优化（GRPO），优于离线替代方案。在DeepCAD基准测试中，我们的SFT模型在所有三种输入模态上均超越了现有的单模态方法。更为重要的是，经过RL微调后，cadrille在包括一个真实世界数据集在内的三个具有挑战性的数据集上，确立了新的技术前沿。

English

Computer-Aided Design (CAD) plays a central role in engineering and manufacturing, making it possible to create precise and editable 3D models. Using a variety of sensor or user-provided data as inputs for CAD reconstruction can democratize access to design applications. However, existing methods typically focus on a single input modality, such as point clouds, images, or text, which limits their generalizability and robustness. Leveraging recent advances in vision-language models (VLM), we propose a multi-modal CAD reconstruction model that simultaneously processes all three input modalities. Inspired by large language model (LLM) training paradigms, we adopt a two-stage pipeline: supervised fine-tuning (SFT) on large-scale procedurally generated data, followed by reinforcement learning (RL) fine-tuning using online feedback, obtained programatically. Furthermore, we are the first to explore RL fine-tuning of LLMs for CAD tasks demonstrating that online RL algorithms such as Group Relative Preference Optimization (GRPO) outperform offline alternatives. In the DeepCAD benchmark, our SFT model outperforms existing single-modal approaches in all three input modalities simultaneously. More importantly, after RL fine-tuning, cadrille sets new state-of-the-art on three challenging datasets, including a real-world one.

cadrille：基於線上強化學習的多模態CAD重建

cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning

摘要

Support