大規模言語モデルに視覚フィードバックを注入することによるテキストからCADデータの生成

要旨

コンピュータ支援設計（CAD）モデルの作成には、膨大な専門知識と労力が必要です。テキストからCADに変換するText-to-CADは、このプロセスを効率化する上で重要です。最近の研究では、この目標を達成するために、シーケンシャルシグナルとして知られる正解のパラメトリックシーケンスを監督として利用しています。しかし、CADモデルは本質的にマルチモーダルであり、パラメトリックシーケンスと対応する描画されたビジュアルオブジェクトから構成されています。さらに、パラメトリックシーケンスからビジュアルオブジェクトへのレンダリングプロセスは多対1です。したがって、効果的なトレーニングには、シーケンシャル信号とビジュアル信号の両方が重要です。本研究では、CADFusionというフレームワークを紹介します。このフレームワークは、大規模言語モデル（LLMs）をバックボーンとして使用し、シーケンシャル学習（SL）ステージとビジュアルフィードバック（VF）ステージの2つのトレーニングステージを交互に行います。SLステージでは、正解のパラメトリックシーケンスを使用してLLMsをトレーニングし、論理的に整合したパラメトリックシーケンスの生成を可能にします。一方、VFステージでは、ビジュアル的に好ましいオブジェクトにレンダリングされるパラメトリックシーケンスを報酬とし、そうでない場合は罰則とし、LLMsがレンダリングされたビジュアルオブジェクトの認識と評価方法を学習できるようにします。これら2つのステージはトレーニング中に交互に繰り返され、バランスの取れた学習を確保し、両方の信号の利点を保持します。実験により、CADFusionが質的にも量的にも性能を大幅に向上させることが示されました。

English

Creating Computer-Aided Design (CAD) models requires significant expertise and effort. Text-to-CAD, which converts textual descriptions into CAD parametric sequences, is crucial in streamlining this process. Recent studies have utilized ground-truth parametric sequences, known as sequential signals, as supervision to achieve this goal. However, CAD models are inherently multimodal, comprising parametric sequences and corresponding rendered visual objects. Besides,the rendering process from parametric sequences to visual objects is many-to-one. Therefore, both sequential and visual signals are critical for effective training. In this work, we introduce CADFusion, a framework that uses Large Language Models (LLMs) as the backbone and alternates between two training stages: the sequential learning (SL) stage and the visual feedback (VF) stage. In the SL stage, we train LLMs using ground-truth parametric sequences, enabling the generation of logically coherent parametric sequences. In the VF stage, we reward parametric sequences that render into visually preferred objects and penalize those that do not, allowing LLMs to learn how rendered visual objects are perceived and evaluated. These two stages alternate throughout the training, ensuring balanced learning and preserving benefits of both signals. Experiments demonstrate that CADFusion significantly improves performance, both qualitatively and quantitatively.

大規模言語モデルに視覚フィードバックを注入することによるテキストからCADデータの生成

Text-to-CAD Generation Through Infusing Visual Feedback in Large Language Models

要旨

Support