基于双重自洽性强化学习的科学图形程序合成

摘要

图形程序合成技术对于解析和编辑视觉数据具有关键作用，能有效实现静态图像到可编辑TikZ代码的逆向工程。尽管TikZ因其编程灵活性成为科学示意图的事实标准，但其对空间精度的严苛要求对多模态大语言模型构成重大挑战。当前进展主要受限于两大瓶颈：（1）数据质量缺口：现有图像-TikZ语料库普遍缺乏严格可执行性与可靠视觉对齐；（2）评估体系缺口：缺乏同时衡量结构保真度与视觉保真度的基准。为此，我们提出闭环解决方案：首先推出SciTikZ-230K——通过我们自主研发的执行中心化数据引擎构建的大规模高质量数据集，覆盖11个跨学科领域；其次建立SciTikZ-Bench多维度基准，从基础几何构造到复杂层次化示意图全面评估视觉保真与结构逻辑。为拓展视觉代码优化方法边界，我们创新提出双重自洽强化学习优化范式，通过往返验证机制惩罚退化代码并提升整体自洽性。基于上述突破，我们训练的SciTikZer-8B模型实现最先进性能，在多项测试中持续超越Gemini-2.5-Pro等专有巨头模型及Qwen3-VL-235B-A22B-Instruct等超大规模模型。

English

Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into editable TikZ code. While TikZ is the de facto standard for scientific schematics due to its programmatic flexibility, its requirement for rigorous spatial precision presents a significant challenge for Multimodal Large Language Models. Progress is currently stifled by two primary gaps: (1) Data Quality Gap: existing image-TikZ corpora often lack strict executability and reliable visual alignment; (2) Evaluation Gap: a lack of benchmarks for both structural and visual fidelity. To address these, we present a closed-loop framework featuring: SciTikZ-230K, a large-scale, high-quality dataset from our Execution-Centric Data Engine covering 11 diverse scientific disciplines; SciTikZ-Bench, a multifaceted benchmark spanning from basic geometric constructs to intricate hierarchical schematics to evaluate both visual fidelity and structural logic. To further broaden the scope of visual-code optimization methodology, we introduce a novel Dual Self-Consistency Reinforcement Learning optimization paradigm, which utilizes Round-Trip Verification to penalize degenerate code and boost overall self-consistency. Empowered by these, our trained model SciTikZer-8B achieves state-of-the-art performance, consistently outperforming proprietary giants like Gemini-2.5-Pro and massive models like Qwen3-VL-235B-A22B-Instruct.