GeoX: 統一された形式化されたビジョン言語事前トレーニングを通じた幾何学的問題解決

要旨

一般的なタスクにおいて優れた性能を発揮するものの、Multi-modal Large Language Models（MLLMs）は、図を理解し、記号を解釈し、複雑な推論を行う自動幾何学問題解決（GPS）に苦労しています。この制約は、自然画像とテキストでの事前トレーニングと、問題解決プロセスにおける自動検証の欠如から生じています。さらに、現在の幾何学専門家は、タスク固有の設計に限定されており、より広範囲の幾何学的問題には効果が薄いです。このため、我々は幾何学的理解と推論タスクに焦点を当てたMulti-modal Large ModelであるGeoXを提案します。幾何学的図形と自然画像のテキストとの間には著しい違いがあるため、図形エンコーダと記号デコーダを開発するために単一モーダルの事前トレーニングを導入し、幾何学的画像とコーパスの理解を向上させます。さらに、単一モーダルの幾何学的専門家間のモダリティのギャップを埋める効果的な事前トレーニングパラダイムであるジオメトリー言語アラインメントを導入します。不均一に分布する幾何学的信号から区別的なクエリを生成し、非情報的な表現を排除するためのGenerator-And-Sampler Transformer（GS-Former）を提案します。最後に、GeoXは視覚的な指示の調整から利益を得て、幾何学的画像と質問を入力として受け取り、検証可能な解決策を生成する能力を強化します。実験結果は、GeoXがGeoQA、UniGeo、Geometry3K、PGPS9kなどの公に認識されたベンチマークにおいて、一般的なモデルと幾何学的専門家の両方を上回ることを示しています。

English

Despite their proficiency in general tasks, Multi-modal Large Language Models (MLLMs) struggle with automatic Geometry Problem Solving (GPS), which demands understanding diagrams, interpreting symbols, and performing complex reasoning. This limitation arises from their pre-training on natural images and texts, along with the lack of automated verification in the problem-solving process. Besides, current geometric specialists are limited by their task-specific designs, making them less effective for broader geometric problems. To this end, we present GeoX, a multi-modal large model focusing on geometric understanding and reasoning tasks. Given the significant differences between geometric diagram-symbol and natural image-text, we introduce unimodal pre-training to develop a diagram encoder and symbol decoder, enhancing the understanding of geometric images and corpora. Furthermore, we introduce geometry-language alignment, an effective pre-training paradigm that bridges the modality gap between unimodal geometric experts. We propose a Generator-And-Sampler Transformer (GS-Former) to generate discriminative queries and eliminate uninformative representations from unevenly distributed geometric signals. Finally, GeoX benefits from visual instruction tuning, empowering it to take geometric images and questions as input and generate verifiable solutions. Experiments show that GeoX outperforms both generalists and geometric specialists on publicly recognized benchmarks, such as GeoQA, UniGeo, Geometry3K, and PGPS9k.

GeoX: 統一された形式化されたビジョン言語事前トレーニングを通じた幾何学的問題解決

GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

要旨

Support