RF-DETR物体検出 vs YOLOv12：複雑な果樹園環境におけるラベル曖昧性下での単一クラスおよび多クラスグリーンフルーツ検出のためのTransformerベースとCNNベースのアーキテクチャの比較研究

要旨

本研究では、ラベル曖昧性、オクルージョン、背景との混在が特徴的な複雑な果樹園環境における青果物検出のために、RF-DETR物体検出ベースモデルとYOLOv12物体検出モデルの設定を詳細に比較した。動的な実世界の条件下でのモデル性能を評価するため、単一クラス（青果物）と多クラス（オクルージョンあり・なしの青果物）のアノテーションを含むカスタムデータセットを開発した。DINOv2バックボーンと変形可能なアテンションを利用するRF-DETR物体検出モデルは、グローバルなコンテキストモデリングに優れ、部分的にオクルージョンされたり曖昧な青果物を効果的に識別した。一方、YOLOv12はCNNベースのアテンションを活用して局所的特徴抽出を強化し、計算効率とエッジデプロイメントに最適化された。RF-DETRは単一クラス検出において最高の平均平均精度（mAP50）0.9464を達成し、混雑したシーンでの青果物の位置特定能力の優位性を証明した。YOLOv12NはmAP@50:95で最高の0.7620を記録したが、RF-DETRは複雑な空間シナリオで一貫して優れた性能を示した。多クラス検出では、RF-DETRがmAP@50で0.8298を記録し、オクルージョンあり・なしの果実を区別する能力を示した一方、YOLOv12LはmAP@50:95で0.6622を記録し、詳細なオクルージョンコンテキストでの分類能力の高さを示した。トレーニングダイナミクスの分析では、RF-DETRの迅速な収束が強調され、特に単一クラス設定では10エポック以内に収束し、トランスフォーマーベースのアーキテクチャが動的な視覚データに適応する効率性を実証した。これらの結果は、精密農業アプリケーションにおけるRF-DETRの有効性を検証し、YOLOv12が高速応答シナリオに適していることを示している。>索引語：RF-DETR物体検出、YOLOv12、YOLOv13、YOLOv14、YOLOv15、YOLOE、YOLO World、YOLO、You Only Look Once、Roboflow、Detection Transformers、CNNs

English

This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under dynamic real-world conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling, effectively identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment. RF-DETR achieved the highest mean Average Precision (mAP50) of 0.9464 in single-class detection, proving its superior ability to localize greenfruits in cluttered scenes. Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR consistently outperformed in complex spatial scenarios. For multi-class detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to differentiate between occluded and non-occluded fruits, while YOLOv12L scored highest in mAP@50:95 with 0.6622, indicating better classification in detailed occlusion contexts. Training dynamics analysis highlighted RF-DETR's swift convergence, particularly in single-class settings where it plateaued within 10 epochs, demonstrating the efficiency of transformer-based architectures in adapting to dynamic visual data. These findings validate RF-DETR's effectiveness for precision agricultural applications, with YOLOv12 suited for fast-response scenarios. >Index Terms: RF-DETR object detection, YOLOv12, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once, Roboflow, Detection Transformers, CNNs

RF-DETR物体検出 vs YOLOv12：複雑な果樹園環境におけるラベル曖昧性下での単一クラスおよび多クラスグリーンフルーツ検出のためのTransformerベースとCNNベースのアーキテクチャの比較研究

RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity

要旨

Support