ATLAS: 調和関税コード分類によるグローバル貿易のための大規模言語モデルのベンチマーキングと適応

要旨

国際貿易における重要なボトルネックであるHSコード（Harmonized Tariff Schedule）に基づく製品の正確な分類は、機械学習コミュニティからほとんど注目されてこなかった。誤分類は貨物の輸送を完全に停止させる可能性があり、主要な郵便事業者が不完全な税関書類を理由に米国への配達を停止する事例も発生している。本論文では、米国税関のCROSS（Customs Rulings Online Search System）から派生した初のHSコード分類ベンチマークを紹介する。主要な大規模言語モデル（LLM）を評価した結果、当社がファインチューニングしたAtlasモデル（LLaMA-3.3-70B）は、10桁分類で40％、6桁分類で57.5％の完全正解率を達成し、GPT-5-Thinkingを15ポイント、Gemini-2.5-Pro-Thinkingを27.5ポイント上回った。精度に加え、AtlasはGPT-5-Thinkingの約5分の1、Gemini-2.5-Pro-Thinkingの約8分の1のコストで運用可能であり、データプライバシーが重要な貿易・コンプライアンスワークフローにおいて自己ホスティングが可能である。Atlasは強力なベースラインを確立したが、10桁分類の正解率が40％にとどまるなど、このベンチマークは依然として非常に困難な課題である。データセットとモデルを公開することで、HSコード分類を新たなコミュニティベンチマークタスクとして位置づけ、検索、推論、アラインメントに関する今後の研究を促進することを目指している。

English

Accurate classification of products under the Harmonized Tariff Schedule (HTS) is a critical bottleneck in global trade, yet it has received little attention from the machine learning community. Misclassification can halt shipments entirely, with major postal operators suspending deliveries to the U.S. due to incomplete customs documentation. We introduce the first benchmark for HTS code classification, derived from the U.S. Customs Rulings Online Search System (CROSS). Evaluating leading LLMs, we find that our fine-tuned Atlas model (LLaMA-3.3-70B) achieves 40 percent fully correct 10-digit classifications and 57.5 percent correct 6-digit classifications, improvements of 15 points over GPT-5-Thinking and 27.5 points over Gemini-2.5-Pro-Thinking. Beyond accuracy, Atlas is roughly five times cheaper than GPT-5-Thinking and eight times cheaper than Gemini-2.5-Pro-Thinking, and can be self-hosted to guarantee data privacy in high-stakes trade and compliance workflows. While Atlas sets a strong baseline, the benchmark remains highly challenging, with only 40 percent 10-digit accuracy. By releasing both dataset and model, we aim to position HTS classification as a new community benchmark task and invite future work in retrieval, reasoning, and alignment.