ATLAS：通過協調關稅代碼分類來基準測試並適應LLM於全球貿易

摘要

在《商品名稱及編碼協調制度》（HTS）下對產品進行準確分類是全球貿易中的一個關鍵瓶頸，然而這一問題卻鮮少受到機器學習領域的關注。錯誤分類可能導致貨物運輸完全停滯，主要郵政運營商因海關文件不完整而暫停向美國的配送。我們首次引入了基於美國海關裁決在線搜索系統（CROSS）的HTS代碼分類基準。通過評估領先的大型語言模型（LLMs），我們發現經過微調的Atlas模型（LLaMA-3.3-70B）在10位數分類上達到了40%的完全正確率，在6位數分類上達到了57.5%的正確率，相比GPT-5-Thinking提升了15個百分點，相比Gemini-2.5-Pro-Thinking提升了27.5個百分點。除了準確性，Atlas的成本大約是GPT-5-Thinking的五分之一，是Gemini-2.5-Pro-Thinking的八分之一，並且可以自主託管，以確保在高風險貿易和合規工作流程中的數據隱私。儘管Atlas設定了強有力的基準，但該任務仍然極具挑戰性，10位數分類的準確率僅為40%。通過發布數據集和模型，我們旨在將HTS分類定位為一個新的社區基準任務，並邀請未來在檢索、推理和對齊方面的工作。

English

Accurate classification of products under the Harmonized Tariff Schedule (HTS) is a critical bottleneck in global trade, yet it has received little attention from the machine learning community. Misclassification can halt shipments entirely, with major postal operators suspending deliveries to the U.S. due to incomplete customs documentation. We introduce the first benchmark for HTS code classification, derived from the U.S. Customs Rulings Online Search System (CROSS). Evaluating leading LLMs, we find that our fine-tuned Atlas model (LLaMA-3.3-70B) achieves 40 percent fully correct 10-digit classifications and 57.5 percent correct 6-digit classifications, improvements of 15 points over GPT-5-Thinking and 27.5 points over Gemini-2.5-Pro-Thinking. Beyond accuracy, Atlas is roughly five times cheaper than GPT-5-Thinking and eight times cheaper than Gemini-2.5-Pro-Thinking, and can be self-hosted to guarantee data privacy in high-stakes trade and compliance workflows. While Atlas sets a strong baseline, the benchmark remains highly challenging, with only 40 percent 10-digit accuracy. By releasing both dataset and model, we aim to position HTS classification as a new community benchmark task and invite future work in retrieval, reasoning, and alignment.