ATLAS：通过协调关税代码分类对LLMs进行全球贸易基准测试与适配

摘要

在《商品名称及编码协调制度》（HTS）下对产品进行准确分类是全球贸易中的关键瓶颈，然而这一领域却鲜少受到机器学习社区的关注。分类错误可能导致货物运输完全停滞，主要邮政运营商因海关文件不完整而暂停向美国发货。我们首次推出了基于美国海关在线裁决搜索系统（CROSS）的HTS编码分类基准。通过评估领先的大型语言模型，我们发现经过微调的Atlas模型（LLaMA-3.3-70B）在10位编码分类上实现了40%的完全正确率，在6位编码分类上达到了57.5%的正确率，分别比GPT-5-Thinking提高了15个百分点，比Gemini-2.5-Pro-Thinking提高了27.5个百分点。除了准确性之外，Atlas的成本大约仅为GPT-5-Thinking的五分之一，Gemini-2.5-Pro-Thinking的八分之一，并且可以自托管，以确保高风险贸易和合规工作流程中的数据隐私。尽管Atlas设定了强有力的基准，但该任务仍极具挑战性，10位编码的准确率仅为40%。通过发布数据集和模型，我们旨在将HTS分类定位为社区新的基准任务，并鼓励未来在检索、推理和对齐方面的研究。

English

Accurate classification of products under the Harmonized Tariff Schedule (HTS) is a critical bottleneck in global trade, yet it has received little attention from the machine learning community. Misclassification can halt shipments entirely, with major postal operators suspending deliveries to the U.S. due to incomplete customs documentation. We introduce the first benchmark for HTS code classification, derived from the U.S. Customs Rulings Online Search System (CROSS). Evaluating leading LLMs, we find that our fine-tuned Atlas model (LLaMA-3.3-70B) achieves 40 percent fully correct 10-digit classifications and 57.5 percent correct 6-digit classifications, improvements of 15 points over GPT-5-Thinking and 27.5 points over Gemini-2.5-Pro-Thinking. Beyond accuracy, Atlas is roughly five times cheaper than GPT-5-Thinking and eight times cheaper than Gemini-2.5-Pro-Thinking, and can be self-hosted to guarantee data privacy in high-stakes trade and compliance workflows. While Atlas sets a strong baseline, the benchmark remains highly challenging, with only 40 percent 10-digit accuracy. By releasing both dataset and model, we aim to position HTS classification as a new community benchmark task and invite future work in retrieval, reasoning, and alignment.