BOE-XSUM：西班牙法律法令与通知的简明极致摘要

摘要

在信息过载的当下，简洁概括长篇文档的能力日益重要，然而针对西班牙语文档的摘要资源普遍匮乏，尤其是在法律领域。本研究推出了BOE-XSUM数据集，该数据集精心收录了3,648份来自西班牙《国家官方公报》（BOE）的文档，每份文档均配有简明易懂的摘要、原文及其文档类型标签。我们评估了在BOE-XSUM上微调的中等规模大语言模型（LLMs）的表现，并将其与零样本设置下的通用生成模型进行了对比。结果显示，经过微调的模型显著优于非专用模型。特别值得一提的是，表现最佳的模型——BERTIN GPT-J 6B（32位精度）——相较于最佳零样本模型DeepSeek-R1，性能提升了24%（准确率分别为41.6%与33.5%）。

English

The ability to summarize long documents succinctly is increasingly important in daily life due to information overload, yet there is a notable lack of such summaries for Spanish documents in general, and in the legal domain in particular. In this work, we present BOE-XSUM, a curated dataset comprising 3,648 concise, plain-language summaries of documents sourced from Spain's ``Bolet\'{\i}n Oficial del Estado'' (BOE), the State Official Gazette. Each entry in the dataset includes a short summary, the original text, and its document type label. We evaluate the performance of medium-sized large language models (LLMs) fine-tuned on BOE-XSUM, comparing them to general-purpose generative models in a zero-shot setting. Results show that fine-tuned models significantly outperform their non-specialized counterparts. Notably, the best-performing model -- BERTIN GPT-J 6B (32-bit precision) -- achieves a 24\% performance gain over the top zero-shot model, DeepSeek-R1 (accuracies of 41.6\% vs.\ 33.5\%).

BOE-XSUM：西班牙法律法令与通知的简明极致摘要

BOE-XSUM: Extreme Summarization in Clear Language of Spanish Legal Decrees and Notifications

摘要

Support