TensorBLEU：基于GPU向量化的BLEU评分实现，用于训练过程中的逐句评估

摘要

现代自然语言处理模型已实现前所未有的规模，然而其评估工具往往成为计算瓶颈，限制了研究进展。这一问题在训练过程中的评估指标上尤为突出，例如强化学习中的逐句奖励信号，这些指标必须直接在GPU上高效处理批量token ID。本文介绍了TensorBLEU，一种专为此特定用例从头设计的BLEU指标新实现。我们的方法在PyTorch中完全向量化，支持GPU加速的逐句计算，并引入了一种内存高效的计数机制。通过利用torch.unique创建紧凑的批量特定n-gram词典，我们的方法避免了传统基于哈希的向量化方法带来的巨大内存开销，使其适用于大规模词汇模型。我们将TensorBLEU与NLTK（CPU上基于token ID的BLEU计算标准库）进行了基准测试。实验表明，TensorBLEU在消费级GPU（NVIDIA T4）上提供了超过13倍的加速，在数据中心级硬件（NVIDIA A100）上加速超过40倍。这一性能将显著瓶颈转化为训练循环中可忽略的部分。通过明确其作为“Token-ID BLEU”的开发用途并开源我们的实现，我们为加速基于RL的模型微调等领域的研究提供了强大工具。

English

Modern natural language processing models have achieved unprecedented scale, yet the tools for their evaluation often remain a computational bottleneck, limiting the pace of research. This is particularly acute for in-training evaluation metrics, such as per-sentence reward signals in Reinforcement Learning, which must operate efficiently on batches of token IDs directly on the GPU. In this paper, we introduce TensorBLEU, a novel implementation of the BLEU metric designed from the ground up for this specific use case. Our approach is fully vectorized for GPU-accelerated, per-sentence computation within PyTorch and introduces a memory-efficient counting mechanism. By creating a compact, batch-specific dictionary of n-grams using torch.unique, our method avoids the prohibitive memory costs of traditional hashing-based vectorization, making it practical for large-vocabulary models. We benchmark TensorBLEU against NLTK, the standard library for token-ID-based BLEU calculation on the CPU. Experiments show that TensorBLEU provides speedups of over 13x on consumer-grade GPUs (NVIDIA T4) and exceeding 40x on data-center-class hardware (NVIDIA A100). This performance transforms a significant bottleneck into a negligible part of the training loop. By clearly defining its role as a "Token-ID BLEU" for development purposes and open-sourcing our implementation, we provide a powerful tool for accelerating research in areas like RL-based model fine-tuning.

TensorBLEU：基于GPU向量化的BLEU评分实现，用于训练过程中的逐句评估

TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation

摘要

Support