TensorBLEU:基于GPU向量化的BLEU评分实现,用于训练过程中的逐句评估
TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation
October 7, 2025
作者: Adam Filipek
cs.AI
摘要
现代自然语言处理模型已实现前所未有的规模,然而其评估工具往往成为计算瓶颈,限制了研究进展。这一问题在训练过程中的评估指标上尤为突出,例如强化学习中的逐句奖励信号,这些指标必须直接在GPU上高效处理批量token ID。本文介绍了TensorBLEU,一种专为此特定用例从头设计的BLEU指标新实现。我们的方法在PyTorch中完全向量化,支持GPU加速的逐句计算,并引入了一种内存高效的计数机制。通过利用torch.unique创建紧凑的批量特定n-gram词典,我们的方法避免了传统基于哈希的向量化方法带来的巨大内存开销,使其适用于大规模词汇模型。我们将TensorBLEU与NLTK(CPU上基于token ID的BLEU计算标准库)进行了基准测试。实验表明,TensorBLEU在消费级GPU(NVIDIA T4)上提供了超过13倍的加速,在数据中心级硬件(NVIDIA A100)上加速超过40倍。这一性能将显著瓶颈转化为训练循环中可忽略的部分。通过明确其作为“Token-ID BLEU”的开发用途并开源我们的实现,我们为加速基于RL的模型微调等领域的研究提供了强大工具。
English
Modern natural language processing models have achieved unprecedented scale,
yet the tools for their evaluation often remain a computational bottleneck,
limiting the pace of research. This is particularly acute for in-training
evaluation metrics, such as per-sentence reward signals in Reinforcement
Learning, which must operate efficiently on batches of token IDs directly on
the GPU. In this paper, we introduce TensorBLEU, a novel implementation of the
BLEU metric designed from the ground up for this specific use case. Our
approach is fully vectorized for GPU-accelerated, per-sentence computation
within PyTorch and introduces a memory-efficient counting mechanism. By
creating a compact, batch-specific dictionary of n-grams using
torch.unique, our method avoids the prohibitive memory costs of
traditional hashing-based vectorization, making it practical for
large-vocabulary models. We benchmark TensorBLEU against NLTK, the standard
library for token-ID-based BLEU calculation on the CPU. Experiments show that
TensorBLEU provides speedups of over 13x on consumer-grade GPUs (NVIDIA T4) and
exceeding 40x on data-center-class hardware (NVIDIA A100). This performance
transforms a significant bottleneck into a negligible part of the training
loop. By clearly defining its role as a "Token-ID BLEU" for development
purposes and open-sourcing our implementation, we provide a powerful tool for
accelerating research in areas like RL-based model fine-tuning.