TensorBLEU:基於向量化的GPU BLEU分數實現,用於訓練期間的逐句評估
TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation
October 7, 2025
作者: Adam Filipek
cs.AI
摘要
現代自然語言處理模型已達到前所未有的規模,然而其評估工具往往成為計算瓶頸,限制了研究進展。這一問題在訓練中的評估指標上尤為突出,例如強化學習中的逐句獎勵信號,這些指標必須直接在GPU上高效地處理批量標記ID。本文介紹了TensorBLEU,這是一種從頭設計的BLEU指標新實現,專門針對這一特定使用場景。我們的方法在PyTorch中完全向量化,用於GPU加速的逐句計算,並引入了一種記憶體高效的計數機制。通過使用torch.unique創建一個緊湊的批次專用n-gram字典,我們的方法避免了傳統基於哈希的向量化所帶來的過高記憶體成本,使其適用於大詞彙量模型。我們將TensorBLEU與NLTK(基於CPU的標記ID BLEU計算標準庫)進行了基準測試。實驗表明,TensorBLEU在消費級GPU(NVIDIA T4)上提供了超過13倍的加速,在數據中心級硬件(NVIDIA A100)上更是超過40倍。這一性能將顯著的瓶頸轉化為訓練循環中可忽略的部分。通過明確其作為開發用途的“標記ID BLEU”角色,並開源我們的實現,我們為加速基於強化學習的模型微調等領域的研究提供了一個強大的工具。
English
Modern natural language processing models have achieved unprecedented scale,
yet the tools for their evaluation often remain a computational bottleneck,
limiting the pace of research. This is particularly acute for in-training
evaluation metrics, such as per-sentence reward signals in Reinforcement
Learning, which must operate efficiently on batches of token IDs directly on
the GPU. In this paper, we introduce TensorBLEU, a novel implementation of the
BLEU metric designed from the ground up for this specific use case. Our
approach is fully vectorized for GPU-accelerated, per-sentence computation
within PyTorch and introduces a memory-efficient counting mechanism. By
creating a compact, batch-specific dictionary of n-grams using
torch.unique, our method avoids the prohibitive memory costs of
traditional hashing-based vectorization, making it practical for
large-vocabulary models. We benchmark TensorBLEU against NLTK, the standard
library for token-ID-based BLEU calculation on the CPU. Experiments show that
TensorBLEU provides speedups of over 13x on consumer-grade GPUs (NVIDIA T4) and
exceeding 40x on data-center-class hardware (NVIDIA A100). This performance
transforms a significant bottleneck into a negligible part of the training
loop. By clearly defining its role as a "Token-ID BLEU" for development
purposes and open-sourcing our implementation, we provide a powerful tool for
accelerating research in areas like RL-based model fine-tuning.