TensorBLEU：基於向量化的GPU BLEU分數實現，用於訓練期間的逐句評估

摘要

現代自然語言處理模型已達到前所未有的規模，然而其評估工具往往成為計算瓶頸，限制了研究進展。這一問題在訓練中的評估指標上尤為突出，例如強化學習中的逐句獎勵信號，這些指標必須直接在GPU上高效地處理批量標記ID。本文介紹了TensorBLEU，這是一種從頭設計的BLEU指標新實現，專門針對這一特定使用場景。我們的方法在PyTorch中完全向量化，用於GPU加速的逐句計算，並引入了一種記憶體高效的計數機制。通過使用torch.unique創建一個緊湊的批次專用n-gram字典，我們的方法避免了傳統基於哈希的向量化所帶來的過高記憶體成本，使其適用於大詞彙量模型。我們將TensorBLEU與NLTK（基於CPU的標記ID BLEU計算標準庫）進行了基準測試。實驗表明，TensorBLEU在消費級GPU（NVIDIA T4）上提供了超過13倍的加速，在數據中心級硬件（NVIDIA A100）上更是超過40倍。這一性能將顯著的瓶頸轉化為訓練循環中可忽略的部分。通過明確其作為開發用途的“標記ID BLEU”角色，並開源我們的實現，我們為加速基於強化學習的模型微調等領域的研究提供了一個強大的工具。

English

Modern natural language processing models have achieved unprecedented scale, yet the tools for their evaluation often remain a computational bottleneck, limiting the pace of research. This is particularly acute for in-training evaluation metrics, such as per-sentence reward signals in Reinforcement Learning, which must operate efficiently on batches of token IDs directly on the GPU. In this paper, we introduce TensorBLEU, a novel implementation of the BLEU metric designed from the ground up for this specific use case. Our approach is fully vectorized for GPU-accelerated, per-sentence computation within PyTorch and introduces a memory-efficient counting mechanism. By creating a compact, batch-specific dictionary of n-grams using torch.unique, our method avoids the prohibitive memory costs of traditional hashing-based vectorization, making it practical for large-vocabulary models. We benchmark TensorBLEU against NLTK, the standard library for token-ID-based BLEU calculation on the CPU. Experiments show that TensorBLEU provides speedups of over 13x on consumer-grade GPUs (NVIDIA T4) and exceeding 40x on data-center-class hardware (NVIDIA A100). This performance transforms a significant bottleneck into a negligible part of the training loop. By clearly defining its role as a "Token-ID BLEU" for development purposes and open-sourcing our implementation, we provide a powerful tool for accelerating research in areas like RL-based model fine-tuning.

TensorBLEU：基於向量化的GPU BLEU分數實現，用於訓練期間的逐句評估

TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation

摘要

Support