ChatPaper.aiChatPaper

ViMRHP:一個基於人機協作標註的越南語多模態評論有用性預測基準數據集

ViMRHP: A Vietnamese Benchmark Dataset for Multimodal Review Helpfulness Prediction via Human-AI Collaborative Annotation

May 12, 2025
作者: Truc Mai-Thanh Nguyen, Dat Minh Nguyen, Son T. Luu, Kiet Van Nguyen
cs.AI

摘要

多模態評論有用性預測(MRHP)是推薦系統中的一項關鍵任務,尤其在電子商務平台上尤為重要。判斷用戶生成評論的有用性不僅能提升用戶體驗,還能改善消費者的決策過程。然而,現有的數據集主要集中於英語和印尼語,導致語言多樣性不足,特別是對於越南語等低資源語言而言。本文介紹了ViMRHP(越南語多模態評論有用性預測),這是一個針對越南語MRHP任務的大規模基準數據集。該數據集涵蓋四個領域,包含2,000種產品及46,000條評論。同時,構建大規模數據集需要投入大量時間和成本。為優化標註流程,我們利用人工智能輔助標註人員構建ViMRHP數據集。在AI的協助下,標註時間從每項任務90至120秒縮短至20至40秒,同時保持了數據質量,並將總成本降低了約65%。然而,AI生成的標註在複雜標註任務中仍存在局限性,我們通過詳細的性能分析進一步探討了這一點。在ViMRHP的實驗中,我們評估了基於人工驗證和AI生成標註的基準模型,以評估其質量差異。ViMRHP數據集已公開於https://github.com/trng28/ViMRHP。
English
Multimodal Review Helpfulness Prediction (MRHP) is an essential task in recommender systems, particularly in E-commerce platforms. Determining the helpfulness of user-generated reviews enhances user experience and improves consumer decision-making. However, existing datasets focus predominantly on English and Indonesian, resulting in a lack of linguistic diversity, especially for low-resource languages such as Vietnamese. In this paper, we introduce ViMRHP (Vietnamese Multimodal Review Helpfulness Prediction), a large-scale benchmark dataset for MRHP task in Vietnamese. This dataset covers four domains, including 2K products with 46K reviews. Meanwhile, a large-scale dataset requires considerable time and cost. To optimize the annotation process, we leverage AI to assist annotators in constructing the ViMRHP dataset. With AI assistance, annotation time is reduced (90 to 120 seconds per task down to 20 to 40 seconds per task) while maintaining data quality and lowering overall costs by approximately 65%. However, AI-generated annotations still have limitations in complex annotation tasks, which we further examine through a detailed performance analysis. In our experiment on ViMRHP, we evaluate baseline models on human-verified and AI-generated annotations to assess their quality differences. The ViMRHP dataset is publicly available at https://github.com/trng28/ViMRHP
PDF22May 14, 2025