更多上下文、更大模型，還是道德知識？政治文本中施瓦茨價值檢測的系統性研究

摘要

在政治文本中檢測施瓦茨價值觀相當困難，因為隱含線索往往依賴於上下文論述以及相鄰價值觀之間的細微區別。本研究探討何時上下文與明確的道德知識有助於句子層級的價值觀檢測。我們採用 ValuesML/Touché ValueEval 格式，比較以下設定：句子層級、視窗層級與全文文件輸入；無檢索增強（no-RAG）與搭配策展道德知識庫的檢索增強設定；監督式 DeBERTa-v3-base 與 large 編碼器；以及參數量從 120 億到 1230 億的零樣本大型語言模型（LLM）。結果顯示，更多上下文並非總是更好：全文文件上下文使監督式 DeBERTa 編碼器的宏平均 F1 分數比僅使用句子輸入提高 3.8 至 4.8 個百分點，但對零樣本 LLM 則無一致助益。在配對比較中，檢索獲取的道德知識更為穩定有用，在早期融合（early fusion）條件下，能改善每個受測模型家族與上下文條件。然而，從 DeBERTa-v3-base 擴展到 large，或從 120 億參數擴展到更大的 LLM，並未確保效能提升，且簡單的早期融合優於編碼器領域中測試的後期融合（late-fusion）與交叉注意力（cross-attention）RAG 變體。針對各別價值觀的分析顯示，上下文與檢索對社會情境性強或概念上易混淆的價值觀幫助最大。這些發現表明，進行價值觀敏感的自然語言處理時，應聯合評估上下文、知識與模型家族，而非將更長的輸入或更大的模型視為普遍改善手段。

English

Detecting Schwartz values in political text is difficult because implicit cues often depend on surrounding arguments and fine-grained distinctions between neighboring values. We study when context and explicit moral knowledge help sentence-level value detection. Using the ValuesML/Touch{é} ValueEval format, we compare sentence, window, and full-document inputs; no-RAG and retrieval-augmented settings with a curated moral knowledge base; supervised DeBERTa-v3-base/large encoders; and zero-shot LLMs from 12B to 123B parameters. The results show that more context is not uniformly better: full-document context improves supervised DeBERTa encoders by 3.8--4.8 macro-F1 points over sentence-only input, but does not consistently help zero-shot LLMs. Retrieved moral knowledge is more consistently useful in matched comparisons, improving each tested model family and context condition under early fusion. However, scaling from DeBERTa-v3-base to large and from 12B to larger LLMs does not guarantee gains, and simple early fusion outperforms the tested late-fusion and cross-attention RAG variants for encoders. Per-value analyses show that context and retrieval help most for socially situated or conceptually confusable values. These findings suggest that value-sensitive NLP should evaluate context, knowledge, and model family jointly rather than treating longer inputs or larger models as universal improvements.