政治科學領域的大型語言模型基準測試：聯合國視角

摘要

大型語言模型（LLMs）在自然語言處理領域取得了顯著進展，然而其在高風險政治決策中的潛力仍未被充分探索。本文針對這一空白，聚焦於LLMs在聯合國（UN）決策過程中的應用，此處的風險尤為重大，政治決策可能產生深遠影響。我們引入了一個新穎的數據集，涵蓋了1994年至2024年間公開的聯合國安全理事會（UNSC）記錄，包括決議草案、投票記錄及外交演講。利用此數據集，我們提出了聯合國基準（UNBench），這是首個旨在評估LLMs在四項相互關聯的政治科學任務中表現的全面基準：共同提案國判斷、代表投票模擬、草案通過預測及代表聲明生成。這些任務貫穿聯合國決策過程的三個階段——起草、投票與討論，旨在評估LLMs理解與模擬政治動態的能力。我們的實驗分析展示了LLMs在該領域應用的潛力與挑戰，為其在政治科學中的優勢與局限提供了洞見。此工作促進了人工智能與政治科學日益交匯的領域，為全球治理的研究與實際應用開闢了新途徑。UNBench資源庫可訪問：https://github.com/yueqingliang1/UNBench。

English

Large Language Models (LLMs) have achieved significant advances in natural language processing, yet their potential for high-stake political decision-making remains largely unexplored. This paper addresses the gap by focusing on the application of LLMs to the United Nations (UN) decision-making process, where the stakes are particularly high and political decisions can have far-reaching consequences. We introduce a novel dataset comprising publicly available UN Security Council (UNSC) records from 1994 to 2024, including draft resolutions, voting records, and diplomatic speeches. Using this dataset, we propose the United Nations Benchmark (UNBench), the first comprehensive benchmark designed to evaluate LLMs across four interconnected political science tasks: co-penholder judgment, representative voting simulation, draft adoption prediction, and representative statement generation. These tasks span the three stages of the UN decision-making process--drafting, voting, and discussing--and aim to assess LLMs' ability to understand and simulate political dynamics. Our experimental analysis demonstrates the potential and challenges of applying LLMs in this domain, providing insights into their strengths and limitations in political science. This work contributes to the growing intersection of AI and political science, opening new avenues for research and practical applications in global governance. The UNBench Repository can be accessed at: https://github.com/yueqingliang1/UNBench.