AI輔助摘要與結論分析:標記未經證實的主張與模糊代詞
Ai-Facilitated Analysis of Abstracts and Conclusions: Flagging Unsubstantiated Claims and Ambiguous Pronouns
June 16, 2025
作者: Evgeny Markhasin
cs.AI
摘要
我們提出並評估了一套概念驗證(PoC)結構化工作流程提示,旨在引導大型語言模型(LLMs)在學術文稿的高層次語義和語言分析中,激發類人的層次化推理。這些提示針對兩項非平凡的解析任務:識別摘要中未經證實的主張(信息完整性)以及標記模糊的代詞指代(語言清晰度)。我們在兩種前沿模型(Gemini Pro 2.5 Pro 和 ChatGPT Plus o3)上,於多種上下文條件下進行了系統性的多輪評估。在信息完整性任務的結果中,我們發現模型表現存在顯著差異:雖然兩模型均成功識別了名詞短語中未經證實的核心部分(成功率達95%),但ChatGPT在識別未經證實的形容詞修飾語時始終失敗(成功率為0%),而Gemini則正確標記了此類修飾語(成功率達95%),這引發了關於目標句法角色可能影響的疑問。在語言分析任務中,兩模型在完整文稿上下文條件下均表現良好(成功率80-90%)。然而,在僅提供摘要的設定下,ChatGPT達到了完美的成功率(100%),而Gemini的表現則大幅下降。我們的研究結果表明,結構化提示是一種適用於複雜文本分析的可行方法,但提示的表現可能高度依賴於模型、任務類型與上下文之間的相互作用,這凸顯了進行嚴格的模型特定測試的必要性。
English
We present and evaluate a suite of proof-of-concept (PoC), structured
workflow prompts designed to elicit human-like hierarchical reasoning while
guiding Large Language Models (LLMs) in high-level semantic and linguistic
analysis of scholarly manuscripts. The prompts target two non-trivial
analytical tasks: identifying unsubstantiated claims in summaries
(informational integrity) and flagging ambiguous pronoun references (linguistic
clarity). We conducted a systematic, multi-run evaluation on two frontier
models (Gemini Pro 2.5 Pro and ChatGPT Plus o3) under varied context
conditions. Our results for the informational integrity task reveal a
significant divergence in model performance: while both models successfully
identified an unsubstantiated head of a noun phrase (95% success), ChatGPT
consistently failed (0% success) to identify an unsubstantiated adjectival
modifier that Gemini correctly flagged (95% success), raising a question
regarding potential influence of the target's syntactic role. For the
linguistic analysis task, both models performed well (80-90% success) with full
manuscript context. In a summary-only setting, however, ChatGPT achieved a
perfect (100%) success rate, while Gemini's performance was substantially
degraded. Our findings suggest that structured prompting is a viable
methodology for complex textual analysis but show that prompt performance may
be highly dependent on the interplay between the model, task type, and context,
highlighting the need for rigorous, model-specific testing.