AI辅助的摘要与结论分析:识别无依据主张与模糊指代
Ai-Facilitated Analysis of Abstracts and Conclusions: Flagging Unsubstantiated Claims and Ambiguous Pronouns
June 16, 2025
作者: Evgeny Markhasin
cs.AI
摘要
我们提出并评估了一套概念验证(PoC)结构化工作流程提示,旨在引导大型语言模型(LLMs)对学术手稿进行高层次语义和语言分析的同时,激发其类人的层次化推理能力。这些提示针对两项非平凡的分析任务:识别摘要中未经证实的声明(信息完整性)以及标记模糊的代词指代(语言清晰度)。我们在两种前沿模型(Gemini Pro 2.5 Pro和ChatGPT Plus o3)上,在不同上下文条件下进行了系统性的多轮评估。在信息完整性任务中,我们的结果显示模型性能存在显著差异:虽然两个模型均成功识别了名词短语中未经证实的主干(95%成功率),但ChatGPT始终未能识别出Gemini正确标记的未经证实的形容词修饰语(0%成功率),这引发了对目标句法角色潜在影响的疑问。在语言分析任务中,两个模型在完整手稿上下文下表现良好(80-90%成功率)。然而,在仅提供摘要的设置下,ChatGPT实现了完美(100%)的成功率,而Gemini的表现则大幅下降。我们的研究结果表明,结构化提示是复杂文本分析的有效方法,但提示性能可能高度依赖于模型、任务类型和上下文之间的相互作用,强调了进行严格、模型特定测试的必要性。
English
We present and evaluate a suite of proof-of-concept (PoC), structured
workflow prompts designed to elicit human-like hierarchical reasoning while
guiding Large Language Models (LLMs) in high-level semantic and linguistic
analysis of scholarly manuscripts. The prompts target two non-trivial
analytical tasks: identifying unsubstantiated claims in summaries
(informational integrity) and flagging ambiguous pronoun references (linguistic
clarity). We conducted a systematic, multi-run evaluation on two frontier
models (Gemini Pro 2.5 Pro and ChatGPT Plus o3) under varied context
conditions. Our results for the informational integrity task reveal a
significant divergence in model performance: while both models successfully
identified an unsubstantiated head of a noun phrase (95% success), ChatGPT
consistently failed (0% success) to identify an unsubstantiated adjectival
modifier that Gemini correctly flagged (95% success), raising a question
regarding potential influence of the target's syntactic role. For the
linguistic analysis task, both models performed well (80-90% success) with full
manuscript context. In a summary-only setting, however, ChatGPT achieved a
perfect (100%) success rate, while Gemini's performance was substantially
degraded. Our findings suggest that structured prompting is a viable
methodology for complex textual analysis but show that prompt performance may
be highly dependent on the interplay between the model, task type, and context,
highlighting the need for rigorous, model-specific testing.