ChatPaper.aiChatPaper

從工具到隊友:評估大語言模型在多輪編程協作中的表現

From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

February 19, 2025
作者: Nathanaël Carraz Rakotonirina, Mohammed Hamdy, Jon Ander Campos, Lucas Weber, Alberto Testoni, Marzieh Fadaee, Sandro Pezzelle, Marco Del Tredici
cs.AI

摘要

大型語言模型(LLMs)在工作環境中的應用日益廣泛,擅長於解決單一且獨立的問題。然而,它們是否也能在長期互動中有效協作?為探討此問題,我們引入了MemoryCode,這是一個合成的多會話數據集,旨在測試LLMs在模擬真實環境下,追蹤並執行簡單編碼指令的能力,同時處理無關信息。儘管所有測試的模型都能很好地處理孤立指令,但即使是像GPT-4o這樣的頂尖模型,在指令分散於多個會話時,其表現也會下降。我們的分析表明,這是由於它們無法在長指令鏈中有效檢索和整合信息。我們的結果揭示了當前LLMs的一個根本限制,這限制了它們在長期互動中有效協作的能力。
English
Large Language Models (LLMs) are increasingly used in working environments for a wide range of tasks, excelling at solving individual problems in isolation. However, are they also able to effectively collaborate over long-term interactions? To investigate this, we introduce MemoryCode, a synthetic multi-session dataset designed to test LLMs' ability to track and execute simple coding instructions amid irrelevant information, simulating a realistic setting. While all the models we tested handle isolated instructions well, even the performance of state-of-the-art models like GPT-4o deteriorates when instructions are spread across sessions. Our analysis suggests this is due to their failure to retrieve and integrate information over long instruction chains. Our results highlight a fundamental limitation of current LLMs, restricting their ability to collaborate effectively in long interactions.

Summary

AI-Generated Summary

PDF53February 20, 2025