ChatPaper.aiChatPaper

BioCoder:一個具有上下文語境實用知識的生物信息學代碼生成基準

BioCoder: A Benchmark for Bioinformatics Code Generation with Contextual Pragmatic Knowledge

August 31, 2023
作者: Xiangru Tang, Bill Qian, Rick Gao, Jiakang Chen, Xinyun Chen, Mark Gerstein
cs.AI

摘要

像ChatGPT這樣的預訓練語言模型顯著改善了程式碼生成。隨著這些模型的規模擴大,對輸出處理更複雜任務的需求也在增加。此外,在生物信息學中,生成功能性程式碼面臨額外的顯著挑戰,這是由於領域知識量大、需要複雜的數據操作以及操作之間錯綜複雜的功能依賴關係。在這裡,我們介紹了一個名為BioCoder的基準,用於評估現有的預訓練模型在生成生物信息學程式碼方面的表現。在功能-程式碼生成方面,BioCoder涵蓋潛在的套件依賴、類聲明和全局變量。它包含來自GitHub的Python和Java中的1026個函數和1243個方法,以及Rosalind Project的253個示例。BioCoder還結合了一個用於評估的模糊測試框架,我們已將其應用於評估許多模型,包括InCoder、CodeGen、CodeGen2、SantaCoder、StarCoder、StarCoder+、InstructCodeT5+和ChatGPT。我們對這些模型的詳細分析強調了領域知識、實用程式碼生成和情境理解的重要性。我們的數據集、基準、Docker映像和測試所需的腳本都可在https://github.com/gersteinlab/biocoder上找到。
English
Pre-trained language models like ChatGPT have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks. Moreover, in bioinformatics, generating functional programs poses additional notable challenges due to the amount of domain knowledge, the need for complicated data operations, and intricate functional dependencies between the operations. Here, we present BioCoder, a benchmark developed to evaluate existing pre-trained models in generating bioinformatics code. In relation to function-code generation, BioCoder covers potential package dependencies, class declarations, and global variables. It incorporates 1026 functions and 1243 methods in Python and Java from GitHub and 253 examples from the Rosalind Project. BioCoder incorporates a fuzz-testing framework for evaluation, and we have applied it to evaluate many models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, and ChatGPT. Our detailed analysis of these models emphasizes the importance of domain knowledge, pragmatic code generation, and contextual understanding. Our dataset, benchmark, Docker images, and scripts required for testing are all available at https://github.com/gersteinlab/biocoder.
PDF120December 15, 2024