FuzzCoder:基於大型語言模型的位元組級模糊測試
FuzzCoder: Byte-level Fuzzing Test via Large Language Model
September 3, 2024
作者: Liqun Yang, Jian Yang, Chaoren Wei, Guanglin Niu, Ge Zhang, Yunli Wang, Linzheng ChaI, Wanxu Xia, Hongcheng Guo, Shun Zhang, Jiaheng Liu, Yuwei Yin, Junran Peng, Jiaxin Ma, Liang Sun, Zhoujun Li
cs.AI
摘要
模糊測試是一項重要的動態程式分析技術,專為在複雜軟體中發現漏洞而設計。該技術通過向目標程式輸入精心構造的惡意資料來觸發程式崩潰、緩衝區溢位、記憶體錯誤及異常。如何高效生成惡意輸入資料至今仍是難解的開放性問題,當前最有效的方法通常是對現有合法輸入資料進行均勻隨機變異。本研究提出採用精調的大型語言模型(FuzzCoder),透過學習成功攻擊案例中的輸入檔案模式來指導後續模糊測試探索。具體而言,我們開發了基於程式碼大語言模型的框架來指導模糊測試中的輸入變異過程,將變異過程建模為序列到序列的轉換任務,由大語言模型接收位元組序列後輸出變異後的位元組序列。FuzzCoder在自建的指令資料集(Fuzz-Instruct)上進行微調,該資料集收集了啟發式模糊測試工具的成功測試記錄。該模型能預測輸入檔案中的變異位置與策略位置,從而觸發程式的異常行為。實驗結果表明,基於AFL(American Fuzzy Lop)的FuzzCoder在ELF、JPG、MP3和XML等多種輸入格式上,於有效變異比例(EPM)和崩潰次數(NC)兩項指標均取得顯著提升。
English
Fuzzing is an important dynamic program analysis technique designed for
finding vulnerabilities in complex software. Fuzzing involves presenting a
target program with crafted malicious input to cause crashes, buffer overflows,
memory errors, and exceptions. Crafting malicious inputs in an efficient manner
is a difficult open problem and the best approaches often apply uniform random
mutations to pre-existing valid inputs. In this work, we propose to adopt
fine-tuned large language models (FuzzCoder) to learn patterns in the input
files from successful attacks to guide future fuzzing explorations.
Specifically, we develop a framework to leverage the code LLMs to guide the
mutation process of inputs in fuzzing. The mutation process is formulated as
the sequence-to-sequence modeling, where LLM receives a sequence of bytes and
then outputs the mutated byte sequence. FuzzCoder is fine-tuned on the created
instruction dataset (Fuzz-Instruct), where the successful fuzzing history is
collected from the heuristic fuzzing tool. FuzzCoder can predict mutation
locations and strategies locations in input files to trigger abnormal behaviors
of the program. Experimental results show that FuzzCoder based on AFL (American
Fuzzy Lop) gain significant improvements in terms of effective proportion of
mutation (EPM) and number of crashes (NC) for various input formats including
ELF, JPG, MP3, and XML.