FuzzCoder: 大規模言語モデルによるバイトレベルファジングテスト

要旨

ファジングは、複雑なソフトウェアの脆弱性発見を目的とした重要な動的プログラム解析技術である。ファジングは、ターゲットプログラムに細工された悪意のある入力を提示し、クラッシュ、バッファオーバーフロー、メモリエラー、例外を引き起こすことを含む。効率的な方法で悪意のある入力を細工することは困難な未解決問題であり、最も優れたアプローチでは、既存の有効な入力に対して均一なランダム変異を適用することが多い。本研究では、微調整された大規模言語モデル（FuzzCoder）を採用し、成功した攻撃からの入力ファイル内パターンを学習して将来のファジング探索を導くことを提案する。具体的には、コードLLMを活用してファジングにおける入力の変異プロセスを導くフレームワークを開発する。変異プロセスはシーケンス-to-シーケンスモデリングとして定式化され、LLMはバイト列を受け取り、変異されたバイト列を出力する。FuzzCoderは、ヒューリスティックなファジングツールから収集された成功したファジング履歴を含む作成された指示データセット（Fuzz-Instruct）で微調整される。FuzzCoderは、プログラムの異常動作を引き起こす入力ファイル内の変異位置と戦略位置を予測できる。実験結果により、AFL（American Fuzzy Lop）を基盤としたFuzzCoderが、ELF、JPG、MP3、XMLなどの様々な入力形式において、有効変異率（EPM）とクラッシュ数（NC）の点で顕著な改善を達成することが示された。

English

Fuzzing is an important dynamic program analysis technique designed for finding vulnerabilities in complex software. Fuzzing involves presenting a target program with crafted malicious input to cause crashes, buffer overflows, memory errors, and exceptions. Crafting malicious inputs in an efficient manner is a difficult open problem and the best approaches often apply uniform random mutations to pre-existing valid inputs. In this work, we propose to adopt fine-tuned large language models (FuzzCoder) to learn patterns in the input files from successful attacks to guide future fuzzing explorations. Specifically, we develop a framework to leverage the code LLMs to guide the mutation process of inputs in fuzzing. The mutation process is formulated as the sequence-to-sequence modeling, where LLM receives a sequence of bytes and then outputs the mutated byte sequence. FuzzCoder is fine-tuned on the created instruction dataset (Fuzz-Instruct), where the successful fuzzing history is collected from the heuristic fuzzing tool. FuzzCoder can predict mutation locations and strategies locations in input files to trigger abnormal behaviors of the program. Experimental results show that FuzzCoder based on AFL (American Fuzzy Lop) gain significant improvements in terms of effective proportion of mutation (EPM) and number of crashes (NC) for various input formats including ELF, JPG, MP3, and XML.