迭代式圖對齊

摘要

透過壓縮多樣敘述，語言模型進一步超越僅僅記憶，透過捕捉可推廣的因果關係而達到智能。然而，由於訓練數據多樣性不足，它們存在著局部的「表示差距」，限制了其在現實世界中的實用性，尤其是在需要嚴格遵循規則的任務中。依賴大量人工標註的傳統對齊方法效率低下且不可擴展。最近的自對齊技術也存在不足，因為它們通常依賴基於自選擇的提示和基於記憶的學習。為解決這些問題，我們引入了迭代圖對齊（IGA），一種無需標註的基於規則的對齊算法。教師模型（VLM）採用迭代圖提示（IGP）來創建邏輯圖和參考答案。學生模型（LLM）通過嘗試將其回應與這些參考對齊來識別局部知識差距，並與輔助模型合作生成多樣答案。然後利用這些對齊的回應進行迭代監督微調（SFT）。我們在五個基於規則的場景中進行的評估顯示了IGP的有效性，在Claude Sonnet 3.5中取得了73.12％的對齊改進，而Llama3-8B-Instruct則取得了86.20％的改進，優於Claude Sonnet 3.5在基於規則的對齊方面。

English

By compressing diverse narratives, LLMs go beyond memorization, achieving intelligence by capturing generalizable causal relationships. However, they suffer from local 'representation gaps' due to insufficient training data diversity, limiting their real-world utility, especially in tasks requiring strict alignment to rules. Traditional alignment methods relying on heavy human annotations are inefficient and unscalable. Recent self-alignment techniques also fall short, as they often depend on self-selection based prompting and memorization-based learning. To address these issues, we introduce Iterative Graph Alignment (IGA), an annotation-free rule-based alignment algorithm. A teacher model (VLM) employs Iterative Graph Prompting (IGP) to create logical graphs and reference answers. The student model (LLM) identifies local knowledge gaps by attempting to align its responses with these references, collaborating with helper models to generate diverse answers. These aligned responses are then used for iterative supervised fine-tuning (SFT). Our evaluations across five rule-based scenarios demonstrate IGP's effectiveness, with a 73.12\% alignment improvement in Claude Sonnet 3.5, and Llama3-8B-Instruct achieving an 86.20\% improvement, outperforming Claude Sonnet 3.5 in rule-based alignment.