AMBEDKAR——一種多層次偏見消除方法,通過知識增強的解碼策略實現語言模型的穩健憲法對齊
AMBEDKAR-A Multi-level Bias Elimination through a Decoding Approach with Knowledge Augmentation for Robust Constitutional Alignment of Language Models
September 2, 2025
作者: Snehasis Mukhopadhyay, Aryan Kasat, Shivam Dubey, Rahul Karthikeyan, Dhruv Sood, Vinija Jain, Aman Chadha, Amitava Das
cs.AI
摘要
大型語言模型(LLMs)可能無意中反映出其訓練數據中存在的社會偏見,導致有害或帶有偏見的輸出。在印度語境下,我們對一系列模型的實證評估顯示,圍繞種姓和宗教的偏見尤為突出。然而,現有的大多數緩解策略都以西方為中心,未能解決這些本土細微差別。我們提出了AMBEDKAR框架,該框架受到印度憲法設計師B. R. 安貝德卡博士的平等願景啟發,旨在引導LLM的輸出朝著公平、中立和包容的方向發展,符合印度憲法第14至17條的精神。我們的方法引入了一個憲法感知解碼層,該層由印度AI憲法指導,僅在推理時應用,無需對基礎模型進行任何參數更新。我們還採用了推測解碼算法,在生成過程中主動減少種姓主義和社群偏見。這一緩解層直接在解碼過程中運作,避免了對模型內部的修改,並降低了與重新訓練相關的計算和基礎設施成本。我們將推測解碼重新詮釋為不僅是效率工具,更是實現公平的機制。在該框架中,小型語言模型(SLM)充當潛在的偏見生成器,而由憲法指導的大型語言模型(LLM)則作為驗證者。LLM並非加速生成,而是確保SLM輸出遵循抗偏見的路徑。這種角色反轉催生了一種基於推測的公平範式。與基線相比,我們的方法使偏見絕對減少了高達26.41%。我們的源代碼、數據集和結果可在https://anonymous.4open.science/r/AMBEDKAR-983B/獲取。
English
Large Language Models (LLMs) can inadvertently reflect societal biases
present in their training data, leading to harmful or prejudiced outputs. In
the Indian context, our empirical evaluations across a suite of models reveal
that biases around caste and religion are particularly salient. Yet, most
existing mitigation strategies are Western-centric and fail to address these
local nuances. We propose AMBEDKAR, a framework inspired by the egalitarian
vision of Dr B. R. Ambedkar, architect of the Indian Constitution, to guide LLM
outputs toward fairness, neutrality, and inclusion in line with Articles 14 to
17. Our approach introduces a Constitution-Aware Decoding Layer, guided by the
AI Constitution of India and applied only at inference time, without any
parameter updates to the base model. We incorporate a speculative decoding
algorithm that proactively reduces casteist and communal bias during
generation. This mitigation layer operates directly within the decoding
process, avoiding changes to model internals and lowering the computational and
infrastructural costs associated with retraining. We reinterpret speculative
decoding not merely as an efficiency tool but as a mechanism for fairness. In
this framework, a Small Language Model (SLM) acts as a potentially biased
generator, while a constitutionally guided Large Language Model (LLM) serves as
the verifier. Rather than accelerating generation, the LLM enforces bias-robust
trajectories in the SLM outputs. This inversion of roles gives rise to a
fairness-by-speculation paradigm. Our approach yields an absolute reduction of
bias up to 26.41 percent compared to baseline. Our source code, datasets, and
results are available at https://anonymous.4open.science/r/AMBEDKAR-983B/