HI-TransPA：聽力障礙翻譯個人助理

摘要

為聽障人士的日常交流提供統一且靈活的解決方案，我們將全能模型範式引入輔助科技領域，推出HI-TransPA——一款指令驅動的視聽個人助理。該模型通過融合模糊語音與高幀率唇部動態，在單一多模態框架內實現翻譯與對話雙重功能。針對原始數據噪聲多樣性及現有全能模型對聽障語音適應性不足的挑戰，我們構建了完整的預處理與校準流程，包括面部關鍵點檢測、唇部區域分離穩定化處理，以及多模態樣本質量的量化評估。這些質量評分指導課程學習策略，使模型先從清晰的高置信度樣本開始訓練，逐步納入難樣本以增強魯棒性。我們進一步採用SigLIP編碼器結合統一3D重採樣器，高效編碼高幀率唇部運動。在自建HI-Dialogue數據集上的實驗表明，HI-TransPA在字面準確度與語義保真度方面均達到頂尖水平。本研究為全能模型應用於輔助溝通技術奠定了基礎，為未來研究提供了端到端建模框架與核心處理工具。

English

To provide a unified and flexible solution for daily communication among hearing-impaired individuals, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with high-frame-rate lip dynamics, enabling both translation and dialogue within a single multimodal framework. To tackle the challenges of noisy and heterogeneous raw data and the limited adaptability of existing Omni-Models to hearing-impaired speech, we construct a comprehensive preprocessing and curation pipeline that detects facial landmarks, isolates and stabilizes the lip region, and quantitatively assesses multimodal sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. We further adopt a SigLIP encoder combined with a Unified 3D-Resampler to efficiently encode high-frame-rate lip motion. Experiments on our purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. This work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.

HI-TransPA：聽力障礙翻譯個人助理

HI-TransPA: Hearing Impairments Translation Personal Assistant

摘要

Support