HI-TransPA：听力障碍翻译个人助手

摘要

为给听障人士的日常交流提供统一灵活的解决方案，我们将全模态范式引入辅助技术领域，推出了指令驱动的视听个人助手HI-TransPA。该模型通过融合模糊语音与高帧率唇部动态，在统一的多模态框架内实现翻译与对话功能。针对原始数据存在噪声异质性、现有全模态模型对听障语音适应性不足的挑战，我们构建了完整的预处理与筛选流程：通过检测面部关键点、分离并稳定唇部区域、量化评估多模态样本质量，形成质量评分体系指导课程学习——先训练清洁高置信度样本，逐步引入困难样本以增强模型鲁棒性。我们进一步采用SigLIP编码器与统一3D重采样器相结合的方法，高效编码高帧率唇部运动。在自建HI-Dialogue数据集上的实验表明，HI-TransPA在字面准确度与语义保真度方面均达到最先进水平。本研究为全模态模型在辅助通信技术中的应用奠定了基础，为未来研究提供了端到端建模框架与核心处理工具。

English

To provide a unified and flexible solution for daily communication among hearing-impaired individuals, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with high-frame-rate lip dynamics, enabling both translation and dialogue within a single multimodal framework. To tackle the challenges of noisy and heterogeneous raw data and the limited adaptability of existing Omni-Models to hearing-impaired speech, we construct a comprehensive preprocessing and curation pipeline that detects facial landmarks, isolates and stabilizes the lip region, and quantitatively assesses multimodal sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. We further adopt a SigLIP encoder combined with a Unified 3D-Resampler to efficiently encode high-frame-rate lip motion. Experiments on our purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. This work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.

HI-TransPA：听力障碍翻译个人助手

HI-TransPA: Hearing Impairments Translation Personal Assistant

摘要

Support