GLiClass：面向序列分类任务的通用轻量级模型

摘要

分类是人工智能应用中最普遍的任务之一，常作为数据筛选、排序和分类的第一步。由于现代AI系统需要处理大量输入数据，且早期处理阶段的错误会向下游传播，因此实现高效率和准确性至关重要。此外，分类需求会随用户需求动态变化，这就要求模型具备强大的零样本学习能力。尽管生成式大语言模型（LLMs）因其多功能性已成为零样本分类的主流选择，但它们存在指令遵循不一致和计算效率低下的问题。交叉编码器（Cross-encoders）作为RAG管道中的重排序器，面临不同的瓶颈：它们必须顺序处理文本-标签对，这在处理大规模标签集时显著降低了效率。基于嵌入的方法虽效率较高，但在涉及逻辑和语义约束的复杂场景中表现欠佳。我们提出了GLiClass，一种将GLiNER架构适配于序列分类任务的新方法。该方法在保持与基于嵌入方法相当的准确性和效率的同时，还具备零样本和小样本学习所需的灵活性。此外，我们还将近端策略优化（PPO）应用于多标签文本分类，使得在数据稀疏条件下或基于人类反馈训练分类器成为可能。

English

Classification is one of the most widespread tasks in AI applications, serving often as the first step in filtering, sorting, and categorizing data. Since modern AI systems must handle large volumes of input data and early pipeline stages can propagate errors downstream, achieving high efficiency and accuracy is critical. Moreover, classification requirements can change dynamically based on user needs, necessitating models with strong zero-shot capabilities. While generative LLMs have become mainstream for zero-shot classification due to their versatility, they suffer from inconsistent instruction following and computational inefficiency. Cross-encoders, commonly used as rerankers in RAG pipelines, face a different bottleneck: they must process text-label pairs sequentially, significantly reducing efficiency with large label sets. Embedding-based approaches offer good efficiency but struggle with complex scenarios involving logical and semantic constraints. We propose GLiClass, a novel method that adapts the GLiNER architecture for sequence classification tasks. Our approach achieves strong accuracy and efficiency comparable to embedding-based methods, while maintaining the flexibility needed for zero-shot and few-shot learning scenarios. Additionally, we adapted proximal policy optimization (PPO) for multi-label text classification, enabling training classifiers in data-sparse conditions or from human feedback.

GLiClass：面向序列分类任务的通用轻量级模型

GLiClass: Generalist Lightweight Model for Sequence Classification Tasks

摘要

Support