Lightweight Safety Classification Using Pruned Language Models

The paper introduces Layer Enhanced Classification (LEC), a novel technique for content safety and prompt injection classification in Large Language Models (LLMs). LEC utilizes a Penalized Logistic Regression (PLR) classifier trained on the hidden state of an LLM's optimal intermediate transformer layer. This approach combines the efficiency of a PLR classifier with the advanced language understanding of an LLM, outperforming models like GPT-4o and specialized models fine-tuned for specific tasks. The study highlights that small general-purpose models, such as Qwen 2.5 and DeBERTa v3, serve as robust feature extractors, enabling effective training of simple classifiers with fewer than 100 high-quality examples. Notably, intermediate transformer layers generally surpass the final layer in performance for classification tasks.

The findings suggest that a single general-purpose LLM can effectively classify content safety, detect prompt injections, and generate output tokens simultaneously. Alternatively, these smaller LLMs can be pruned to their optimal intermediate layer to function solely as feature extractors. The consistent results across different transformer architectures imply that robust feature extraction is a fundamental capability of most LLMs.

Key takeaways:

The Layer Enhanced Classification (LEC) technique combines a Penalized Logistic Regression classifier with the hidden state of an LLM's optimal intermediate transformer layer for improved content safety and prompt injection classification.
LEC outperforms GPT-4o and specialized models, demonstrating superior performance with small general-purpose models and transformer-based architectures like DeBERTa v3.
Intermediate transformer layers are more effective than final layers for classification tasks, allowing simple classifiers to be trained on fewer than 100 high-quality examples.
Robust feature extraction is a common capability across different transformer architectures, suggesting that a single general-purpose LLM can handle multiple tasks efficiently.

Lightweight Safety Classification Using Pruned Language Models

Key takeaways:

Comments (0)

Newsletter