1
Feature Story
Lightweight Safety Classification Using Pruned Language Models
Dec 19, 2024 · arxiv.orgThe findings suggest that a single general-purpose LLM can effectively classify content safety, detect prompt injections, and generate output tokens simultaneously. Alternatively, these smaller LLMs can be pruned to their optimal intermediate layer to function solely as feature extractors. The consistent results across different transformer architectures imply that robust feature extraction is a fundamental capability of most LLMs.
Key takeaways
- The Layer Enhanced Classification (LEC) technique combines a Penalized Logistic Regression classifier with the hidden state of an LLM's optimal intermediate transformer layer for improved content safety and prompt injection classification.
- LEC outperforms GPT-4o and specialized models, demonstrating superior performance with small general-purpose models and transformer-based architectures like DeBERTa v3.
- Intermediate transformer layers are more effective than final layers for classification tasks, allowing simple classifiers to be trained on fewer than 100 high-quality examples.
- Robust feature extraction is a common capability across different transformer architectures, suggesting that a single general-purpose LLM can handle multiple tasks efficiently.