distilbert-base-uncased is a compact, faster alternative to BERT developed through a distillation process. It retains 97% of BERT's language understanding performance while being 40% smaller and 60% faster. Trained on English Wikipedia and BookCorpus, it was distilled using BERT base as the teacher model with three objectives: distillation loss, masked language modeling (MLM), and cosine embedding loss. The model is uncased (treats "english" and "English" as the same) and is suitable for a wide range of downstream NLP tasks like sequence classification, token classification, or question answering. While efficient, it inherits biases present in the original BERT model. DistilBERT is available under the Apache 2.0 license and is compatible with PyTorch, TensorFlow, and JAX.
Features
- 40% smaller and 60% faster than BERT base
- Trained with distillation, MLM, and cosine loss
- Achieves 97% of BERT's performance on GLUE benchmarks
- Pretrained on BookCorpus and English Wikipedia
- Uncased: capitalization is ignored
- Ideal for fine-tuning on classification and QA tasks
- Available for PyTorch, TensorFlow, and JAX