DINOv2 is a self-supervised vision learning framework that produces strong, general-purpose image representations without using human labels. It builds on the DINO idea of student–teacher distillation and adapts it to modern Vision Transformer backbones with a carefully tuned recipe for data augmentation, optimization, and multi-crop training. The core promise is that a single pretrained backbone can transfer well to many downstream tasks—from linear probing on classification to retrieval, detection, and segmentation—often requiring little or no fine-tuning. The repository includes code for training, evaluating, and feature extraction, with utilities to run k-NN or linear evaluation baselines to assess representation quality. Pretrained checkpoints cover multiple model sizes so practitioners can trade accuracy for speed and memory depending on their deployment constraints.
Features
- Self-supervised training recipe for ViT backbones using student–teacher distillation
- Strong, task-agnostic features that transfer to classification, retrieval, and segmentation
- Ready-to-use pretrained weights at multiple model scales
- Baseline evaluation scripts for linear probes and k-NN classifiers
- Feature extraction utilities for downstream pipelines and nearest-neighbor search
- Reproducible configs and training utilities for large-scale pretraining