clip-vit-base-patch16 is a vision-language model by OpenAI designed for zero-shot image classification by aligning images and text in a shared embedding space. It uses a Vision Transformer (ViT-B/16) as the image encoder and a masked Transformer for text, trained with a contrastive loss on large-scale web-sourced (image, caption) pairs. The model can infer relationships between text and images without needing task-specific fine-tuning, enabling broad generalization across domains. It's commonly used in research to explore robustness, generalization, and semantic alignment across modalities. Despite strong benchmark results, CLIP struggles with tasks requiring fine-grained classification, object counting, and fairness across demographic groups. It has known biases influenced by data composition and class design, particularly with respect to race and gender. The model is not intended for deployment without careful in-domain testing and is unsuitable for surveillance or face recognition.
Features
- Zero-shot classification by comparing image-text similarity
- Uses ViT-B/16 (Vision Transformer) architecture
- Processes and embeds both text and image inputs
- Outputs cosine similarity logits between image and text
- Trained on 400M+ (image, text) pairs from web data
- Pretrained for English text and public-domain imagery
- Evaluated across 30+ vision datasets like ImageNet and CIFAR
- Supports inference via Hugging Face Transformers and CLIPProcessor