clip-vit-large-patch14-336 is a vision-language model developed by OpenAI as part of the CLIP (Contrastive Language–Image Pre-training) family. It uses a Vision Transformer (ViT) backbone with 14×14 patch size and 336×336 image resolution to learn joint representations of images and text. Though detailed training data is undisclosed, the model was trained from scratch and enables powerful zero-shot classification by aligning visual and textual features in the same embedding space. Users can apply this model to perform tasks like zero-shot image recognition, image search with text, or text generation from visual cues—without task-specific training.
Features
- Vision Transformer architecture with 336×336 input resolution
- Supports zero-shot image classification and retrieval
- Joint image-text embedding space for multi-modal tasks
- Compatible with Hugging Face Transformers and PyTorch
- Fine-tunable for domain-specific vision-language tasks
- Base for many fine-tuned adapters and visual apps
Categories
AI ModelsFollow clip-vit-large-patch14-336
Other Useful Business Software
Our Free Plans just got better! | Auth0
You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
Rate This Project
Login To Rate This Project
User Reviews
Be the first to post a review of clip-vit-large-patch14-336!