LLaVA
LLaVA (Large Language-and-Vision Assistant) is an innovative multimodal model that integrates a vision encoder with the Vicuna language model to facilitate comprehensive visual and language understanding. Through end-to-end training, LLaVA exhibits impressive chat capabilities, emulating the multimodal functionalities of models like GPT-4. Notably, LLaVA-1.5 has achieved state-of-the-art performance across 11 benchmarks, utilizing publicly available data and completing training in approximately one day on a single 8-A100 node, surpassing methods that rely on billion-scale datasets. The development of LLaVA involved the creation of a multimodal instruction-following dataset, generated using language-only GPT-4. This dataset comprises 158,000 unique language-image instruction-following samples, including conversations, detailed descriptions, and complex reasoning tasks. This data has been instrumental in training LLaVA to perform a wide array of visual and language tasks effectively.
Learn more
SmolVLM
SmolVLM-Instruct is a compact, AI-powered multimodal model that combines the capabilities of vision and language processing, designed to handle tasks like image captioning, visual question answering, and multimodal storytelling. It works with both text and image inputs, providing highly efficient results while being optimized for smaller, resource-constrained environments. Built with SmolLM2 as its text decoder and SigLIP as its image encoder, the model offers improved performance for tasks that require integration of both textual and visual information. SmolVLM-Instruct can be fine-tuned for specific applications, offering businesses and developers a versatile tool for creating intelligent, interactive systems that require multimodal inputs.
Learn more
PaliGemma 2
PaliGemma 2, the next evolution in tunable vision-language models, builds upon the performant Gemma 2 models, adding the power of vision and making it easier than ever to fine-tune for exceptional performance. With PaliGemma 2, these models can see, understand, and interact with visual input, opening up a world of new possibilities. It offers scalable performance with multiple model sizes (3B, 10B, 28B parameters) and resolutions (224px, 448px, 896px). PaliGemma 2 generates detailed, contextually relevant captions for images, going beyond simple object identification to describe actions, emotions, and the overall narrative of the scene. Our research demonstrates leading performance in chemical formula recognition, music score recognition, spatial reasoning, and chest X-ray report generation, as detailed in the technical report. Upgrading to PaliGemma 2 is a breeze for existing PaliGemma users.
Learn more
Eyewey
Train your own models, get access to pre-trained computer vision models and app templates, learn how to create AI apps or solve a business problem using computer vision in a couple of hours. Start creating your own dataset for detection by adding the images of the object you need to train. You can add up to 5000 images per dataset. After images are added to your dataset, they are pushed automatically into training. Once the model is finished training, you will be notified accordingly. You can simply download your model to be used for detection. You can also integrate your model to our pre-existing app templates for quick coding. Our mobile app which is available on both Android and IOS utilizes the power of computer vision to help people with complete blindness in their day-to-day lives. It is capable of alerting hazardous objects or signs, detecting common objects, recognizing text as well as currencies and understanding basic scenarios through deep learning.
Learn more