Page 8 | Best Open Source Mac AI Models 2025

AI Models for Mac

View 108 business solutions

AI Models Mac Clear Filters

Our Free Plans just got better! | Auth0
With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.

Try free now
Secure remote access solution to your private network, in the cloud or on-prem.
Deliver secure remote access with OpenVPN.

OpenVPN is here to bring simple, flexible, and cost-effective secure remote access to companies of all sizes, regardless of where their resources are located.

Get started — no credit card required.
1

speaker-diarization-3.1

Speaker diarization pipeline fully in PyTorch, no ONNX required

speaker-diarization-3.1 is a state-of-the-art speaker diarization pipeline built with pyannote.audio 3.1, fully implemented in PyTorch for easier deployment and faster inference by removing reliance on ONNX. It processes mono audio sampled at 16kHz (resampling and downmixing handled automatically) and outputs speaker annotations in RTTM format. The pipeline performs speaker segmentation and embedding, allowing for optional specification or estimation of the number of speakers. It supports GPU acceleration and in-memory waveform processing. Designed for fully automatic operation—no need for manual VAD, speaker count, or fine-tuning—the model has been benchmarked across multiple datasets like AMI, DIHARD, and VoxConverse using strict diarization error rate (DER) metrics. It demonstrates robust performance in realistic, overlapping, and noisy audio environments.

Downloads: 0 This Week

Last Update: 2025-07-01
See Project
2

stable-diffusion-2-1

Latent diffusion model for high-quality text-to-image generation

Stable Diffusion 2.1 is a text-to-image generation model developed by Stability AI, building on the 768-v architecture with additional fine-tuning for improved safety and image quality. It uses a latent diffusion framework that operates in a compressed image space, enabling faster and more efficient image synthesis while preserving detail. The model is conditioned on text prompts via the OpenCLIP-ViT/H encoder and supports generation at resolutions up to 768×768. Released under the OpenRAIL++ license, it permits research and commercial use with specific content restrictions. Stable Diffusion 2.1 is designed for creative tasks such as digital art, design prototyping, and educational tools, but is not suitable for generating factual representations or non-English content. The model was trained on filtered subsets of LAION-5B, with additional steps to reduce NSFW content.

Downloads: 0 This Week

Last Update: 2025-06-27
See Project
3

stable-diffusion-3-medium

Efficient text-to-image model with enhanced quality and typography

Stable Diffusion 3 Medium is a next-generation text-to-image model by Stability AI, designed using a Multimodal Diffusion Transformer (MMDiT) architecture. It offers notable improvements in image quality, prompt comprehension, typography, and computational efficiency over previous versions. The model integrates three fixed, pretrained text encoders—OpenCLIP-ViT/G, CLIP-ViT/L, and T5-XXL—to interpret complex prompts more effectively. Trained on 1 billion synthetic and filtered public images, it was fine-tuned on 30 million high-quality aesthetic images and 3 million preference-labeled samples. SD3 Medium is optimized for both local deployment and cloud API use, with support via ComfyUI, Diffusers, and other tooling. It is distributed under the Stability AI Community License, permitting research and commercial use for organizations under $1M in annual revenue. While equipped with safety mitigations, developers are encouraged to apply additional safeguards.

Downloads: 0 This Week

Last Update: 2025-06-26
See Project
4

stable-diffusion-3.5-large

Advanced MMDiT text-to-image model for high-quality visual generation

Stable Diffusion 3.5 Large is a multimodal diffusion transformer (MMDiT) developed by Stability AI, designed for generating high-quality images from text prompts. It integrates three pretrained text encoders—OpenCLIP-ViT/G, CLIP-ViT/L, and T5-XXL—with QK-normalization for improved training stability and prompt understanding. This model excels in handling typography, detailed scenes, and creative compositions while maintaining resource efficiency. It supports inference via ComfyUI, Hugging Face Diffusers, and various APIs, and is compatible with quantization techniques for low-VRAM deployment. Stable Diffusion 3.5 Large is trained on filtered public and synthetic datasets, with a focus on aesthetic quality and prompt adherence. It is released under the Stability AI Community License, free for non-commercial use by entities with under $1M in annual revenue. Safety mitigations have been implemented during training, but developers are advised to conduct their own testing.

Downloads: 0 This Week

Last Update: 2025-06-27
See Project
Cloud SQL for MySQL, PostgreSQL, and SQL Server
Focus on your application, and leave the database to us

Fully managed, cost-effective relational database service for PostgreSQL, MySQL, and SQL Server. Try Enterprise Plus edition for a 99.99% availability SLA and category-leading performance.

Try it for free
5

stable-diffusion-inpainting

Latent text-to-image model for high-quality inpainting from prompts

Stable Diffusion Inpainting is a powerful text-to-image latent diffusion model designed specifically for inpainting tasks, allowing users to modify or regenerate parts of images using text prompts and masks. Based on the Stable Diffusion v1.2 architecture, it was further fine-tuned with 440k steps of inpainting-specific training on the LAION-Aesthetics v2 5+ dataset. The model takes an image, a binary mask, and a descriptive prompt to realistically fill in masked regions while keeping the surrounding content intact. Its UNet architecture was adapted with 5 additional input channels to handle encoded masked images and masks. The model supports use through the Hugging Face diffusers library and tools like AUTOMATIC1111, offering accessible integration. Though highly capable, the model retains the original limitations of Stable Diffusion, such as struggles with text rendering, compositional logic, and demographic bias.

Downloads: 0 This Week

Last Update: 2025-07-02
See Project
6

stable-diffusion-v-1-4-original

Stable Diffusion v1.4 generates photorealistic images from text prompt

Stable Diffusion v1.4 is a latent diffusion model that generates images from text, trained at 512×512 resolution using the LAION-Aesthetics v2.5+ dataset. Built on the weights of v1.2, it uses a CLIP ViT-L/14 encoder to guide image generation through cross-attention mechanisms. It supports classifier-free guidance by dropping 10% of text conditioning during training, enhancing creative control. The model runs efficiently while producing visually coherent and high-quality results, though it struggles with compositional prompts, fine details, and photorealistic faces. Stable Diffusion v1.4 primarily supports English and may underperform in other languages. It is licensed under CreativeML OpenRAIL-M and is intended for research and creative use, not for generating factual or identity-representative content. Developers emphasize safety, bias awareness, and the importance of responsible deployment due to its training on unfiltered web data.

Downloads: 0 This Week

Last Update: 2025-06-27
See Project
7

stable-diffusion-v1-5

Text-to-image diffusion model for high-quality image generation

Stable Diffusion v1-5 is a latent text-to-image diffusion model capable of producing high-quality, photo-realistic images from natural language prompts. It builds upon the v1.2 checkpoint and was fine-tuned with 595,000 additional steps at 512x512 resolution on the “laion-aesthetics v2 5+” dataset. This model improves generation fidelity through classifier-free guidance sampling, including 10% prompt dropout during training. It leverages a CLIP ViT-L/14 text encoder and a UNet-based diffusion architecture operating in latent space to enable fast and efficient image synthesis. Stable Diffusion v1-5 is compatible with Diffusers, ComfyUI, AUTOMATIC1111, and other user interfaces. Its intended use is for research and creative applications such as digital art, design, and exploration of generative models. While powerful, it has known limitations with photorealism, compositionality, and cultural representation, and requires responsible usage under the CreativeML OpenRAIL-M license.

Downloads: 0 This Week

Last Update: 2025-07-02
See Project
8

stable-diffusion-xl-base-1.0

Advanced base model for high-quality text-to-image generation

stable-diffusion-xl-base-1.0 is a next-generation latent diffusion model developed by Stability AI for producing highly detailed images from text prompts. It forms the core of the SDXL pipeline and can be used on its own or paired with a refinement model for enhanced results. This base model utilizes two pretrained text encoders—OpenCLIP-ViT/G and CLIP-ViT/L—for richer text understanding and improved image quality. The model supports two-stage generation, where the base model creates initial latents and the refiner further denoises them using techniques like SDEdit for sharper outputs. SDXL-base shows significant performance improvement over previous versions such as Stable Diffusion 1.5 and 2.1, especially when paired with the refiner. It is compatible with PyTorch, ONNX, and OpenVINO runtimes, offering flexibility for various hardware setups. Although it delivers high visual fidelity, it still faces challenges with complex composition, photorealism, and rendering legible text.

Downloads: 0 This Week

Last Update: 2025-06-26
See Project
9

stable-video-diffusion-img2vid-xt

Generates high-quality short videos from a single still image input

Stable Video Diffusion Img2Vid XT is an advanced image-to-video latent diffusion model developed by Stability AI, designed to generate short video clips from a single static image. It produces 25 frames at 576x1024 resolution, offering improved temporal consistency by fine-tuning from an earlier 14-frame version. The model operates without text prompts and instead uses a single input frame to guide visual generation, making it ideal for stylized motion or animation. It includes both a standard frame-wise decoder and a fine-tuned f8-decoder to enhance coherence across frames. Despite its high quality, output videos are short (under 4 seconds) and not always fully photorealistic. Faces, text, and realistic motion may be inconsistently rendered, and the model cannot generate legible writing. It is suited for creative video generation, research, and educational applications under a community license, with image-level watermarking enabled by default.

Downloads: 0 This Week

Last Update: 2025-06-27
See Project
Gen AI apps are built with MongoDB Atlas
The database for AI-powered applications.

MongoDB Atlas is the developer-friendly database used to build, scale, and run gen AI and LLM-powered apps—without needing a separate vector database. Atlas offers built-in vector search, global availability across 115+ regions, and flexible document modeling. Start building AI apps faster, all in one place.

Start Free
10

starcoder

Code generation model trained on 80+ languages with FIM support

StarCoder is a 15.5B parameter language model developed by BigCode for code generation tasks across more than 80 programming languages. It is trained on 1 trillion tokens from the permissively licensed dataset The Stack v1.2, using the Fill-in-the-Middle (FIM) objective and Multi-Query Attention to enhance performance. With an extended context window of 8192 tokens and pretraining in bfloat16, StarCoder can generate, complete, or refactor code in various languages, with English as the primary natural language. While it is not an instruction-tuned model, it can act as a capable technical assistant when prompted appropriately. Developers can use it for general-purpose code generation, with fine control over prefix/middle/suffix tokens. The model has some limitations: generated code may contain bugs or licensing constraints, and attribution must be observed when output resembles training data. StarCoder is licensed under the BigCode OpenRAIL-M license.

Downloads: 0 This Week

Last Update: 2025-06-27
See Project
11

t5-base

Flexible text-to-text transformer model for multilingual NLP tasks

t5-base is a pre-trained transformer model from Google’s T5 (Text-To-Text Transfer Transformer) family that reframes all NLP tasks into a unified text-to-text format. With 220 million parameters, it can handle a wide range of tasks, including translation, summarization, question answering, and classification. Unlike traditional models like BERT, which output class labels or spans, T5 always generates text outputs. It was trained on the C4 dataset, along with a variety of supervised NLP benchmarks, using both unsupervised denoising and supervised objectives. The model supports multiple languages, including English, French, Romanian, and German. Its flexible architecture and consistent input/output format simplify model reuse and transfer learning across different NLP tasks. T5-base achieves competitive performance across 24 language understanding tasks, as documented in its research paper.

Downloads: 0 This Week

Last Update: 2025-07-02
See Project
12

t5-small

T5-Small: Lightweight text-to-text transformer for NLP tasks

T5-Small is a lightweight variant of the Text-To-Text Transfer Transformer (T5), designed to handle a wide range of NLP tasks using a unified text-to-text approach. Developed by researchers at Google, this model reframes all tasks—such as translation, summarization, classification, and question answering—into the format of input and output as plain text strings. With only 60 million parameters, T5-Small is compact and suitable for fast inference or deployment in constrained environments. It was pretrained on the C4 dataset using both unsupervised denoising and supervised learning on tasks like sentiment analysis, NLI, and QA. Despite its size, it performs competitively across 24 NLP benchmarks, making it a strong candidate for prototyping and fine-tuning. T5-Small is compatible with major deep learning frameworks including PyTorch, TensorFlow, JAX, and ONNX. The model is open-source under the Apache 2.0 license and has wide support across Hugging Face's ecosystem.

Downloads: 0 This Week

Last Update: 2025-07-02
See Project
13

table-transformer-detection

Transformer model for detecting tables in document images

table-transformer-detection is a fine-tuned DETR-based model by Microsoft for detecting tables in document images. Built on the Transformer architecture, it was trained on the PubTables1M dataset and excels at locating tabular structures in unstructured documents like PDFs. The model leverages the "normalize before" variant of DETR, applying layer normalization before attention layers. With 28.8 million parameters, it performs end-to-end object detection specific to tables without requiring handcrafted features. It is particularly useful in document understanding tasks where precise table extraction is critical. While Hugging Face provided the model card, the original authors released the training setup and paper. The model is implemented in PyTorch and uses Safetensors format for safe and efficient storage.

Downloads: 0 This Week

Last Update: 2025-07-02
See Project
14

twitter-roberta-base-sentiment-latest

RoBERTa model for English sentiment analysis on Twitter data

twitter-roberta-base-sentiment-latest is a RoBERTa-based transformer model fine-tuned on over 124 million tweets collected between 2018 and 2021. Designed for sentiment analysis in English, it categorizes tweets as Negative, Neutral, or Positive. The model is optimized using the TweetEval benchmark and integrated with the TweetNLP ecosystem for seamless deployment. Its training emphasizes real-world, social media content, making it highly effective for analyzing informal or noisy text. This updated version improves performance over earlier Twitter sentiment models. It supports both PyTorch and TensorFlow and includes example pipelines for quick implementation. With strong classification accuracy and ease of use, it’s ideal for social media monitoring, brand sentiment tracking, and public opinion research.

Downloads: 0 This Week

Last Update: 2025-07-01
See Project
15

unidepth-v2-vitl14

Metric monocular depth estimation (vision model)

Estimates absolute (metric) depth from single RGB images, along with camera intrinsics and uncertainty. Designed to generalize across domains (zero-shot) using a self‑prompting camera module and pseudo-spherical prediction space.

Downloads: 0 This Week

Last Update: 2025-07-02
See Project
16

vit-age-classifier

Vision Transformer model fine-tuned for facial age classification

vit-age-classifier is a Vision Transformer (ViT) model fine-tuned by nateraw to classify a person's age based on their facial image. Trained on the FairFace dataset, the model predicts age group categories using facial features with high accuracy. It leverages the robust image representation capabilities of ViT for fine-grained facial analysis. With 85.8 million parameters, the model operates efficiently for image classification tasks on faces. The model outputs probabilities for predefined age classes and is compatible with Hugging Face’s transformers library using ViTFeatureExtractor. It's suitable for integration into pipelines for demographic analysis, social science research, or personalized UI experiences. However, users should be aware of dataset bias and ethical implications when deploying facial analysis models.

Downloads: 0 This Week

Last Update: 2025-07-02
See Project
17

vit-base-patch16-224

Transformer model for image classification with patch-based input.

vit-base-patch16-224 is a Vision Transformer (ViT) model developed by Google for image classification tasks. It was pretrained on ImageNet-21k (14 million images, 21,843 classes) and fine-tuned on ImageNet-1k (1 million images, 1,000 classes), both using 224x224 resolution. The model treats images as sequences of 16x16 pixel patches, which are linearly embedded and processed using a transformer encoder. A special [CLS] token is used to summarize the image for classification. ViT learns high-quality representations that can be adapted to downstream visual tasks with minimal additional training. This model has 86.6 million parameters and is available in PyTorch, TensorFlow, and JAX implementations. While the model card was written by Hugging Face, the weights were originally converted from JAX to PyTorch by the community.

Downloads: 0 This Week

Last Update: 2025-07-02
See Project
18

vit-base-patch16-224-in21k

Base Vision Transformer pretrained on ImageNet-21k at 224x224

vit-base-patch16-224-in21k is a base-sized Vision Transformer (ViT) model pretrained by Google on the large-scale ImageNet-21k dataset, comprising 14 million images across over 21,000 classes. It was introduced in the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." This model uses 16x16 image patches and absolute positional embeddings, turning image classification into a token-based sequence modeling task akin to NLP transformers. While it lacks task-specific fine-tuned heads, it provides strong image representations useful for transfer learning and feature extraction. The model is compatible with PyTorch, TensorFlow, and JAX, and includes a pretrained pooler that facilitates downstream use cases. It is typically used by adding a linear classification head on top of the [CLS] token's output. The ViT architecture demonstrated that transformers, when scaled and trained properly, can match or exceed convolutional models in image recognition.

Downloads: 0 This Week

Last Update: 2025-07-02
See Project
19

vitmatte-small-composition-1k

Lightweight ViT-based model for accurate image matting tasks

vitmatte-small-composition-1k is a Vision Transformer (ViT)-based model for image matting, trained on the Composition-1k dataset. Image matting involves separating foreground objects from the background in a detailed and visually precise way, particularly useful in photo editing and compositing. This model uses a plain ViT backbone with a lightweight head, showcasing strong performance without requiring complex, handcrafted architectures. Introduced in the paper ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers, the model leverages the representational power of pretrained transformers for pixel-level foreground prediction. Its simplicity and efficiency make it suitable for research, content creation, and visual effects pipelines. It can be fine-tuned or used directly for foreground extraction tasks with high fidelity. The model is released under an Apache-2.0 license and is easy to integrate into vision workflows via PyTorch.

Downloads: 0 This Week

Last Update: 2025-07-01
See Project
20

voice-activity-detection

Detects speech activity in audio using pyannote.audio 2.1 pipeline

The voice-activity-detection model by pyannote is a neural pipeline for detecting when speech occurs in audio recordings. Built on pyannote.audio 2.1, it identifies segments of active speech within any audio file, making it valuable for preprocessing tasks like transcription, diarization, or voice-controlled systems. The model was trained using datasets such as AMI, DIHARD, and VoxConverse, and it requires users to authenticate via Hugging Face for access. To use the model, users must accept usage conditions and provide a Hugging Face access token. Once initialized, the pipeline returns time-stamped intervals of detected speech. The model is ideal for academic research and production environments seeking high-accuracy voice detection. It is released under the MIT license and supports applications in speech recognition, speaker segmentation, and conversational AI.

Downloads: 0 This Week

Last Update: 2025-07-01
See Project
21

waifu-diffusion

Waifu Diffusion creates anime-style images from text prompts

Waifu Diffusion is a text-to-image latent diffusion model fine-tuned on high-quality anime-style artwork using Stable Diffusion as its base. Tailored for anime fans and artists, it allows users to generate detailed and stylized anime images from written prompts. The model performs especially well with common anime tropes and visual features like eye color, hairstyles, and character poses. It integrates seamlessly with the diffusers library and supports fast inference on GPU using PyTorch. Users can run it locally or via web UIs like Gradio or Google Colab for ease of use. The generated outputs are unrestricted in ownership, though usage must comply with the CreativeML OpenRAIL-M license. The project is maintained by independent contributors and builds upon work from Stability AI and NovelAI.

Downloads: 0 This Week

Last Update: 2025-06-27
See Project
22

wav2vec2-large-xlsr-53-portuguese

Portuguese ASR model fine-tuned on XLSR-53 for 16kHz audio input

wav2vec2-large-xlsr-53-portuguese is an automatic speech recognition (ASR) model fine-tuned on Portuguese using the Common Voice 6.1 dataset. It is based on Facebook’s wav2vec2-large-xlsr-53, a multilingual self-supervised learning model, and is optimized to transcribe Portuguese speech sampled at 16kHz. The model performs well without a language model, though adding one can improve word error rate (WER) and character error rate (CER). It achieves a WER of 11.3% (or 9.01% with LM) on Common Voice test data, demonstrating high accuracy for a single-language ASR model. Inference can be done using HuggingSound or via a custom PyTorch script using Hugging Face Transformers and Librosa. Training scripts and evaluation methods are open source and available on GitHub. It is released under the Apache 2.0 license and intended for ASR tasks in Brazilian Portuguese.

Downloads: 0 This Week

Last Update: 2025-07-01
See Project
23

wav2vec2-large-xlsr-53-russian

Russian ASR model fine-tuned on Common Voice and CSS10 datasets

wav2vec2-large-xlsr-53-russian is a fine-tuned automatic speech recognition (ASR) model based on Facebook’s wav2vec2-large-xlsr-53 and optimized for Russian. It was trained using Mozilla’s Common Voice 6.1 and CSS10 datasets to recognize Russian speech with high accuracy. The model operates best with audio sampled at 16kHz and can transcribe Russian speech directly without a language model. It achieves a Word Error Rate (WER) of 13.3% and Character Error Rate (CER) of 2.88% on the Common Voice test set, with even better results when used with a language model. The model supports both PyTorch and JAX and is compatible with the Hugging Face Transformers and HuggingSound libraries. It is ideal for Russian voice transcription tasks in research, accessibility, and interface development. The training was made possible with compute support from OVHcloud, and the training scripts are publicly available for replication.

Downloads: 0 This Week

Last Update: 2025-07-01
See Project
24

wespeaker-voxceleb-resnet34-LM

Speaker embedding model for voice verification and identification

wespeaker-voxceleb-resnet34-LM is a pretrained speaker embedding model wrapped for use in pyannote.audio (v3.1+), built on the WeSpeaker toolkit and trained on the VoxCeleb dataset. It leverages a ResNet34 architecture and is designed for speaker recognition, verification, and diarization tasks. The model outputs dense embeddings from full audio, excerpts, or sliding windows, allowing for flexible speaker comparison using cosine similarity. Embeddings can be extracted easily with PyTorch and integrated into pipelines for audio processing. Originally released under a CC BY 4.0 license, this model benefits from high-quality data and is suitable for academic and research use, particularly in scenarios where robust speaker identity modeling is required.

Downloads: 0 This Week

Last Update: 2025-07-01
See Project
25

whisper-large-v3

High-accuracy multilingual speech recognition and translation model

Whisper-large-v3 is OpenAI’s most advanced multilingual automatic speech recognition (ASR) and speech translation model, featuring 1.54 billion parameters and trained on 5 million hours of labeled and pseudo-labeled audio. Built on a Transformer-based encoder-decoder architecture, it supports 99 languages and delivers significant improvements in transcription accuracy, robustness to noise, and handling of diverse accents. Compared to previous versions, v3 introduces a 128 Mel bin spectrogram input and better support for Cantonese, achieving up to 20% error reduction over Whisper-large-v2. It handles zero-shot transcription and translation, performs language detection automatically, and supports features like word-level timestamps and long-form audio processing. The model integrates well with Hugging Face Transformers and supports optimizations such as batching, SDPA, and Flash Attention 2.

Downloads: 0 This Week

Last Update: 2025-06-27
See Project