Browse free open source LLM Inference tools and projects for Windows and Linux below. Use the toggles on the left to filter open source LLM Inference tools by OS, license, language, programming language, and project status.

  • Gemini 3 and 200+ AI Models on One Platform Icon
    Gemini 3 and 200+ AI Models on One Platform

    Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

    Build generative AI apps with Vertex AI. Switch between models without switching platforms.
    Start Free
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 1
    ONNX Runtime

    ONNX Runtime

    ONNX Runtime: cross-platform, high performance ML inferencing

    ONNX Runtime is a cross-platform inference and training machine-learning accelerator. ONNX Runtime inference can enable faster customer experiences and lower costs, supporting models from deep learning frameworks such as PyTorch and TensorFlow/Keras as well as classical machine learning libraries such as scikit-learn, LightGBM, XGBoost, etc. ONNX Runtime is compatible with different hardware, drivers, and operating systems, and provides optimal performance by leveraging hardware accelerators where applicable alongside graph optimizations and transforms. ONNX Runtime training can accelerate the model training time on multi-node NVIDIA GPUs for transformer models with a one-line addition for existing PyTorch training scripts. Support for a variety of frameworks, operating systems and hardware platforms. Built-in optimizations that deliver up to 17X faster inferencing and up to 1.4X faster training.
    Downloads: 89 This Week
    Last Update:
    See Project
  • 2
    ncnn

    ncnn

    High-performance neural network inference framework for mobile

    ncnn is a high-performance neural network inference computing framework designed specifically for mobile platforms. It brings artificial intelligence right at your fingertips with no third-party dependencies, and speeds faster than all other known open source frameworks for mobile phone cpu. ncnn allows developers to easily deploy deep learning algorithm models to the mobile platform and create intelligent APPs. It is cross-platform and supports most commonly used CNN networks, including Classical CNN (VGG AlexNet GoogleNet Inception), Face Detection (MTCNN RetinaFace), Segmentation (FCN PSPNet UNet YOLACT), and more. ncnn is currently being used in a number of Tencent applications, namely: QQ, Qzone, WeChat, and Pitu.
    Downloads: 25 This Week
    Last Update:
    See Project
  • 3
    TensorRT

    TensorRT

    C++ library for high performance inference on NVIDIA GPUs

    NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning inference applications. TensorRT-based applications perform up to 40X faster than CPU-only platforms during inference. With TensorRT, you can optimize neural network models trained in all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded, or automotive product platforms. TensorRT is built on CUDA®, NVIDIA’s parallel programming model, and enables you to optimize inference leveraging libraries, development tools, and technologies in CUDA-X™ for artificial intelligence, autonomous machines, high-performance computing, and graphics. With new NVIDIA Ampere Architecture GPUs, TensorRT also leverages sparse tensor cores providing an additional performance boost.
    Downloads: 20 This Week
    Last Update:
    See Project
  • 4
    OpenVINO

    OpenVINO

    OpenVINO™ Toolkit repository

    OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference. Boost deep learning performance in computer vision, automatic speech recognition, natural language processing and other common tasks. Use models trained with popular frameworks like TensorFlow, PyTorch and more. Reduce resource demands and efficiently deploy on a range of Intel® platforms from edge to cloud. This open-source version includes several components: namely Model Optimizer, OpenVINO™ Runtime, Post-Training Optimization Tool, as well as CPU, GPU, MYRIAD, multi device and heterogeneous plugins to accelerate deep learning inferencing on Intel® CPUs and Intel® Processor Graphics. It supports pre-trained models from the Open Model Zoo, along with 100+ open source and public models in popular formats such as TensorFlow, ONNX, PaddlePaddle, MXNet, Caffe, Kaldi.
    Downloads: 17 This Week
    Last Update:
    See Project
  • Fully Managed MySQL, PostgreSQL, and SQL Server Icon
    Fully Managed MySQL, PostgreSQL, and SQL Server

    Automatic backups, patching, replication, and failover. Focus on your app, not your database.

    Cloud SQL handles your database ops end to end, so you can focus on your app.
    Try Free
  • 5
    ONNX

    ONNX

    Open standard for machine learning interoperability

    ONNX is an open format built to represent machine learning models. ONNX defines a common set of operators - the building blocks of machine learning and deep learning models - and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers. Open Neural Network Exchange (ONNX) is an open ecosystem that empowers AI developers to choose the right tools as their project evolves. ONNX provides an open source format for AI models, both deep learning and traditional ML. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types. Currently we focus on the capabilities needed for inferencing (scoring). ONNX is widely supported and can be found in many frameworks, tools, and hardware. Enabling interoperability between different frameworks and streamlining the path from research to production helps increase the speed of innovation in the AI community.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 6
    DeepSpeed

    DeepSpeed

    Deep learning optimization library: makes distributed training easy

    DeepSpeed is an easy-to-use deep learning optimization software suite that enables unprecedented scale and speed for Deep Learning Training and Inference. With DeepSpeed you can: 1. Train/Inference dense or sparse models with billions or trillions of parameters 2. Achieve excellent system throughput and efficiently scale to thousands of GPUs 3. Train/Inference on resource constrained GPU systems 4. Achieve unprecedented low latency and high throughput for inference 5. Achieve extreme compression for an unparalleled inference latency and model size reduction with low costs DeepSpeed offers a confluence of system innovations, that has made large scale DL training effective, and efficient, greatly improved ease of use, and redefined the DL training landscape in terms of scale that is possible. These innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, etc. fall under the training pillar.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 7
    Hello AI World

    Hello AI World

    Guide to deploying deep-learning inference networks

    Hello AI World is a great way to start using Jetson and experiencing the power of AI. In just a couple of hours, you can have a set of deep learning inference demos up and running for realtime image classification and object detection on your Jetson Developer Kit with JetPack SDK and NVIDIA TensorRT. The tutorial focuses on networks related to computer vision, and includes the use of live cameras. You’ll also get to code your own easy-to-follow recognition program in Python or C++, and train your own DNN models onboard Jetson with PyTorch. Ready to dive into deep learning? It only takes two days. We’ll provide you with all the tools you need, including easy to follow guides, software samples such as TensorRT code, and even pre-trained network models including ImageNet and DetectNet examples. Follow these directions to integrate deep learning into your platform of choice and quickly develop a proof-of-concept design.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 8
    Coqui STT

    Coqui STT

    The deep learning toolkit for speech-to-text

    Coqui STT is a fast, open-source, multi-platform, deep-learning toolkit for training and deploying speech-to-text models. Coqui STT is battle-tested in both production and research. Multiple possible transcripts, each with an associated confidence score. Experience the immediacy of script-to-performance. With Coqui text-to-speech, production times go from months to minutes. With Coqui, the post is a pleasure. Effortlessly clone the voices of your talent and have the clone handle the problems in post. With Coqui, dubbing is a delight. Effortlessly clone the voice of your talent into another language and let the clone do the dub. With text-to-speech, experience the immediacy of script-to-performance. Cast from a wide selection of high-quality, directable, emotive voices or clone a voice to suit your needs. With Coqui text-to-speech, production times go from months to minutes.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 9
    MegEngine

    MegEngine

    Easy-to-use deep learning framework with 3 key features

    MegEngine is a fast, scalable and easy-to-use deep learning framework with 3 key features. You can represent quantization/dynamic shape/image pre-processing and even derivation in one model. After training, just put everything into your model and inference it on any platform at ease. Speed and precision problems won't bother you anymore due to the same core inside. In training, GPU memory usage could go down to one-third at the cost of only one additional line, which enables the DTR algorithm. Gain the lowest memory usage when inferencing a model by leveraging our unique pushdown memory planner. NOTE: MegEngine now supports Python installation on Linux-64bit/Windows-64bit/MacOS(CPU-Only)-10.14+/Android 7+(CPU-Only) platforms with Python from 3.5 to 3.8. On Windows 10 you can either install the Linux distribution through Windows Subsystem for Linux (WSL) or install the Windows distribution directly. Many other platforms are supported for inference.
    Downloads: 2 This Week
    Last Update:
    See Project
  • AI-generated apps that pass security review Icon
    AI-generated apps that pass security review

    Stop waiting on engineering. Build production-ready internal tools with AI—on your company data, in your cloud.

    Retool lets you generate dashboards, admin panels, and workflows directly on your data. Type something like “Build me a revenue dashboard on my Stripe data” and get a working app with security, permissions, and compliance built in from day one. Whether on our cloud or self-hosted, create the internal software your team needs without compromising enterprise standards or control.
    Try Retool free
  • 10
    AWS Deep Learning Containers

    AWS Deep Learning Containers

    A set of Docker images for training and serving models in TensorFlow

    AWS Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in TensorFlow, TensorFlow 2, PyTorch, and MXNet. Deep Learning Containers provide optimized environments with TensorFlow and MXNet, Nvidia CUDA (for GPU instances), and Intel MKL (for CPU instances) libraries and are available in the Amazon Elastic Container Registry (Amazon ECR). The AWS DLCs are used in Amazon SageMaker as the default vehicles for your SageMaker jobs such as training, inference, transforms etc. They've been tested for machine learning workloads on Amazon EC2, Amazon ECS and Amazon EKS services as well. This project is licensed under the Apache-2.0 License. Ensure you have access to an AWS account i.e. setup your environment such that awscli can access your account via either an IAM user or an IAM role.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 11
    GPT-NeoX

    GPT-NeoX

    Implementation of model parallel autoregressive transformers on GPUs

    This repository records EleutherAI's library for training large-scale language models on GPUs. Our current framework is based on NVIDIA's Megatron Language Model and has been augmented with techniques from DeepSpeed as well as some novel optimizations. We aim to make this repo a centralized and accessible place to gather techniques for training large-scale autoregressive language models, and accelerate research into large-scale training. For those looking for a TPU-centric codebase, we recommend Mesh Transformer JAX. If you are not looking to train models with billions of parameters from scratch, this is likely the wrong library to use. For generic inference needs, we recommend you use the Hugging Face transformers library instead which supports GPT-NeoX models.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 12
    PaddlePaddle

    PaddlePaddle

    PArallel Distributed Deep LEarning: Machine Learning Framework

    PaddlePaddle is an open source deep learning industrial platform with advanced technologies and a rich set of features that make innovation and application of deep learning easier. It is the only independent R&D deep learning platform in China, and has been widely adopted in various sectors including manufacturing, agriculture and enterprise service. PaddlePaddle covers core deep learning frameworks, basic model libraries, end-to-end development kits and more, with support for both dynamic and static graphs.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 13
    AWS Neuron

    AWS Neuron

    Powering Amazon custom machine learning chips

    AWS Neuron is a software development kit (SDK) for running machine learning inference using AWS Inferentia chips. It consists of a compiler, run-time, and profiling tools that enable developers to run high-performance and low latency inference using AWS Inferentia-based Amazon EC2 Inf1 instances. Using Neuron developers can easily train their machine learning models on any popular framework such as TensorFlow, PyTorch, and MXNet, and run it optimally on Amazon EC2 Inf1 instances. You can continue to use the same ML frameworks you use today and migrate your software onto Inf1 instances with minimal code changes and without tie-in to vendor-specific solutions. Neuron is pre-integrated into popular machine learning frameworks like TensorFlow, MXNet and Pytorch to provide a seamless training-to-inference workflow. It includes a compiler, runtime driver, as well as debug and profiling utilities with a TensorBoard plugin for visualization.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    DALI

    DALI

    A GPU-accelerated library containing highly optimized building blocks

    The NVIDIA Data Loading Library (DALI) is a library for data loading and pre-processing to accelerate deep learning applications. It provides a collection of highly optimized building blocks for loading and processing image, video and audio data. It can be used as a portable drop-in replacement for built-in data loaders and data iterators in popular deep learning frameworks. Deep learning applications require complex, multi-stage data processing pipelines that include loading, decoding, cropping, resizing, and many other augmentations. These data processing pipelines, which are currently executed on the CPU, have become a bottleneck, limiting the performance and scalability of training and inference. DALI addresses the problem of the CPU bottleneck by offloading data preprocessing to the GPU. Additionally, DALI relies on its own execution engine, built to maximize the throughput of the input pipeline.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    DeepDetect

    DeepDetect

    Deep Learning API and Server in C++14 support for Caffe, PyTorch

    The core idea is to remove the error sources and difficulties of Deep Learning applications by providing a safe haven of commoditized practices, all available as a single core. While the Open Source Deep Learning Server is the core element, with REST API, and multi-platform support that allows training & inference everywhere, the Deep Learning Platform allows higher level management for training neural network models and using them as if they were simple code snippets. Ready for applications of image tagging, object detection, segmentation, OCR, Audio, Video, Text classification, CSV for tabular data and time series. Neural network templates for the most effective architectures for GPU, CPU, and Embedded devices. Training in a few hours and with small data thanks to 25+ pre-trained models. Full Open Source, with an ecosystem of tools (API clients, video, annotation, ...) Fast Server written in pure C++, a single codebase for Cloud, Desktop & Embedded.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    DeepSpeed MII

    DeepSpeed MII

    MII makes low-latency and high-throughput inference possible

    MII makes low-latency and high-throughput inference possible, powered by DeepSpeed. The Deep Learning (DL) open-source community has seen tremendous growth in the last few months. Incredibly powerful text generation models such as the Bloom 176B, or image generation model such as Stable Diffusion are now available to anyone with access to a handful or even a single GPU through platforms such as Hugging Face. While open-sourcing has democratized access to AI capabilities, their application is still restricted by two critical factors: inference latency and cost. DeepSpeed-MII is a new open-source python library from DeepSpeed, aimed towards making low-latency, low-cost inference of powerful models not only feasible but also easily accessible. MII offers access to the highly optimized implementation of thousands of widely used DL models. MII-supported models achieve significantly lower latency and cost compared to their original implementation.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    MACE

    MACE

    Deep learning inference framework optimized for mobile platforms

    Mobile AI Compute Engine (or MACE for short) is a deep learning inference framework optimized for mobile heterogeneous computing on Android, iOS, Linux and Windows devices. Runtime is optimized with NEON, OpenCL and Hexagon, and Winograd algorithm is introduced to speed up convolution operations. The initialization is also optimized to be faster. Chip-dependent power options like big.LITTLE scheduling, Adreno GPU hints are included as advanced APIs. UI responsiveness guarantee is sometimes obligatory when running a model. Mechanism like automatically breaking OpenCL kernel into small units is introduced to allow better preemption for the UI rendering task. Graph level memory allocation optimization and buffer reuse are supported. The core library tries to keep minimum external dependencies to keep the library footprint small.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    MMDeploy

    MMDeploy

    OpenMMLab Model Deployment Framework

    MMDeploy is an open-source deep learning model deployment toolset. It is a part of the OpenMMLab project. Models can be exported and run in several backends, and more will be compatible. All kinds of modules in the SDK can be extended, such as Transform for image processing, Net for Neural Network inference, Module for postprocessing and so on. Install and build your target backend. ONNX Runtime is a cross-platform inference and training accelerator compatible with many popular ML/DNN frameworks. Please read getting_started for the basic usage of MMDeploy.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    OpenVINO Training Extensions

    OpenVINO Training Extensions

    Trainable models and NN optimization tools

    OpenVINO™ Training Extensions provide a convenient environment to train Deep Learning models and convert them using the OpenVINO™ toolkit for optimized inference. When ote_cli is installed in the virtual environment, you can use the ote command line interface to perform various actions for templates related to the chosen task type, such as running, training, evaluating, exporting, etc. ote train trains a model (a particular model template) on a dataset and saves results in two files. ote optimize optimizes a pre-trained model using NNCF or POT depending on the model format. NNCF optimization used for trained snapshots in a framework-specific format. POT optimization used for models exported in the OpenVINO IR format.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    Ray

    Ray

    A unified framework for scalable computing

    Modern workloads like deep learning and hyperparameter tuning are compute-intensive and require distributed or parallel execution. Ray makes it effortless to parallelize single machine code — go from a single CPU to multi-core, multi-GPU or multi-node with minimal code changes. Accelerate your PyTorch and Tensorflow workload with a more resource-efficient and flexible distributed execution framework powered by Ray. Accelerate your hyperparameter search workloads with Ray Tune. Find the best model and reduce training costs by using the latest optimization algorithms. Deploy your machine learning models at scale with Ray Serve, a Python-first and framework agnostic model serving framework. Scale reinforcement learning (RL) with RLlib, a framework-agnostic RL library that ships with 30+ cutting-edge RL algorithms including A3C, DQN, and PPO. Easily build out scalable, distributed systems in Python with simple and composable primitives in Ray Core.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    SageMaker Hugging Face Inference Toolkit

    SageMaker Hugging Face Inference Toolkit

    Library for serving Transformers models on Amazon SageMaker

    SageMaker Hugging Face Inference Toolkit is an open-source library for serving Transformers models on Amazon SageMaker. This library provides default pre-processing, predict and postprocessing for certain Transformers models and tasks. It utilizes the SageMaker Inference Toolkit for starting up the model server, which is responsible for handling inference requests. For the Dockerfiles used for building SageMaker Hugging Face Containers, see AWS Deep Learning Containers. The SageMaker Hugging Face Inference Toolkit implements various additional environment variables to simplify your deployment experience. The Hugging Face Inference Toolkit allows user to override the default methods of the HuggingFaceHandlerService. SageMaker Hugging Face Inference Toolkit is licensed under the Apache 2.0 License.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    SageMaker MXNet Inference Toolkit

    SageMaker MXNet Inference Toolkit

    Toolkit for allowing inference and serving with MXNet in SageMaker

    SageMaker MXNet Inference Toolkit is an open-source library for serving MXNet models on Amazon SageMaker. This library provides default pre-processing, predict and postprocessing for certain MXNet model types and utilizes the SageMaker Inference Toolkit for starting up the model server, which is responsible for handling inference requests. AWS Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in TensorFlow, TensorFlow 2, PyTorch, and MXNet. Deep Learning Containers provide optimized environments with TensorFlow and MXNet, Nvidia CUDA (for GPU instances), and Intel MKL (for CPU instances) libraries and are available in the Amazon Elastic Container Registry (Amazon ECR). The AWS DLCs are used in Amazon SageMaker as the default vehicles for your SageMaker jobs such as training, inference, transforms etc. They've been tested for machine learning workloads on Amazon EC2, Amazon ECS and Amazon EKS services as well.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    TNN

    TNN

    Uniform deep learning inference framework for mobile

    TNN, a high-performance, lightweight neural network inference framework open sourced by Tencent Youtu Lab. It also has many outstanding advantages such as cross-platform, high performance, model compression, and code tailoring. The TNN framework further strengthens the support and performance optimization of mobile devices on the basis of the original Rapidnet and ncnn frameworks. At the same time, it refers to the high performance and good scalability characteristics of the industry's mainstream open source frameworks, and expands the support for X86 and NV GPUs. On the mobile phone, TNN has been used by many applications such as mobile QQ, weishi, and Pitu. As a basic acceleration framework for Tencent Cloud AI, TNN has provided acceleration support for the implementation of many businesses. Everyone is welcome to participate in the collaborative construction to promote the further improvement of the TNN inference framework.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    TensorFlow Probability

    TensorFlow Probability

    Probabilistic reasoning and statistical analysis in TensorFlow

    TensorFlow Probability is a library for probabilistic reasoning and statistical analysis. TensorFlow Probability (TFP) is a Python library built on TensorFlow that makes it easy to combine probabilistic models and deep learning on modern hardware (TPU, GPU). It's for data scientists, statisticians, ML researchers, and practitioners who want to encode domain knowledge to understand data and make predictions. Since TFP inherits the benefits of TensorFlow, you can build, fit, and deploy a model using a single language throughout the lifecycle of model exploration and production. TFP is open source and available on GitHub. Tools to build deep probabilistic models, including probabilistic layers and a `JointDistribution` abstraction. Variational inference and Markov chain Monte Carlo. A wide selection of probability distributions and bijectors. Optimizers such as Nelder-Mead, BFGS, and SGLD.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    pipeless

    pipeless

    A computer vision framework to create and deploy apps in minutes

    Pipeless is an open-source computer vision framework to create and deploy applications without the complexity of building and maintaining multimedia pipelines. It ships everything you need to create and deploy efficient computer vision applications that work in real-time in just minutes. Pipeless is inspired by modern serverless technologies. It provides the development experience of serverless frameworks applied to computer vision. You provide some functions that are executed for new video frames and Pipeless takes care of everything else. You can easily use industry-standard models, such as YOLO, or load your custom model in one of the supported inference runtimes. Pipeless ships some of the most popular inference runtimes, such as the ONNX Runtime, allowing you to run inference with high performance on CPU or GPU out-of-the-box. You can deploy your Pipeless application with a single command to edge and IoT devices or the cloud.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • Next
MongoDB Logo MongoDB