Alternatives to XLNet
Compare XLNet alternatives for your business or organization using the curated list below. SourceForge ranks the best alternatives to XLNet in 2026. Compare features, ratings, user reviews, pricing, and more from XLNet competitors and alternatives in order to make an informed decision for your business.
-
1
Vertex AI
Google
Build, deploy, and scale machine learning (ML) models faster, with fully managed ML tools for any use case. Through Vertex AI Workbench, Vertex AI is natively integrated with BigQuery, Dataproc, and Spark. You can use BigQuery ML to create and execute machine learning models in BigQuery using standard SQL queries on existing business intelligence tools and spreadsheets, or you can export datasets from BigQuery directly into Vertex AI Workbench and run your models from there. Use Vertex Data Labeling to generate highly accurate labels for your data collection. Vertex AI Agent Builder enables developers to create and deploy enterprise-grade generative AI applications. It offers both no-code and code-first approaches, allowing users to build AI agents using natural language instructions or by leveraging frameworks like LangChain and LlamaIndex. -
2
Google AI Studio
Google
Google AI Studio is a unified development platform that helps teams explore, build, and deploy applications using Google’s most advanced AI models, including Gemini 3. It brings text, image, audio, and video models together in one interactive playground. With vibe coding, developers can use natural language to quickly turn ideas into working AI applications. The platform reduces friction by generating functional apps that are ready for deployment with minimal setup. Built-in integrations like Google Search enhance real-world use cases. Google AI Studio also centralizes API key management, usage monitoring, and billing. It offers a fast, intuitive path from prompt to production powered by vibe coding workflows. -
3
LM-Kit.NET
LM-Kit
LM-Kit.NET is a cutting-edge, high-level inference SDK designed specifically to bring the advanced capabilities of Large Language Models (LLM) into the C# ecosystem. Tailored for developers working within .NET, LM-Kit.NET provides a comprehensive suite of powerful Generative AI tools, making it easier than ever to integrate AI-driven functionality into your applications. The SDK is versatile, offering specialized AI features that cater to a variety of industries. These include text completion, Natural Language Processing (NLP), content retrieval, text summarization, text enhancement, language translation, and much more. Whether you are looking to enhance user interaction, automate content creation, or build intelligent data retrieval systems, LM-Kit.NET offers the flexibility and performance needed to accelerate your project. -
4
BERT
Google
BERT is a large language model and a method of pre-training language representations. Pre-training refers to how BERT is first trained on a large source of text, such as Wikipedia. You can then apply the training results to other Natural Language Processing (NLP) tasks, such as question answering and sentiment analysis. With BERT and AI Platform Training, you can train a variety of NLP models in about 30 minutes.Starting Price: Free -
5
GPT-4
OpenAI
GPT-4 (Generative Pre-trained Transformer 4) is a large-scale unsupervised language model, yet to be released by OpenAI. GPT-4 is the successor to GPT-3 and part of the GPT-n series of natural language processing models, and was trained on a dataset of 45TB of text to produce human-like text generation and understanding capabilities. Unlike most other NLP models, GPT-4 does not require additional training data for specific tasks. Instead, it can generate text or answer questions using only its own internally generated context as input. GPT-4 has been shown to be able to perform a wide variety of tasks without any task specific training data such as translation, summarization, question answering, sentiment analysis and more.Starting Price: $0.0200 per 1000 tokens -
6
Megatron-Turing
NVIDIA
Megatron-Turing Natural Language Generation model (MT-NLG), is the largest and the most powerful monolithic transformer English language model with 530 billion parameters. This 105-layer, transformer-based MT-NLG improves upon the prior state-of-the-art models in zero-, one-, and few-shot settings. It demonstrates unmatched accuracy in a broad set of natural language tasks such as, Completion prediction, Reading comprehension, Commonsense reasoning, Natural language inferences, Word sense disambiguation, etc. With the intent of accelerating research on the largest English language model till date and enabling customers to experiment, employ and apply such a large language model on downstream language tasks - NVIDIA is pleased to announce an Early Access program for its managed API service to MT-NLG mode. -
7
RoBERTa
Meta
RoBERTa builds on BERT’s language masking strategy, wherein the system learns to predict intentionally hidden sections of text within otherwise unannotated language examples. RoBERTa, which was implemented in PyTorch, modifies key hyperparameters in BERT, including removing BERT’s next-sentence pretraining objective, and training with much larger mini-batches and learning rates. This allows RoBERTa to improve on the masked language modeling objective compared with BERT and leads to better downstream task performance. We also explore training RoBERTa on an order of magnitude more data than BERT, for a longer amount of time. We used existing unannotated NLP datasets as well as CC-News, a novel set drawn from public news articles.Starting Price: Free -
8
InstructGPT
OpenAI
InstructGPT is an open-source framework for training language models to generate natural language instructions from visual input. It uses a generative pre-trained transformer (GPT) model and the state-of-the-art object detector, Mask R-CNN, to detect objects in images and generate natural language sentences that describe the image. InstructGPT is designed to be effective across domains such as robotics, gaming and education; it can assist robots in navigating complex tasks with natural language instructions, or help students learn by providing descriptive explanations of processes or events.Starting Price: $0.0200 per 1000 tokens -
9
Cohere
Cohere AI
Cohere is an enterprise AI platform that enables developers and businesses to build powerful language-based applications. Specializing in large language models (LLMs), Cohere provides solutions for text generation, summarization, and semantic search. Their model offerings include the Command family for high-performance language tasks and Aya Expanse for multilingual applications across 23 languages. Focused on security and customization, Cohere allows flexible deployment across major cloud providers, private cloud environments, or on-premises setups to meet diverse enterprise needs. The company collaborates with industry leaders like Oracle and Salesforce to integrate generative AI into business applications, improving automation and customer engagement. Additionally, Cohere For AI, their research lab, advances machine learning through open-source projects and a global research community.Starting Price: Free -
10
Amazon Nova
Amazon
Amazon Nova is a new generation of state-of-the-art (SOTA) foundation models (FMs) that deliver frontier intelligence and industry leading price-performance, available exclusively on Amazon Bedrock. Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Pro are understanding models that accept text, image, or video inputs and generate text output. They provide a broad selection of capability, accuracy, speed, and cost operation points. Amazon Nova Micro is a text only model that delivers the lowest latency responses at very low cost. Amazon Nova Lite is a very low-cost multimodal model that is lightning fast for processing image, video, and text inputs. Amazon Nova Pro is a highly capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Pro’s capabilities, coupled with its industry-leading speed and cost efficiency, makes it a compelling model for almost any task, including video summarization, Q&A, math & more. -
11
AI21 Studio
AI21 Studio
AI21 Studio provides API access to Jurassic-1 large-language-models. Our models power text generation and comprehension features in thousands of live applications. Take on any language task. Our Jurassic-1 models are trained to follow natural language instructions and require just a few examples to adapt to new tasks. Use our specialized APIs for common tasks like summarization, paraphrasing and more. Access superior results at a lower cost without reinventing the wheel. Need to fine-tune your own custom model? You're just 3 clicks away. Training is fast, affordable and trained models are deployed immediately. Give your users superpowers by embedding an AI co-writer in your app. Drive user engagement and success with features like long-form draft generation, paraphrasing, repurposing and custom auto-complete.Starting Price: $29 per month -
12
Llama
Meta
Llama (Large Language Model Meta AI) is a state-of-the-art foundational large language model designed to help researchers advance their work in this subfield of AI. Smaller, more performant models such as Llama enable others in the research community who don’t have access to large amounts of infrastructure to study these models, further democratizing access in this important, fast-changing field. Training smaller foundation models like Llama is desirable in the large language model space because it requires far less computing power and resources to test new approaches, validate others’ work, and explore new use cases. Foundation models train on a large set of unlabeled data, which makes them ideal for fine-tuning for a variety of tasks. We are making Llama available at several sizes (7B, 13B, 33B, and 65B parameters) and also sharing a Llama model card that details how we built the model in keeping with our approach to Responsible AI practices. -
13
Alpa
Alpa
Alpa aims to automate large-scale distributed training and serving with just a few lines of code. Alpa was initially developed by folks in the Sky Lab, UC Berkeley. Some advanced techniques used in Alpa have been written in a paper published in OSDI'2022. Alpa community is growing with new contributors from Google. A language model is a probability distribution over sequences of words. It predicts the next word based on all the previous words. It is useful for a variety of AI applications, such the auto-completion in your email or chatbot service. For more information, check out the language model wikipedia page. GPT-3 is very large language model, with 175 billion parameters, that uses deep learning to produce human-like text. Many researchers and news articles described GPT-3 as "one of the most interesting and important AI systems ever produced". GPT-3 is gradually being used as a backbone in the latest NLP research and applications.Starting Price: Free -
14
NLP Cloud
NLP Cloud
Fast and accurate AI models suited for production. Highly-available inference API leveraging the most advanced NVIDIA GPUs. We selected the best open-source natural language processing (NLP) models from the community and deployed them for you. Fine-tune your own models - including GPT-J - or upload your in-house custom models, and deploy them easily to production. Upload or Train/Fine-Tune your own AI models - including GPT-J - from your dashboard, and use them straight away in production without worrying about deployment considerations like RAM usage, high-availability, scalability... You can upload and deploy as many models as you want to production.Starting Price: $29 per month -
15
GPT-3.5
OpenAI
GPT-3.5 is the next evolution of GPT 3 large language model from OpenAI. GPT-3.5 models can understand and generate natural language. We offer four main models with different levels of power suitable for different tasks. The main GPT-3.5 models are meant to be used with the text completion endpoint. We also offer models that are specifically meant to be used with other endpoints. Davinci is the most capable model family and can perform any task the other models can perform and often with less instruction. For applications requiring a lot of understanding of the content, like summarization for a specific audience and creative content generation, Davinci is going to produce the best results. These increased capabilities require more compute resources, so Davinci costs more per API call and is not as fast as the other models.Starting Price: $0.0200 per 1000 tokens -
16
LUIS
Microsoft
Language Understanding (LUIS): A machine learning-based service to build natural language into apps, bots, and IoT devices. Quickly create enterprise-ready, custom models that continuously improve. Add natural language to your apps. Designed to identify valuable information in conversations, LUIS interprets user goals (intents) and distills valuable information from sentences (entities), for a high quality, nuanced language model. LUIS integrates seamlessly with the Azure Bot Service, making it easy to create a sophisticated bot. Powerful developer tools are combined with customizable pre-built apps and entity dictionaries, such as Calendar, Music, and Devices, so you can build and deploy a solution more quickly. Dictionaries are mined from the collective knowledge of the web and supply billions of entries, helping your model to correctly identify valuable information from user conversations. Active learning is used to continuously improve the quality of the models. -
17
LLaVA
LLaVA
LLaVA (Large Language-and-Vision Assistant) is an innovative multimodal model that integrates a vision encoder with the Vicuna language model to facilitate comprehensive visual and language understanding. Through end-to-end training, LLaVA exhibits impressive chat capabilities, emulating the multimodal functionalities of models like GPT-4. Notably, LLaVA-1.5 has achieved state-of-the-art performance across 11 benchmarks, utilizing publicly available data and completing training in approximately one day on a single 8-A100 node, surpassing methods that rely on billion-scale datasets. The development of LLaVA involved the creation of a multimodal instruction-following dataset, generated using language-only GPT-4. This dataset comprises 158,000 unique language-image instruction-following samples, including conversations, detailed descriptions, and complex reasoning tasks. This data has been instrumental in training LLaVA to perform a wide array of visual and language tasks effectively.Starting Price: Free -
18
BLOOM
BigScience
BLOOM is an autoregressive Large Language Model (LLM), trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources. As such, it is able to output coherent text in 46 languages and 13 programming languages that is hardly distinguishable from text written by humans. BLOOM can also be instructed to perform text tasks it hasn't been explicitly trained for, by casting them as text generation tasks. -
19
Chinchilla
Google DeepMind
Chinchilla is a large language model. Chinchilla uses the same compute budget as Gopher but with 70B parameters and 4× more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher. -
20
ALBERT
Google
ALBERT is a self-supervised Transformer model that was pretrained on a large corpus of English data. This means it does not require manual labelling, and instead uses an automated process to generate inputs and labels from raw texts. It is trained with two distinct objectives in mind. The first is Masked Language Modeling (MLM), which randomly masks 15% of words in the input sentence and requires the model to predict them. This technique differs from RNNs and autoregressive models like GPT as it allows the model to learn bidirectional sentence representations. The second objective is Sentence Ordering Prediction (SOP), which entails predicting the ordering of two consecutive segments of text during pretraining. -
21
ESMFold
Meta
ESMFold shows how AI can give us new tools to understand the natural world, much like the microscope, which enabled us to see into the world at an infinitesimal scale and opened up a whole new understanding of life. AI can help us understand the immense scope of natural diversity, and see biology in a new way. Much of AI research has focused on helping computers understand the world in a way similar to how humans do. The language of proteins is one that is beyond human comprehension and has eluded even the most powerful computational tools. AI has the potential to open up this language to our understanding. Studying AI in new domains such as biology can also give insight into artificial intelligence more broadly. Our work reveals connections across domains: large language models that are behind advances in machine translation, natural language understanding, speech recognition, and image generation are also able to learn deep information about biology.Starting Price: Free -
22
NVIDIA NeMo
NVIDIA
NVIDIA NeMo LLM is a service that provides a fast path to customizing and using large language models trained on several frameworks. Developers can deploy enterprise AI applications using NeMo LLM on private and public clouds. They can also experience Megatron 530B—one of the largest language models—through the cloud API or experiment via the LLM service. Customize your choice of various NVIDIA or community-developed models that work best for your AI applications. Within minutes to hours, get better responses by providing context for specific use cases using prompt learning techniques. Leverage the power of NVIDIA Megatron 530B, one of the largest language models, through the NeMo LLM Service or the cloud API. Take advantage of models for drug discovery, including in the cloud API and NVIDIA BioNeMo framework. -
23
GPT-3
OpenAI
Our GPT-3 models can understand and generate natural language. We offer four main models with different levels of power suitable for different tasks. Davinci is the most capable model, and Ada is the fastest. The main GPT-3 models are meant to be used with the text completion endpoint. We also offer models that are specifically meant to be used with other endpoints. Davinci is the most capable model family and can perform any task the other models can perform and often with less instruction. For applications requiring a lot of understanding of the content, like summarization for a specific audience and creative content generation, Davinci is going to produce the best results. These increased capabilities require more compute resources, so Davinci costs more per API call and is not as fast as the other models.Starting Price: $0.0200 per 1000 tokens -
24
DeepSeek-V2
DeepSeek
DeepSeek-V2 is a state-of-the-art Mixture-of-Experts (MoE) language model introduced by DeepSeek-AI, characterized by its economical training and efficient inference capabilities. With a total of 236 billion parameters, of which only 21 billion are active per token, it supports a context length of up to 128K tokens. DeepSeek-V2 employs innovative architectures like Multi-head Latent Attention (MLA) for efficient inference by compressing the Key-Value (KV) cache and DeepSeekMoE for cost-effective training through sparse computation. This model significantly outperforms its predecessor, DeepSeek 67B, by saving 42.5% in training costs, reducing the KV cache by 93.3%, and enhancing generation throughput by 5.76 times. Pretrained on an 8.1 trillion token corpus, DeepSeek-V2 excels in language understanding, coding, and reasoning tasks, making it a top-tier performer among open-source models.Starting Price: Free -
25
Azure OpenAI Service
Microsoft
Apply advanced coding and language models to a variety of use cases. Leverage large-scale, generative AI models with deep understandings of language and code to enable new reasoning and comprehension capabilities for building cutting-edge applications. Apply these coding and language models to a variety of use cases, such as writing assistance, code generation, and reasoning over data. Detect and mitigate harmful use with built-in responsible AI and access enterprise-grade Azure security. Gain access to generative models that have been pretrained with trillions of words. Apply them to new scenarios including language, code, reasoning, inferencing, and comprehension. Customize generative models with labeled data for your specific scenario using a simple REST API. Fine-tune your model's hyperparameters to increase accuracy of outputs. Use the few-shot learning capability to provide the API with examples and achieve more relevant results.Starting Price: $0.0004 per 1000 tokens -
26
OpenAI
OpenAI
OpenAI’s mission is to ensure that artificial general intelligence (AGI)—by which we mean highly autonomous systems that outperform humans at most economically valuable work—benefits all of humanity. We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome. Apply our API to any language task — semantic search, summarization, sentiment analysis, content generation, translation, and more — with only a few examples or by specifying your task in English. One simple integration gives you access to our constantly-improving AI technology. Explore how you integrate with the API with these sample completions. -
27
Codestral Mamba
Mistral AI
As a tribute to Cleopatra, whose glorious destiny ended in tragic snake circumstances, we are proud to release Codestral Mamba, a Mamba2 language model specialized in code generation, available under an Apache 2.0 license. Codestral Mamba is another step in our effort to study and provide new architectures. It is available for free use, modification, and distribution, and we hope it will open new perspectives in architecture research. Mamba models offer the advantage of linear time inference and the theoretical ability to model sequences of infinite length. It allows users to engage with the model extensively with quick responses, irrespective of the input length. This efficiency is especially relevant for code productivity use cases, this is why we trained this model with advanced code and reasoning capabilities, enabling it to perform on par with SOTA transformer-based models.Starting Price: Free -
28
Claude Pro
Anthropic
Claude Pro is an advanced large language model designed to handle complex tasks while maintaining a friendly, accessible demeanor. Trained on extensive, high-quality data, it excels at understanding context, interpreting subtle nuances, and producing well-structured, coherent responses across a wide range of topics. By leveraging robust reasoning capabilities and a refined knowledge base, Claude Pro can draft detailed reports, compose creative content, summarize lengthy documents, and even assist in coding tasks. Its adaptive algorithms continuously improve its ability to learn from feedback, ensuring that its output remains accurate, reliable, and helpful. Whether serving professionals seeking expert support or individuals looking for quick, informative answers, Claude Pro delivers a versatile and productive conversational experience.Starting Price: $18/month -
29
PaLM 2
Google
PaLM 2 is our next generation large language model that builds on Google’s legacy of breakthrough research in machine learning and responsible AI. It excels at advanced reasoning tasks, including code and math, classification and question answering, translation and multilingual proficiency, and natural language generation better than our previous state-of-the-art LLMs, including PaLM. It can accomplish these tasks because of the way it was built – bringing together compute-optimal scaling, an improved dataset mixture, and model architecture improvements. PaLM 2 is grounded in Google’s approach to building and deploying AI responsibly. It was evaluated rigorously for its potential harms and biases, capabilities and downstream uses in research and in-product applications. It’s being used in other state-of-the-art models, like Med-PaLM 2 and Sec-PaLM, and is powering generative AI features and tools at Google, like Bard and the PaLM API. -
30
Mistral 7B
Mistral AI
Mistral 7B is a 7.3-billion-parameter language model that outperforms larger models like Llama 2 13B across various benchmarks. It employs Grouped-Query Attention (GQA) for faster inference and Sliding Window Attention (SWA) to efficiently handle longer sequences. Released under the Apache 2.0 license, Mistral 7B is accessible for deployment across diverse platforms, including local environments and major cloud services. Additionally, a fine-tuned version, Mistral 7B Instruct, demonstrates enhanced performance in instruction-following tasks, surpassing models like Llama 2 13B Chat.Starting Price: Free -
31
Tiny Aya
Cohere AI
Tiny Aya is a family of open-weight multilingual language models from Cohere Labs designed to deliver powerful, adaptable AI that can run efficiently on local devices, including phones and laptops, without requiring constant cloud connectivity. It focuses on enabling high-quality text understanding and generation across more than 70 languages, including many lower-resource languages that are often underserved by mainstream models. Built with lightweight architectures around 3.35 billion parameters, Tiny Aya is optimized for balanced multilingual representation and realistic compute constraints, making it suitable for edge deployment and offline use. The models support downstream adaptation and instruction tuning, allowing developers to customize behavior for specific applications while maintaining strong cross-lingual performance.Starting Price: Free -
32
Yi-Large
01.AI
Yi-Large is a proprietary large language model developed by 01.AI, offering a 32k context length with both input and output costs at $2 per million tokens. It stands out with its advanced capabilities in natural language processing, common-sense reasoning, and multilingual support, performing on par with leading models like GPT-4 and Claude3 in various benchmarks. Yi-Large is designed for tasks requiring complex inference, prediction, and language understanding, making it suitable for applications like knowledge search, data classification, and creating human-like chatbots. Its architecture is based on a decoder-only transformer with enhancements such as pre-normalization and Group Query Attention, and it has been trained on a vast, high-quality multilingual dataset. This model's versatility and cost-efficiency make it a strong contender in the AI market, particularly for enterprises aiming to deploy AI solutions globally.Starting Price: $0.19 per 1M input token -
33
Qwen2-VL
Alibaba
Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of: SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. Understanding videos of 20 min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc. Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions. Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside imagesStarting Price: Free -
34
Mistral Large 2
Mistral AI
Mistral AI has launched the Mistral Large 2, an advanced AI model designed to excel in code generation, multilingual capabilities, and complex reasoning tasks. The model features a 128k context window, supporting dozens of languages including English, French, Spanish, and Arabic, as well as over 80 programming languages. Mistral Large 2 is tailored for high-throughput single-node inference, making it ideal for large-context applications. Its improved performance on benchmarks like MMLU and its enhanced code generation and reasoning abilities ensure accuracy and efficiency. The model also incorporates better function calling and retrieval, supporting complex business applications.Starting Price: Free -
35
ERNIE 3.0 Titan
Baidu
Pre-trained language models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. GPT-3 has shown that scaling up pre-trained language models can further exploit their enormous potential. A unified framework named ERNIE 3.0 was recently proposed for pre-training large-scale knowledge enhanced models and trained a model with 10 billion parameters. ERNIE 3.0 outperformed the state-of-the-art models on various NLP tasks. In order to explore the performance of scaling up ERNIE 3.0, we train a hundred-billion-parameter model called ERNIE 3.0 Titan with up to 260 billion parameters on the PaddlePaddle platform. Furthermore, We design a self-supervised adversarial loss and a controllable language modeling loss to make ERNIE 3.0 Titan generate credible and controllable texts. -
36
PanGu-Σ
Huawei
Significant advancements in the field of natural language processing, understanding, and generation have been achieved through the expansion of large language models. This study introduces a system which utilizes Ascend 910 AI processors and the MindSpore framework to train a language model with over a trillion parameters, specifically 1.085T, named PanGu-{\Sigma}. This model, which builds upon the foundation laid by PanGu-{\alpha}, takes the traditionally dense Transformer model and transforms it into a sparse one using a concept known as Random Routed Experts (RRE). The model was efficiently trained on a dataset of 329 billion tokens using a technique called Expert Computation and Storage Separation (ECSS), leading to a 6.3-fold increase in training throughput via heterogeneous computing. Experimentation indicates that PanGu-{\Sigma} sets a new standard in zero-shot learning for various downstream Chinese NLP tasks. -
37
Code Llama
Meta
Code Llama is a large language model (LLM) that can use text prompts to generate code. Code Llama is state-of-the-art for publicly available LLMs on code tasks, and has the potential to make workflows faster and more efficient for current developers and lower the barrier to entry for people who are learning to code. Code Llama has the potential to be used as a productivity and educational tool to help programmers write more robust, well-documented software. Code Llama is a state-of-the-art LLM capable of generating code, and natural language about code, from both code and natural language prompts. Code Llama is free for research and commercial use. Code Llama is built on top of Llama 2 and is available in three models: Code Llama, the foundational code model; Codel Llama - Python specialized for Python; and Code Llama - Instruct, which is fine-tuned for understanding natural language instructions.Starting Price: Free -
38
Gemini 1.5 Pro
Google
The Gemini 1.5 Pro AI model is a state-of-the-art language model designed to deliver highly accurate, context-aware, and human-like responses across a variety of applications. Built with cutting-edge neural architecture, it excels in natural language understanding, generation, and reasoning tasks. The model is fine-tuned for versatility, supporting tasks like content creation, code generation, data analysis, and complex problem-solving. Its advanced algorithms ensure nuanced comprehension, enabling it to adapt to different domains and conversational styles seamlessly. With a focus on scalability and efficiency, the Gemini 1.5 Pro is optimized for both small-scale implementations and enterprise-level integrations, making it a powerful tool for enhancing productivity and innovation. -
39
LongLLaMA
LongLLaMA
This repository contains the research preview of LongLLaMA, a large language model capable of handling long contexts of 256k tokens or even more. LongLLaMA is built upon the foundation of OpenLLaMA and fine-tuned using the Focused Transformer (FoT) method. LongLLaMA code is built upon the foundation of Code Llama. We release a smaller 3B base variant (not instruction tuned) of the LongLLaMA model on a permissive license (Apache 2.0) and inference code supporting longer contexts on hugging face. Our model weights can serve as the drop-in replacement of LLaMA in existing implementations (for short context up to 2048 tokens). Additionally, we provide evaluation results and comparisons against the original OpenLLaMA models.Starting Price: Free -
40
NVIDIA NeMo Megatron
NVIDIA
NVIDIA NeMo Megatron is an end-to-end framework for training and deploying LLMs with billions and trillions of parameters. NVIDIA NeMo Megatron, part of the NVIDIA AI platform, offers an easy, efficient, and cost-effective containerized framework to build and deploy LLMs. Designed for enterprise application development, it builds upon the most advanced technologies from NVIDIA research and provides an end-to-end workflow for automated distributed data processing, training large-scale customized GPT-3, T5, and multilingual T5 (mT5) models, and deploying models for inference at scale. Harnessing the power of LLMs is made easy through validated and converged recipes with predefined configurations for training and inference. Customizing models is simplified by the hyperparameter tool, which automatically searches for the best hyperparameter configurations and performance for training and inference on any given distributed GPU cluster configuration. -
41
Qwen3-Max
Alibaba
Qwen3-Max is Alibaba’s latest trillion-parameter large language model, designed to push performance in agentic tasks, coding, reasoning, and long-context processing. It is built atop the Qwen3 family and benefits from the architectural, training, and inference advances introduced there; mixing thinker and non-thinker modes, a “thinking budget” mechanism, and support for dynamic mode switching based on complexity. The model reportedly processes extremely long inputs (hundreds of thousands of tokens), supports tool invocation, and exhibits strong performance on benchmarks in coding, multi-step reasoning, and agent benchmarks (e.g., Tau2-Bench). While its initial variant emphasizes instruction following (non-thinking mode), Alibaba plans to bring reasoning capabilities online to enable autonomous agent behavior. Qwen3-Max inherits multilingual support and extensive pretraining on trillions of tokens, and it is delivered via API interfaces compatible with OpenAI-style functions.Starting Price: Free -
42
ERNIE Bot
Baidu
ERNIE Bot is an AI-powered conversational assistant developed by Baidu, designed to facilitate seamless and natural interactions with users. Built on the ERNIE (Enhanced Representation through Knowledge Integration) model, ERNIE Bot excels at understanding complex queries and generating human-like responses across various domains. Its capabilities include processing text, generating images, and engaging in multimodal communication, making it suitable for a wide range of applications such as customer support, virtual assistants, and enterprise automation. With its advanced contextual understanding, ERNIE Bot offers an intuitive and efficient solution for businesses seeking to enhance their digital interactions and automate workflows.Starting Price: Free -
43
Yi-Lightning
Yi-Lightning
Yi-Lightning, developed by 01.AI under the leadership of Kai-Fu Lee, represents the latest advancement in large language models with a focus on high performance and cost-efficiency. It boasts a maximum context length of 16K tokens and is priced at $0.14 per million tokens for both input and output, making it remarkably competitive. Yi-Lightning leverages an enhanced Mixture-of-Experts (MoE) architecture, incorporating fine-grained expert segmentation and advanced routing strategies, which contribute to its efficiency in training and inference. This model has excelled in various domains, achieving top rankings in categories like Chinese, math, coding, and hard prompts on the chatbot arena, where it secured the 6th position overall and 9th in style control. Its development included comprehensive pre-training, supervised fine-tuning, and reinforcement learning from human feedback, ensuring both performance and safety, with optimizations in memory usage and inference speed. -
44
GLM-4.7-FlashX
Z.ai
GLM-4.7 FlashX is a lightweight, high-speed version of the GLM-4.7 large language model created by Z.ai that balances efficiency and performance for real-time AI tasks across English and Chinese while offering the core capabilities of the broader GLM-4.7 family in a more resource-friendly package. It is positioned alongside GLM-4.7 and GLM-4.7 Flash, delivering optimized agentic coding and general language understanding with faster response times and lower resource needs, making it suitable for applications that require rapid inference without heavy infrastructure. As part of the GLM-4.7 model series, it inherits the model’s strengths in programming, multi-step reasoning, and robust conversational understanding, and it supports long contexts for complex tasks while remaining lightweight enough for deployment with constrained compute budgets.Starting Price: $0.07 per 1M tokens -
45
Qwen3-Omni
Alibaba
Qwen3-Omni is a natively end-to-end multilingual omni-modal foundation model that processes text, images, audio, and video and delivers real-time streaming responses in text and natural speech. It uses a Thinker-Talker architecture with a Mixture-of-Experts (MoE) design, early text-first pretraining, and mixed multimodal training to support strong performance across all modalities without sacrificing text or image quality. The model supports 119 text languages, 19 speech input languages, and 10 speech output languages. It achieves state-of-the-art results: across 36 audio and audio-visual benchmarks, it hits open-source SOTA on 32 and overall SOTA on 22, outperforming or matching strong closed-source models such as Gemini-2.5 Pro and GPT-4o. To reduce latency, especially in audio/video streaming, Talker predicts discrete speech codecs via a multi-codebook scheme and replaces heavier diffusion approaches. -
46
YandexGPT
Yandex
Take advantage of the capabilities of generative language models to improve and optimize your applications and web services. Get an aggregated result of accumulated textual data whether it be information from work chats, user reviews, or other types of data. YandexGPT will help both summarize and interpret the information. Speed up text creation as you improve their quality and style. Create template texts for newsletters, product descriptions for online stores and other applications. Develop a chatbot for your support service: teach the bot to answer various user questions, both common and more complicated. Use the API to integrate the service with your applications and automate processes. -
47
GPT-J
EleutherAI
GPT-J is a cutting-edge language model created by the research organization EleutherAI. In terms of performance, GPT-J exhibits a level of proficiency comparable to that of OpenAI's renowned GPT-3 model in a range of zero-shot tasks. Notably, GPT-J has demonstrated the ability to surpass GPT-3 in tasks related to generating code. The latest iteration of this language model, known as GPT-J-6B, is built upon a linguistic dataset referred to as The Pile. This dataset, which is publicly available, encompasses a substantial volume of 825 gibibytes of language data, organized into 22 distinct subsets. While GPT-J shares certain capabilities with ChatGPT, it is important to note that GPT-J is not designed to operate as a chatbot; rather, its primary function is to predict text. In a significant development in March 2023, Databricks introduced Dolly, a model that follows instructions and is licensed under Apache.Starting Price: Free -
48
GPT-NeoX
EleutherAI
An implementation of model parallel autoregressive transformers on GPUs, based on the DeepSpeed library. This repository records EleutherAI's library for training large-scale language models on GPUs. Our current framework is based on NVIDIA's Megatron Language Model and has been augmented with techniques from DeepSpeed as well as some novel optimizations. We aim to make this repo a centralized and accessible place to gather techniques for training large-scale autoregressive language models, and accelerate research into large-scale training.Starting Price: Free -
49
Hippocratic AI
Hippocratic AI
Hippocratic AI is the new state of the art (SOTA) model, outperforming GPT-4 on 105 of 114 healthcare exams and certifications. Hippocratic AI has outperformed GPT-4 on 105 out of 114 tests and certifications, outperformed by a margin of five percent or more on 74 of the certifications, and outperformed by a margin of ten percent or more on 43 of the certifications. Most language models pre-train on the common crawl of the Internet, which may include incorrect and misleading information. Unlike these LLMs, Hippocratic AI is investing heavily in legally acquiring evidence-based healthcare content. We’re conducting a unique Reinforcement Learning with Human Feedback process using healthcare professionals to train and validate the model’s readiness for deployment. We call this RLHF-HP. Hippocratic AI will not release the model until a large number of these licensed professionals deem it safe. -
50
GPT-4V (Vision)
OpenAI
GPT-4 with vision (GPT-4V) enables users to instruct GPT-4 to analyze image inputs provided by the user, and is the latest capability we are making broadly available. Incorporating additional modalities (such as image inputs) into large language models (LLMs) is viewed by some as a key frontier in artificial intelligence research and development. Multimodal LLMs offer the possibility of expanding the impact of language-only systems with novel interfaces and capabilities, enabling them to solve new tasks and provide novel experiences for their users. In this system card, we analyze the safety properties of GPT-4V. Our work on safety for GPT-4V builds on the work done for GPT-4 and here we dive deeper into the evaluations, preparation, and mitigation work done specifically for image inputs.