Showing 68 open source projects for "tokenizer"

View related business solutions
  • Gemini 3 and 200+ AI Models on One Platform Icon
    Gemini 3 and 200+ AI Models on One Platform

    Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

    Build generative AI apps with Vertex AI. Switch between models without switching platforms.
    Start Free
  • Try Google Cloud Risk-Free With $300 in Credit Icon
    Try Google Cloud Risk-Free With $300 in Credit

    No hidden charges. No surprise bills. Cancel anytime.

    Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.
    Start Free
  • 1
    Tokenizer

    Tokenizer

    A small library for converting tokenized PHP source code into XML

    A small library for converting tokenized PHP source code into XML. You can add this library as a local, per-project dependency to your project using Composer. If you only need this library during development, for instance to run your project's test suite, then you should add it as a development-time dependency.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    SentencePiece

    SentencePiece

    Unsupervised text tokenizer for Neural Network-based text generation

    SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 3
    Tiktoken

    Tiktoken

    tiktoken is a fast BPE tokeniser for use with OpenAI's models

    tiktoken is a high-performance, tokenizer library (based on byte-pair encoding, BPE) designed for use with OpenAI’s models. It handles encoding and decoding text to token IDs efficiently, with minimal overhead. Because tokenization is a fundamental step in preparing text for models, tiktoken is optimized for speed, memory, and correctness in model contexts (e.g. matching OpenAI’s internal tokenization).
    Downloads: 1 This Week
    Last Update:
    See Project
  • 4
    minbpe

    minbpe

    Minimal, clean code for the Byte Pair Encoding (BPE) algorithm

    ...It operates on UTF-8 encoded bytes rather than Unicode characters, which makes it robust to arbitrary text inputs and avoids needing a language-specific character vocabulary. The repository is structured as a teaching-oriented implementation that shows how to train a tokenizer by learning merge rules, then apply those merges to encode text into token IDs and decode tokens back into text. It is intentionally small and readable so developers can understand each stage of BPE, including the mechanics of pair counting, merge application, and vocabulary growth. The project is especially useful for practitioners who want to demystify how LLM tokenizers work or who need a lightweight reference implementation for experimentation.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • 5
    1D Visual Tokenization and Generation

    1D Visual Tokenization and Generation

    This repo contains the code for 1D tokenizer and generator

    The 1D Visual Tokenization and Generation project from ByteDance introduces a novel “one-dimensional” tokenizer designed for images: instead of representing images with large grids of 2D tokens (as in many prior generative/image-modeling systems), it compresses images into as few as 32 discrete tokens (or more, optionally) — thereby achieving a very compact, efficient representation that drastically speeds up generation and reconstruction while retaining strong fidelity.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    FireRedTTS-2

    FireRedTTS-2

    Long-form streaming TTS system for multi-speaker dialogue generation

    FireRedTTS2 is a next-generation open-source text-to-speech (TTS) system focused on long-form, streaming speech synthesis for multi-speaker dialogue, delivering stable natural speech with context-aware prosody and reliable speaker transitions that support real-time and conversational applications. It features a specialized streaming speech tokenizer and a dual-transformer architecture that enables low latency and high-quality synthesis, making it suitable for interactive systems like chatbots, podcasts, and applications where dynamic turn-taking between speakers is essential. FireRedTTS2 supports multilingual output and speaker flexibility, enabling scenarios that involve language switching, cross-lingual voice cloning, and expressive dialogue generation that maintains consistency over longer utterances.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    gTTS

    gTTS

    Python library and CLI tool to interface with Google Translate

    ...It lets you send text to the Google Translate TTS endpoint and receive spoken audio back as MP3 data, either written to a file, a file-like object, or standard output. The library is designed to handle long texts, using a speech-specific sentence tokenizer that keeps intonation and punctuation natural while splitting requests into acceptable chunks. It supports customizable text pre-processors, which can correct pronunciations, tweak formatting, or handle domain-specific vocabulary before sending it to the API. gTTS is primarily aimed at developers who want a quick way to add cloud-backed speech to scripts, apps, or pipelines without managing any model weights locally. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 8
    nanochat

    nanochat

    The best ChatGPT that $100 can buy

    nanochat is a from-scratch, end-to-end “mini ChatGPT” that shows the entire path from raw text to a chatty web app in one small, dependency-lean codebase. The repository stitches together every stage of the lifecycle: tokenizer training, pretraining a Transformer on a large web corpus, mid-training on dialogue and multiple-choice tasks, supervised fine-tuning, optional reinforcement learning for alignment, and finally efficient inference with caching. Its north star is approachability and speed: you can boot a fresh GPU box and drive the whole pipeline via a single script, producing a usable chat model in hours and a clear markdown report of what happened. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    VoxCPM

    VoxCPM

    TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning

    VoxCPM is a tokenizer-free text-to-speech system that models speech in a continuous space, aiming for extremely realistic, context-aware synthesis and true-to-life zero-shot voice cloning. Instead of converting speech into discrete tokens, it uses an end-to-end diffusion-autoregressive architecture built on the MiniCPM-4 backbone, combining hierarchical language modeling, finite scalar quantization (FSQ), and local Diffusion Transformers.
    Downloads: 4 This Week
    Last Update:
    See Project
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 10
    ZhParser

    ZhParser

    PostgreSQL extension for full-text search of Chinese language

    zhparser is a PostgreSQL extension for full-text search of Chinese text. It integrates with PostgreSQL's text search engine to tokenize Chinese characters using a dictionary-based segmentation algorithm. zhparser is a valuable tool for improving search accuracy and performance in Chinese-language applications.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 11
    Step-Audio-EditX

    Step-Audio-EditX

    LLM-based Reinforcement Learning audio edit model

    ...Rather than treating audio editing as low-level waveform manipulation, this model converts speech into a sequence of discrete “audio tokens” (via a dual-codebook tokenizer) — combining a linguistic token stream and a semantic (prosody/emotion/style) token stream — thereby abstracting audio editing into high-level token operations. This allows users to modify not only what is said (the text) but also how it's said: emotion, tone, speaking style, prosody, accent, even paralinguistic cues. Because the model is trained with a “large-margin learning” objective over many synthesized and natural speech samples, it gains robust control over expressive attributes, and can perform iterative editing: e.g. you could record a line, then ask the model to “make it sadder,” “speak slower,” or “change accent to X.”
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    Bumblebee

    Bumblebee

    Pre-trained Neural Network models in Axon

    Bumblebee provides pre-trained Neural Network models on top of Axon. It includes integration with Models, allowing anyone to download and perform Machine Learning tasks with few lines of code. The best way to get started with Bumblebee is with Livebook. Our announcement video shows how to use Livebook's Smart Cells to perform different Neural Network tasks with a few clicks. You can then tweak the code and deploy it. First, add Bumblebee and EXLA as dependencies in your mix.exs. EXLA is an...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    AudioCraft

    AudioCraft

    Audiocraft is a library for audio processing and generation

    ...It includes MusicGen for music generation conditioned on text (and optionally melody) and AudioGen for text-conditioned sound effects and environmental audio. Both models operate over discrete audio tokens produced by a neural codec (EnCodec), which acts like a tokenizer for waveforms and enables efficient sequence modeling. The repo provides inference scripts, checkpoints, and simple Python APIs so you can generate clips from prompts or incorporate the models into applications. It also contains training code and recipes, so researchers can fine-tune on custom data or explore new objectives without building infrastructure from scratch. ...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 14
    Gemma in PyTorch

    Gemma in PyTorch

    The official PyTorch implementation of Google's Gemma models

    ...It includes model definitions, configuration files, and loading utilities for multiple parameter scales, enabling quick evaluation and downstream adaptation. The repository demonstrates text generation pipelines, tokenizer setup, quantization paths, and adapters for low-rank or parameter-efficient fine-tuning. Example notebooks walk through instruction tuning and evaluation so teams can benchmark and iterate rapidly. The code is organized to be legible and hackable, exposing attention blocks, positional encodings, and head configurations. With standard PyTorch abstractions, it integrates easily into existing training loops, loggers, and evaluation harnesses.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    LLMs-from-scratch

    LLMs-from-scratch

    Implement a ChatGPT-like LLM in PyTorch from scratch, step by step

    LLMs-from-scratch is an educational codebase that walks through implementing modern large-language-model components step by step. It emphasizes building blocks—tokenization, embeddings, attention, feed-forward layers, normalization, and training loops—so learners understand not just how to use a model but how it works internally. The repository favors clear Python and NumPy or PyTorch implementations that can be run and modified without heavyweight frameworks obscuring the logic. Chapters...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 16
    Qwen3-Coder

    Qwen3-Coder

    Qwen3-Coder is the code version of Qwen3

    Qwen3-Coder is the latest and most powerful agentic code model developed by the Qwen team at Alibaba Cloud. Its flagship version, Qwen3-Coder-480B-A35B-Instruct, features a massive 480 billion-parameter Mixture-of-Experts architecture with 35 billion active parameters, delivering top-tier performance on coding and agentic tasks. This model sets new state-of-the-art benchmarks among open models for agentic coding, browser-use, and tool-use, matching performance comparable to leading models...
    Downloads: 19 This Week
    Last Update:
    See Project
  • 17
    Janus

    Janus

    Unified Multimodal Understanding and Generation Models

    Janus is a sophisticated open-source project from DeepSeek AI that aims to unify both visual understanding and image generation in a single model architecture. Rather than having separate systems for “look and describe” and “prompt and generate”, Janus uses an autoregressive transformer framework with a decoupled visual encoder—allowing it to ingest images for comprehension and to produce images from text prompts with shared internal representations. The design tackles long-standing...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 18
    Happy-LLM

    Happy-LLM

    Large Language Model Principles and Practice Tutorial from Scratch

    ...It explains the Transformer architecture, pre-training paradigms, and model scaling strategies while also providing hands-on coding examples so readers can implement and experiment with their own models. The tutorial emphasizes practical understanding by walking users through building and training small language models, including tokenizer construction, pre-training workflows, and fine-tuning methods.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    torchtext

    torchtext

    Data loaders and abstractions for text and NLP

    ...Please refer to pytorch.org for the details of PyTorch installation. LTS versions are distributed through a different channel than the other versioned releases. Alternatively, you might want to use the Moses tokenizer port in SacreMoses (split from NLTK). You have to install SacreMoses. To build torchtext from source, you need git, CMake and C++11 compiler such as g++. When building from source, make sure that you have the same C++ compiler as the one used to build PyTorch. A simple way is to build PyTorch from source and use the same environment to build torchtext. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    IK Analysis for Elasticsearch

    IK Analysis for Elasticsearch

    A plugin that integrates Lucene IK analyzer into elasticsearch

    ...Starting from version 3.0, IK has developed into a common word segmentation component for Java, independent of the Lucene project, and at the same time provides a default optimized implementation of Lucene. In the 2012 version, IK implemented a simple word segmentation ambiguity elimination algorithm, marking the evolution of the IK tokenizer from pure dictionary word segmentation to analog semantic word segmentation.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    Enhanced version of Brill's Parts-of-Speech Tagger with built-in Tokenizer and Lemmatizer.
    Leader badge
    Downloads: 12 This Week
    Last Update:
    See Project
  • 22
    GPT-2

    GPT-2

    Code for the paper Language Models are Unsupervised Multitask Learners

    This repository contains the code and model weights for GPT-2, a large-scale unsupervised language model described in the OpenAI paper “Language Models are Unsupervised Multitask Learners.” The intent is to provide a starting point for researchers and engineers to experiment with GPT-2: generate text, fine‐tune on custom datasets, explore model behavior, or study its internal phenomena. The repository includes scripts for sampling, training, downloading pre-trained models, and utilities for...
    Downloads: 9 This Week
    Last Update:
    See Project
  • 23
    RE/flex lexical analyzer generator

    RE/flex lexical analyzer generator

    The regex-centric, fast lexical analyzer generator for C++

    A C++ high-performance regex library and Flex-compatible lexical analyzer generator with full Unicode support, new indentation anchors, lazy quantifiers, and many other modern features. Accepts Flex lexer specification syntax and is compatible with Bison/Yacc parsers. Generates reusable source code that is easy to understand. Supports fast scanning of UTF-8/16/32 files, strings, and streams. The reflex scanner generator generates clean C++ lexer class code that is thread-safe. Generates...
    Downloads: 9 This Week
    Last Update:
    See Project
  • 24
    LLaMA

    LLaMA

    Inference code for Llama models

    “Llama” is the repository from Meta (formerly Facebook/Meta Research) containing the inference code for LLaMA (Large Language Model Meta AI) models. It provides utilities to load pre-trained LLaMA model weights, run inference (text generation, chat, completions), and work with tokenizers. Tokenizer utilities, download scripts, shell helpers to fetch model weights with correct licensing/permissions. Includes example scripts for chat completions and text completions to show how to call the models in code. This repo is a core piece of the Llama model infrastructure, used by researchers and developers to run LLaMA models locally or in their infrastructure. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    llm

    llm

    An ecosystem of Rust libraries for working with large language models

    llm is an ecosystem of Rust libraries for working with large language models - it's built on top of the fast, efficient GGML library for machine learning. The primary entry point for developers is the llm crate, which wraps the llm-base and the supported model crates. Documentation for the released version is available on Docs.rs. For end-users, there is a CLI application, llm-cli, which provides a convenient interface for interacting with supported models. Text generation can be done as a...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • Next
MongoDB Logo MongoDB