csm-1b

CSM-1B (Conversational Speech Model) is a text-to-speech model developed by Sesame, designed to generate natural-sounding audio using text and audio prompts. Built on a LLaMA-based architecture and paired with a lightweight Mimi audio decoder, CSM-1B produces RVQ audio codes for realistic voice synthesis. It supports both single-sentence audio generation and full conversational modeling with contextual audio and text input. While not fine-tuned to mimic specific voices, it can create a wide range of synthetic speaker identities. It runs natively on Hugging Face Transformers (v4.52.1+) and supports batched inference, CUDA graph compilation, and fine-tuning with the standard Transformers Trainer. Though optimized for English, it has limited multilingual capabilities due to data overlap. CSM-1B is released under the Apache-2.0 license and includes strict ethical use guidelines prohibiting impersonation, misinformation, and other forms of misuse.

Features

Text-to-speech generation using RVQ audio code output
LLaMA-based model with Mimi audio decoder
Supports full conversational input with contextual audio
Batched inference and CUDA graph support for efficiency
Fine-tuning available via Transformers’ Trainer API
Native support in Hugging Face Transformers (v4.52.1+)
Open-ended voice generation without predefined speakers
Ethical use policy to prevent impersonation and misuse

Project Samples

Project Activity

See All Activity >

Follow csm-1b

csm-1b Web Site

Other Useful Business Software

AI-powered service management for IT and enterprise teams

Enterprise-grade ITSM, for every business

Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity. Maximize operational efficiency with refreshingly simple, AI-powered Freshservice.

Try it Free

Rate This Project

User Reviews

Be the first to post a review of csm-1b!

Additional Project Details

Registered

2025-06-27

Similar Business Software

Chatterbox

Chatterbox is a free, open source voice cloning AI model developed by Resemble AI, licensed under MIT. It enables zero-shot voice cloning using just 5 seconds of reference audio, eliminating the need for training. The model offers expressive speech synthesis with unique emotion control, allowing...

See Software
Piper TTS

Piper is a fast, local neural text-to-speech (TTS) system optimized for devices like the Raspberry Pi 4, designed to deliver high-quality speech synthesis without relying on cloud services. It utilizes neural network models trained with VITS and exported to ONNX Runtime, enabling efficient and...

See Software
MARS6

CAMB.AI's MARS6 is a groundbreaking text-to-speech (TTS) model that has become the first speech model accessible on Amazon Web Services (AWS) Bedrock platform. This integration allows developers to incorporate advanced TTS capabilities into generative AI applications, facilitating the creation...

See Software