Moshi

Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec. Mimi processes 24 kHz audio, down to a 12.5 Hz representation with a bandwidth of 1.1 kbps, in a fully streaming manner (latency of 80ms, the frame size), yet performs better than existing, non-streaming, codecs like SpeechTokenizer (50 Hz, 4kbps), or SemantiCodec (50 Hz, 1.3kbps). Moshi models two streams of audio: one corresponds to Moshi, and the other one to the user. At inference, the stream from the user is taken from the audio input, and the one for Moshi is sampled from the model's output. Along these two audio streams, Moshi predicts text tokens corresponding to its own speech, its inner monologue, which greatly improves the quality of its generation. A small Depth Transformer models inter codebook dependencies for a given time step, while a large, 7B parameter Temporal Transformer models the temporal dependencies.

Features

Converts JSON to Java and Kotlin objects
Annotation-based mapping for flexibility
Supports custom adapters for complex models
Lightweight with minimal dependencies
JSON serialization and deserialization

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow Moshi

Moshi Web Site

Other Useful Business Software

Cut Cloud Costs with Google Compute Engine

Save up to 91% with Spot VMs and get automatic sustained-use discounts. One free VM per month, plus $300 in credits.

Save on compute costs with Compute Engine. Reduce your batch jobs and workload bill 60-91% with Spot VMs. Compute Engine's committed use offers customers up to 70% savings through sustained use discounts. Plus, you get one free e2-micro VM monthly and $300 credit to start.

Try Compute Engine

Rate This Project

User Reviews

Be the first to post a review of Moshi!

Additional Project Details

Operating Systems

Linux, Mac, Windows

Programming Language

Python

Related Categories

Python Speech Software

Registered

2024-11-04

Similar Business Software

LALAL.AI

LALAL.AI is a next-generation audio separation service powered by advanced AI technology. With a suite of innovative tools - Stem Splitter, Voice Cleaner, Voice Changer, Voice Cloner, LALAL.AI enables users to take their audio content to the next level. Stem Splitter The core service of...

See Software
3Q

3Q's European online video platform provides access to reliable and secure video hosting and streaming technology. With its own infrastructure, global CDN, an accessible video player, intuitive video management system and smart AI solutions, AV content can be stored, managed and distributed via...

See Software
LTX

Control every aspect of your video using AI, from ideation to final edits, on one holistic platform. We’re pioneering the integration of AI and video production, enabling the transformation of a single idea into a cohesive, AI-generated video. LTX empowers individuals to share their visions,...

See Software
Switcher Studio

Switcher Studio lets you capture video from multiple camera angles — and edit it in real time — helping you connect with your community in the most engaging way yet. Stream it live or record it for later. Captivate your audience with content that matters — and make it look amazing. No need to...

See Software
Canva

Design anything. Publish anywhere. Use Canva’s drag-and-drop feature and professional layouts to design consistently stunning graphics. Design presentations, social media graphics with thousands of beautiful forms, over 100 million stock photos, video & audio, and all the tools you need. Design...

See Software
Google Cloud Speech-to-Text

Google Cloud’s Speech API processes more than 1 billion voice minutes per month with close to human levels of understanding for many commonly spoken languages. Powered by the best of Google's AI research and technology, Google Cloud's Speech-to-Text API helps you accurately transcribe speech...

See Software