We use cookies to improve your browsing experience, save your preferences and provide us with informations on how you use our website. For more information about cookies, please read our

We value your privacy

Manage cookies settings

Explore the development journey of Moshi, the first speech-to-speech model, from its inception to real-time inference. This session will dive into key decisions, such as large-scale training on vast datasets and the shift from traditional frameworks like PyTorch to Rust/Candle. Learn how these choices impacted performance, and discover how 500,000 Moshi sessions were served using optimized L4 GPU clusters, minimizing computational demand while maintaining real-time accuracy.

Moshi’s behind the scenes: From conception to inference