Audio is the one area small labs are winning

Arming the rebels with GPUs: Gradium, Kyutai, and Audio AI | Amplify Partners Projects to Know Issue 128 Read Now Menu AI and ML Arming the rebels with GPUs: Gradium, Kyutai, and Audio AI By Justin Gage Rebecca Dodd Share February 12, 2026 Disclosure: Amplify is an investor in Gradium. If AI research is Star Wars and OpenAI is the death star, then without a doubt the rebels are building audio models. The best models for voice – TTS, STS, STT, and the like – are not coming from the big labs. Instead, they’re built by their underfunded, understaffed, and underhyped siblings, a wave of incredible startups that is improbably crushing benchmarks with every model release. And if you believe that audio is the biggest future modality for AI – like many researchers do – this is one of the more interesting and underdiscussed topics in genAI today. One of these improbably cutting edge startups is Gradium , born out of the open lab Kyutai . In summer 2024 on a stage in Paris, a Kyutai researcher (his name is Neil) demoed the first realtime audio conversation with AI . This model (Moshi) could respond in real time, change its voice style and volume on request, and even recite an original poem in a French accent (research shows poems sound better this way). You’ve probably seen audio AI demos before. You may not be particularly impressed. Didn’t OpenAI do this a few years ago? Well, not exactly: This was the first full-duplex conversational AI model . Moshi could interrupt, be interrupted, backchannel (“uh-huh”, “I see”) and respond in around 160ms (faster than most human conversations). This demo happened before OpenAI released Advanced Voice Mode, and a full year before xAI released a similar demo (with more latency). This would have been a groundbreaking release from a major lab, except it wasn’t from a major lab, it was from a team of 4 (four) researchers who built it completely from scratch (without a pre-trained base) in 6 months. The model is open source, and can even run on mobile. Oh, and the team was part of a non-profit with extremely limited funding. How did they do it? Based on extensive interviews with the Gradium team, this post is going to go in technical depth on an incredibly interesting niche of the increasingly top heavy AI world: A brief history of audio ML, and why it’s consistently overlooked Dynamics of big labs and why small teams of researchers can outperform Anatomy of training a voice AI model, and how it differs from text Core Gradium / Kyutai research: full-duplex models, audio codecs, oh my! Let’s get to it. A brief history of audio ML, and why it’s consistently overlooked If you watch any science fiction movie — 2001: A Space Odyssey , Her and Iron Man or incessantly invoked — the colloquial AI speaks in a distinctly natural, human-sounding voice. One simply needs to ask Siri what time it is (it took 5 seconds for me this morning) to realize how far away from this ideal our devices can be. There’s an obvious question here: how d

Source: Hacker News | Original Link