AI voice synthesis tech 🤝 Impromptu

It was only a little more than a year ago that after gaining beta-access to Dall-E, I enjoyed exploring the visual possibilities of “words that could create images.” At the start of 2023, as an early tester of GPT-4, I was firmly in the thrall of text conversations. The countless exploratory chats that came of that birthed the inspiration for my last book, Impromptu: Amplifying our Humanity through AI.

Most recently, I added audio to my experimentation. Voice.

My voice.

Kind of.

Our desire for devices to speak has a long history dating back to at least 1769 when Wolfgang von Kempelen fashioned a speaking machine of sorts. It took nearly two centuries to go from that first device, which leveraged a kitchen tool used to stoke fires in wood-burning stoves, to something we would recognize as a truly modern progenitor. That reasonable approximation of “computer voice” took place in 1961 when researchers at Bell Labs spurred their IBM computer to sing “Daisy Bell.”

Today the sound of a computer speaking in an unnatural, monotone robot voice strikes us as passé. Over the last decade, Alexa, Siri, Cortana and their ilk began responding to our questions in a much more fluid and human digital voice. Given the recent explosive progress in AI, it’s perhaps not surprising just how far voice synthesis has evolved– especially for short passages. But I wanted to try something a bit more difficult: could I approximate my voice for a much, much longer piece, perhaps something the length of a book?

Though I had wanted there to be an audiobook of Impromptu in my voice, I hadn’t been able to carve out the time to record it. I had found my test case. With that goal in mind, I recently experimented with two voice synthesis technologies: Vall-E, from Microsoft and Respeecher, a Ukraine-based startup.

Vall-E is a text-to-speech system (TTS), which means it takes written text as input and as output transforms it to an audio file that is trained on a target sample of the desired voice. After being impressed by the early sound bytes, I sent over the text file to Impromptu along with a few podcasts of my voice. While Vall-E is purely a research project for Microsoft with no plans for commercial use, the potential of what it can do when trained on as little as three seconds of a speaker’s audio are immense.

Respeecher offers an alternative approach for voice synthesis: speech-to-speech (STS). In this method, there are two audio sequences submitted. Similar to Vall-E, the first batch were target samples of 120 minutes of my voice. But, here instead of uploading a text file, it was a second batch of audio, called the “source,” which was the manuscript to Impromptu recorded by Scott, a professional voice actor. Respeecher algorithms then merged my “voice sound” from the target file to the words Scott spoke in the source file.

As a lifelong fan of tabletop role-playing games like Dungeons & Dragons, it’s perhaps not surprising that I might liken this experience to language from that universe.

Vall-E (and TTS) conjures. Seemingly out of thin air. All of a sudden, there’s my voice, doing a very good job reading the text.

Respeecher (and STS) morphs. One minute I’m listening to Scott’s strong narration. And the next moment, I hear my voice, but with his perfect evocation.

Vall-E really did nail my voice. In certain passages, I doubt my family could tell I didn’t record it (link to snippet/podcast). But as you’d expect in these early days, the fit and finish is still a work-in-progress. Language nuances are tricky to get exactly right. Pauses can be a bit too long—or too short. It’s that subtle dance across commas, semicolons, and other punctuation that humans instinctively deliver. Or the intonation can sometimes feel off: the emphasis non-existent or at the wrong moment. But make no mistake, it got many sentences perfect and I would basically forget what I was listening to. Only when my ear caught something unnatural was I reminded that I hadn’t recorded it. And this is just the beginning. Listen for yourself as we’ve just released the Vall-E audiobook as a podcast (link).

Similarly, Respeecher also matched my voice to Scott’s professional reading of the Impromptu text. The end result is a nearly perfect audio artifact with amazing intonation. It sounds flawlessly human– though a bit less of a match to “my” voice than the Vall-E recording– and most importantly, similarly didn’t require multiple days of my time to record. If we hadn’t prioritized releasing an early version to show folks what that sounded like, the Respeecher team was confident that with more iterations we could have gotten it near-perfect. Listen to an audio sample for yourself on Audible (link).

Very successful quests, indeed.

My takeaway is that I can see how I’d use these technologies in my work. While Vall-E is purely a research project for Microsoft, I could use it or another TTS model trained on my voice to produce audio versions of future essays to release as a podcast feed. It’s the kind of thing I’ve thought about in the past, but couldn’t get to. While not perfect, for those who prefer or require a previously non-existent audio of the text, it would be a win. Likewise, I suspect I‘d similarly use an STS for future audiobooks. These are both examples of how AI can help creators produce more, faster.

It’s not hard to see more broadly promising uses for such capabilities. Think of how a startup like Sanas thoughtfully offers to help with global translation. Allowing a speaker to choose from a multitude of accents gives them control of how they’d like to be heard– providing them flexibility and potentially new economic opportunities in our global workplace.

Or consider the power of hearing a real-time translation in the original speaker’s voice. Whether in the context of high-stakes foreign diplomacy or the enjoyment of a global television show– the jarringly different voice of the translator will become a thing of the past. Hollywood has already started using variants of this technology and you may have even heard a recent example in The Mandalorian.

Most personally inspiring, though, is the potential of these audio synthesis technologies to help those with speech and voice disabilities. There are a range of conditions ranging from Parkinsons to cancer that leave millions of patients unable or struggling to speak in their own voice. These offerings offer a new alternative to be heard.

Of course, opportunity also brings with it the need for responsibility and the potential for abuse. In our new world, the obvious fake email from a family member requesting us to wire them money becomes the more startling familiar voice on the other end of the phone making a similar request. Deep-fakes have grown in volume in recent years and as synthetic technologies further develop the prevalence of such content will only accelerate.

Given today’s arguments over “alternative facts” and what constitutes fake news, our near-term reality only grows more complex. Organizations like Microsoft, OpenAI, and Respeecher have admirably focused on ethical paths forward. Providing greater sources of trust requires improving, deepening, and integrating new technologies that can detect synthesized voice. In a world where it grows harder to trust our eyes and our ears, things will need to change. Moving forward, it’s likely we’ll need to invent and integrate newer identity verification and signature technologies, such as a system to watermark AI-generated content, per President Biden’s recent executive order on the safe, secure and trustworthy development and use of AI.

Without the STS AI morphing technology, I probably still would have hired Scott to narrate the Audiobook of Impromptu; it would just have been his voice, not mine. Without the TTS technology, I wouldn’t feasibly be able to release my essays as podcasts. These personal use cases are easy win-wins made possible by AI. Obviously there are many others that are much more complex. We see this already in the headlines as AI featured prominently in the Hollywood strikes of both the Writers Guild and the Screen Actors Guild. It’s our immediate reminder of how bumpy the path to progress remains; moving society forward requires close collaboration, compromise, and trust.

I remain broadly optimistic on a societal level that we can sit down and reach mutually-beneficial and equitable long-term solutions together. Intricately complex collaboration has always been at the forefront of our most powerfully defining human traits.

Whether talking to Pi, translating a conversation, or amplifying a communication– conversational computing is an exciting frontier area just entering its prime-time. I hope you take a moment to explore what’s possible today. You can do that by listening to Vall-E summon Impromptu (link to podcast), download the Audible version of Respeecher morphing Scott’s voice to mine, or have a conversation with Pi.

For a quick bit of fun, try this out. The following are excerpts from the introduction and conclusion of Impromptu. Which of the following is recorded by Vall-E, by Respeecher and by yours truly?

Intro to Impromptu clip: Recording 1