I cloned my voice with AI and my mother couldn’t tell the difference
The technology is getting shockingly cheap and easy to use.
A couple of weeks ago, I used AI software to clone my voice. The resulting audio sounded pretty convincing to me, but I wanted to see what others thought.
So I created a test audio file based on the first 12 paragraphs of this article that I wrote. Seven randomly chosen paragraphs were my real voice, while the other five were generated by AI. I asked members of my family to see if they could tell the difference.
My mother was stumped. “All of the paragraphs sounded like you,” she told me afterward. She thought she had identified telltale signs of the computer-generated audio. But she was wrong more often than she was right, correctly identifying only five out of 12 paragraphs.
Other members of my family had better luck. My wife, sister, brother, and mother-in-law got all 12 paragraphs right. My father went 10 for 12.
When I opened up the experiment to the broader internet (you can try your luck here), the results weren’t great for my ego.
“The real voices had much more richness and emotional flavor,” one anonymous participant wrote. “The AI voices sounded like a mopey person with a cold. At least I hope that's right and I'm not insulting your actual voice! I've never met you in person.”
Unfortunately, this person guessed wrong about every single paragraph: that “mopey person with a cold” was me. Another zero-for-12 listener wrote that the AI voice (actually my voice) “lacks variations in timbre and cadence.”
A grad school friend whom I haven’t seen in years guessed wrong 11 out of 12 times. A former employee was wrong 10 out of 12 times.
Overall, people who didn’t know me well barely did better than a coin flip, guessing correctly only 54 percent of the time. Here are the results, with the speakers identified, for you to hear yourself:
So my cloned voice wasn’t perfect, but it was remarkably good. And creating it was surprisingly cheap and easy.
Voice cloning has improved a lot in three years
Back in 2020, researchers at MIT worked with a company called Respeecher to generate a fake video of Richard Nixon announcing the failure of the Apollo 11 Moon landing. A behind-the-scenes video shows the laborious process required to clone Nixon’s voice. The MIT researchers collected hundreds of short clips of Nixon’s voice and then had a voice actor record himself speaking the same words. The actor then read Nixon’s alternate moon landing speech and the software modified his words to sound like Nixon’s.
This process seems to yield excellent results: Last year, Respeecher won a contract to clone the voice of James Earl Jones as Darth Vader in future Star Wars projects. But it comes at a high cost. When I reached out to Respeecher recently to give their service a try, they informed me that “a project usually takes several weeks with fees from 4-digit to 6-digit in $USD.”
I didn’t have thousands of dollars to spend, so I went with a little-known startup called Play.ht instead. All I had to do was upload a 30-minute video of me reading text of my choice, then wait a few hours.
Play.ht is a text-to-speech service, so I didn’t need to hire a voice actor. Once it had been trained on my voice, the software could generate realistic human speech from written text in just a few minutes. Best of all, I didn’t have to pay a dime. I was able to clone my voice using Play.ht’s free plan. Commercial plans start at $39 per month.
Realistic text-to-speech systems like Play.ht are hard to build because human beings pronounce the same word differently depending on the context. We do that depending on what comes before or after a word in a sentence, and we follow complex, and largely subconscious, rules about which words in a sentence to emphasize.
There’s also some totally random variation in how human beings pronounce words. Sometimes we stop and take a breath, pause to think about what we’re saying, or we just get distracted. So any system that always pronounces words or phrases in exactly the same way is going to sound a bit robotic.
A voice-to-voice system like Respeecher doesn’t need to worry about these issues as much because it can follow the lead of the voice actor who supplied the source audio. In a text-to-speech system, in contrast, the AI system needs to understand human speech well enough to know how long to pause, which words to emphasize, and so forth.
Play.ht says its system uses a transformer, a type of neural network that was invented at Google in 2017 and has become the foundation of many generative AI systems since then. (The T in GPT, OpenAI’s family of large language models, stands for transformer.)
What makes a transformer model powerful is its ability to “pay attention” to multiple parts of its input at the same time. When Play.ht’s model generates the audio for a new word, it isn’t just “thinking about” the current word or the one that came before it, it’s taking into account the structure of the sentence as a whole. This allows it to vary the speed, emphasis, and other characteristics of speech in a way that mirrors the speech patterns of the person whose voice is being cloned.
The challenge of text-to-speech voice cloning
Play.ht is designed for creative professionals making podcasts, audiobooks, instructional videos, television ads, and so forth. The startup is actually a bit of an underdog in this market, as it is competing with a sophisticated audio editing tool called Descript.
The original version of Descript, launched in 2017, automatically generated a transcript from an audio file. You could delete words from the transcript and Descript would automatically delete the corresponding portion of the audio file.
In 2019, Descript acquired a voice-cloning startup called Lyrebird and integrated its technology into Descript. As a result, since 2020 it has also been possible to add words to a transcript and have Descript generate realistic audio of your voice saying those words—a feature Descript calls Overdub. Like Play.ht, Overdub needs to be trained using a lengthy audio sample of the target voice.
To test Overdub out, I created another 12-paragraph audio file using Descript and challenged family and friends to say which paragraphs were my real voice and which were generated by Overdub. This was far from a rigorous scientific experiment, but overall it seemed like the cloned voice generated by Play.ht was a bit more convincing than the one generated by Descript’s Overdub technology. You can compare Overdub’s output to my real voice here:
This may not matter much in practice because the two products are designed for slightly different use cases. Play.ht is optimized for generating long audio files from scratch—for example, a complete audio book. In contrast, Overdub is designed to add short phrases to an existing audio file. It’s much harder to detect a synthetic voice in short audio clips, so I suspect Overdub’s voices are plenty realistic for this application.
And Descript uses its AI technology to enhance audio in other ways. A feature called Studio Sound, for example, takes normal audio—perhaps produced using a low-quality microphone in a noisy room—and uses AI to make it sound like it was recorded in a studio. It doesn’t just remove background noise, it subtly alters the speaker’s voice so it sounds like it was recorded with a better microphone.
Descript can also help in the opposite direction: If you add a new audio clip to an existing recording, Descript can add subtle background noise to make sure the new clip has the same “room tone” as the surrounding audio.
Tools like this are a boon for independent creative professionals because they eliminate much of the tedious post-production work required to publish high-quality audio content. But they could also be a boon to criminals and other troublemakers.
The dark side of voice cloning
Last month the Washington Post reported about a Canadian grandmother who was fooled by scammers using voice cloning technology. A man who sounded just like her grandson Brandon called to say he was in jail and needed money.
According to the Post, the woman and her husband “dashed to their bank in Regina, Saskatchewan, and withdrew 3,000 Canadian dollars ($2,207 in U.S. currency), the daily maximum. They hurried to a second branch for more money.”
Luckily, a manager at the second branch warned them that the call had likely been a scam. They didn’t send the money and Brandon turned out to be fine. But scams like this are only going to become more common in the next few years.
Recent months have also seen a proliferation of fake audio of various celebrities—from Joe Biden to Taylor Swift—saying a variety of funny and sometimes offensive things. While most of these clips are harmless, the trend worries Duncan Crabtree-Ireland, the executive director of SAG-AFTRA, a union that represents a broad spectrum of performers, from actors to singers and broadcast journalists. He’s concerned about people using voice cloning to create fake celebrity endorsements, deceiving customers and depriving his members of revenue they are entitled to.
It’s easy to imagine fake audio causing more serious harms. Voice cloning could be used to humiliate celebrities (or non-celebrities for that matter) with fake, sexually explicit audio clips. Political operatives could use fake audio to trick voters in the final days of an election. Imagine someone leaking fake audio of a political candidate saying something embarrassing, or circulating a fake radio or television broadcast on social media.
The leaders of Play.ht and Descript are acutely aware of these dangers. Play.ht CEO Hammad Syed told me that the company has put several safeguards in place, including manual review of training audio and automatic detection of attempts to generate racist or sexually explicit audio.
Descript takes an extra step to make sure users don’t clone someone else’s voice without permission. When someone tries to create a new Overdub voice, the software asks the owner of the voice to read a short statement into the microphone stating that they agree to have their voice cloned. Descript checks to make sure the voice recorded by the microphone matches the voice in the audio file being used for training. This should make it difficult for anyone to use Overdub for impersonation scams or to clone the voice of a celebrity.
Unlike Play.ht, Descript doesn’t restrict the kind of content people can generate with Overdub once a voice has been created.
Many of the celebrity voice-cloning videos released in recent months were made using software from a company called ElevenLabs. Back in January, 4chan users started using ElevenLabs software to produce fake clips of celebrities engaging in hate speech. ElevenLabs responded by removing the voice-cloning feature from its free tier and releasing a tool to help the public identify fake video clips.
You could imagine this technology becoming a subject of government regulation, but none of the people I talked to for this story seemed to think that was a good idea.
“We're not looking to ban technology or halt forward progress on technology,” SAG-AFTRA’s Crabtree-Ireland told me. “We are instead looking to work with companies developing these technologies to make sure it's respectful.” He said he’s gotten a “surprisingly positive reaction” when he’s sought to work with technology companies about implementing appropriate safeguards.
Legislation in this area might ultimately prove futile because it’s only a matter of time before voice cloning software is efficient enough to run entirely on a personal computer. Once that happens, it will become very difficult for governments to limit its distribution or use.
So the most important countermeasure against the misuse of voice cloning may be to make sure the public understands that high-quality voice cloning software exists. Most abuses of voice cloning depend on people wrongly assuming that audio is genuine. If the public knows about voice cloning technology, perhaps they’ll be appropriately cautious about believing the evidence they encounter with their own ears.
I don't quite understand the need for this technology, and definitely all the dangers!
Super interesting.