Our columnist replaced herself with AI voice and video to see how humanlike the tech can be. The results were eerie.
By guest author Joanna Stern from the Wall Street Journal.
The bad news: She can fool my family and trick my bank.
Maybe you’ve played around with chatbots like OpenAI’s ChatGPT and Google’s Bard, or image generators like Dall-E. If you thought they blurred the line between AI and human intelligence, you ain’t seen—or heard—nothing yet.
Over the past few months, I’ve been testing Synthesia, a tool that creates artificially intelligent avatars from recorded video and audio (aka deepfakes). Type in anything and your video avatar parrots it back.
Since I do a lot of voice and video work, I thought this could make me more productive, and take away some of the drudgery. That’s the AI promise, after all. So I went to a studio and recorded about 30 minutes of video and nearly two hours of audio that Synthesia would use to train my clone. A few weeks later, AI Joanna was ready.
Then I attempted the ultimate day off, Ferris Bueller style. Could AI me—paired with ChatGPT-generated text—replace actual me in videos, meetings and phone calls? It was…eye-opening or, dare I say, AI-opening. (Let’s just blame AI Joanna for my worst jokes.)
Eventually AI Joanna might write columns and host my videos. For now, she’s at her best illustrating the double-edged sword of generative-AI voice and video tools.
My video avatar looks like an avatar.
Video is a lot of work. Hair, makeup, wardrobe, cameras, lighting, microphones. Synthesia promises to eradicate that work, and that’s why corporations already use it. You know those boring compliance training videos? Why pay actors to star in a live-action version when AI can do it all? Synthesia charges $1,000 a year to create and maintain a custom avatar, plus an additional monthly subscription fee. It offers stock avatars for a lower monthly cost.
I asked ChatGPT to generate a TikTok script about an iOS tip, written in the voice of Joanna Stern. I pasted it into Synthesia, clicked “generate” and suddenly “I” was talking. It was like looking at my reflection in a mirror, albeit one that removes hand gestures and facial expressions. For quick sentences, the avatar can be quite convincing. The longer the text, the more her bot nature comes through. See for yourself in my video.
On TikTok, where people have the attention span of goldfish, those computer-like attributes are less noticeable. Still, some quickly picked up on it. For the record, I would rather eat live eels than utter the phrase “TikTok fam” but AI me had no problem with it.
The bot-ness got very obvious on work video calls. I downloaded clips of her saying common meeting remarks (“Hey everyone!” “Sorry, I was muted.”) then used software to pump them into Google Meet. Apparently AI Joanna’s perfect posture and lack of wit were dead giveaways.
All this will get better, though. Synthesia has some avatars in beta that can nod up and down, raise their eyebrows and more.
My AI voice sounds a lot like me.
When my sister’s fish died, could I have called with condolences? Yes. On a phone interview with Snap CEO Evan Spiegel, could I have asked every question myself? Sure. But in both cases, my AI voice was a convincing stand-in. At first.
I didn’t use Synthesia’s voice clone for those calls. Instead, I used one generated by ElevenLabs, an AI speech-software developer. My producer Kenny Wassus gathered about 90 minutes of my voice from previous videos and we uploaded the files to the tool—no studio visit needed. In under two minutes, it cloned my voice. In ElevenLabs’s web-based tool, type in any text, click Generate, and within seconds “my” voice says it aloud. Creating a voice clone with ElevenLabs starts at $5 a month.
Compared with Synthesia Joanna, the ElevenLabs me sounds more humanlike, with better intonations and flow. Listen to the test audio here:
My sister, whom I call several times a week, said the bot sounded just like me, but noticed the bot didn’t pause to take breaths. When I called my dad and asked for his Social Security number, he only knew something was up because it sounded like a recording of me.
The potential for misuse is real.
The ElevenLabs voice was so good it fooled my Chase credit card’s voice biometric system.
I cued AI Joanna up with several things I knew Chase would ask, then dialed customer service. At the biometric step, when the automated system asked for my name and address, AI Joanna responded. Hearing my bot’s voice, the system recognized it as me and immediately connected to a representative. When our video intern called and did his best Joanna impression, the automated system asked for further verification.
A Chase spokeswoman said the bank uses voice biometrics, along with other tools, to verify callers are who they say they are. She added that the feature is meant for customers to quickly and securely identify themselves, but to complete transactions and other financial requests, customers must provide additional information.
What’s most worrying: ElevenLabs made a very good clone without much friction. All I had to do was click a button saying I had the “necessary rights or consents” to upload audio files and create the clone, and that I wouldn’t use it for fraudulent purposes.
That means anyone on the internet could take hours of my voice—or yours, or Joe Biden’s or Tom Brady’s—to save and use. The Federal Trade Commission is already warning about AI-voice related scams.
Synthesia requires that the audio and video include verbal consent, which I did when I filmed and recorded with the company.
ElevenLabs only allows cloning in paid accounts, so any use of a cloned voice that breaks company policies can be traced to an account holder, company co-founder Mati Staniszewski told me. The company is working on an authentication tool so people can upload any audio to check if it was created using ElevenLabs technology.
Both systems allowed me to generate some horrible things in my voice, including death threats.
A Synthesia spokesman said my account was designated for use with a news organization, which means it can say words and phrases that might otherwise be filtered. The company said its moderators flagged and deleted my problematic phrases later on. When my account was changed to the standard type, I was no longer able to generate those same phrases.
Mr. Staniszewski said ElevenLabs can identify all content made with its software. If content breaches the company’s terms of service, he added, ElevenLabs can ban its originating account and, in case of law breaking, assist authorities.
This stuff is hard to spot.
When I asked Hany Farid, a digital-forensics expert at the University of California, Berkeley, how we can spot synthetic audio and video, he had two words: good luck.
“Not only can I generate this stuff, I can carpet-bomb the internet with it,” he said, adding that you can’t make everyone an AI detective.
Sure, my video clone is clearly not me, but it will only get better. And if my own parents and sister can’t really hear the difference in my voice, can I expect others to?
I got a bit of hope from hearing about the Adobe-led Content Authenticity Initiative. Over 1,000 media and tech companies, academics and more aim to create an embedded “nutrition label” for media. Photos, videos and audio on the internet might one day come with verifiable information attached. Synthesia is a member of the initiative.
I feel good about being a human.
Unlike AI Joanna who never smiles, real Joanna had something to smile about after this. ChatGPT generated text lacking my personality and expertise. My video clone was lacking the things that make me me. And while my video producer likes using my AI voice in early edits to play with timing, my real voice has more energy, emotion and cadence.
Will AI get better at all of that? Absolutely. But I also plan to use these tools to afford me more time to be a real human. Meanwhile, I’m at least sitting up a lot straighter in meetings now.
Corrections & Amplifications
A caption accompanying an image in an earlier version of this article misspelled the name of the Synthesia web tool as Sythesia. (Corrected on April 28, 2023)