Spotify and OpenAI are working together to give podcasters the ability to translate themselves into other languages. This is done almost completely automatically and even in the podcaster’s voice. The AI translator tool is initially available to a select few podcasters, who can have their episodes translated from English to Spanish, French and German.
The translations are done using OpenAI’s Whisper. Whisper is actually specialized for transcribing texts. However, these can then be converted back into speech. To play back the original voice of a podcaster, he must first create a synthetic version of his voice, a few seconds of audio input is enough. Of course, this is not possible for everyone, also for security reasons, writes OpenAI: “These possibilities also vouch for new risks, such as misuse to imitate known people or other fraud.”
That’s why another version of the speech technology is in ChatGPT‘s voice assistant feature, also just announced. Here, OpenAI itself has worked with professional voice actors to synthesize their voices. These are still supposed to sound a bit tinny, but still close to the original and thus enable a completely different experience than previous voice assistants such as Alexa or Siri.
Spotify writes in its announcement about Voice Translation for Podcasters, “With recent advances, we asked ourselves: are there more ways to bridge the language barrier so these voices can be heard around the world?” The early testers are Dax Shepard, Monica Padman, Lex Fridman, Bill Simmons and Steven Barlett, all English-speaking podcasters. Some past episodes will be translated as well as upcoming episodes. More podcasters are to follow soon, for example Trevor Noah is already in the pipeline.
Youtube has also already released an AI-powered translation feature for videos. This allows Youtubers to create alternative audio tracks to reach a wider audience. This feature is also only available for select people and languages so far. It works just like the OpenAI and Spotify collaboration. It first makes a transcript of the audio track, translates it, and then passes it again to a text-to-speech model. Responsible at Google is the Aloud team. However, there is no way to use your own synthesized voice yet. So far, it is a computer-generated voice that then performs the translation. However, Google has already announced that this will change.