Microsoft Unveils VALL-E, a Text-to-Speech AI Model with Alarming Implications

Microsoft's latest artificial intelligence model, VALL-E, is making waves in the tech community. While VALL-E's capabilities are impressive, they also raise serious concerns about the potential for misuse.

Microsoft Unveils VALL-E, a Text-to-Speech AI Model with Alarming Implications

Microsoft's researchers announced the creation of a new artificial intelligence (AI) text-to-speech model called VALL-E. The model can accurately simulate the human voice when given a 3-second audio sample. VALL-E has the potential to synthesize audio that resembles a particular speaker.


The development of VALL-E was made possible by using an audio library called LibriLight compiled by Meta. This library contains 60,000 hours of his English audio from over 7,000 speakers. Most of which comes from LibriVox public domain audiobooks. For VALL-E to produce convincing results, the voice in the sample must closely match a voice in the training data.


The AI model can preserve the speaker's emotional tone and mimic the “acoustic environment” of audio samples. Further enhancing the realism of synthesized speech. Yet, this advanced functionality raises serious concerns about potential abuse. Such as spoofing voice identification or impersonating a specific speaker.


To mitigate the risks, Microsoft plans to practice AI principles as models evolve. It is also possible to build a detection model to identify if an audio clip was synthesized by VALL-E.

Progress made in speech synthesis will further reinforce the spread of deepfakes. The two are closely tied to each other. To better understand the power of deepfakes, I invite you to take a look at one of the most recent examples. In the video below, you can see Joe Rogan and the neuroscientist Andrew Huberman promoting a scam product on Joe Rogan's podcast. Everything in this video has been generated by an AI model. The two protagonist never said any of the word you can hear in the video.

The previous tweet showcases the power of deepfake for scammers. But the use of this technology in politics could have much more serious implications.

The previous example of deepfake wasn't made with VALL-E, since its not currently available to the public. But the model seems so powerful and user-friendly that its potential impact is already raising concerns among experts and the public. With the use of deepfake on the rise, the spread of misinformation online is a major problem. The creation of VALL-E added to these concerns, raising questions about the future of text-to-speech and the spread of fake news.


The development of VALL-E highlights the ongoing ethical and technical challenges of AI technology. As AI continues to advance, it is important that researchers, policymakers, and society at large consider the potential implications of these developments and ensure responsible use.