This year 2023 is going strong with artificial intelligence. There are several companies and companies that have joined this new era. Without going any further, one of these companies is Microsoft. The way in which Microsoft integrates artificial intelligence is not only with the ChatGPT in Bing.
What is VALL-E: Microsoft’s Commitment to AI
Microsoft has a secret weapon and it is VALL -E. Doesn’t ring a bell, does it? In this article, we are going to explain what VALL-E is and what it will be used for.
What is VALL-E?
This artificial intelligence is a model of the language for text-to-speech synthesis. In other words, it is a tool that will allow you to replicate any voice and at the same time insert a text. Microsoft ensures that only three seconds of recording is necessary to be able to imitate the voice. Doesn’t that seem amazing?
The most interesting thing about all this is that Microsoft is working hand in hand with Chat GPT so that both technologies can work together. If you think about it for a moment, this means that generative AI technology and voice AI will be combined.
To make it clearer to understand. Imagine it as an update to ChatGPT, in which it has the option to show us the results with the voice we want.
With this option, you could ask the results to be read to you in the voice of your favorite celebrity. All you need is a three-second recording and you would already have your fantasy come true.
In addition, it is not only capable of imitating the voice but also, it is capable of imitating the original cadence of the language and the tone with which the voice has been recorded.
Not only is it compatible with ChatGPT, it can be combined with other speech synthesis applications such as TTS and other voice editing applications.
What is VALL-E: process
The process is simple, all you have to do is insert the text you want to synthesize on one side. On the other hand, you add the three-second recording of the voice of the person you want this technology to imitate.
The next step is to convert the text into a phoneme conversion while running the recording through an audio codec encoder.
Once this is achieved, both converge on a neural codec language model. Finally, this union of text and voice goes through an audio codec decoder, thus obtaining personalized speech.
Through the link that we are left above, as you can see there are several examples of the uses of this technology. But as is normal, a question will have crossed your mind when you learned about the potential of this tool. What about voice impersonation?
This is a section at the bottom of the page, in which they explain the ethical declaration of VALL-E. They admit that it can carry potential risks from possible misuse, such as impersonating a speaker.
On the other hand, he explains that the experiments carried out are carried out under the acceptance that the speaker wishes to be the objective of speech synthesis. For this reason, it is necessary and advisable to include a protocol to ensure that the speaker approves the use of his voice.
As is always the case in topics related to artificial intelligence, they border on human moral ethics. It is undoubtedly an issue that as artificial intelligence advances we cannot ignore it.
For this reason, we cannot forget that technology is a tool to help, but that it can be misused depending on the hands in which it is found. Technology advances and will advance, but we must also learn and protect ourselves from its possible misuse.