A neural codec language model - VALL-E can reproduce a voice from a three-second audio recording
Text-to-speech models usually require significantly longer training samples, while VALL-E creates a much more natural-sounding synthetic voice from just a few seconds.
What's Your Reaction?
![like](https://technetspot.com/assets/img/reactions/like.png)
![dislike](https://technetspot.com/assets/img/reactions/dislike.png)
![love](https://technetspot.com/assets/img/reactions/love.png)
![funny](https://technetspot.com/assets/img/reactions/funny.png)
![angry](https://technetspot.com/assets/img/reactions/angry.png)
![sad](https://technetspot.com/assets/img/reactions/sad.png)
![wow](https://technetspot.com/assets/img/reactions/wow.png)