A neural codec language model - VALL-E can reproduce a voice from a three-second audio recording

Text-to-speech models usually require significantly longer training samples, while VALL-E creates a much more natural-sounding synthetic voice from just a few seconds.

Text-to-speech models usually require significantly longer training samples, while VALL-E creates a much more natural-sounding synthetic voice from just a few seconds.

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow