
Stay updated on Telegram with latest updates from Google Home/Gemini ecosystem.
The Google Assistant now sounds more natural thanks to software from Alphabet’s DeepMind artificial intelligence research group.
Google’s voice-generating AI is now indistinguishable from humans. Tacotron 2: Generating Human-like Speech from Text.
Google Assistant on your phone and speaker to sound a lot more life-like. Google has reached a new milestone in the quest to make computer-generated speech indistinguishable from human speech with Tacotron 2, a system that trains neural networks to generate natural-sounding speech from text.
Google asserts that previous approaches to text-to-speech (TTS) systems have thus far failed to achieve a genuinely natural sound. Techniques such as concatenative synthesis, in which pre-recorded samples of speech are stitched together, and statistical parametric speech synthesis, Google says have been insufficient, explaining, “The audio produced by these systems often sounds muffled and unnatural compared to human speech.”
In past, the goal of the WaveNet technology was to move Assistant from synthesized speech to a more natural speech pattern. Synthesized speech like you’d get from Google Assistant or Apple’s Siri is normally stitched together using small bits of recorded speech. This is called “concatenative text-to-speech” and it’s why some answers can sound a bit off when they’re read back to you.
Since bits of speech are essentially glued together, it’s hard to account for emotion or inflection. To get around that, most voice models are trained with samples that have as little variance as possible. That lack of any variance in the speech pattern is why it can sound a bit robotic, which is where WaveNet comes in.
With WaveNet, instead of recording hours of words, phrases, and fragments and then linking them together, the technology uses real speech to train a neural network. WaveNet learned the underlying structure of speech like which tones followed others and which waveforms were realistic and which weren’t. Using that data, the network was then able to synthesize voice samples one at a time and take into account the voice sample before it. By being aware of the waveform before it, WaveNet was able to create speech patterns that sound more natural. The advantages of this system was subtle, but you could definitely hear them.
Stay updated on Google News with the latest updates from Google Home/Gemini ecosystem.
You may have noticed a difference in Google Assistant the last few days. With Tacotron 2, Google has incorporated ideas from its previous TTS systems, WaveNet and the first Tacotron, to reach a new level of fidelity. Software engineers Jonathan Shen and Ruoming Pang explain:
In a nutshell, it works like this: We use a sequence-to-sequence model optimized for TTS to map a sequence of letters to a sequence of features that encode the audio. These features, an 80-dimensional audio spectrogram with frames computed every 12.5 milliseconds, capture not only pronunciation of words, but also various subtleties of human speech, including volume, speed and intonation. Finally these features are converted to a 24 kHz waveform using a WaveNet-like architecture.
The result of all this work is a digital voice that can handle some of the most subtle nuances of human speech. Not only is Tacotron 2 able to handle increasingly complex words and correctly interpret the intent of text with errors, it has also noticeably improved on the ability to grasp the nuances of things like punctuation, intonation, and pronunciation based on the semantic context of a sentence.
Tacotron 2 correctly pronounces and intones the words “read,” “desert,” and “present” based on their intended meanings. Google shows how confident it is with its TTS capabilities by pitting Tacotron 2 samples against recordings of a real human reading the same text.
Interested in learning more about the Google Gemini and Smart Home? Subscribe to WAV newsletter via Email.
The system has come a long way in a short time. Just 12 months ago when it was introduced, it took one second to generate 0.02 seconds of speech. In those 12 months, the team was able to make the process 1,000 times faster. It can now generate 20 seconds of higher quality audio in just one second of processing time. The team has also increased the quality of the audio. The waveform resolution for each sample has also been bumped from 8 bits to 16 bits, the resolution used in CDs.


mmm great news things are ok as of today …i wonder if i notice the difference when it reaches the UK
LikeLike
Hiiii
LikeLike