Articulatory speech synthesis

1. Why articulatory speech synthesis?

Modern personal computers have gigabytes of RAM and hundreds or thousands of gigabytes of storage space. This allowed concatenative speech synthesizers with very big databases (hundreds of megabytes), which are able to produce very high quality speech. And the new synthesizers based on deep neural networks (DNN) are producing speech with excellent quality. But these methods depend on recorded speech. To create a new high quality voice, many hours of speech must be recorded.

Articulatory synthesis is a method that produces speech by simulating the human phonatory system. For this reason, it is possible to develop an articulatory synthesizer that does not depend on recorded speech. Producing high quality speech with articulatory synthesizers is very difficult, but it is relatively easy to change their voice, for example from male to female. The user just needs to modify some parameters.

The most interesting characteristic of articulatory synthesizers is that they do not treat the phonatory system as a black box. For example, with articulatory synthesizers we can easily understand the cause of the differences between the "ba" and "ma" sounds (the velum aperture), or between "ba" and "da" (the constriction position).

The video "Real-time control of an articulatory speech synthesizer" demonstrates the flexibility of articulatory synthesizers. The speech quality is not good, but the video shows the potential of articulatory synthesis.

2. Software