Text To Speech

The Text to Speech (TTS) technology aims to convert a sequence of words into speech. Traditional TTS pipelines or engines consist of few steps to generate the speech:

The text normalization or tokenization step aims to convert raw text containing symbols like numbers and abbreviations into the equivalent words.
In text-to-phoneme or grapheme-to-phoneme conversion step, phonetic transcriptions for each word are assigned.
Prosodic phrasing step aims to divide and mark the text into prosodic units, like phrases and sentences.
Prediction of the target prosody (pitch contour, phoneme durations) step. The target prosody is used to generate/control the output speech.
Finally, the synthesizer is used to convert the symbolic linguistic representation into sound.

Synthesized speech can be created by concatenating units of recorded speech that are stored in a database as in [1] [2]. Common units used in concatenative synthesizers are phones or diphones. Alternatively, statistical parametric synthesizers also known as HMM-based synthesizers (based on hidden Markov models) can be used to create the synthesized speech [3][4]. In these systems, the frequency spectrum (vocal tract), fundamental frequency (voice source), and duration (prosody) of speech are modeled simultaneously by HMMs. Speech waveforms are generated from HMMs themselves based on the maximum likelihood criterion. Recently, neural networks have been used as acoustic models for statistical parametric synthesizers [5]. In addition, end-to-end DNN-based speech synthesizers such as Tacotron [6] by Google and Deep Voice [7] from Baidu are an active area of research. A state-of-the-art synthesizer based on Tacotron, developed for the Arabic language, is available on github [8].

Since Modern Standard Arabic (MSA) is written without diacritics, the first step to develop an Arabic TTS engine [2] is to restore the diacritics of each word in the text [9][10][11]. The diacritized text is then passed to a phonetic transcription module to generate the phoneme sequence for each phrase [12]. Hence, a synthesizer (i.e. concatenative, parametric, neural networks) can be used to synthesis the speech.

[1] A. Hunt and A. Black, :Unit selection in a concatenative speech synthesis system using a large speech database". In ICASSP-96, volume 1, pages 373--376, Atlanta, Georgia, 1996.

[2] Hifny, Yasser, et al. "ArabTalk®: An Implementation for Arabic Text To Speech System."The proceedings of the 4th Conference on Language Engineering. 2004.

[3] HMM/DNN-based Speech Synthesis System (HTS), http://hts.sp.nitech.ac.jp/

[4] Abdel-Hamid, Ossama, Sherif Mahdy Abdou, and Mohsen Rashwan. "Improving Arabic HMM based speech synthesis quality." Ninth International Conference on Spoken Language Processing. 2006.

[5] Merlin: The Neural Network (NN) based Speech Synthesis System, https://github.com/CSTR-Edinburgh/merlin

[6] Ping, Wei, et al. "Deep voice 3: Scaling text-to-speech with convolutional sequence learning." (2018).

[7] Wang, Yuxuan, et al. "Tacotron: Towards end-to-end speech synthesis." arXiv preprint arXiv:1703.10135

[8] https://github.com/youssefsharief/arabic-tacotron-tts

[9] Rashwan, Mohsen AA, et al. "A stochastic Arabic diacritizer based on a hybrid of factorized and unfactorized textual features."IEEE Transactions on Audio, Speech, and Language Processing 19.1 (2011): 166-175.

[10] Darwish, Kareem, Hamdy Mubarak, and Ahmed Abdelali. "Arabic diacritization: Stats, rules, and hacks."Proceedings of the Third Arabic Natural Language Processing Workshop. 2017.

[11] Hifny, Yasser. "Hybrid LSTM/MaxEnt Networks for Arabic Syntactic Diacritics Restoration."IEEE Signal Processing Letters 25.10 (2018): 1515-1519.

[12] https://github.com/nawarhalabi/Arabic-Phonetiser