Automatic Speech Recognition

Automatic Speech Recognition (ASR) is the process of converting the speech signal into its corresponding text. The quality of ASR systems is measured by how close their recognized sequences of words are to human recognized sequences of words. More formally, the ASR quality, measured by Word Error Rate (WER), is the edit-distance between automatically generated hypotheses and the ground-truth human transcription.

Traditionally, ASR systems are split into four components: the acoustic model, the pronunciation dictionary, the language model, and the search decoder. Since ASR output hypotheses need to adhere to the statistical structure of language, the language model ensures that the output sequence matches what is likely to be said. As an example, words like "school" or "work" have a higher probability than "oil" or "yellow" to be the following word in the word-sequence: "Ali walks to his ....... ". The pronunciation dictionary is used to decompose words into small units of sound, known as Phonemes. The acoustic model represents the mapping between the audio signal, its temporal, i.e. time-related, and spectral, i.e. frequency-related, characteristics as well as the phonemes in the language. Each model assigns probabilities to different choices it makes, then the decoder searches over all these alternatives weighing their probabilities to come up with the best output hypothesis. A very good starting point to learn about ASR is the HTKbook [1].

Text Hover

ASR system component.

Different statistical modeling techniques were used for different components of the ASR system. For acoustic models, Hidden Markov Models (HMMs) with Gaussian Mixture Models (GMMs) state representation [2] was used as well as Neural Networks [3]. With the Deep Learning revolution, Neural Networks [4] got a boost in performance by going deeper [5], having a large set of acoustic units in its output (using some ways of combining phonemes) [6], and training on a very large volume of data [7]. For language modeling, there was a similar trend, where the n-gram language models [8] got dethroned by recurrent neural network language models [9].

More recently, the research community is going towards a more holistic approach that combines all the four components into one end-to-end ASR system, where inputs are the acoustics signal representation and the output is the word sequence without building four different distinct components [10, 11, 12].

For building an ASR system in practice, you can also learn a lot from kaldi [13]. It is a toolkit for speech recognition written in C++ and licensed under the Apache License v2.0.

[1] Steve Young, et. al "The HTK book" https://www.danielpovey.com/files/htkbook.pdf

[2] Mark Gales, Steve Young "The Application of Hidden Markov Models in Speech Recognition" 2007, https://mi.eng.cam.ac.uk/~mjfg/mjfg_NOW.pdf

[3] Hervé Bourlard, Nelson Morgan, "Connectionist Speech Recognition: A Hybrid Approach", 1994

[4] Geoffrey Hinton, et al. "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The shared views of four research groups", 2012

[5] Abdel-rahman Mohamed, George Dahl, Geoffrey Hinton, "Acoustic Modeling using Deep Belief Networks" 2010

[6] George Dahl, Dong Yu, Li Deng, Alex Acero, "Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition" 2010

[7] Frank Seide, Gang Li, Xie Chen, Dong Yu, "Feature engineering in context-dependent deep neural networks for conversational speech transcription", 2011

[8] Andreas Stolcke, "SRILM - an Extensible Language Modeling Toolkit", 2002

[9] Tomas Mikolov, et. al, "RNNLM - Recurrent Neural Network Language Modeling Toolkit" 2010

[10] Alex Graves, Navdeep Jaitly, "Towards End-To-End Speech Recognition with Recurrent Neural Networks" 2014

[11] Dzmitry Bahdanau, et al "End-to-End Attention-based Large Vocabulary Speech Recognition" 2015

[12] William Chan, et al "Listen, Attend and Spell" 2015

[13] http://kaldi-asr.org/