POSTS
PORTFOLIO
SEARCH
Your address will show here +12 34 56 78
Language Model

Language Modeling aims at accurately estimating the probability distribution of word sequences or sentences produced in a natural language such as Arabic [1]. Having a way to estimate the relative likelihood of different word sequences is useful in many natural language processing applications, especially those where natural text is generated such as the case of speech recognition. The goal of a speech recognizer is to match input speech sounds with word sequences. To accomplish this goal, the speech recognizer will leverage the language model to provide the capability to distinguish between words and phrases that sound similar. These ambiguities are easier to resolve when evidence from the language model is incorporated with the pronunciation model and the acoustic model.

Language models rely heavily on the context, or history, to estimate the probability distribution. The context can be long or short, knowledge-rich or knowledge-poor. We may base the estimation on a single preceding word (e.g., bigram), or potentially using knowledge of all words from the start of the passage preceding the word in question. Knowledge-rich models can incorporate information about morphology, syntax or semantics to inform the estimation of the probability distribution of word sequence, whereas knowledge poor models will rely solely on the words as the appear in the text. It is reasonable to state that current language modeling techniques can be split into two categories: count based and continuous-space based language models.

The count-based approaches represent the traditional techniques and usually involves the estimation of n-gram probabilities, where the goal is to accurately predict the next word in a sequence of words. In a model that estimates probabilities for two-word sequences (bigrams), it is unclear whether a given bigram has a count of zero because it is not a valid sequence in the language, or because it is not in the training data. As the length of the modeled sequences grows more complex, this sparsity issue also grows. Of all possible combinations of 5-grams in a language, very few are likely to appear at all in a given text, and even fewer will repeat often enough to provide reliable frequency statistics. Therefore, as the language model is trying to predict the next word, the challenge is to find appropriate, reliable estimates of word sequence probabilities to enable the prediction. Approaches to this challenge are three-folds: smoothing techniques are used to offset zero-probability sequences and spread probability mass across a model [2-4]; enhanced modeling techniques that incorporate machine learning or complex algorithms are used to create models that can best incorporate additional linguistic information [5-6]; and particularly for Arabic language modeling, morphological information is extracted and provided to the models in place of or in addition to lexical information [7-8].

The continuous space-based language modeling approach are based on the use of neural networks to estimate the probability distribution of a word sequence [9-10]. This approach, also denoted neuronal language model, are based on feed-forward neural network [9] or recurrent neural network [11-13] that achieved state of the art performance. Recently, a new technique based on transformers (BERT) start to be explored for language modeling as well [16]. Initially, the feed-forward neural network based LM tackled efficiently the problems of data sparsity but not necessary the context. It uses a fixed length context. Every word in the vocabulary is associated with a distributed word feature vector, and the joint probability function of words sequence is a function of the feature vectors of these words in the sequence [9-10].
The recurrent neural network based LM was able in a certain degree to address the problem of limited context. It does not use fixed length context as their internal memory is able to remember important things about the input they received. In this type of architecture, neurons with input from recurrent connections assumed to represent short term memory and hence enables them to better leverage the history or context [9, 14, 15]. Also, subsequent research has been focusing on sub-word modelling and corpus-level modelling based on recurrent neural network and its variant, such as the long short-term memory network (LSTM) [15]. However, a very long training time and large amounts of data are still the main limitations. It is also reasonable to say that, sub-word modelling and large-context language model are still interesting challenges to solve, which is very important for a language such as Arabic [17].
The reader can also refer to these [18-22] as a start to build your own language models.

[1] I. Zitouni (Ed.), Natural language processing of Semitic languages, theory and applications of natural language processing, Chapter 5. Springer, Berlin, Heidelberg (2014)
[2] Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 181–184.
[3] Ciprian Chelba and Johan Schalkwyk, 2013. Empirical Exploration of Language Modeling for the google.com Query Stream as Applied to Mobile Voice Search, pages 197–229. Springer, New York
[4] Stanley Chen and Joshua Goodman. 1998. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Harvard University, August.
[5] P.F. Brown V.J. DellaPietra P.V. DeSouza J.C. Lai R.L. Mercer "Class-based n-gram models of natural language" Computational Linguistics vol. 18 no. 4 pp. 467-479 1992.
[6] R. A. Solsona, E. Fosler-Lussier, H. J. Kuo, A. Potamianos and I. Zitouni, "Adaptive language models for spoken dialogue systems," 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, FL, 2002, pp. I-37-I-40. doi: 10.1109/ICASSP.2002.5743648
[7] G. Choueiter, D. Povey, S. F. Chen and G. Zweig, "Morpheme-Based Language Modeling for Arabic Lvcsr," 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Toulouse, 2006, pp. I-I.
doi: 10.1109/ICASSP.2006.1660205
[8] K. Kirchhoff, D. Vergyri, J. Bilmes, K. Duth, A. Stolcke, “Morphology-based language modeling for conversational Arabic speech recognition” Computer Speech & Language. Vol. 20 no. 4 pp. 589-608 Oct 2006.
[9] Mikolov, T. Statistical Language Models based on Neural Networks. PhD thesis, Brno University of Technology, 2012.
[10] W. Mulder, S. Bethard, M.F. Moens. A survey on the application of recurrent neural networks to statistical language modeling. Computer Speech & Language. Vol. 30 no. 1 pp. 61-98 March 2015.
[11] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2015. Character-Aware Neural Language Models. CoRR, abs/1508.06615.
[12] Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. A neural probabilistic language model. The Journal of Machine Learning Research, 3:1137–1155, 2003.
[13]Mikolov, T., Karafi´at, M., Burget, L., Cernock`y, J., and Khudanpur, S. Recurrent neural network based language model. In INTERSPEECH, pp. 1045–1048, 2010.
[14] Martin Sundermeyer, Hermann Ney, and Ralf Schlüter. 2015. From feedforward to recurrent LSTM neural networks for language modeling. Trans. Audio, Speech and Lang. Proc. 23, 3 (March 2015), 517-529. DOI: https://doi.org/10.1109/TASLP.2015.2400218
[15] S. Yousfi, S.A. Berrani, C. Garcia. Contribution of recurrent connectionist language models in improving LSTM-based Arabic text recognition in videos. Pattern Recognition. Vol. 64 pp. 245-254 April 2017.
[16] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv e-prints.
[17] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, ´ and Y. Wu. Exploring the limits of language modeling. arXiv preprint, 1602.02410, 2016. arxiv.org/abs/1602.02410.
[18] CMU Statistical Language Modeling Toolkit: http://www.speech.cs.cmu.edu/SLM/toolkit.html
[19] HTK Toolkit: http://htk.eng.cam.ac.uk/download.shtml
[20] SRILM - The SRI Language Modeling Toolkit: http://www.speech.sri.com/projects/srilm/
[21] Stanford CoreNLP – Natural language software: https://stanfordnlp.github.io/CoreNLP/
[22] The Berkeley NLP Group: http://nlp.cs.berkeley.edu/software.shtml