Arabic Dialect Identification

The task of dialect identification (DID) is a special case of the more general problem of language identification (LID). LID refers to the process of automatically identifying the language class for a given speech segment or text document. The Arabic language has several spoken dialects. There are four major dialects for Arabic, including Egyptian, Gulf, Levantine and North African in addition to modern standard Arabic (MSA) which is the official language in Arabic speaking countries.

Arabic dialect identification is arguably a more challenging problem than LID, since it consists of identifying the different dialects within the same language class. Thus, automatically identifying the input dialect from the speech signal has been an interesting research problem both on its own and to improve automatic speech recognition (ASR) [1].

Tomato has 10 lexical variations and 15 phonological variations. [Fig-1]

Approaches to Arabic dialect identification (ADI) are closely related to those of language recognition. These include Gaussian mixture models, the phonotactic approach and phone recognition [2], the i-vector combined with dimensionality reduction [3] and more recently deep learning techniques [4-7]. Arabic dialect identification has been also closely associated with improving dialectal Arabic ASR interesting work has been done in the context of the GALE project [8] and recent thesis [9]. In spite of this advances Arabic dialect recognition remains a challenging problem and several special sessions and contests have been organized around the subject [10]. These include good pointers to many techniques and data sets. Also, there are various repositories [11-13] that can be a good start for having an experimental setup.

[1] A. Ali, et al. "Automatic dialect detection in Arabic broadcast speech." in Interspeech 2016.

[2] Marc A. Zissman, “A comparison of four approaches to automatic language identification of telephone speech,” in IEEE Transactions on Speech and Audio Processing, vol. 4, no. 1, Jan 1996.

[3] N. Dehak, P.A. Torres-Carrasquillo, D. Reynolds and R. Dehak, “Language recognition via i-vectors and dimensionality reduction,” in Interspeech 2011.

[4] O. Ghahabi, A. Bonafonte, J. Hernando and A. Moreno, “Deep neural networks for i-vector language identification of short utterances in cars,” in Interspeech 2016.

[5] S. Shon, A. Ali, and J. Glass. "MIT-QCRI Arabic dialect identification system for the 2017 multi-genre broadcast challenge." Automatic Speech Recognition and Understanding Workshop (ASRU), 2017.

[6] M. Najafian, et al. "Exploiting convolutional neural networks for phonotactic based dialect identification." in ICASSP 2018.

[7] S. Shon, A. Ali, and J. Glass. "Convolutional Neural Network and Language Embeddings for End-to-End Dialect Recognition." Proc. Odyssey 2018 The Speaker and Language Recognition Workshop. 2018.

[8] F. Biadsy, J. Hirschberg and N. Habash, “Spoken Arabic dialect identification using phonotactic modeling, in Proceedings of EACL workshop on computational approaches to Semitic languages, 2009.

[9] A. Ali. Multi-dialect Arabic broadcast speech recognition. PhD thesis, The University of Edinburgh, 2018.

[10] Zampieri, Marcos, et al. "Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign." Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects. Association for Computational Linguistics, 2018.

[11] https://github.com/qcri/dialectID/

[12] https://github.com/swshon/dialectID_e2e

[13] https://github.com/swshon/dialectID_siam

[Fig-1] Bouamor, et al. The MADAR Arabic dialect corpus and lexicon. In (LREC-2018). http://madar.camel-lab.com