POSTS
PORTFOLIO
SEARCH
Your address will show here +12 34 56 78

Broadcast Arabic Speech Recognition: MGB-5

ASR and dialect identification techniques using YouTube recordings. The task of Moroccan ASR comprises 14 hours of speech extracted from 93 YouTube videos distributed across seven genres. The task of Fine-grained Arabic Dialect Identification (ADI) is dialect identification of speech from YouTube to one of the 17 dialects (ADI17) using more than 3,000 hours.

In addition to the 1,200 hours used in 2016 from Aljazeera TV programs, the MGB-5 explores multi-genre data; comedy, cooking, cultural, environment, family-kids, fashion, movies-drama, sports and science talks (TEDX).

Moroccan Arabic Automatic Speech Recognition

The MGB-5 Arabic data comprises 14 hours of Moroccan Arabic speech extracted from 93 YouTube videos distributed across seven genres: comedy, cooking, family/children, fashion, drama, sports, and science clips. We assume that the MGB-5 data is not enough by itself to build robust speech recognition systems, but could be useful for adaptation, and for hyper-parameter tuning of models built using the MGB-2 data. Therefore, we suggest to reuse the MGB-2 training data in this challenge, and consider the provided in-domain data as (supervised) adaptation data.

Given that dialectal Arabic does not have a clearly defined orthography, different people tend to write the same word in slightly different forms. Therefore, instead of developing strict guidelines to ensure a standardized orthography, variations in spelling are allowed. Thus multiple transcriptions were produced, allowing transcribers to write the transcripts as they deemed correct. Every file has been segmented and transcribed by four different Moroccan annotators.

The 93 YouTube clips have been manually labeled for speech, non-speech segments. About 12 minutes from each program were selected for transcription. The resulting speech segments were then distributed into train, development and test data sets as follows:
  • Training data: 10.2 hours from 69 programs
  • Development data: 1.8 hours from 10 programs
  • Testing data: 2.0 hours from 14 programs

In addition to the transcribed 14 hours, the full programs are also provided, which amounts 48 hours for the 93 programs. This data can be used for in-domain speech or genre adaptation.

You can find samples here: audio, segmentation, transcription in Arabic and transcription in Buckwalter.
You can find the MGB-5 ASR baseline system here.

Dataset for Moroccan ASR

  • Text Hover
MGB-5 data distribution across the three classes, duration in hours/number of programs (12 minutes each roughly). * is the duration for the complete recordings including speech and non-speech segments

Fine-grained Arabic Dialect Identification (ADI)

The task of ADI is dialect identification of speech from YouTube to one of the 17 dialects (ADI17). The previous studies on Arabic dialect identification using audio signal is limited to 5 dialect classes by lack of speech corpus. To present a fine-grained analysis of the Arabic dialect speech, we collected the Arabic dialect from YouTube.

For Train set, about 3,000 hours of Arabic dialect speech data from 17 countries on the Arabic world was collected from YouTube. Since we collected the speech by considering the YouTube channels in a specific country, certain that the dataset might have some labeling errors. For this reason, we have two sub-tracks for the ADI task, supervised learning track and unsupervised track. Thus, the label of the train set can be either used or not and it completely depends on the choice of participants.

For the Dev and Test set, about 280 hours of speech data was collected from YouTube. After automatic speaker linking and dialect labeling by human annotators, we selected 57 hours of speech dataset to use as Dev and Test set for performance evaluation. The test dataset was considered to have three sub-categories by the segment duration to represent short (under 5 sec), medium(between 5 sec and 20 sec), long duration (over 20 sec) of the dialectal speech.
You can find the ADI17 baseline system here.

Arabic Dialect Identification for 17 countries (ADI17) Dataset

  • Text Hover
For more details about MGB-5, visit here