MGB-3 – Arabic Speech

Broadcast Arabic Speech Recognition: MGB-3

Download MGB-3

The MGB-3 is 16 hours of multi-genre data collected from different YouTube channels. The 16 hours have been manually transcribed. The chosen Arabic dialect for the MGB-3 is Egyptian. Given that dialectal Arabic has no orthographic rules, each program has been transcribed by four different transcribers using this transcription guideline. The MGB-3 data is split into three groups; adaptation, development and evaluation data which was shared at the evaluation.

Data

Egyptian broadcast data collected from YouTube.

This year, we collected about 80 programs from different YouTube channels. The first 12 minutes from each program have been transcribed and released. This sums up to roughly 16 hours in total divided as follow:

Adaptation: 12 minutes * 24 programs.

Development: 12 minutes * 24 programs .

Evaluation: 12 minutes * 31 programs

All programs have been transcribed by four different annotators to explore the non-orthographic nature of the dialectal Arabic.

Challenge Overview: Speech data

Text Hover

*A. Ali, S. Renals, S. Vogel, “Speech Recognition Challenge in the Wild: Arabic MGB-3”, ASRU 2017

The inter annotator disagreement

Text Hover

*A. Ali, S. Renals, S. Vogel, “Speech Recognition Challenge in the Wild: Arabic MGB-3”, ASRU 2017

For more details about MGB-3, visit here