POSTS
PORTFOLIO
SEARCH
Your address will show here +12 34 56 78

Broadcast Arabic Speech Recognition: MGB-3

The MGB-3 is 16 hours of multi-genre data collected from different YouTube channels. The 16 hours have been manually transcribed. The chosen Arabic dialect for the MGB-3 is Egyptian. Given that dialectal Arabic has no orthographic rules, each program has been transcribed by four different transcribers using this transcription guideline. The MGB-3 data is split into three groups; adaptation, development and evaluation data which was shared at the evaluation. 

Data

Egyptian broadcast data collected from YouTube.
This year, we collected about 80 programs from different YouTube channels. The first 12 minutes from each program have been transcribed and released. This sums up to roughly 16 hours in total divided as follow:
Adaptation: 12 minutes * 24 programs.
Development: 12 minutes * 24 programs .
Evaluation: 12 minutes * 31 programs
All programs have been transcribed by four different annotators to explore the non-orthographic nature of the dialectal Arabic.

Challenge Overview: Speech data

  • Text Hover
*A. Ali, S. Renals, S. Vogel, “Speech Recognition Challenge in the Wild: Arabic MGB-3”, ASRU 2017 

The inter annotator disagreement 

  • Text Hover
*A. Ali, S. Renals, S. Vogel, “Speech Recognition Challenge in the Wild: Arabic MGB-3”, ASRU 2017 
For more details about MGB-3, visit here