MGB-2 – Arabic Speech

Broadcast Arabic Speech Recognition: MGB-2

Download MGB-2

The 1,200 hours MGB-2: from Aljazeera TV programs have been manually captioned with no timing information. QCRI Arabic ASR system has been used to recognize all programs. The ASR output was used to align the manual captioning and produce speech segments for training speech recognition. More than 20 hours from 2015 programs have been transcribed verbatim and manually segmented.

Data provided includes:

Approximately 1,200 hours of Arabic broadcast data, obtained from about 4,000 programmes broadcast on Aljazeera Arabic TV channel over a span of 10 years, from 2005 until September 2015.
Time-aligned transcription as an output from light supervised alignment, with a varying quality of human transcription for the whole episode.
More than 110 million words of Aljazeera.net website collected between 2004, and the year of 2011

Metadata for each program include title, genre tag, and date/time of transmission. The original set of data for this period contained about 1,500 hours of audio, obtained from all shows; we have removed programs with damaged aligned transcriptions. the aligned segmented transcription will be shared as well as the original raw transcription (which has no time information).

This data is split into a development set of 10 hours, and a similar evaluation set of 10 hours. Both the development and evaluation data have been released in the 2016 MGB challenge.

MGB-2 genre; domain statistics

Text Hover

*A. Ali, P. Bell, J. Glass, Y. Messaoui, H. Mubarak, S. Renals, and Y. Zhang, “The MGB-2 Challenge: Arabic multi-dialect broadcast media recognition,” in SLT, 2016.

Text Hover

Sample of speech recognition

Text Hover

QCRI Arabic ASR system has been used MGB-2 to recognize all programs. The ASR output was used to align the manual captioning and produce speech segments for training speech recognition. More than 20 hours from 2015 programs have been transcribed verbatim and manually segmented.

For more details about MGB-2, visit here