Broadcast Arabic Speech Recognition: MGB-2
- The 1,200 hours MGB-2: from Aljazeera TV programs have been manually captioned with no timing information. QCRI Arabic ASR system has been used to recognize all programs. The ASR output was used to align the manual captioning and produce speech segments for training speech recognition. More than 20 hours from 2015 programs have been transcribed verbatim and manually segmented.
Data provided includes:
- Approximately 1,200 hours of Arabic broadcast data, obtained from about 4,000 programmes broadcast on Aljazeera Arabic TV channel over a span of 10 years, from 2005 until September 2015.
- Time-aligned transcription as an output from light supervised alignment, with a varying quality of human transcription for the whole episode.
- More than 110 million words of Aljazeera.net website collected between 2004, and the year of 2011
This data is split into a development set of 10 hours, and a similar evaluation set of 10 hours. Both the development and evaluation data have been released in the 2016 MGB challenge.
MGB-2 genre; domain statistics
*A. Ali, P. Bell, J. Glass, Y. Messaoui, H. Mubarak, S. Renals, and Y. Zhang, “The MGB-2 Challenge: Arabic multi-dialect broadcast media recognition,” in SLT, 2016.
Sample of speech recognition
QCRI Arabic ASR system has been used MGB-2 to recognize all programs. The ASR output was used to align the manual captioning and produce speech segments for training speech recognition. More than 20 hours from 2015 programs have been transcribed verbatim and manually segmented.
For more details about MGB-2, visit here