CALM: the Corpus al-Logha al-Musriya
A two-million word Corpus of Egyptian Arabic
CALM contains transcripts from 65 movies (comprising 655,858 word tokens), 88 scripted television programs (396,734 word tokens), and internet texts (1,092,442 word tokens). Some of the content has been annotated, and annotation is ongoing.
Download the following files by right-clicking on each:
For more information, consult these resources:
the WACL paper
the thesis
or contact the developer directly: Michael Grant White (mgrantwhite at gmail.com).