CALM: the Corpus al-Logha al-Musriya

 A two-million word Corpus of Egyptian Arabic

 

CALM contains transcripts from 65 movies (comprising 655,858 word tokens), 88 scripted television programs (396,734 word tokens), and internet texts (1,092,442 word tokens). Some of the content has been annotated, and annotation is ongoing.

 

Download the following files by right-clicking on each:

blog_corpus.zip

movie_corpus.zip

annotated_blog.zip

annotated_movie.zip

 

For more information, consult these resources:

or contact the developer directly: Michael Grant White (mgrantwhite at gmail.com).