A Frequency Dictionary of Arabic Newsprint

A Frequency Dictionary of Arabic Newsprint (FDAN) is a prototype frequency dictionary of Modern Standard Arabic (MSA) based on the Linguistics Data Consortium’s Arabic Treebank Corpus (ATC). Four volumes of the ATC, representing a corpus of some 800,000 words, were used to create the frequency count. Discrepancies among the tagging format for each volume of the ATC were resolved using regular expressions. The frequency count was then created using a relational database. The 2,000 most common lemmas were included as lexical entries in FDAN, together with relevant grammatical and phonemic information. Sample phrases from the ATC were selected using Parkinson’s arabiCorpus.

Thesis Author: Mouritsen, Stephen C.

Year Completed: 2007

Committee Members: Deryle W. Lonsdale, Dilworth Parkinson

Thesis Chair: Mark Edward Davies