|
Simple ratios:
With linguistic data:
Words with frequency of 80 in the BNC:
Example: garnished
2. Which register? · Try some very colloquial phrases -- like so not, he's all worried, etc in BNC or COCA. · Main issue: what "genre" is it? How to distinguish genres? 3. What dialect (geographical)? · UK, AU, NZ, but what about US? .EDU? .US? .COM? 4. What types of queries? Possibly for frequency of words and phrases, but what about: · Part of speech · Lemma · Collocates Other "non-corpus corpora"
B. Newspaper corpus (New York
Times) Look for "new words" in the Oxford
English Dictionary
Meaningful comparisons · Can't just give raw frequency for one feature in Corpus1 vs raw frequency in Corpus2 · Have to do one of the following: o "Normalize" frequency (e.g. per million words) in the two corpora, OR o Compare two features in each corpus (e.g. at hospital vs. at the hospital) · To calculate per million: (FREQ/SIZE)
|
|||||||||||||||||||||||||||||||||||||||||||||||
BNC has 100
million and COCA had 425 million words