1. Size of the web

Simple ratios:

3

9

4

??

With linguistic data:

(A) Freq in BNC

(C) Freq in Google

(B) Size of BNC

(D) Size of Google

Formula: (B x C) / A

Words with frequency of 80 in the BNC:

annealing, appraise, archivists, asthmatic, attractor, backswing, bedsit, blameless, boatman, bogwood, botanists, buggered, buggery, burgundy, calmness, caramel, castigated, clawing, clench, clings, coachman, colectomy, collectivist, compensations, congressman, conjunctions, conquerors, contaminants, contemplates, contrive, controversially, countervailing, cruisers, deathbed, decays, diplomatically, domineering, dominoes, eccentricities, empties, eugenics, exigencies, ferociously, forested, frontline, garnished, glade, gnawed, gradation, grieved, gulping, handheld, headroom, heathen, heaviness, hideously, holocaust, icily, impressionistic, inimitable, innuendo, irradiated, joyfully, kayak, keg, landowning, largesse, latched, lexis, luminosity, lumped, luv, lysosomal, malfunction, meaty, medreses, memorably, mesenchyme, mettle, misinterpretation, national, neoplasia, netball, neutrophil, nibble, obliges, overseer, participates, pate, pelmet, perfused, personification, po, polluter, predilection, promontory, purist, quays, rabies, racehorse, raindrops, reciprocated, redoubtable, refunds, resurrect, reversals, ri, sacrosanct, sepsis, sheathed, shopfloor, slackened, slats, slimmer, smithy, snubbed, sporty, staid, steers, strangling, suns, swallows, sympathize, tans, tarpaulin, unbelief, unfailing, ungainly, unselfish, unsubstantiated, untied, variceal, warlike, weaned, whoop, wicketkeeper, woodworm

Example: garnished

(A) 80

(C) 1,750,000

(B) 100,000,000

(D) 2,187,500,000,000 (~2.2 trillion)

Formula: (B x C) / A

2. Which register?

·         Try some very colloquial phrases -- like so not, he's all worried, etc in BNC or COCA.

·         Main issue: what "genre" is it? How to distinguish genres?

3. What dialect (geographical)?

·         UK, AU, NZ, but what about US?   .EDU? .US? .COM?

4. What types of queries?

Possibly for frequency of words and phrases, but what about:

·         Part of speech

·         Lemma

·         Collocates


Other "non-corpus corpora"

A. General Conference

B. Newspaper corpus (New York Times)

Look for "new words" in the Oxford English Dictionary

TIME PERIOD

Millions of words

1850-1899

39

1900-1949

127

1950-1999

158

TOTAL

323

Google books

Google newspapers

Google historical newspapers

SpeechWars
 


Meaningful comparisons

·         Can't just give raw frequency for one feature in Corpus1 vs raw frequency in Corpus2

·         Have to do one of the following:

o    "Normalize" frequency (e.g. per million words) in the two corpora, OR

o    Compare two features in each corpus (e.g. at hospital vs. at the hospital)

·         To calculate per million: (FREQ/SIZE)

 

BNC

COCA

 

#

Per million

#

Per million

snuck

11

0.11

767

1.80

sneaked

132

1.32

830

1.95

                                    BNC has 100 million and COCA had 425 million words