Corpora
The following corpora are available for BYU research and educational use. In addition to these corpora, BYU is an institutional member of the Linguistics Data Consortium and has had subscriptions for the years 1998, 2002, and 2003.
For access or futher information, contact Deryle Lonsdale at (801) 422-4067, or via email at lonz@byu.edu.
Corpora available for BYU research and educational use:
- LDC93S1 TIMIT Acoustic-Phonetic Continuous Speech Corpus
- LDC93T1 ACL/DCI (text: North American English)
- LDC94S17 OGI Multilanguage Corpus
- LDC94T5 ECI Multilingual Text
- LDC96L14 CELEX2 (lexical: English, Dutch, German)
- LDC96S46 CALLFRIEND American English-Non-Southern Dialect
- LDC96S47 CALLFRIEND American English-Southern Dialect
- LDC96S48 CALLFRIEND Canadian French
- LDC96S49 CALLFRIEND Egyptian Arabic
- LDC96S50 CALLFRIEND Farsi
- LDC96S51 CALLFRIEND German
- LDC96S52 CALLFRIEND Hindi
- LDC96S53 CALLFRIEND Japanese
- LDC96S54 CALLFRIEND Korean
- LDC96S55 CALLFRIEND Mandarin Chinese-Mainland Dialect
- LDC96S56 CALLFRIEND Mandarin Chinese-Taiwan Dialect
- LDC96S57 CALLFRIEND Spanish-Caribbean Dialect
- LDC96S58 CALLFRIEND Spanish-Non-Caribbean Dialect
- LDC96S59 CALLFRIEND Tamil
- LDC96S60 CALLFRIEND Vietnamese
- LDC97T22 1996 English Broadcast News Transcripts (Hub-4)
- LDC98T29 1997 Spanish Broadcast News Transcripts (Hub-4NE)
- LDC98L21 COMLEX English Syntax Lexicon
- LDC96T11 COMLEX Syntax Text Corpus Version 2.0
- LDC98T26 Hub-5 Mandarin Transcripts
- LDC98T27 Hub-5 Spanish Transcripts
- LDC98T30 North American News Text Supplement
- LDC98T32 JURIS
- LDC98S72 Taiwanese Putonghua Speech and Transcripts
- LDC99T42 Treebank-3
- LDC2001T11 Chinese Treebank Version 2.0
- LDC2002L27 Chinese-English Translation Lexicon Version 3.0
- LDC2002S22 1997 HUB5 Arabic Evaluation
- LDC2002T39 1997 HUB5 Arabic Transcripts
- LDC2002S56 2000 Communicator Evaluation
- LDC2002T01 Multiple-Translation Chinese Corpus
- LDC2002S37 Callhome Egyptian Arabic Speech Supplement
- LDC2002T38 Callhome Egyptian Arabic Transcripts Supplement
- LDC2002T01 Multiple-Translation Chinese Corpus
- LDC2002S02 West Point Arabic Speech Corpus
- LDC2003T18 Multiple-Translation Arabic (MTA) Part 1
- LDC2003T17 Multiple-Translation Chinese (MTC) Part 2
- LDC2003T12 Arabic Gigaword
- LDC2003T07 Arabic Treebank: Part 1 - 10K-word English Translation
- LDC2003T06 Arabic Treebank: Part 1 v 2.0
- LDC2003T17 Multiple-Translation Chinese (MTC) Part 2
- LDC2003S06 Santa Barbara Corpus of Spoken American English Part-II
- LDC2003T09 Chinese Gigaword
- LDC2004L02 Buckwalter Arabic Analyzer
- LDC2004T02 Arabic Treebank: Part 2 v 2.0
- LDC2004T11 Arabic Treebank: Part 3 v 1.0
- LDC2004T07 Multiple-Translation Chinese (MTC) Part 3
- LDC2005T20 Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis)
- LDC2005T30 Arabic Treebank: Part 4 v 1.0 (MPG Annotation)
- LDC2005T05 Multiple-Translation Arabic (MTA) Part 2
- LDC2005T10 Chinese English News Magazine Parallel Text
- LDC2005T14 Chinese Gigaword Second Edition
- LDC2005T06 Chinese News Translation Text Part 1
- LDC2005S16 RT-04 MDE Training Data Speech
- LDC2005T02 Arabic Treebank: Part 1 v 3.0 (POS with full vocal + syntactic analysis)
- LDC2006S35 CSLU: Multilanguage Telephone Speech Version 1.2
- LDC2006S26 CSLU: Speaker Recognition Version 1.1
- LDC2006T04 Multiple-Translation Chinese (MTC) Part 4
Corpora available at BYU (from other resources):
- CETEMPublico (text: European Portuguese)
- TELRI East Meets West (text & speech: 20 European languages)
- the EMILLE corpus (text, bitext: South Asian languages)
- French WordNet
- CUWORD Cantonese (speech)
- BDLEX
- MULTEXT JOC
- ARCADE ROMANSEVAL
- FarsDat (Farsi speech corpus)
In addition, BYU is an institutional member of the Linguistic Data Consortium, and has had subscriptions for the years 1998 and 2002-2007.
Contact Deryle Lonsdale for access or further information (422-4067).
