This is a synopsis of sample projects carried out by past students in Lonsdale's BYU NLP class:

Retooling AML for language identification: This project implemented the core engine in the AML language modeling system for the purposes of language identification. It extended the AML engine with the ability to handle free-form data, multiple data files, dynamic outcome files, new configuration options, and the choice of several statistical algorithms to analyse the results. The implementation was developed in Perl, trained on current news reports about the Kosovo crisis, and was successful in discriminating between short samples of various languages' texts, even involving typologically similar languages (e.g. English vs. German or French vs. Spanish).

Automatic word sense disambiguation using a WordNet-based algorithm: This project involved determining the word sense of polysemous and homonymous nouns by examining their occurrence in verbal contexts. Resources employed include the Brown Corpus, Princeton's WordNet, and modified semantic similarity measuring algorithm used in other NLP applications. The implementation was developed in C++, including establishing a component of Folio Infobase code that was isolated into a single C++ class.

Topic division by content clustering: This project performed segmentation of text into clusters of related content-specific subsections. The project was implemented in C++ and used  various clustering algorithms for content analysis, as well as Quattro Pro to analyze the results. Resources used included several hundred Wall Street Journal articles concatenated together with no distinguishable delineation, and the Qtag part-of-speech tagging engine.

New approaches to dialogue simulation: This project produced a dialog simulation engine whose core leverages a genetic algorithm (from MIT's GALib) to determine the most appropriate response to an input. This was a large-scale implementation that takes text input, parses it using CMU's Link Grammar parser, consults Princeton's WordNet for lexical-semantic information,  determines the relevant speech act, performs a GA-based search for the best response, and sends the output to a text-to-speech system using Microsoft's SAPI SDK and speech engine. The code system was written in C++ and seamlessly integrates several API's for the resources mentioned above.

Automatic dataset generation for analogical modeling: This project produced a tool that allows users to more easily specify and encode datasets for Skousen's AML engine, especially when they can be generated by the output of PC-Kimmo, a two-level finite-state machine development environment. The system, which is written in C,  permits user specification of requisite patterns, mapping of relevant data items to the AML data representation, and frequency-based handling of ambiguity, all using regular expressions and other widely-used techniques.

Sentence boundary detection by analogy: This project involved the development of a sentence-boundary detector for English and comparing various techniques including memory-based processing (TIMBL), analogical modelling, and rule-based approaches. The primary result was implemented in Perl and involves various routines to take into account linguistic, metalinguistic, and formatting information. A substantial corpus of Wall Street Journal documents was used in developing, training, and testing the system.

An LFG parser for German: This project involved developing a phrase-structure grammar for German using Stanford's Lexical-Functional Grammar Workbench. This included developing morphological formation rules, lexicon entries, and LFG-based descriptions of syntactic constituency for the German language, as well as encoding them in the metalinguistic formalism used by the workbench.

A family relationships parser: This project resulted in an engine that reasons about family relationships based on partial or incomplete information, performing forward inferencing from a set of facts and predicates extracted from natural text. It is implemented in Prolog, is based on the Gazdar/Mellish parser, and draws conclusions in the domain from basic natural-language factual statements.

Hypernym semantic hierarchies for paragraph-scope text classification: This project resulted in an approach that allows individual paragraphs in a document to be classified for semantic content based on lexical semantic relationships such as hypernymy other hierarchical relationships between concepts. Processing involved using Qtag, a Java-based part-of-speech tagger, Princeton's WordNet, geometrical distance measure algorithms implemented in C++, and postprocessing/GUI output using Folio Views technology.

Adapting instance-based methods for classifying natural language text: This project combined and refined traditional text classification methods such as instance-based approaches, prototype representation, and distance metric calculations within a system designed to classify documents. Over 3000 documents from the Wall Street Journal were used as training data by the system which was encoded in C++.

Modifications to the Cangjie input system for Chinese characters: This project involved developing a new method for Chinese character keyboard input which performs context-dependent, trigram-based disambiguation of characters, thus obviating the need for frequent interaction and therefore enabling touch-typing or "typing blind". This approach is implemented in Perl/Tk, employs lookup-tables and regular expressions for pattern matching, and produces (and leverages) a strokewise decomposition of characters recognitionally and in generation.

Text annotation manager: This project implements a tagging capability for Microsoft Word documents using Visual Basic for Applications (VBA) automation.  This provides the functionality to include markup in a document, while at the same time hiding the markup from the user. Markup can include bookmarks, comments, special key sequences, bitext alignment equivalency range delimitation, and contextually-driven machine translation output postediting.

Crosslingual text classification: Japanese web-page classification with an English classifier: This project involves plugging in a machine translation (MT) system to a framework where comparable results in Japanese can be attained without any Japanese training data.  The fact that a MT system is feasible in a text classification problem works under some important assumptions about how text classification algorithms work and how the current state-of-the-art MT systems perform. The algorithm involves three different classifiers: decision tree, Naïve Bayes, and a boosted decision tree.

Web-based KWIC listings for Arabic: This project resulted in a key-word-in-context listing facility that runs on the Web for displaying Arabic words. It included formatting data at an 800x600 dpi resolution and uses the Parkinson method for romanization of Arabic for input boxes for search keywords and parameters. It was written in Perl and runs on an Apache server. Unicode was chosen as the Arabic font, which entailed translation from hexadecimal codes used by Parkinson.

The linguists’ APR: an automatic pattern recognizer for morphology: This project was developed to ease the acquisition of morphological knowledge, by providing a simple interface for any speaker of the language to train it.  It uses directed graphs to match morphological patterns, and then compacts them into a table which is in a standard form. More user-friendly acquisition of data for other morphological engines such as PC-Kimmo is a result of this project.

Foreign language recognition system: This project involved building a system that identifies a language from a spoken sample, based solely on the acoustic qualities of the wave, rather than phonetic qualities. The process is extremely processor-intensive, and hence it involves a networked system and distributed processing over several computers. Inspired by text categorization approaches, the same method of searching for and weighting features in a text was extended to features in sound waves.  A vector file of features is built from sound waves—one from each language to recognize, sand the vector file undergoes several statistical and neural network processing stages. Various acoustic signal and statistical toolkits were used, as well as Java as AudioInputStreams, SoundSegment, and SoundWave classes.

Recipe Central NLP: This project took recipes stored in the popular MasterCook  program and extracted them into the Cogito knowledge data base, to enable various computer-assisted recipe and cooking tasks. The Cogito knowledge center allows for information organization in engineering and other applications to a degree that before required a heterogeneous approach. Relevant libraries include: an instance library, a language library, a classification library, and an attribute library.

Language reconstruction via genetic programming: This project used a genetic algorithm/machine learning approach to historical reconstruction, modeling how several daughter languages could have "descended'' over time from a common ancestor language. Based on MIT's GAlib for genetic-algorithm-based programming, the system takes a population of words related to a source word. For each generation, the system uses a non-fixed single point crossover system constructing novel combinations from the most fit two members of the surviving population. Then a mutation operator randomly selects some percentage of the population and introduces random change.
The above process repeats until the stopping criterion is reached.

Extracting predicate-argument structures from news headlines: This project presented a way of doing information extraction of semantic content from newspaper headlines.  The system has three components: (i)  the Link-Grammar parser, a dependency-based syntactic parser implemented in the C language, (ii) the Lingua::LinkParser module written in Perl, and (iii) a library of regular expressions. Headlines are extracted from the html source code of popular newspaper webpages and sent through the Link-Grammar parser. Then the Lingua::LinkParser module extracts the linkage information and outputs it to a Perl regular expression module to determine what kind of information to output. Information is generated in a predicate-argument structure to either the screen or a specified file.

Spanish dependency parser: This project produced a Spanish version of the Link Grammar Parser written in the Spanish language.  Various knowledge sources were developed for Spanish: a grammar file containing rules for linking words together, several lexicon files containing lexical information, and  syntactic and morphological constituency files. The system is capable of processing a wide array of syntactic constructions in Spanish.

Word sense disambiguation through analogical modeling: This project was the first attempt to perform word sense disambiguation using Skousen's analogical language modeling technique. Two types of featural information were used in the exemplar instance vectors: collocational features (including part-of-speech and lexical information) and cooccurrence features (including content word lexical semantic information). Using publicly available sources including Senseval 2, WordNet and Semcor, a vector building engine written in Perl created input exemplars for the system. Several experiments with varying feature vector composition were performed and compared with psycholinguistic experiments describing results from human subjects.

Classification of Russian documents: This project built a text categorization engine for handing Russian documents. The first step involved localizing the campus computing environment for working with Cyrillic text and programs. Then a part-of-speech tagger was built for Russian, which performed word lookup of some 32,000 tagged words using a directed acyclic word graph. Then a finite-state machine was implemented to perform a shallow parse on the text, extracting all of the noun phrases. The terminology and vocabulary could thus be compared between documents for assessment of topical content. The system was used to classify and discriminate between such Russian documents as novels (e.g. "Crime and Punishment"), conference talks (translations of LDS General Conference), and web documents discussing historical topics (e.g. Vladivostok in past wars).

Forced alignment of Conference proceedings: This project combined several tools and resources to assure forced alignment of LDS General Conference talks. The resulting index of marked-up information maps English source audio files of conference talks with their transcripts, both available for download from a web site. After appropriate audio file conversions, the input is sent through the Sphinx-3.3 decoder to generate an output transcription. A text transcription is generated based on language models and a pronunciation dictionary provided with the system. An alternative method, forced alignment, used the Sphinx-II batch method and computed phoneme-level mappings that were time-aligned based on utterance-specific language models and pronunciation dictionaries custom-built from the correpsonding transcripts using the Sphinx Knowledge Base Tool. The resulting index and markup will be useful in training of (simultaneous) interpreters.

Parsing Latin with PC-PATR: This project built and implemented a morphosyntactic parser for Latin. The PC-Parse package was used for the linguistic processing engine, including PC-Kimmo (a finite-state morphological processor) and PC-Patr (a context-free unification-based parser). A 16,000-term public domain dictionary was integrated into the morphological processor's lexicon substructure, and several two-level and word-grammar rules were used along with some three-dozen features. The syntactic parser takes results from the morphological processor and, based on context-free rules, parses out whole utterances, allowing for Latin's very loose word order. The resulting engine was deployed on the web using the Apache Tomcat server and some custom-written Java graphical interface code. A recursive tree-drawing algorithm graphs the parsed results as a PNG image.

Portuguese Link Grammar parser: This project implemented a Link Grammar parser for Portuguese. Portuguese-specific knowledge sources were developed: word-linkage constraints in a grammar file, and lexicon files containing word-specific information. Using a wider inventory of links than the English system does, the parser supports several types of syntactic structures in Portuguese.

A two-level morphology engine for Mongolian: This project involved development of a morphology processor for Khalkha Mongolian, an agglutinating language in the Altaic family. A lossless Romanization scheme was developed for system input/output, and a lexicon was developed from a large-scale bilingual dictionary. The two-level rule component assures treatment of such inflectional phenomena as vowel epenthesis, use of palatalization symbols, vowel harmony, and various other morphophonological processes. A test set of over 200 manually-entered cases was developed to test the coverage of the system.

A morphosyntactic parser for Persian (Farsi): This project developed a robust parser for Farsi, an Indo-Iranian language. A Romanization scheme was developed for system input/output and lexical representation. A lexicon was built from a variety of web-based and hand-developed sources. The morphological parsing is done via a PC-Kimmo implementation (previously developed by the same student). Results from the morphological parse are then sent to a custom version of the Link Grammar parser whose English knowledge sources have been entirely replaced by Farsi linguistic information (link types, link directionality, lexical categories). The two systems (the morphological parser and the syntactic link grammar parser) were then integrated (via custom code written in C and Tcl) into the goal-directed, intelligent-agent-based, Soar cognitive modeling and machine learning system. This project formed the basis for work presented by the student at an international conference on Iranian linguistics.

Word sense disambiguation using naïve Bayesian classification, decision trees, and analogical modeling: This project investigated the word-sense disambiguation problem in the context of several different machine learning approaches. Taking several hundred (and sometimes thousand) WordNet-sense-annotated instances of various senses of the word "hard" from running text, vectors of differing types were created with a custom-developed toolkit. These vectors were used as development and test data for a variety of machine learning systems. Comparative results were obtained, tabulated and discussed.

Using the Levenshtein Edit Distance to perform a textual analysis of Ch’olti’: This project developed a technique for mapping non-standard written forms to standardized forms, given that both inputs include the exact same words with variations in spelling and even possibly including differing word breaks and punctuation.  One algorithm assured extraction of a sentence from the written form.  Also used was a modified version of the Levenshtein Edit Distance that assigns different scores for matches based on the likelihood of occurrence.  The results were very encouraging, even when entire sentences are absent from one source or the other, suggesting that the modified algorithm is particularly apt for the task.

A Chinese part-of-speech extractor: This project developed a program for searching for words in Mandarin Chinese pinyin (Romanization of Chinese characters) of a certain POS defined by the user and returns the words that meet the criteria. This was achieved by implementing a Chinese version of the Link Grammar and integrating it with the Lingua::LinkParser Perl module. Certain types of words are retrieved by the Chinese POS Extractor program by applying a regular expression to the "link labels", which are returned by the Link Grammar parser when a sentence is successfully parsed. The Chinese POS Extractor consists of a dictionary file, several lexicon files (both composed in the Link Grammar format), and a Perl script that reads sentences in pinyin, searches the input, and returns the words that match the regular expression.

Automatic Music Generation through the use of N-Grams: The automatic generation of aesthetically pleasing melodies and harmonies is still an open problem – one which has a number of correlations to topics in the field of natural language processing.  This project explores the generation of melodic and harmonic ideas by using a selection technique based on probability of n-gram occurrence in a corpus of midi files.  Sample midi files are provided for evaluation and comparison. This project became the groundwork for a successful National Science Foundation Graduate Fellowship application.

A Portuguese morphology engine: Many tools have been developed for morphological analysis and generation. A few of these, such as the PC-KIMMO engine, have enjoyed widespread use. Recently, Xerox has released new tools based on finite state networks, which have ready application to morphological processing. This project leverages the features in two of these – xfst and lexc. These tools were used in this project to perform generation of Portuguese noun inflection and analysis/generation of Portuguese verb inflection.

Spoken language identification: This project resulted in the development of acoustic feature-based specifications that are useful in recognizing which language is being spoken over the telephone within 3, 10, or 30 seconds. It involved analyzing telephone speech corpora, locating interesting phonological and suprasegmental properties that give a clue to which language is being spoken (e.g. tonal contours, filled pauses, diphthongs, unique vowels and consonants), and extracting the relevant and appropriate acoustic data. A featural encoding of this data was compiled into a form that could be used by a maximum entropy classifier that combined language-independent phoneme recognition implementation of the Sphinx recognizer, and n-gram language models built with the CMU-Cambridge language modeling toolkit. The OGI telephone speech corpus and NIST speech language data evaluation corpus were used in this task which addressed several languages including English, Spanish, Chinese, Korean, and Tamil.

Automatic Message Identification in Vervet Monkeys Using Machine Learning.  The purpose of this project was to create a system that automatically identified the 7 distinct messages as identified in the Talkbank Ethology Corpus: Field Recordings of vervet monkey calls.  This was accomplished using Perl, Bash and Praat scripts to output feature vectors for use in the Timbl machine learning system.  Over 30 sets of feature vectors were created and tested. The results were written up and presented at the the special LACUS conference on animal communication in Kentucky. It was part of the student's undergraduate ORCA project.

Sphinx with CRS: This project integrated CMU’s Sphinx 4 speech recognition system with the Cisco Customer Response System (CRS). Cisco CRS version 4.0(4)sr1 is used on the BYU campus. Due to expense the speech recognition software was included in the purchase. Sphinx 4 is a free open source speech recognition system that is Java. CRS allows custom Java classes to be integrated. This is project investigated and demonstrated the usefulness of integrating Sphinx with CRS.

Allomorphic Prediction by Analogy of the Morphemes +ance and +ence: This project developed a predictive modeling system to show which nominal allomorph suffix is used in English (ance or ence) by analogy. A custom corpus of over 1400 words from the OED was created of every instance in the OED of these morphemes. The technique used was primarily orthographic analogy modeled in Royal Skousen’s Analogical Modeling program and also TIMBL. As a check a smaller corpus from the BNC was used to test the results obtained from the OED corpus.

Implementation of Named Entity Recognition and Coreference Resolution: This project involved an implementation of named entity recognition. It began with the creation of a corpus from the Harold B. Lee Library’s Mormon Missionary Diary collection. The diaries, encoded in the TEI-lite mark-up standard, were processed to a corpus of files tagged for personal names, place names, organization names, dates, and for parts of speech. A maximum entropy classifier was used to identify named entities in the test files based on features extracted from the training set. The coreference resolution problem began with the MUC-7 corpus developed for this task and implemented the Luo Bell Tree (Luo et al., 2004) algorithm to identify tokens which corefer.

Sphinx at 10 Feet: Speech Recognition in the Living Room: This project involved the integration of the Sphinx speech recognition engine into a PC-based DVD annotation application, written in Visual Basic 6 (due to the need to integrate a legacy COM-based DVD controller). The addition of speech control allowed the application to be repurposed for home use, overcoming the difficulties inherent in the input-intensive segmentation process.

Combinatory Categorial Grammar over the Penn Chinese Treebank: This project implemented a combinatory categorial grammar (CCG) parser for Chinese, using the OpenCCG framework. Rules and lexical items were developed using the Penn Chinese Treebank for reference lexical and syntactic data. Focus was placed on modeling a variety of different Chinese syntactic phenomena.

Forced Alignment for Elicited Imitation:  Elicited imitation (EI) is an effective speech-based testing protocol for evaluating second language oral proficiency. This project involved grading EI recordings automatically using the Sphinx speech recognizer. Forced alignment, a technique to align speech content (words, syllables, phonemes) with transcripts of the expected content, was shown to be effective in assessing proficiency, even when the speakers have noticeable foreign-accented English.

Generating Mnemonic Stories for Chinese Characters: Learning written Chinese characters is a daunting task, and using mnemonic stories for them has proven helpful. This project developed automatic generation of mnemonic stories for Chinese characters that exhibit: high phonetic similarity to a Mandarin target; phonetic and semantic distinctiveness from other sound names; phonetic and semantic distinctiveness from certain other words; and memorability. It involved use of Perl, CEDICT for Chinese tone syllables, CELEX for English monosyllabic lexical items, PC-Kimmo for mapping pinyin to English, Levenshtein edit distance and WordNet for calculating phonetic and semantic distance, and Heisig's Japanese character primitives.