Running head: Issues in modeling
Article type: Full length
Brigham Young University
Department of Linguistics and English Language
2129 JKHB
Provo, Utah
(801) 422-7452
Exemplar-based models of language assert that linguistic processing involves analogy to past linguistic experiences stored in the mental lexicon. This study explores how three factors influence the predictions made by exemplar-based simulations of linguistic processing. Three questions are posed as they relate to such simulations: 1) Is type frequency or token frequency a better predictor of outcomes? 2) What is the optimal way of aligning the variables in the database so that the most relevant analogs are found? 3) Are there significant differences between representing variables as phonemes versus representing them in terms of distinctive features?
Spanish stress assignment and English past tense formation served as the linguistic phenomena on which these issues were tested. The results suggest that type frequency is a better predictor of outcomes, although simulations using token frequency were most successful when only middle frequency words were included. Several methods for aligning variables in the analogical database are discussed. The dual-alignment method has advantages for the English past tense task, but not in predicting Spanish stress. In the Spanish task, strict phonemic representation of words demonstrated no advantage over featural representation. However, phonemic representation produced better results than distinctive features in predicting the English past tense.
Key words: Analogical Modeling of Language, distinctive features versus phonemic representation, English past tense, Spanish stress assignment, Tilburg Memory-based Learner, token frequency, type frequency, variable alignment.
1. Introduction. In recent decades, as computers have become more accessible, and as their processing speed and memory capacity have increased dramatically, many language researchers have turned their attention to computational methods in order to model linguistic processes. The most widely utilized model is arguably connectionism (e.g. McClelland, 1988; Rumelhart and McClelland, 1986). However, connectionism is not the sole computational model linguists have at their disposal. Several other models have been developed that may be classified under the rubric of exemplar-based or analogical models, for example, Nosofsky's Generalized Contextual Model (Nosofsky, 1990), Pierrehumbert's exemplar model (Pierrehumbert, 2001), the Tilburg Memory-based Learner (Daelemans et al., 2001), and Analogical Modeling of Language (Skousen, 1989, 1992). Exemplar models have been applied to investigate a wide variety of linguistic phenomena such as word recognition (Goldinger, 1996), Arabic and German plural formation (Nakisa, Plunkett and Hahn, 2001), linking elements in Dutch noun compounds (Krott et al., 2002), phonological alternations in Turkish stems (Rytting, 2000), Dutch stress assignment (Gillis et al., 1993), Italian verb conjugations (Eddington, 2002), and phonotactic knowledge in Arabic and English (Frisch et al., 2001).
The literature on connectionism is replete with discussions of how different network configurations, training sets, and input variables affect the outcome produced (e.g. MacWhinney et al., 1989) However, little attention had been paid to such important issues as they relate to analogical models. The purpose of the present paper is to fill this gap, and to explore how various factors influence exemplar-based models. In particular, I will attempt to answer three questions as they relate to analogical simulations: 1) Is type frequency or token frequency a better predictor of outcomes? 2) What is the optimal way of aligning the variables in the dataset so that the most relevant analogs are found? 3) Are there significant differences between representing variables as phonemes versus representing them in terms of distinctive features? All of these issues relate to the question of how to develop a database that most closely represents how words are stored in the mental lexicon of language speakers, a nd how these words affect linguistic processing. Previous research demonstrates that the actual algorithm used to run a simulation does not affect the outcome a great deal (Daelemans et al, 1994; Daelemans, 2002; Eddington, 2002a; Krott et al., 2002). However, my own experience has shown that altering the contents of the dataset on which the simulation is run has profound consequences. Therefore, it is important to determine how to construct the most optimal datasets for exemplar-modeling. The three issues mentioned above will be discussed as they relate to two phenomenon, English past tense formation and Spanish stress assignment.
2. Exemplar-based Models. In the traditional rule-based approach, linguistic processing is thought to involve gleaning generalizations from the input data, and codifying these into rules which are then used in subsequent processing. Connectionism, on the other hand, may be classified as a sort of prototype model. The input given to the network results in certain patterns of representation of varying strengths being formed among the interconnected nodes of the network. Once the network is trained, processing relies on the patterns encoded in the network to produce the outcome. In contrast, exemplar- or memory-based models are founded on the idea that no sort of rule or prototypical representation needs to be generalized from the data and stored as a unit or entity separate from the data. Instead, generalizations exist within the stored lexical items themselves. Accordingly, linguistic processing is a matter of lexical access, and analogy to existing patterns found among the lexical items.
Of course, exemplar models require vast amounts of storage space if individual tokens of speech are retained in the mental lexicon. Nevertheless, there is evidence for such massive storage (Alegre and Gordon, 1999; Baayen, Dijkstra and Schreuder, 1997; Bybee, 1995, 1998; Goldinger, 1997; Manelis and Tharp, 1977; Palmeri, Goldinger and Pisoni,1993; Sereno and Jongman, 1997). It also appears that storage is not limited to the unpredictable features of speech, but that it includes redundant, detailed phonetic information about individual word tokens (Brown and McNeill, 1966; Bybee, 1994; Pisoni, 1997). Storage may even go beyond individual words and encompass recurrent word combinations as well as entire phrases (Bod 1998; Bybee, 1998; Pawley and Syder, 1983).
Two exemplar-based algorithms were employed to help answer the questions posed in the introduction, Analogical Modeling of Language (AM) and the modified value difference metric algorithm incorporated in the Tilburg Memory-based Learner (TiMBL). It must be noted at this point that the accuracy and performance differences between these two algorithms is not the focus of the present paper. Readers who are interested in this topic are may consult the relevant literature (Daelemans et al, 1994; Daelemans, 2002; Eddington, 2002a; Krott et al., 2002). It should also be noted that the present paper does not, and cannot explore all possible combinations of frequency, variable alignment, and featural versus phonemic representation.
One crucial element of all exemplar-based models is a database of words (or other linguistic variables) that serves as a kind of approximation of the mental lexicon of language speakers. If one's goal is to study consonant spirantization, the database would contain instances of words or word combinations in which spirantization has or has not occurred. One of the variables would indicate whether the database item is an example of spirantization or not. A study designed to predict the part of speech of words based on their phonological structure would contain phonological representations of nouns, verbs, adjectives, etc., along with a variable specifying the part of speech of each entry. Of course, the database should always be based on naturally occurring language data. The goal of exemplar modeling is to use a database representing a speaker's prior language experience, to predict linguistic behavior. For example, to predict the part of speech of a word whose part of speech is unknown, the task of the algorithm would be to compare the test word to the words in the database whose part of speech is known, and to extrapolate or analogize the word's part of speech based on database items that bear similarities to the word in question.
2.1. Memory-based Language Processing. One memory-based algorithm used in the present study is found in the family of algorithms incorporated in the Tilburg Memory-based Learner (henceforth TiMBL). TiBML is an expansion of the algorithm developed by Aha et al. (1991), and a detailed description of the algorithm is found in Daelemans et al. (2001). In essence, TiMBL takes an input and determines which items in a database of exemplars are the most similar to the input form. These are known as the nearest neighbors of the input. During the training session, the model stores in memory series of variables which represent instances of words. The words are stored along with their behavior (e.g. the type of past tense taken, or an indication of which syllable is stressed). In the case that the same word is encountered more than once in the database, a count is kept of how often each word is associated with a given behavior. During the testing phase, when an input is presented, the model searches for it in the database and applies the behavior that it has been assigned in the majority of cases. If the word is not found in the database, a similarity algorithm is used to find the most similar item(s)-its nearest neighbor(s). The behavior of the nearest neighbor is then applied to the word in question. If two or more items are equidistant from the word in question, the most frequent behavior of the tied items is applied to the word in question. In the algorithm utilized in the present simulations, the similarity between the values of a variable is precalculated and used to adjust the search for nearest neighbors accordingly. This precalculation allows certain values to be regarded as more similar to each other than other values.
2.2. Analogical Modeling of Language Processes. Another memory-based model is found in Analogical Modeling of Language (hereafter AM; see Skousen, 1992 for an in depth explanation of the AM algorithm, and Eddington, 2000a for a succinct description of AM's functioning). AM also conducts a search of a database looking for words similar to the input word In AM, the search begins with the entries most similar to the input word whose behavior is being predicted, and then extends to less similar entries. The members of the database are grouped into sets called subcontexts whose members share similarities with the input form. For example, in determining the past tense behavior of the English nonce verb kive, one subcontext would be comprised of all database items ending in /v/, another would contain those that end in /aiv/, another all items whose final syllable begins with /k/, another all items whose final syllable begins with /k/ and ends in /v/, and so forth until all possible combinations of all variables are explored.
One derived property that results from dividing the database in this manner is that of proximity. Database items that share more features with the nonce input kive will appear in more subcontexts and will therefore have a higher likelihood of influencing the probability that kive will be assigned a given past tense form. Gang effects also fall out of this architecture. Groups of similar items that display the same behavior will increase their chances of influencing the input form.
Heterogeneity is another important property of AM. It suggests that a word in the database cannot be chosen as an analog if there are intervening words, with a different behavior that are more similar to the input item. Calculating heterogeneity involves determining disagreements. A disagreement occurs when one member of a subcontext has a behavior that is different from the behavior of another member of the same subcontext. For example, drive and thrive share a final /aiv/, but form their past tense in a different manner (drove, thrived). As a result, when they appear in the same subcontext, they disagree in terms of what type of past tense they take. Under certain conditions, the analogical influence of the members of a subcontext that contains disagreements will be reduced or eliminated. AM's output is given in term of the statistical probability that one or more behaviors will apply to the input word.
Having introduced the two algorithms, the remainder of the paper will be dedicated to applying these algorithms to answer the questions posed in the introduction.
3. Spanish Stress Assignment. Spanish stress assignment is a fairly predictable phenomenon. With the exception of gerunds followed by two clitic pronouns (haciéndoselo 'doing it for him/her') stress falls on one of the final three syllables of a word. Words ending in a vowel or /s/ are more commonly stressed on the penultimate syllable, while those that end in a consonant other than /s/ usually receive final stress. This generalization holds for about 87% of Spanish words (Eddington, 2000b).
The databases for the Spanish stress assignment simulations are based on the 4970 most frequent Spanish word taken from a frequency dictionary (Alameda and Cuetos, 1995). Two databases were created. The type database contained only one entry for each word, while the token database contained multiple instances words according to the word's frequency in the dictionary. For example, one instance of the word abuela 'grandmother' appears in the type database, while 158 instances appear in the token database since the frequency of abuela is 158 in the frequency dictionary. The phonemic information in final three syllables of each word were included as variables. Variables indicating the tense and person of each verbal form were also included. For example, in Table 1, '3' indicates third person singular, and 'pt' indicates preterit tense. Dashes are used in place of spaces to keep the syllables aligned, and '0' indicates the absence of a syllable when in the nucleus column, and a non-verb when in the tense column.
3.1. Spanish Stress: Type and Token Frequency. There is a plethora of psycholinguistic evidence demonstrating that both type and token frequency play a part in language processing (e.g. Allen, McNeal and Kvak, 1992; Kelliher and Henderson, 1990; MacKay, 1982; Scarborough, Cortese and Scarborough, 1977). However, one issue that has been left unanswered in the exemplar literature is the question of whether analogies are made on the basis of type or token frequency. The type and token databases of Spanish were designed to help answer this question, but some measure of performance was needed. One measure of generalization performance is cross-validation (Breiman, et al., 1984). This consists of dividing a database into ten groups. Each group is extracted from the remaining nine and its members serve as the test cases, while the members of the remaining nine groups comprise the training set from which analogs are chosen. In this way, each item in the database serves as a test case once, and as a possible source of analogical influence nine times. One advantage of cross-validation is that it does not allow exact matches to be found. That is, no word in the test set will find the exact same word in the dataset, therefore, the behavior of the words must be determined by analogy with similar words. If every test word finds the identical word in the dataset, that is the psychological equivalent of remembering the word (along with its stress pattern). In that case, the success rate would be an uninteresting 100%.
A ten fold cross-validation was performed on the type database by dividing it into ten sets of 497 items. TiMBL's modified value difference metric was used to determine the nearest neighbor(2) of each test item. The stress of the nearest neighbor was used to predict the stress of each test item. In this simulation, stress was successfully predicted on an average of 95.72% across the ten type datasets (range: 93.56% -97.18%). The ten token databases were then formed by multiplying the instances of the words in each type dataset by the word's frequency in Alameda and Cuetos (1995). However, even though the datasets were augmented to reflect token frequency, the stress placement of each of the 4970 words was predicted only once, not multiple times according to their token frequency. Again, a ten fold cross-validation was performed yielding an average success rate of 93.58% (range: 91.35%- 95.77%). These data suggest that type frequency is a better predictor of stress assignment than token frequency ( 2 (1)=21.12, p < .001).
The fact that type frequency outperforms token frequency is not unique to Spanish stress assignment. I have found similar effects in simulations of Italian conjugation classes and Spanish gender assignment (Eddington, 2002b, 2002c). Processing of Dutch morphology also appears to involve type frequency over token frequency (Bertram, Baayen and Schreuder, 2000; deJong, Schreuder and Baayen, 2000). Bybee (2001) suggests that type frequency is more important than token frequency in cases involving productivity. Perhaps the computational method of treating each test item as if were a new and previously unencountered word is tantamount to testing the productivity of the Spanish stress patterns.
Another question of frequency which needs to be examined involves the frequencies at which the optimal analogs may be found. In a study of phonotactics in English, Bailey and Hahn (2001) observed that subjects' ratings of wordlikeness appeared to be influenced more by the token frequency of medium frequency words; extremely high and extremely low frequency words did not exert much influence. The present token database of Spanish words only represents high to medium frequency Spanish words. An extremely large database of Spanish words would need to be constructed in order to cover the entire frequency range. Nevertheless, if the token database is divided into two equal parts based on frequency, the most frequent part would contain high frequency words while the less frequent part could be considered representative of middle frequency words.
The most desirable way of testing the high and medium token frequency databases is to use all 4970 words as test items, and to test these against the high and middle frequency databases. In this way, it can be seen if there are differences in the results produced with the two databases of differing frequencies. The difficulty with this is structuring the simulation so that no test item finds its exact counterpart in the training database. This required a bit of manipulation. First, the entire high frequency database was used as the training set on which the stress placement of the middle frequency words was determined, and vice versa. Next, a 5 fold cross validation was performed on each database so that the high frequency words could be tested against the high frequency database, and the middle frequency words could be tested against the middle frequency database without encountering undesirable exact matches. The average success rate for the simulations was 92.43% using the high frequency database, and 94.06% using the middle frequency database.(3) This demonstrates that the middle frequency database is a better set on which to analogize ( 2 (1)=9.778, p < .005). Bailey and Hahn found the same in their study of phonotactics, but also discovered that low frequency words do not exert much influence either. Since the database of Spanish words used in the present study does not contain low frequency words, their effect on stress placement could not be ascertained.
3.2. Spanish Stress: Variable Alignment. One question that arises when converting language data into computer-readable format is how to correctly codify the data. Consider the way the words in the stress assignment simulation were encoded in Table 1. Monstruo 'monster' contains eight phonemes, but these are compressed into only five variables. In this encoding, all phonemes that fall into a syllable onset or coda combine to form a single variable; ns and trw form one variable. This means that the ns in the coda of the penultimate syllable of mons.truo will be counted as similar to the ns in cons.truir 'to build', but no similarity will be found to the s of ras.go 'trait', nor to the n of can.to 'chant'. In order for this to occur, the members of the onsets and codas must be counted as separate variables. An additional question involves where glides belong. The current encoding places them in the onset or coda, but it could be argued that they belong in the nucleus. Accordingly, the final nucleus of monstruo should contain wo instead of a simple o.
To answer these questions, I modified the original database in several ways that are described below, and compared the performance of each encoding. The comparison of these encodings was done using a leave-one-out method, which is another measure of performance (Weiss and Kulikowski, 1991). This consists of removing each word from the database one at a time. The word that has been extracted becomes the test item, while the remainder of the items serve as the training set from which analogies are drawn. In this way, the stress placement of each word is calculated only once. A leave-one-out simulation was not possible with the token database because it would have resulted in multiple predictions being made for the same word. That, in turn, would not have allowed the results of the type and token simulations to be evaluated on equal grounds. Kohavi (1995) reports that cross-validation may have advantages over the leave-one-out method when one's goal is to determine the superiority of one computational model over another. However, the focus of the present paper is not model selection, but the evaluation of datasets with differing characteristics.
The question of whether glides should be included in the nucleus may be answered by reencoding the data to reflect this. Therefore, the original alignment of the phonemes in monstruo (m/o/ns/trw/o ) was changed to (m/o/ns/tr/wo). I will refer to the latter as the glide-in-the-nucleus alignment, and to the former as the no-glide-in-the-nucleus alignment. The same changes were made in every word in the type database containing a glide. The success rates are as follows:
No-glide-in-the-nucleus (cross-validation method) 95.72%
No-glide-in-the-nucleus (leave-one-out method) 95.96%
Glide-in-the-nucleus (leave-one-out method) 95.89%
The first thing that must be noted is that there is no statistical difference between the cross validation and leave-one-out methods of measuring performance( 2 (1)=0.348, p < .25). Second, whether the glide is placed in the nucleus or in the onset or coda does not appear to be a factor that influences stress assignment ( 2 (1)=0.022, p < .75).
The next question to address is whether any benefit results from considering the individual members of a consonant cluster in an onset or coda as separate variables. This entails recoding words such as monstruo from (m/o/ns/trw/o) into something along the lines of (m/o/ n/s/t/r/w/o). However, this encoding is not adequate because it does not address the problem of how to correctly align the members of the onsets and codas. Consider the final syllables of the words monstruo, filtro 'filter', and contínuo 'continuum' ([trwo], [tro] and [nwo]). The ideal alignment would show that contínuo shares w and o with monstruo, and filtro shares t, r, and o with monstruo. If we arrange the variables so that the first consonants in the onsets are aligned, the phonemes that mostruo and filtro have in common are correctly identified, in that they appear as variables in the same column:
t r - o filtro
n w - o contínuo
However, the w of contínuo and monstruo belong to different variables, which means that the similarity between the words will not be identified by the analogical algorithm. The other possibility is to align the phonemes starting from the nucleus and working toward the left.
- t r o filtro
- n w o contínuo
This yields an alignment that highlights the fact that contínuo shares w and o with monstruo, but fails to capture the t and r that monstruo and filtro have in common. The best resolution to this paradox, in my view, is to encode the data so that both alignments are represented at the same time. I will refer to this as dual-alignment:
t r - - t r o filtro
n w - - n w o contínuo
The dual-alignment database was created by separating the members of each onset and coda into separate variables and aligning them as in (3). Although the dual-alignment appeared to be an intuitively more correct way to encode the words, it produced no significant change in the outcome when a leave-one-out simulation was performed ( 2 (1)=0.690, p < .5).
No-glide-in-the-nucleus 95.96%
Dual-alignment 95.61%
The three different alignments experimented with to this point failed to demonstrate any real differing effect on the performance of the model. The reason why this is so may be found by inspecting the feature permutation. In the course of running a simulation, TiMBL ranks the variables in terms of how much each one contributes to making the predictions. This is called the feature permutation. In the no-glide-in-the-nucleus simulation, the five most important variables were: 1) the phoneme or absence of phoneme in the coda of the final syllable; few Spanish words have complex codas in the final syllable; 2) the variable indicating the tense of verbal forms, 3) the variable indicating the person of verbal forms; none of the three alignments altered the morphological variables; 4) the nucleus of the final syllable; few Spanish words contain glides in the final syllable; 5) the nucleus of the penultimate syllable. What becomes obvious is that the different alignments manipulate the variables that are least relevant to stress assignment.
3.3. Spanish Stress: Features versus Phonemes. One objection that could be made to the databases used in the simulations is that they use phonemes as variables. What this means is that if one word contains an /m/ in a certain context, and another an /n/, the two values will be viewed as having nothing in common. That is, the phonemic representation of variables does not allow the algorithm to see that the two phonemes share nasality. A case could be made that the use of distinctive features would improve the results of the simulations. To test this, I modified the dual-alignment database so that the only variables were the onset, nucleus, and rime of the final syllable. No morphological variables were included. A leave-one-out analysis yielded a success rate of 89.05%. The phonemes in this database were then converted into series of 17 binary features.(4) Features that are irrelevant for consonants were marked with a '0' in the database, and features that are not pertinent to vowels were marked in the same fashion. The resulting success rate of 89.03% is virtually identical to the simulation using phonemic representation.
One objection that could be made to comparing featural and phonemic representations is that the two databases are radically different, as a result, the task of predicting stress is essentially redefined. It may be that one algorithm (either AM's or one of TiMBL's) produces better results when the database contains phonemic representations, but another algorithm may prove more adept at processing binary features. While this may be true, if both the algorithm and the dataset are modified, it becomes impossible to tell whether the differing outcomes are due changes in the algorithm, changes in the dataset configuration, or a combination of both. In the present study, only TiMBL's modified value difference metric was used so that issues of dataset representation were not confounded with issues involving algorithm differences.
To summarize thus far, the different variable alignments, and the use of binary features in place of phonemes, produced no difference in the predictive ability of the exemplar model. This is most likely due to the fact that the specific consonants and consonant clusters which appear in onsets and codas play an insignificant role in stress assignment. As far as type versus token frequency is concerned, however, a database reflecting type frequency is a much better predictor of stress assignment than one based on token frequency, although a token database consisting of middle frequency items is not ineffective.
4. English Past Tense. The remainder of the present study describes a number of simulations that predict the English past tense. The goal of these simulations is again to evaluate the role of frequency, variable alignment, and phonemic representation as they relate to exemplar models of language processing. The English past tense has occupied a central role in the debate on language processing at least since Rumelhart and McClelland's controversial study in 1986. A survey of the literature on this debate, besides being extremely lengthy, would fall beyond the scope of the present paper. Instead of entering the debate, the past tense is simply used as a test case against which analogical databases with differing characteristics may be tested.
4.1. English Past Tense: Type and Token Frequency. The verbs utilized in this study were the same 2179 English verbs and their corresponding past tense forms used in the study by Mac Whinney and Leinbach (1991). These include all verbs from Francis and Kuera's English frequency dictionary (1982) as well many extant verbs not found in that sample. Several English verbs allow two past tense forms (dived/dove), and in these cases, each alternative was included in the database. In the initial database, the present tense forms were encoded with the same variables used in Derwing and Skousen (1994). This includes the phonemes of the final two syllables, along with an indication of whether the final syllable is stressed or stressless.
The first few past tense simulations were carried out with AM's algorithm. AM was chosen because it gives the outcome in terms of the probability that one outcome or another will be applied. This sort of output has the advantage of being interpreted in two different ways. One interpretation, termed 'selection by plurality' (Skousen, 1989), involves considering the behavior with the highest probability to apply. This sort of winner-take-all output is the produced by connectionist networks, as well as the TiMBL simulations discussed previously. With AM's 'random selection', on the other hand, one considers the degree to which two or more outcomes are predicted. This more fine-grained output is important when comparing the model to the results of psycholinguistic experiments which usually entail some degree of variability.
A leave-one-out simulation was run using all 2179 items in the database. Selection by plurality yielded a success rate of 90.32%. A token database was constructed by multiplying each item in the type database by the number of times they occur in Francis and Kuera (1982).(5) In this way, a series of variables representing an item with a frequency of 15 appeared 15 times in the token database. When the 2179 verbs were tested against this database a success rate of 81.37% resulted. The token simulation performed significantly poorer than the simulation using types ( 2 (1)=61.628, p < .001).
One advantage of the past tense database is that, although it does not contain every English verb, it does represent a continuum spanning high frequency verbs (e.g. go, have) to very low frequency verbs (slay, flog). Therefore, that ability of an analogical database comprised of middle frequency words to outperform high and low frequency words could be assessed. In order to test this, the token and type frequency databases were divided in three ways. The first was to divide them in half according to frequency so that the verbs with the highest token frequency appeared in one database and those with the lowest token frequency in another. Another division eliminated the most frequent and least frequent fourths of the database, so that only the middle frequency items remained. The 2179 words were tested against these type and token databases. An option was set in the AM algorithm so that when a test item encountered an exact match in the training set, the influence of that item on the outcome was eliminated. The success rates are given in Table 2.
As far as token frequency is concerned, the middle frequency verbs appear to constitute a better set of items on which to analogize, as Bailey and Hahn have reported. Nevertheless, the highest success rate for the middle frequency token simulation (89.86%) is statistically equivalent to the success rate obtained using the entire type frequency database (90.32%; 2 (1)=0.080, p < .9). Of course, it should not be surprising that the most frequent past tense verbs are the poorest set from which to choose analogs; they contain the most verbs with irregular past tense forms. In like manner, the low frequency database contains few irregular items from which verbs with irregular past tenses forms can find correct analogies to other irregular items. When the highest frequency words are eliminated, the remaining words do not exhibit such radical frequency differences among themselves, which means that to a certain extent, the resulting token database is more similar to a type database. It may be for these reasons that the middle frequency token database provides the best success rate.
The use of the same database to draw both test items and training items from is a common practice in natural language processing tasks of the sort reported on herein. However, most theories of language processing hold that irregular past tense forms must be stored in memory. Therefore, treating them as novel items in a simulation may be problematic (Ling and Marinov, 1993). One way to avoid this potential problem is by utilizing nonce words in place of existing words. Prasada and Pinker (1993) conducted a study in which they elicited the past tense form of 60 nonce words. Ten of the nonce words were designed to be highly similar to extant regular verbs (prototypical), ten were somewhat similar to regular verbs (intermediate), and 10 were very dissimilar to regular verbs (distant). The remaining 30 nonce verbs were arranged according to how similar they were to extant verbs with irregular past tense forms. In the experiment, they measured how often the subjects produced one of the regular past tense suffixes [-d, -t, -d] and how often an irregular form involving some sort of vowel change was produced. A number of simulations were run in order to predict the type of past tense form that would be given to Prasada and Pinker's nonce verbs. They were modified to reflect the sort of output reported by Prasada and Pinker. The simulations differed according to which database was used to draw analogies from, as seen in Table 2.
The results of the leave-one-out simulations and the nonce word study produced very similar outcomes (compare Table 2 and Table 3).
In each case, the type databases outperformed the token databases, and in both cases, the most successful simulation utilized the type database that contained all 2179 verbs (see Figure 1).(6) Simulations done with the highest frequency words underperformed all others. In the leave-one-out simulations using token databases, the middle frequency items appear to provide the best pool of possible analogs. In the nonce simulations, the token databases containing the middle frequency words also outperformed the high frequency database and the database containing all 2179 items. However, the low frequency database performed as well as the middle frequency database.
A number of things may be concluded from these simulations. 1) Since the outcomes of the nonce study and the leave-one-out study differ little, the concern that it is problematic to use the same database items as both test and training sets appears unwarranted; 2) Simulations using type frequency outperform those that use tokens; 3) Eliminating the most frequent, (and possibly the least frequent items as well), leads to better performance when token frequency is used. Whether this holds true for modeling phenomena besides the English past tense is a matter that needs to be investigated further.
4.3. English Past Tense: Features versus Phonemes. In the Spanish stress simulation, it was seen that a feature representation of the phonemes provided no significant advantage over a purely phonemic representation, but this does not necessarily mean that the use of features may not yield more optimal results in modeling other phenomena. Skousen gives the following argument against treating the distinctive features of different phonemes as independent variables (1989, p. 53): Bought differs from bit in one phoneme [I~], and these vowels differ in terms of three features. [I] is unround, high, and front, while [] is round, low, and back. Mid and beet differ by three phonemes, yet from a featural standpoint, they only differ in three features just as bought and bit only differ in three features: /m/ is nasal while /b/ is non nasal; /I/ is lax while /i/ is tense; /d/ is voiced while /t/ is voiceless. In other words, it may be the case that the use of features results in skewed measures of similarity.
In order to test phonemic versus featural representation, TiBML's algorithm was used simply because AM could not handle the number of variables required. A leave-one-out simulation on the above mentioned database yielded a success rate of 91.55% when the variables were phonemic.(7) The phonemes were then converted into series of 17 binary features.(8) With binary feature representation, the leave-one-out analysis significantly underperformed the phonemic representation by correctly assigning the past tense to only 90.13% of the verbs ( 2 (1)=5.533, p < .025).(9)
4.3. English Past Tense: Variable Alignment. To this point, the databases utilized in the past tense simulations have been encoded using the variable alignment exemplified in Table 4, which I will refer to as the no-syllable-boundary alignment.
This alignment, which was taken from Derwing and Skousen's (1994) study of the English past tense, centers on the nuclei of the final two syllables (variables 5 and 10). Any phonemes appearing two slots before and after the nuclei are included as phonemes, regardless of whether these phonemes belong to the same syllable as the nucleus. Cases in which no phoneme appears are marked with '0'. In all of the alignments discussed below, variable 1 indicates the verb's final phoneme, and variable 2 whether the verb's stress falls on the final syllable (F) or not (N). One possible objection to the no-syllable-boundary alignment is that it encodes some phonemes twice (as in the case of the /s/ of transform and the /bj/ of distribute), and only once in other verbs.
The lumped-cluster alignment provides an encoding that respects syllable boundaries and syllable constituents. For example, in Table 5 variable 8 contains the onset of the penult syllable, variable 7 the nucleus, and variable 6 the coda. Empty syllable positions appear with '0'.
Although the lumped-cluster alignment addresses the problem of respecting syllable constituents and boundaries, it may be problematic in that the members of a consonant cluster are not separate variables. For example, variable 8 contains tr for transform and str for distribute. Although both words share tr the algorithms will treat the two variables as having nothing in common. It also fails to represent the fact that both tally and transform both begin with the same phoneme/variable t.
One way of separating the consonant clusters in to separate variables appears in Table 6. Here assignment of variables begins with the nucleus and moves outward incorporating phonemes belonging to the onset and coda of the syllable.
This alignment faithfully represents the fact that distribute and transform share tr, but does not demonstrate that both tally and transform share word initial t. In order to represent both of these similarities, all of the consonants in an onset and codas must be encoded twice: once left-justified and once right-justified as in Table 7.
This dual-alignment is the only way of encoding the variables so that all possible similarities between the members of codas and onsets are made.
In order to test the adequacy of each of these four alignments, four leave-one-out simulations were performed with type frequency represented. AM's algorithm was applied to this task with the following success rates, none of which differ significantly from another ( 2 (1)=1.450, p < .75):
No-syllable-boundary alignment 90.31%
Lumped-consonant-cluster alignment 91.28%
Separate-consonant-cluster alignment 91.19%
Dual-alignment 91.14%
These results are reminiscent of the insignificant differences that the alignments in the Spanish stress assignment produced. Nevertheless, it is important to establish, not only whether these alignments are equally optimal from a database-internal perspective, but whether they are equal when measured in terms of actual language processing. To this end, the four alignments were used to predict the past test forms of nonce words devised by Albright and Hayes (2001).
In two different experiments, Albright and Hayes asked subjects to provide past tenses for 58 nonce words. They calculated the percentage of responses in which a particular past tense form was provided. This involved determining how often a regular past tense form, or one or more irregular past tense forms, was given. For example the past tense of spling was given as splinged by 51.4% of the subjects, as splung by 32.4%, and as splang by 10.8% of the subjects. These data were correlated with AM's predicted probability that each past tense form would occur. Analogies were made with databases using each of the four variable alignments with the following results, all of which demonstrate a significant positive correlation with the subjects' responses (p < .005 level, two-tailed):
No-syllable-boundary alignment .856
Lumped-consonant-cluster alignment .848
Separate-consonant-cluster alignment .877
Dual-alignment .886
It is clear that least successful alignment is the one in which consonant clusters in codas and onsets are not treated as separate variables, but lumped together. The dual-alignment achieves the highest correlation since it allows more similarities to be found between words. That is, it is the only alignment that shows that sing and string both begin with /s/, and at the same time highlights the /r/ that string and ring have in common.
The fact that the best alignment occurs when all syllabic constituents are represented and when syllable boundaries are respected may indicate that words are encoded syllabically in the mental lexicon. There is evidence to support the notion that syllable structure plays a role in language processing (Carreiras, Alvarez, and de Vega, 1993; Costa, and Sebatián-Gallés, 1998 ; Levelt and Wheeldon, 1995; Perea and Carreiras, 1998). However, such evidence is based on languages such as Dutch, French, and Spanish. In other studies, syllables do not appear to be a significant factor in processing English (Cutler et al., 1983, 1986).
5. Conclusions. The purpose of this study was to evaluate the role of frequency, variable alignment, and phonemic representation in analogical simulations. In the simulations which were carried out, databases based on type frequency yielded better results that those based on token frequency in predicting both Spanish stress assignment and English past tense forms. However, the simulations also suggest that when token frequency is used, the middle frequency items provide the best set from which to draw analogies.
One question that was also address was whether phonemic or featural representation is optimal. In the Spanish stress assignment task, both representations produced statistically similar results, while a featural representation of English verbs performed significantly poorer than did a straight phonemic representation.
A number of different variable alignments were tried on the Spanish data, none of which performed significantly better than the other. This may be due to the fact that the variables which are most important to Spanish stress assignment were not affected much by the different alignments. As far as predicting the English past tense, however, comparisons with the nonce word task suggest that it is better to consider the individual members of onsets and codas as separate variables. The dual-alignment proved most adept and may have some advantages over the others.
The results of the simulations reported in the present paper must be construed as being relevant only for the tasks to which they were applied. For instance, the fact that token frequency underperformed type frequency in simulations of Spanish stress and English past tense formation does not necessarily indicate that token frequency will never play an important role in predicting other linguistic behaviors analogically. Further investigation into other linguistic phenomena as well as into other languages are warranted before any broad generalizations may be made.
Aha, D. W., D. Kibler and M. K. Albert. "Instance-based Learning Algorithms". Machine Learning, 6, (1991), pp. 37-66.
Alameda, J. R. and F. Cuetos. Diccionario de frecuencias de las unidades lingüísticas del castellano. Oviedo, Spain, University of Oviedo Press, 1995.
Albright, A. and B. Hayes. "Rules vs. Analogy in English Past Tenses: A Computational/Experimental Study." Manuscript, UCLA, 2001. (http://www.linguistics.ucla.edu/people/hayes/rulesvsanalogy)
Alegre, M. and P. Gordon. "Frequency Effects and the Representational Status of Regular Inflections". Journal of Memory and Language, 40, (1999), pp. 41-61.
Allen P., M. McNeal and D. Kvak. "Perhaps the Lexicon is Coded as a Function of Word Frequency." Journal of Memory and Language, 31, (1992), pp. 826-44.
Baayen, H. R., T. Dijkstra and R. Schreuder. "Singulars and Plurals in Dutch: Evidence for a Parallel Dual-route Model". Journal of Memory and Language, 37, (1997), pp. 94-117.
Bailey, T. and U. Hahn. "Determinants of Wordlikeness: Phonotactics or Lexical Neighborhoods?" Journal of Memory and Language, 44, (2001), pp. 568-591.
Breiman, L., J. H. Friedman, R. A. Olshen and C. J. Stone. Classification and Regression Trees. Belmont, CA, Wadsworth, 1984.
Bertram, R., R. H. Baayen and R. Schreuder. "Effects of Family Size for Complex Words." Journal of Memory and Language, 42, (2000), pp. 390-405.
Brown, R. and D. Mc Neill. "The 'Tip of the Tongue' Phenomenon". Journal of Verbal Learning and Verbal Behavior, 5, (1966), pp. 325-337.
Bod, R. Beyond Grammar. Stanford, CA, CSLI, 1998.
Bybee, J. "A View of Phonology from a Cognitive and Functional Perspective". Cognitive Linguistics, 5, (1994), pp. 285-305.
---. "Regular Morphology and the Lexicon". Language and Cognitive Processes, 10, (1995), pp. 425-55.
---. "The Emergent Lexicon. In Proceedings of the Chicago Linguistic Society, vol. 34. Eds. M. Gruber, C. D. Higgins, K. S. Olson and T. Wysocki, Chicago, Chicago Linguistic Society, 1998, pp. 421-435.
--. Phonology and Language Use. Cambridge, Cambridge University Press, 2001.
Carreiras, M., C. J. Alvarez, and M. de Vega. "Syllable Frequency and Visual Word Recognition in Spanish." Journal of Memory and Language, 32, (1993), pp. 766-780.
Costa, A. and N. Sebastián. "Abstract Phonological Structure in Language Production: Evidence from Spanish." Journal of Experimental Psychology: Learning, Memory, and Cognition, 24, (1998), pp. 886-903.
Cutler, A., J. Mehler, D. Norris, and J. Seguí. "A Language-specific Comprehension Strategy". Nature, 304, (1983), pp. 159-160.
--. "The Syllable's Differing Role in the Segmentation of French and English." Journal of Memory and Language, 25, (1986), pp. 385-400.
Daelemans, W. "A Comparison of Analogical Modeling to Memory-based Language Processing". In Analogical Modeling: An Exemplar-based Approach to Language. Eds. R. Skousen, D. Lonsdale and D. B. Parkinson, Amsterdam, John Benjamins, 2002, in press.
Daelemans, W., S. Gillis and G. Durieux. "Skousen's Analogical Modeling Algorithm: A Comparison with Lazy Learning. In Proceedings of the International Conference on New Methods in Language Processing. Ed. D. Jones, Manchester, UMIST, 1994, pp. 1-7.
Daelemans, W., J. Zavrel, K. van der Sloot and A. van den Bosch. "TiMBL: Tilburg Memory-based Learner, version 4.1 Reference Guide". Induction of Linguistic Knowledge Technical Report, ILK 01-04. Tilburg, Netherlands, ILK Research Group, Tilburg University, 2001. Available at http://ilk.kub.nl/.
De Jong, N, R. Schreuder and H. Baayen. "The Morphological Family Size Effect and Morphology." Language and Cognitive Processes, 15, (2000), pp. 329-365.
Derwing, B. L. and R. Skousen. "Productivity and the English Past Tense: Testing Skousen's Analogy Model." In The Reality of Linguistic Rules. Eds. S. D. Lima, R. L. Corrigan and G. K. Iverson, Amsterdam, Benjamins,1994, pp. 193-218.
Eddington, D. "Analogy and the Dual-route Model of Morphology". Lingua, 110, (2000a), pp. 281-298.
--. "Spanish Stress Assignment within the Analogical Modeling of Language". Language, 76, (2000b), pp. 92-109.
--. "A Comparison of Two Models: Tilburg Memory-based Learner Versus Analogical Modeling of Language". In Analogical Modeling: An Exemplar-based Approach to Language. Eds. R. Skousen, D. Lonsdale and D. B. Parkinson, Amsterdam, John Benjamins, 2002a, in press.
--. "Dissociation in Italian Conjugations: A single-route Account". Brain and Language, 2002b, in press.
--. "Spanish Gender Assignment in an Analogical Framework". Journal of Quantitative Linguistics, (2002c), in press.
Francis, N. W. and H. Kuera. Frequency Analysis of English Usage: Lexicon and Grammar. Boston: Houghton Mifflin, 1982.
Frisch, S. A., N. R. Large, B. Zawaydeh and D. B. Pisoni, "Emergent Phonotactic Generalizations in English and Arabic". In Frequency and the Emergence of Linguistic Structure. Eds. J. Bybee and P. Hooper, Amsterdam, Benjamins, 2001, pp. 159-179.
Gillis, S., W. Daelemans, G. Durieux and A. van den Bosch. "Learnability and Markedness: Dutch Stress Assignment". In Proceedings of the Fifteenth Annual Conference of the Cognitive Science Society, Hillsdale, N.J., Erlbaum, 1993, pp. 452-457.
Goldinger, S.D. "Words and Voices: Episodic Traces in Spoken Word Identification and Recognition in Memory". Journal of Experimental Psychology: Learning, Memory, and Cognition, 22 (1996), pp. 1166-1183.
---. "Words and Voices: Perception and Production in an Episodic Lexicon. In Talker Variability in Speech Processing, Eds. K. Johnson and J. W. Mullennix, San Diego, Academic, 1997, 33-65.
Kelliher, S. and L. Henderson. "Morphologically Based Frequency Effects in the Recognition of Irregularly Inflected Verbs." British Journal of Psychology, 81, (1990), pp. 527-539.
Kohavi, R. "A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection". In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, vol. 2, San Mateo, Morgan Kaufmann, 1995, pp.1137-1145.
Krott, A., R. Schreuder and R. H. Baayen. "Analogical Hierarchy: Exemplar-based Modeling of Linkers in Dutch Noun-Noun Compounds". In Analogical Modeling: An Exemplar-based Approach to Language. Eds. R. Skousen, D. Lonsdale and D. B. Parkinson, Amsterdam, John Benjamins, 2002, in press.
Levelt, W. and L. Wheeldon. "Do Speakers Have Access to a Mental Syllabary?" In Cognition on Cognition. Eds. J. Mehler and S. Franck, Cambridge, MA, MIT Press, 1995, pp. 301-331.
Ling, C. X. and M. Marinov. "Answering the Connectionist Challenge: A Symbolic Model of Learning the Past Tenses of English Verbs." Cognition, 49, (1993), pp. 235-290.
MacKay, D. G. "The Problems of Flexibility, Fluency, and Speed-Accuracy Trade-off in Skilled Behavior." Psychological Review, 89, (1982), pp. 60-94.
MacWhinney, B. and J. Leinbach. "Implementations are not Conceptualizations: Revising the Verb Learning Model". Cognition, 29, (1991), pp. 121-157.
MacWhinney, B., J. Leinbach, R. Taraban and J. McDonald. "Language Learning: Rules or Cues? Journal of Memory and Language, 28 (1989), pp. 255-277.
Manelis, L. and D. A. Tharp. "The Processing of Affixed Words". Memory and Cognition, 5 , (1977), pp. 690-695.
McClelland, J. L., "Connectionist Models and Psychological Evidence". Journal of Memory and Language, 27 (1988), pp. 107-123.
* Nakisa, R. C., K. Plunkett and U. Hahn. "A Cross-linguistic Comparison of Single and Dual-route Models of Inflectional Morphology. In Cognitive models of language acquisition. Eds. P. Broeder and J. Murre, Cambridge, MA. MIT Press, 2000.
Nosofsky, R. M., "Relations Between Exemplar Similarity and Likelihood Models of Classification". Journal of Mathematical Psychology, 34 (1990), pp. 393-418.
Palmeri, T. J., S. D. Goldinger and D. B. Pisoni. "Episodic Encoding of Voice Attributes and Recognition memory for Spoken Words". Journal of Experimental Psychology: Learning, Memory, and Cognition, 19, (1993), pp. 309-28.
Pawley, A. and F. Hodgetts Syder. "Two Puzzles for Linguistic Theory: Nativelike Selection and Nativelike Fluency". Language and Communication, Eds.J. C. Richards and R. W. Smith, London, Longman, 1993, pp. 191-225.
Perea, M. and M. Carreiras. "Effects of Syllable Frequency and Syllable Neighborhood Frequency in Visual Word Recognition." Journal of Experimental Psychology: Human Perception and Performance, 24, (1998), pp. 134-144.
Pierrehumbert, J. "Exemplar Dynamics: Word Frequency, Lenition and Contrast". In Frequency and the Emergence of Linguistic Structure. Eds. J. Bybee and P. Hooper, Amsterdam, Benjamins, 2001, pp. 137-158.
Pisoni, D. "Some Thoughts on 'Normalization' in Speech Perception. In Talker Variability in Speech Processing, Eds. K. Johnson and J. W. Mullennix, San Diego, Academic, 1997, pp. 9-32.
Prasada, S.and S. Pinker. "Generalisation of Regular and Irregular Morphological Patterns." Language and Cognitive Processes, 8, (1993), pp. 1-56.
Rytting, C. A. "An Empirical Test of Analogical Modeling: The /k/ ~ Alternation. In Lacus Forum XVII: The Lexicon. Eds. A. K. Melby and A. R. Lommel, Fullerton, CA, Linguistic Association of Canada and the United States, 2000, pp. 73-84.
Rumelhart, D. E. and J. L. McClelland, "On Learning the Past Tense of English Verbs". In Parallel Distributed Processing, vol. 2. Eds. J. L. Mc Clelland, D. E. Rumelhart, and the PDP Research Group, Cambridge, Mass., The MIT Press, 1986, pp. 216-271.
Scarborough, D. L., C. Cortese and H. S. Scarborough. "Frequency and Repetition Effects in Lexical Memory." Journal of Experimental Psychology, 3, (1977), pp. 1-17.
Sereno, J. A., and A. Jongman. "Processing of English Inflectional Morphology". Memory and Cognition, 25, (1997), pp. 425-37.
Skousen, R. Analogical Modeling of Language. Dordrecht, Kluwer Academic, 1989.
--. Analogy and Structure. Dordrecht, Kluwer Academic, 1992.
Weiss, S. M. and C. A. Kulikowski. Computer Systems that Learn. San Mateo, Morgan Kaufmann, 1991.
| Variables | ||||||||||||
| Word | Stress | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
| monstruo | Penult | - | 0 | - | 0 | - | m | o | ns | trw | o | - |
| resbaló | Final | 3 | pt | rr | e | s | b | a | - | l | o | - |
| Token Frequency | Type Frequency | |
| All 2179 verbs | 81.37 | 90.32 |
| Most frequent 1090 verbs | 77.65 | 87.65 |
| Least frequent 1089 verbs | 88.39 | 89.22 |
| Middle frequency 1090 verbs | 89.86 | 89.58 |
| Token Frequency | Type Frequency | |
| All 2179 verbs | .896 | .996 |
| Most frequent 1090 verbs | .938 | .982 |
| Least frequent 1089 verbs | .987 | .989 |
| Middle frequency 1090 verbs | .987 | .993 |
| 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | |
| transform | t | r | æ | n | s | s | f | o | r | m | F | m |
| distribute | t | r | I | b | j | b | j | u | t | 0 | N | t |
| tally | 0 | t | æ | l | i | æ | l | i | 0 | - | N | i |
| 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | |
| transform | tr | æ | ns | f | o | rm | F | m |
| distribute | str | I | 0 | bj | u | t | N | t |
| tally | t | æ | 0 | l | i | 0 | N | i |
| 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | |
| transform | 0 | t | r | æ | n | s | 0 | - | 0 | f | o | r | m | 0 | F | m |
| distribute | s | t | r | I | 0 | - | - | 0 | b | j | u | t | 0 | - | N | t |
| tally | - | 0 | t | æ | 0 | - | - | - | 0 | l | i | 0 | - | - | N | i |
| 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | |
| transform | t | r | - | - | t | r | æ | n | s | - | - | n | s | f | - | - | - | - | f | o | r | m | - | - | r | m | F | m |
| distribute | s | t | r | s | t | r | I | 0 | - | - | - | - | 0 | b | j | - | - | b | j | u | t | - | - | - | - | t | N | t |
| tally | t | - | - | - | - | t | æ | 0 | - | - | - | - | 0 | l | - | - | - | - | l | i | 0 | - | - | - | - | 0 | N | i |

1. I am indebted to Steven Chandler and Walter Daelemans for their input and critique of this paper.
2. A number of simulations were run in which either one or three nearest neighbor were used to determine behavior. However, in each case the simulation with one nearest neighbor always produced slightly better results than with three, which is why the results with three are not reported.
3. In calculating success rates, care was taken to weight the results of each test set in proportion to the number of test cases it contained.
4. Sonorant, consonantal, syllabic, continuant, voiced, aspirated, nasal, labial, anterior, coronal, strident, lateral, high, low, back, rounded, tense.
5. The frequency of words not found in Francis and Kuera was set at one for the purposes of the present studies.
6. The reason that the results of this simulation is more highly correlated with the subjects' responses, when compared to the simulation reported on in Eddington (2000a) is most likely due to the larger database (2179 items versus 848 items in the previous study).
7. With this same database AM correctly predicted 90.32% of the past tense forms which is somewhat less successful than TiMBL's 91.55% success rate ( 2 (1)=4.130, p < .05).
8. Sonorant, consonantal, syllabic, continuant, voiced, aspirated, nasal, labial, anterior, coronal, strident, lateral, high, low, back, rounded, tense. Consonantal features not relevant to vowels were marked with a '0', as were vocalic features not relevant to consonants.
9. A number of other simulations were run in TiBML using various algorithms (overlap, no weighting, k=1, 3, 5; overlap, gain ration weighting, k=1, 3, 5; overlap, chi-squared weighting, k=1, 3, 5; overlap, shared variance weighting, k=1, 3, 5; modified value difference metric, information gain weighting, k=1, 3, 5). In all of these simulations, the phonemic and the feature databases were used, and in all simulations the phonemic database outperformed the feature database.