In the small corpus example we are using, there are six lemmas (the, cat, on, roof, meow, and helplessly). Lemmatization also involves correcting spelling errors and standardizing spelling variants. In general, lemmas exclude proper nouns (names of people, places, …). In most analyses is limited to alphabetical word types that are seen by the English community as existing words (e.g., they are mentioned in a dictionary or a group of people on the web use them with a consistent meaning). Uninflected word from which all inflected words are derived. So, the words “GREAT” and “great” are the same alphabetical type. The cleaning to get to alphabetical word types in addition involves eliminating the distinction between uppercase letters and lowercase letters. In the example above mee-ee-ooow would be deleted. This is a word type consisting only of letters. Somewhat surprisingly, in some word counts the words “The” and “the” are considered as two different word types because of the capital letter in the first token of “The.” If such practice is followed, the number of word types reported nearly doubles. If the corpus consists of the sentence “The cat on the roof meowed helplessly: meow meeooow mee-ee-ooow,” then it has nine word types (the, cat, on, roof, meowed, helplessly, meow, meeooow, and mee-ee-ooow) and 10 word tokens (given that the word type “the” was observed twice). Word types refer to different word forms observed in a corpus tokens refer to the total number of words in a corpus. For the readers’ ease we summarize them here. In this text, we will need a number of terms related to words. In this paper, we try to give practical answers for American English depending on the definition of a word, the language input an individual is exposed to, and the age of the individual. As a result, in the literature one finds estimates going from less than 10 thousand to over 200 thousand (see below).
The answer usually starts with a deep sigh, followed by the explanation that the number depends on how a word is defined. The question is raised not only by lay people but also by colleagues from various disciplines related to language processing, development, acquisition, and education. Researchers dealing with quantitative aspects of language are often asked how many words a typical native speaker knows.