The size of vocabulary to set as a goal.

General discussion about learning languages
User avatar
tarvos
Black Belt - 2nd Dan
Posts: 2889
Joined: Sun Jul 26, 2015 11:13 am
Location: The Lowlands
Languages: Native: NL, EN
Professional: ES, RU
Speak well: DE, FR, RO, EO, SV
Speak reasonably: IT, ZH, PT, NO, EL, CZ
Need improvement: PO, IS, HE, JP, KO, HU, FI
Passive: AF, DK, LAT
Dabbled in: BRT, ZH (SH), BG, EUS, ZH (CAN), and a whole lot more.
Language Log: http://how-to-learn-any-language.com/fo ... PN=1&TPN=1
x 6094
Contact:

Re: The size of vocabulary to set as a goal.

Postby tarvos » Wed Jan 13, 2021 9:25 pm

Interpreters make specific glossary lists that they review for specific assignments and projects, but even they don't remember all the necessary vocabulary - they look it up beforehand if they have to, and that's how they prepare (sometimes you have to be creative on the fly). I would say I have a pretty outstanding vocabulary in Spanish (and that's not empty bluster - I can back that up), and there are still plenty of words I don't know or that are local to a certain region and that I would have to learn.

And you know what? That's absolutely FINE.

Vocabulary is more of a problem for me in Finnish and Mandarin, but even in those languages I get by pretty well with what I have, which isn't enough for true mastery but certainly does the job.
5 x
I hope your world is kind.

Is a girl.

User avatar
Montmorency
Brown Belt
Posts: 1035
Joined: Tue Oct 06, 2015 3:01 pm
Location: Oxfordshire, UK
Languages: English (Native)
Maintaining: German (active skills lapsed somewhat).
Studying: Welsh (advanced beginner/intermediate);
Dabbling/Beginner: Czech

Back-burner: Spanish (intermediate) Norwegian (bit more than beginner) Danish (beginner).

Have studied: Latin, French, Italian, Dutch; OT Hebrew (briefly) NT Greek (briefly).
Language Log: viewtopic.php?f=15&t=1429
x 1184

Re: The size of vocabulary to set as a goal.

Postby Montmorency » Wed Jan 13, 2021 11:32 pm

Iversen wrote:My take on this is that the better you know a language, the higher the percentage of words you could mobilize in a relevant situation.


And a corollary of this is (probably) that the more frequently you actually use that language, the higher the percentage of words you could mobilise in a relevant situation.

(My spell-checker has encouraged me to British-ise your "mobilize". ;) )
1 x

User avatar
AcademiaNut
White Belt
Posts: 47
Joined: Mon Jan 04, 2021 9:54 pm
Location: U.S.A.
Languages: English (N).
Spanish (beginner), French (beginner).
Medium interest: Latin, Dutch, German.
Mild interest: Japanese, Danish, Swedish, Portuguese, Greek, Hawaiian.
x 32

Re: The size of vocabulary to set as a goal.

Postby AcademiaNut » Thu Jan 14, 2021 12:45 am

ryanheise wrote:I've updated the script to display the word position in the frequency list (e.g. "10" means "10th most frequent word). It will show a blank if the word is not in the corpus (which again is clipped off at 10,000 words due to resource constraints on the server).


Awesome! Thanks.
OK, here we go with the unrefined results. This is going to take some space, so I'll show the raw results for only the first text sample.
I chose 3 samples of text, in what I believe to be increasing order of vocabulary difficulty. The 3 samples were taken from:

(1) "Go Dog Go"
http://iepclass.weebly.com/uploads/1/1/ ... astman.pdf
(2) "Alice's Adventures in Wonderland," Chapter IV
https://www.owleyes.org/text/alices-adv ... chapter-iv
(3) U.S. Constitution Preamble
https://constitutioncenter.org/interact ... n/preamble

My hypothesis was that: (1) The mean would increase with each text sample, since that would indicate increasingly rare vocabulary, on the average, with each successive piece of text, (2) The "Alice's Adventures in Wonderland" excerpt would have the highest variance, since as I noted yesterday, it has an unusual mixture of very common words with very rare words. (3) The mean would be less reliable if the variance were high, since variance flags anomalous excerpts. (4) After filtering out all the very common words of certain predetermined sets (I chose articles and conjunctions) whose choice of inclusion is extremely limited, the variance would decrease since the only words being analyzed would be those over which the author has a choice of selection, which would result in the means being more clearly separated.

The results below showed that the first 3 parts of my hypothesis were right on target, though I'm pressed for time so I might have to check the 4th part tomorrow, where I filter out all articles and conjunctions. Anyone is free to beat me to it, if they feel like it.

Here is the procedure I used for each excerpt: Copy and paste each excerpt into your online tool.
https://www.ryanheise.com/languages/eng ... ension.cgi
Fix any spelling anomalies. (E.g., I changed "defence" to "defense" since I was getting a spelling warning for that word.) Set the vocabulary size at 10,000, which looks like your maximum possible value. Submit. Copy and paste the resulting outputted text and list of statistics into another file. Delete the words of that statistics list to leave only the values behind, *except* substitute 10,001 for any word that had no statistic due to it being rarer than any of the 10,000 words in your list. Copy this list of integers into the online variance calculator at...
https://www.calculatorsoup.com/calculat ... ulator.php
Choose "Sample" and [Calculate]. Copy the statistical summary into a separate file. Repeat this entire procedure three times total, one for each excerpt. Rearrange the 3 resulting statistical summaries so that each type of statistic shows all 3 text results together for ease of comparison. Round off any decimal values to produce integers in the final report.

(1)
SUBMITTED TEXT:
Now all the dogs get out. And now look where those dogs are going! To the tree! To the tree! Up the tree! Up the tree! Up they go to the top of the tree. Why? Will they work there? Will they play there? What is up there on top of that tree? A dog party! A big dog party! Big dogs, little dogs,
RETURNED TEXT:
Now all the dogs get out. And now look where those dogs are going! To the tree! To the tree! Up the tree! Up the tree! Up they go to the top of the tree. Why? Will they work there? Will they play there? What is up there on top of that tree? A dog party! A big dog party! Big dogs, little dogs,
OUTPUTTED STATISTICS LIST:
Now 84
all 32
the 1
dogs 2631
get 147
out 59
And 3
now 84
look 232
where 92
those 97
dogs 2631
are 22
going 218
To 4
the 1
tree 964
To 4
the 1
tree 964
Up 63
the 1
tree 964
Up 63
the 1
tree 964
Up 63
they 29
go 141
to 4
the 1
top 608
of 2
the 1
tree 964
Why 202
Will 45
they 29
work 108
there 39
Will 45
they 29
play 552
there 39
What 49
is 8
up 63
there 39
on 15
top 608
of 2
that 7
tree 964
A 6
dog 8665
party 406
A 6
big 580
dog 8665
party 406
Big 580
dogs 2631
little 121
dogs 2631
CLEANED AND MODIFIED STATISTICS LIST:
84
32
1
2631
147
59
3
84
232
92
97
2631
22
218
4
1
964
4
1
964
63
1
964
63
1
964
63
29
141
4
1
608
2
1
964
202
45
29
108
39
45
29
552
39
49
8
63
39
15
608
2
7
964
6
8665
406
6
580
8665
406
580
2631
121
2631
CALCULATOR OUTPUT:
Variance s2 = 2557945.9
Standard Deviation s = 1599.358
Count n = 64
Mean x¯¯¯ = 620
Sum of Squares SS = 161150590

(2)
SUBMITTED TEXT:
It was the White Rabbit, trotting slowly back again, and looking anxiously about as it went, as if it had lost something; and she heard it muttering to itself, “The Duchess! The Duchess! Oh my dear paws! Oh my fur and whiskers! She'll get me executed, as sure as ferrets are ferrets! Where can I have dropped them, I wonder!” Alice guessed in a moment that it was looking for the fan and the pair of white kid gloves, and she very good-naturedly began hunting about for them, but they were nowhere to be seen—everything seemed to have changed since her swim in the pool, and the great hall, with the glass table and the little door, had vanished completely.
RETURNED TEXT:
It was the White ?????? ?????? slowly back again and looking anxiously about as it went as if it had lost something and she heard it ?????? to itself The Duchess The Duchess Oh my dear ?????? Oh my fur and ?????? ?????? get me executed as sure as ?????? are ?????? Where can I have dropped them I wonder Alice ?????? in a moment that it was looking for the fan and the pair of white kid gloves and she very good ?????? began hunting about for them but they were ?????? to be seen everything seemed to have changed since her ?????? in the pool and the great hall with the glass table and the little door had vanished ??????
CALCULATOR OUTPUT:
Variance s2 = 10967712
Standard Deviation s = 3311.7536
Count n = 122
Mean x¯¯¯ = 1860.9344
Sum of Squares SS = 1327093100

(3)
SUBMITTED TEXT:
We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defense, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America.
RETURNED TEXT:
We the People of the United States in Order to form a more perfect Union establish Justice insure domestic ?????? provide for the common defense promote the general Welfare and secure the Blessings of Liberty to ourselves and our Posterity do ?????? and establish this Constitution for the United States of America
CALCULATOR OUTPUT:
Variance s2 = 4445830.1
Standard Deviation s = 2108.5137
Count n = 50
Mean x¯¯¯ = 1017.74
Sum of Squares SS = 217845670

COMPARISON:

variance:
(1) 2,557,945.9
(2) 10,967,712
(3) 7,314,431
variance rounded:
(1) 2,557,946
(2) 10,967,712
(3) 7,314,431

mean:
(1) 620
(2) 1860.9344
(3) 1363.25
mean rounded:
(1) 620
(2) 1861
(3) 1363


Analysis: The mean did increase progressively as expected *except* on (2), but note that (2) had a very high variance, which means that the mean is unreliable.

Those "mean rounded" values will be the comprehensibility measures you asked for, except they are unrefined. Caveat: This procedure assumes that comprehensibility is not affected by the placement or interaction of words. I believe you mentioned later that how words are combined affects comprehensibility, so if that is the measure you want, that will be *much* harder to model.

I hope I didn't make any typos in all this, since a good part of the work involved manual cleaning of the text, which tends to be error prone. At any rate, with all the information provided above, these results will presumably be repeatable.

P.S. (1-14-2021): I ran a set of filtered word excerpts today, but the variance increased in 3 of the 4 of the texts I tried, contrary to my prediction. I then realized that while the removed material will help to lower the variance, the increased percentage of rare words with high ranking values will simultaneously help to raise the variance. The trend toward higher values won over the trend for lower values. However, if the values were normalized, the variances might in fact be seen to have followed my prediction. I don't have time to fool with this anymore today, but I hope to look up how to normalize unbounded numbers under these conditions, then apply the appropriate adjustments and report back.
Last edited by AcademiaNut on Fri Jan 15, 2021 1:27 am, edited 3 times in total.
0 x

User avatar
AcademiaNut
White Belt
Posts: 47
Joined: Mon Jan 04, 2021 9:54 pm
Location: U.S.A.
Languages: English (N).
Spanish (beginner), French (beginner).
Medium interest: Latin, Dutch, German.
Mild interest: Japanese, Danish, Swedish, Portuguese, Greek, Hawaiian.
x 32

Re: The size of vocabulary to set as a goal.

Postby AcademiaNut » Thu Jan 14, 2021 2:02 am

ryanheise wrote:You are correct that there is no objective measure of word importance, because what's important to you might not be important to me.


I've been brainstorming on that problem, too, and I also considered the degrees of separation solution you mentioned, à la Kevin Bacon. What I believe is the key problem is that "importance" is a concept (represented by a token), whereas frequency is a mathematical statistic that doesn't care about what those tokens mean. That's the essence of semantics. Computers are seriously at a loss when it comes to understanding meaning. Still, I believe the problem is surmountable, especially if a certain little-known trick of A.I. is used. To use this trick in your system would require that your entire list (or is it a database?) be restructured into a network representation, probably a special type of semantic net. The hurdle then would be to populate the values of the nodes of that network. This would either require painstaking manual data entry, or unsupervised learning. Machine learning is very popular now with all the interest in Deep Learning, so I believe that would be the way to go, unless the demo is very small.

You mentioned that importance depends on the person, and I agree, but more specifically it depends on the goal. "Important" in a social sense would involve words like "hello" and "please" and "thank you," but "important" in a survival sense would mean words like "caution" and "danger" and "stop," and "important" in a job-seeking sense would mean words like "resume" and "manager" and "interview." This suggests that all that would be needed for the database to understand which concepts are important in which context would be merely to provide training samples in the form of text describing scenarios in those contexts, combined with clever methods of set intersection and maybe representation. Such changes in focus within the conversation that tiia mentioned could then be handled, so that if the topic switched to engineering, the system would automatically shift the context so that civil engineering would become relevant and important. Altogether, I expect such an effort would be the size of a small, commercial R&D project, though. If you ever decide to pursue such a project, let me know by PM, since I might be able to help out.
0 x

User avatar
AcademiaNut
White Belt
Posts: 47
Joined: Mon Jan 04, 2021 9:54 pm
Location: U.S.A.
Languages: English (N).
Spanish (beginner), French (beginner).
Medium interest: Latin, Dutch, German.
Mild interest: Japanese, Danish, Swedish, Portuguese, Greek, Hawaiian.
x 32

Re: The size of vocabulary to set as a goal.

Postby AcademiaNut » Thu Jan 14, 2021 2:25 am

Cainntear wrote:I disagree strongly with the assertion in bold (my emphasis). These topics are interdependent. A script attempts at some level to reflect pronunciation, and a good script informs the learner of the pronunciation, and grammar is difficult to divorce from vocabulary -- how do you reason about the structures related to "and" and "or" without the words themselves? Can you discuss articles without reference to "the" and "a"? This may seem like a purely academic debate, but I feel a lot of teaching, interventions and learning aids fail because they attempt to compartmentalise things too far and end up making things unclear and often meaningless.


I agree to some extent. Your example in my other thread about how the meaning of the transitive verb (whether it involved a cognizant benefit to the indirect object) in a Pattern #8 sentence determined whether that sentence would be proper grammar or not is a very good example of the influence of words on grammar. Also, remember the conversation about how Tesnière corrected Chomsky's oversight on how verbs should be connected to nouns. Still, I believe such connections tend to be subtle and uncommon. In operating systems terminology such systems are said to be "loosely coupled." As a result, I would still tend to mentally categorize the language parts to be learned into those four loosely coupled topics.
0 x

User avatar
sfuqua
Black Belt - 1st Dan
Posts: 1644
Joined: Sun Jul 19, 2015 5:05 am
Location: san jose, california
Languages: Bad English: native
Samoan: speak, but rusty
Tagalog: imperfect, but use all the time
Spanish: read
French: read some
Japanese: beginner, obsessively studying
Language Log: https://forum.language-learners.org/vie ... =15&t=9248
x 6314

Re: The size of vocabulary to set as a goal.

Postby sfuqua » Thu Jan 14, 2021 3:36 am

:o
I disagree with everybody.
Memorizing vocabulary lists is fun :D
Learning the words for a particular task, passage, or book is very effective.
I've never had much luck memorizing a big vocabulary list in order to increase general comprehension. I've tried. :lol:
Maybe I'm just too impatient. :shock:
4 x
荒海や佐渡によこたふ天の川

the rough sea / stretching out towards Sado / the Milky Way
Basho[1689]

Sometimes Japanese is just too much...

Cainntear
Black Belt - 3rd Dan
Posts: 3526
Joined: Thu Jul 30, 2015 11:04 am
Location: Scotland
Languages: English(N)
Advanced: French,Spanish, Scottish Gaelic
Intermediate: Italian, Catalan, Corsican
Basic: Welsh
Dabbling: Polish, Russian etc
x 8793
Contact:

Re: The size of vocabulary to set as a goal.

Postby Cainntear » Thu Jan 14, 2021 12:07 pm

AcademiaNut wrote:Also, remember the conversation about how Tesnière corrected Chomsky's oversight on how verbs should be connected to nouns. Still, I believe such connections tend to be subtle and uncommon. In operating systems terminology such systems are said to be "loosely coupled." As a result, I would still tend to mentally categorize the language parts to be learned into those four loosely coupled topics.

That's a point of no small debate. Tesnière's main point in his valency grammars (later generalised into "dependency grammars") was that some words have obligatory grammatical links -- "to give", for example always requires a direct object:
"I gave him it" or "I gave it to him", but never "*I gave to him" or "*I gave him".

Tesnière talked about (IIRC) "logical objects" in his grammars, to explain why "It was given to him" fulfilled the rules despite the grammatical direct object slot being empty, which was one of the major points in Chomsky's reasoning for grammar being meaningless.

More recent schools of thought have described this as "lexicogrammar" -- words are intrinsically linked to grammatical rules and the word isn't known until you know which rules apply. There's a school of thought in English teaching that goes even further and tries to treat all grammar as lexis -- effectively saying syntax doesn't exist, only collocation. I don't agree with that extreme interpretation, but dependency grammars and lexicogrammar certainly address deficiencies in other models, and I don't recall ever seeing any complaint about them that any other model better addresses.
1 x

User avatar
lusan
Green Belt
Posts: 463
Joined: Sat Aug 15, 2015 1:25 pm
Location: Greensboro, NC, USA
Languages: Spanish(Native)
English (Naïve)
French(Intermediate)
Italian(Intermediate)
Polish(In Alcatraz)
x 985

Re: The size of vocabulary to set as a goal.

Postby lusan » Thu Jan 14, 2021 4:13 pm

s_allard wrote: Learning a language is not about learning words ; it’s about learning to put words together into meaningful units.

A very nice quote.
1 x
Italian, polish, and French dance
FSI Basic French Lessons : 10 / 24 17 of 24 goal

User avatar
AcademiaNut
White Belt
Posts: 47
Joined: Mon Jan 04, 2021 9:54 pm
Location: U.S.A.
Languages: English (N).
Spanish (beginner), French (beginner).
Medium interest: Latin, Dutch, German.
Mild interest: Japanese, Danish, Swedish, Portuguese, Greek, Hawaiian.
x 32

Re: The size of vocabulary to set as a goal.

Postby AcademiaNut » Fri Jan 15, 2021 12:56 am

s_allard wrote:I persist in believing that trying to determine the optimum vocabulary size for levels of proficiency is a waste of time for language learning purposes. More specifically, I would like to take issue with the above statement "...vocabulary seems to be the biggest hurdle to attaining proficiency in any given foreign language, by far.


In my experience, understanding vocabulary is *clearly* the most important component of language for *me*. The most common situation where I encounter a foreign language where I live is in Spanish broadcasts, especially from radio, recorded spiels for public transportation, phone recordings, and similar. Where my knowledge of Spanish vocabulary seems to be lacking the most is in verbs: I don't know anywhere near enough verbs, and I certainly can't readily conjugate all the verbs I know. However, when I hear such a recording, I can understand most of it by use of two things: (1) the Spanish words I happen to know, and (2) cognates of words I already know, especially since I know a lot of root words from Spanish, French, and Latin (all romance languages).

For example, for (1), in a radio ad I might recognize words and phrases like "el presidente" and "para la gente," and for (2) it's easy to figure out that "votar" means "to vote," "electoral" has something to do with elections, and so on. Put (1) and (2) together and it's obvious that it's a political ad for voting for the named Mexican presidential candidate that I'm hearing. No verb knowledge or grammar knowledge was needed at all. Similarly, in any language, if you hear a sentence that has the root words for bite + careful + you + dog, it's pretty obvious which is the subject and which is the direct object, and therefore the overall meaning of the sentence. No verb knowledge or grammar knowledge was needed all. That is *unquestionably* *my* most common experience with foreign languages, even if your experience differs.
0 x

s_allard
Blue Belt
Posts: 985
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 2370

Re: The size of vocabulary to set as a goal.

Postby s_allard » Fri Jan 15, 2021 8:07 am

AcademiaNut wrote:
s_allard wrote:I persist in believing that trying to determine the optimum vocabulary size for levels of proficiency is a waste of time for language learning purposes. More specifically, I would like to take issue with the above statement "...vocabulary seems to be the biggest hurdle to attaining proficiency in any given foreign language, by far.


In my experience, understanding vocabulary is *clearly* the most important component of language for *me*. The most common situation where I encounter a foreign language where I live is in Spanish broadcasts, especially from radio, recorded spiels for public transportation, phone recordings, and similar. Where my knowledge of Spanish vocabulary seems to be lacking the most is in verbs: I don't know anywhere near enough verbs, and I certainly can't readily conjugate all the verbs I know. However, when I hear such a recording, I can understand most of it by use of two things: (1) the Spanish words I happen to know, and (2) cognates of words I already know, especially since I know a lot of root words from Spanish, French, and Latin (all romance languages).

For example, for (1), in a radio ad I might recognize words and phrases like "el presidente" and "para la gente," and for (2) it's easy to figure out that "votar" means "to vote," "electoral" has something to do with elections, and so on. Put (1) and (2) together and it's obvious that it's a political ad for voting for the named Mexican presidential candidate that I'm hearing. No verb knowledge or grammar knowledge was needed at all. Similarly, in any language, if you hear a sentence that has the root words for bite + careful + you + dog, it's pretty obvious which is the subject and which is the direct object, and therefore the overall meaning of the sentence. No verb knowledge or grammar knowledge was needed all. That is *unquestionably* *my* most common experience with foreign languages, even if your experience differs.

The point here is well taken and raises a number of very interesting questions. Let me first point out that my own post refers specifically to "attaining proficiency in any given foreign language, by far." By proficiency, let's assume that we are talking about the ability to use the language. There are of course many different levels of proficiency from the very beginner to very advanced. I suggest we use the CEFR scale A1 to C2 and let's call the absolute beginner A0.

In the situation described in the post our A0 has some knowledge of words in the target language (Spanish) and a knowledge of some words in related languages (cognates). With just these elements, the A0 beginner can "understand" some simple phrases heard or seen in real contexts.

This is a very common experience. Given just a few elements, the A0 can guess with some degree of accuracy some very simple meanings. And I should point out that a certain amount of grammatical knowledge is often necessary and must be guessed. I quote:

Similarly, in any language, if you hear a sentence that has the root words for bite + careful + you + dog, it's pretty obvious which is the subject and which is the direct object, and therefore the overall meaning of the sentence. No verb knowledge or grammar knowledge was needed all.

Notice that this A0 beginner is able to distinguish between the subject, the direct object and the verb. I would hardly say that no verb knowledge of grammar knowledge was needed. Quite the contrary, a minimal amount of grammar is needed or has to be guessed. But it is true that at the A0 level the target language is a sort of haze in which certain words may be recognizable.

But when we talk about proficiency in a language, we're not talking about our A0 speaker guessing the meaning of some simple stuff. Let's move up the proficiency ladder. Can you go from A0 to A2 with no verb knowledge or grammar knowledge ? Or let's say you get ambitious and you want to pass the B2 proficiency test in Spanish from the Instituto Cervantes, is it a question of how many words you know ? Should you not bother studying the verb system and that dreadful subjunctive mood ? Or not bother with grammatical gender ? Or those awful direct and indirect pronouns that look alike ? How about word order ?

I don’t want to repeat what I said in previous posts but in my opinion attempting to learn a certain number of words a day as the principal strategy for attaining proficiency will not work. That said, I would be interested in seeing the results of an experiment along this line using a word frequency list for Spanish that can be easily found on the Internet.
1 x


Return to “General Language Discussion”

Who is online

Users browsing this forum: stell and 2 guests