The statistical distribution of language difficulty

General discussion about learning languages
s_allard
Blue Belt
Posts: 839
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 1740

Re: The statistical distribution of language difficulty

Postby s_allard » Sun Aug 01, 2021 4:10 am

ryanheise wrote:
s_allard wrote:
ryanheise wrote:Since I already explained before why length isn't significant, this time I'll try a demonstration rather than an explanation.

The demonstration is very nice but


It's just that I already explained to you multiple times why length is not a factor, and you weren't receptive to that (you continue to comment on length without even acknowledging those previous points about length) so I decided to try a different tack.

I wasn't talking about the size of half of Shrek. What I pointed out is that in 95 minutes Shrek has more unique vocabulary than 34000 podcasts combined if I understand chart1 correctly.


I've also already explained to you multiple times why document vocabulary size (the number of unique words in a document) is not a factor in graph 1. Bonus points if you can find it and quote it. I think it is more considerate for you to rephrase your discussion about a previously explained topic by including a reference to that previous explanation as it a starting point, rather than starting from zero each time. It takes a lot of effort for me to keep repeating the same explanation multiple times.

We would normally expect the podcasts to have some differences, however slight, in vocabulary between each other. One 30-minute podcast will probably use around 750 unique words but obviously the number of unique words in let’s say 20000 30-minute podcasts is much higher.


So the comment/explanation I would like you to go and find is the explanation of your confusion between document vocabulary size vs learner vocabulary size.
...


Well I give up. But I do have to thank you for all the effort in explaining and re-explaining things that are obviously beyond my grasp. Since I seem to be the only person who doesn’t understand some of the terminology and the methodology used here, I will just bow out and retreat to my academic cave. It’s kind of interesting that with 98% word coverage of this thread, I don’t understand most of it. On the other hand I find Paul Nation’s work on optimal receptive vocabulary size crystal clear although I have my own disagreements with him. Go figure, that’s just me.
0 x

Online
Dragon27
Green Belt
Posts: 379
Joined: Tue Aug 25, 2015 6:40 am
Languages: Russian (N)
English - best foreign language
Polish, Spanish - passive advanced
Tatar, German, French - studying
x 780

Re: The statistical distribution of language difficulty

Postby Dragon27 » Sun Aug 01, 2021 5:43 am

luke wrote:Let me try another compound word. Beetlejuice. That was a weird movie. Beetle Juice. Juice for Beetles. Or is it Juice of Beetles? I don't remember beetles or juice in the movie though. It was a hard movie to sit through.

Yeah, I remember not giving it a second thought (just some weird nonsense compound word, English is full of them; didn't watch the movie, btw), until one day I've stumbled on the name of the star Betelgeuse in English (I immediately recognized from its written form that it's the same star we call "Бетельгейзе" in Russian) and decided to look up its pronunciation. As soon as I did that I instantly realized what Beetlejuice means (and that it's not actually a compound word at all). Life is full of little discoveries.
Last edited by Dragon27 on Sun Aug 01, 2021 12:21 pm, edited 3 times in total.
1 x

User avatar
luke
Green Belt
Posts: 427
Joined: Fri Aug 07, 2015 9:09 pm
Languages: English (N). Spanish (intermediate), Esperanto (B1), French (intermediate but rusting)
Language Log: https://forum.language-learners.org/vie ... 15&t=16948
x 1000

Re: The statistical distribution of language difficulty

Postby luke » Sun Aug 01, 2021 12:11 pm

s_allard wrote:Since I seem to be the only person who doesn’t understand some of the terminology and the methodology used here, I will just bow out and retreat to my academic cave. It’s kind of interesting that with 98% word coverage of this thread, I don’t understand most of it.

I'm sure you know what the following words mean, but I'm going to poke them in here, because I think I've just clarified them to myself. Obviously, someone can look them up in a dictionary, but I'm not always good at turning a dictionary definition into a concept.

And please correct me if my "concepts" here are incorrect. I allow words to "grow" and "change" over time and situation. But if I'm clearly wrong, please correct me.

lemma: That's the term that means all conjugations of a verb collapse into a single unit. E.G., I am, he was, they were, I used to be, I will be, she is, all collapse into "to be".

It also simplifies plurals into singular. car, cars are just "car".

It may do a bit more, but that's the general idea.

Per my wikipedia-fu, this may be synonymous with a "headword" or "dictionary entry". (less sure about headword. That may be something else).

word family: Professor Arguelles uses this phrase to collapse words that share the same root into one, so it's a more general "collapse" than a lemma. E.G., acceptable, unacceptable, accepts, accepting, unaccepted, unacceptably, all collapse into "accept". And since "to accept" is a verb, this is more general than a lemma.

So for me, what I'm thinking about in the discussion, and Arguelles cites Nation (and Krashen :shock: ), is "do they mean lemma, or word family, or, not even sure if I got this term right, "token word", which I take to mean, "are, is, was" are three "words".

And I'm not even trying to hold anyone to a particular definition. This may be a terrible sin, but I like to use "your definition" of whatever word you say. E.G., when you say "brother", you're not referring to "my brother", but rather "your brother", unless you're talking about "brotherhood", which still has me wanting to understand better what you mean by "brother". I.E., what makes a "brother"?

Oh and hello Iverson. I'd heard Professor Arguelles refer to your talk each time I listened to his lecture on Reading at the Polyglot Conference, but I just now spotted you in the audience. (I think :))
2 x

User avatar
luke
Green Belt
Posts: 427
Joined: Fri Aug 07, 2015 9:09 pm
Languages: English (N). Spanish (intermediate), Esperanto (B1), French (intermediate but rusting)
Language Log: https://forum.language-learners.org/vie ... 15&t=16948
x 1000

Re: The statistical distribution of language difficulty

Postby luke » Sun Aug 01, 2021 12:15 pm

Dragon27 wrote:
luke wrote:Beetlejuice. It was a hard movie to sit through.

Yeah, I remember not giving it a second thought (just some weird nonsense compound word, English is full of them; didn't watch the movie, btw), until one day I've stumbled on the name of the star Betelgeuse in English (I immediately recognized from its written form that it's the same star we call "Бетельгейзе" in Russian) and decided to look its pronunciation in English. As soon as I did that I instantly realized what Beetlejuice means (and that it's not actually a compound word at all). Life is full of little discoveries.

Amen brother.

And I haven't watched more than a few minutes of it either. :lol:
0 x

User avatar
luke
Green Belt
Posts: 427
Joined: Fri Aug 07, 2015 9:09 pm
Languages: English (N). Spanish (intermediate), Esperanto (B1), French (intermediate but rusting)
Language Log: https://forum.language-learners.org/vie ... 15&t=16948
x 1000

Re: The statistical distribution of language difficulty

Postby luke » Sun Aug 01, 2021 12:29 pm

ryanheise wrote:so the above analysis of Shrek whole vs Shrek Part 1 and Part 2 still tells you that that length is not a factor.

Which also seems to be supported by the notion and maybe even Paul Nation that longer works can be better for acquiring a subset vocabulary. They don't so much add additional "words to learn", but they do "recycle" the words in the "document". That idea that you have to see a word in a variety of contexts to get a good handle on it.
0 x

s_allard
Blue Belt
Posts: 839
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 1740

Re: The statistical distribution of language difficulty

Postby s_allard » Sun Aug 01, 2021 12:55 pm

luke wrote:
s_allard wrote:Since I seem to be the only person who doesn’t understand some of the terminology and the methodology used here, I will just bow out and retreat to my academic cave. It’s kind of interesting that with 98% word coverage of this thread, I don’t understand most of it.

I'm sure you know what the following words mean, but I'm going to poke them in here, because I think I've just clarified them to myself. Obviously, someone can look them up in a dictionary, but I'm not always good at turning a dictionary definition into a concept.

And please correct me if my "concepts" here are incorrect. I allow words to "grow" and "change" over time and situation. But if I'm clearly wrong, please correct me.

lemma: That's the term that means all conjugations of a verb collapse into a single unit. E.G., I am, he was, they were, I used to be, I will be, she is, all collapse into "to be".

It also simplifies plurals into singular. car, cars are just "car".

It may do a bit more, but that's the general idea.

Per my wikipedia-fu, this may be synonymous with a "headword" or "dictionary entry". (less sure about headword. That may be something else).

word family: Professor Arguelles uses this phrase to collapse words that share the same root into one, so it's a more general "collapse" than a lemma. E.G., acceptable, unacceptable, accepts, accepting, unaccepted, unacceptably, all collapse into "accept". And since "to accept" is a verb, this is more general than a lemma.

So for me, what I'm thinking about in the discussion, and Arguelles cites Nation (and Krashen :shock: ), is "do they mean lemma, or word family, or, not even sure if I got this term right, "token word", which I take to mean, "are, is, was" are three "words".



Kudos for your excellent understanding of the terminology. I follow Nation and try to systematically use the term word-families although I may say just words. Similarly I use word tokens or word forms for all written units in a text.

It should be pointed out that there is a major problem in English with the handling of phrasal verbs where we have two or three units in what would be considered one word.

There are other issues such as polysemy where words have multiple meanings. There is also the major problem of idiomatic expressions.

It is also extremely important to keep in mind the distinction between word coverage and comprehension. Vocabulary studies are all about word coverage, i.e. number of words you know relative to all the words in the text. Paul Nation speaks of 98% word coverage for unassisted comprehension.

The elephant in the room here of course is that knowing the words and being able to understand the text are two different things. Words are in sentences and paragraphs or they could be streams of sounds. What do people actually see or hear and understand is a subject for another day.
2 x

User avatar
luke
Green Belt
Posts: 427
Joined: Fri Aug 07, 2015 9:09 pm
Languages: English (N). Spanish (intermediate), Esperanto (B1), French (intermediate but rusting)
Language Log: https://forum.language-learners.org/vie ... 15&t=16948
x 1000

Re: The statistical distribution of language difficulty

Postby luke » Sun Aug 01, 2021 2:26 pm

s_allard wrote:I follow Nation and try to systematically use the term word-families although I may say just words. Similarly I use word tokens or word forms for all written units in a text.

Even there, between 2 native speakers of English (although maybe French is strongest for you - not sure what language your mom usually used with you) - there can be trouble connecting the dots. You said "try to systematically use ... although I may just say", which brings up that we're not always precise in our language and sometimes it's bothersome when someone is too pedantic. In computer languages, there's a term "language lawyer", which doesn't necessarily mean "pedantic", but does mean they know the rules of the language better than most need to. Obviously "lawyer" is related to those who parse human language and contracts for meaning and loopholes. There, courts or rivals decide who "wins". With computer languages, generally the computer decides, although there could be a bug in the compiler or some governing body or company that makes compilers can "change the rules".

Humans are a bit like individual "language compilers". They can add tricks (new features) and sometimes trick (try to deceive). "Trick", a simple common word with many usages, as you say below with "polysemy". "Polymorphic" is a computer language term that many others here would be better at introducing when rdearman or someone unveils the great "Computer languages vs human languages" debate. :lol:

s_allard wrote:It should be pointed out that there is a major problem in English with the handling of phrasal verbs where we have two or three units in what would be considered one word.

I don't know, but I'd imagine many languages have this feature, although they may use it a different way and English may be more liberal in this respect. Thinking reflexive verbs, direct and indirect objects and phrases. In Spanish, as you surely know, "dar", "darse", "darse cuenta" are variations. "me di cuenta, if you don't have all the parts, you don't know "who realized what" and it looks like a phrase. It's an idiom and your point is even knowing all three words still doesn't mean you have complete comprehension.

s_allard wrote:There are other issues such as polysemy where words have multiple meanings.

Yeah and that makes metrics less precise.

s_allard wrote:knowing the words and being able to understand the text are two different things.

Well said.
0 x

s_allard
Blue Belt
Posts: 839
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 1740

Re: The statistical distribution of language difficulty

Postby s_allard » Tue Aug 03, 2021 1:20 pm

luke wrote:
s_allard wrote:It should be pointed out that there is a major problem in English with the handling of phrasal verbs where we have two or three units in what would be considered one word.

I don't know, but I'd imagine many languages have this feature, although they may use it a different way and English may be more liberal in this respect. Thinking reflexive verbs, direct and indirect objects and phrases. In Spanish, as you surely know, "dar", "darse", "darse cuenta" are variations. "me di cuenta, if you don't have all the parts, you don't know "who realized what" and it looks like a phrase. It's an idiom and your point is even knowing all three words still doesn't mean you have complete comprehension.

s_allard wrote:There are other issues such as polysemy where words have multiple meanings.

Yeah and that makes metrics less precise.



And to make things even more complicated, one could add the figurative or metaphoric uses of language plus questions of cultural and historical references with proper nouns.

My own opinion of all this is that these frequency lists are rather blunt instruments for assessing readability and comprehension but we really don’t have much choice. They are certainly useful for the design of graded learning materials and dictionary making but for language hobbyists like us, I don’t see much use.

The simple reason is that a word in itself has little value ; it’s how the word is used in its various forms and in combination with other words that gives it meaning. Knowing that ser and haber are among the most common verbs in Spanish is good to know doesn’t really tell us how to use them properly.

So the issue becomes how does one acquire these word-form and usage combinations. This is the very subject of an excellent paper with the great title, How much input do you need to learn the most frequent 9,000 words ?, by Paul Nation.

https://files.eric.ed.gov/fulltext/EJ1044345.pdf

Not surprisingly, Nation says that to acquire a large vocabulary you have to read widely and a lot. We all knew that but it’s good to see the science behind it.

All this said, I think that the OP’s idea of being able to rate materials such as podcasts by difficulty is an interesting one. Wouldn’t it be great to see such a rating - let’s say B1 to C2 or even by vocabulary size– next to an article or a podcast. Of course, this is the whole point of graded learning materials.
3 x

User avatar
luke
Green Belt
Posts: 427
Joined: Fri Aug 07, 2015 9:09 pm
Languages: English (N). Spanish (intermediate), Esperanto (B1), French (intermediate but rusting)
Language Log: https://forum.language-learners.org/vie ... 15&t=16948
x 1000

Re: The statistical distribution of language difficulty

Postby luke » Tue Aug 03, 2021 2:30 pm

s_allard wrote:My own opinion of all this is that these frequency lists are rather blunt instruments for assessing readability and comprehension but we really don’t have much choice. They are certainly useful for the design of graded learning materials and dictionary making but for language hobbyists like us, I don’t see much use.

The simple reason is that a word in itself has little value.

But we have to start somewhere.

Thank you for reminding me of a passage I believe I read that said: En el principio, era el verbo (in the beginning, there was the word).
1 x

rpg
Orange Belt
Posts: 144
Joined: Fri Jul 21, 2017 2:21 pm
Languages: English (N), Spanish (B2), French (B1)
Language Log: https://forum.language-learners.org/vie ... =15&t=8368
x 428

Re: The statistical distribution of language difficulty

Postby rpg » Tue Aug 03, 2021 11:38 pm

ryanheise wrote:
How are you generating the frequency list in step 1, though? I got the impression you generated it from the same corpus that you're testing on, is that right? If so I don't think that's methodologically sound; I think the corpus should be independent (and ideally extremely large, of course).


The podcast corpus that I've built is already by this stage one of the largest spoken English corpora in the world, containing 184 million tokens, and 148 thousand unique lemmas. Once it becomes automated, this number is expected increase multiple times over. Furthermore, it definitely covers the right type of language and words in the frequency distributions that are relevant for the type of content being analysed, and the alternatives for this are slim pickings.

So the corpus should be extremely large, yes, but also importantly, it should cover the right type of language. You should not use the Wikipedia corpus to analyse movies, for instance, and you should not use a movie corpus to analyse fictional literature. We want to ensure that all of the words that we expect to find in podcast-type material are actually covered by this corpus with the right type of frequency distribution for this type of material.

One of the interesting things about spoken language corpora is that they have been historically difficult to build, because they typically involved a lot of manual work to transcribe. The corpora that are very large (such as my own) analyse existing transcripts, whether that be of TV transcripts, movie transcripts or in my case, podcast transcripts. Each of these have their own natural skew which can't really be avoided. It is slim pickings, but you have to ask yourself which of those 3 corpora would be the most useful if you're interested in comparing the difficulty of different podcasts? Movies, as we've seen, can have an entirely different character than podcasts because movies are often set in fantasy or fictional worlds and are not actually using the same set of words that we use here in the real world, discussing real things.

So I would say that yes, this corpus ticks the boxes that I needed for the project. Now, there will be an issue when extending this to other languages which don't have as much podcast content, as I won't be able to build as large a corpus. If I manage to find another spoken language corpus in that language larger than my own, I will use it until my own outgrows it, but at the same time, there may be even slimmer pickings in these other languages, and I may have no choice but to build my own corpus.


Thanks for the response! That makes sense. With 184 million tokens, let's say your most common one is at 5% (9.2 M) and with a Zipf's law approximation vocabulary at the 10,000th position in the frequency list (almost all of the podcasts from your original post) that's around 920 occurrences--still large enough to be resilient to some extent to some small bias.

My point (and I'm sure you're aware of this) was that with a smaller corpus you can get some distortionary effects. The words that are repeated in your podcast will also be higher in the frequency list because that same podcast was used to generate the frequency list (imagine the limiting case where your corpus was so small that it only contained that one podcast, for example, and then consider how it changes as the corpus increases). I do think it's a little conceptually cleaner even still to do some cross-validation: build your frequency list based on eg 80% of your podcasts and then use the other 20% as your test set to generate the chart. But I don't think the results would be hardly any different because your corpus is pretty big.

The other thing is that I think the corpus that's the most relevant for your typical language learner would be a mix between spoken and written language--almost all language learners learn from both types of source, I think. That's what the Routledge frequency dictionaries do too (mixing the two types), if I recall correctly. Obviously that brings its own complications for how you weigh them though.

Anyway I think this is really cool, thanks for sharing! I've been really interested in doing something very similar and then using the results to assign difficulty levels to particular texts (or in this case to particular podcasts) based on the vocab size required. Though I think that works a little better for written texts since they both contain a wider range of vocabulary and are free from some of the other variables of spoken language (speed/accent/audio quality/etc).
1 x
Super challenge 2020/21
French reading: 3935 / 5000      Spanish reading: 81 / 5000
French movies: 94 / 150       Spanish movies: 98 / 150


Return to “General Language Discussion”

Who is online

Users browsing this forum: gsbod and 2 guests