The statistical distribution of language difficulty

General discussion about learning languages
User avatar
ryanheise
Green Belt
Posts: 370
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1246
Contact:

Re: The statistical distribution of language difficulty

Postby ryanheise » Sat Jul 31, 2021 2:15 pm

s_allard wrote:
ryanheise wrote:So slightly easier, but still on the difficult end (and looking at the vocabulary in Shrek compared to a typical conversational podcast, I am not too surprised by that.)


I’m really intrigued by the fact that the children’s movie Shrek requires more vocabulary than around 35000 adult podcasts most of which probably last 20 – 30 minutes each whereas Shrek lasts only 95 minutes.


Since I already explained before why length isn't significant, this time I'll try a demonstration rather than an explanation.

The experiment: I cut the movie Shrek into two halves.

Now, what do you expect the analysis to reveal? Do you expect that because each split is half the length that each split will be more comprehensible to people who have a smaller vocabulary?

Let's see. The original vocabulary size required for the whole was 11064.

After splitting it in two halves, the vocabulary size required for each half is:

1st half: 11869
2nd half: 10218

So the first half used slightly rarer words, but if we take the average of the two scores, we get something resembling the original number.

Now I'm not saying that we should get roughly the same vocabulary requirements in each half, but we should expect the average to come close to the original. The reason is that perhaps the first half of the movie was much more difficult to understand than the second half of the movie. So if we examine the whole, we'll get one figure, and if we examine each half, we'll get more localised figures.

Sure, podcasts can be very conversational but isn’t a movie all conversational ?


(edit: someone else already made this point that movies are more carefully scripted, but I can't remember who, sorry!)

I know that when I'm trying to be eloquent, or if I'm writing a script for a talk or lecture, I will search for the best word to use even if it is a rarely used word, but if I'm speaking casually off the top of my head, the vocabulary that I tend to find right there at the top of my head are those words I use or hear frequently. A movie is carefully scripted.

The figure of 11094 words of required vocabulary reported here can be explained by differences of methodology that I will leave to the author to explain. But the question remains : what words of Shrek are so different from those of 35000 podcasts ?


You may have some idea from the words pasted above. If a movie is set in the middle ages with kings and queens, and uses words like "beset", do you expect that is what people talk about frequently in casual conversation?

The interesting tidbit that strikes me is that with 357 words one gets 80% coverage in Shrek. So few words for a lot of coverage and probably 0 comprehension. And to go from 95% to 93% coverage you have to more than double the number of words. Fascinating indeed.


Even at 98% coverage, it's not really a great way to determine comprehension.
2 x

s_allard
Blue Belt
Posts: 839
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 1740

Re: The statistical distribution of language difficulty

Postby s_allard » Sat Jul 31, 2021 5:27 pm

ryanheise wrote:
s_allard wrote:
ryanheise wrote:So slightly easier, but still on the difficult end (and looking at the vocabulary in Shrek compared to a typical conversational podcast, I am not too surprised by that.)


I’m really intrigued by the fact that the children’s movie Shrek requires more vocabulary than around 35000 adult podcasts most of which probably last 20 – 30 minutes each whereas Shrek lasts only 95 minutes.


Since I already explained before why length isn't significant, this time I'll try a demonstration rather than an explanation.

The experiment: I cut the movie Shrek into two halves.

Now, what do you expect the analysis to reveal? Do you expect that because each split is half the length that each split will be more comprehensible to people who have a smaller vocabulary?

Let's see. The original vocabulary size required for the whole was 11064.

After splitting it in two halves, the vocabulary size required for each half is:

1st half: 11869
2nd half: 10218

So the first half used slightly rarer words, but if we take the average of the two scores, we get something resembling the original number.

Now I'm not saying that we should get roughly the same vocabulary requirements in each half, but we should expect the average to come close to the original. The reason is that perhaps the first half of the movie was much more difficult to understand than the second half of the movie. So if we examine the whole, we'll get one figure, and if we examine each half, we'll get more localised figures.

The demonstration is very nice but I wasn't talking about the size of half of Shrek. What I pointed out is that in 95 minutes Shrek has more unique vocabulary than 34000 podcasts combined if I understand chart1 correctly. We would normally expect the podcasts to have some differences, however slight, in vocabulary between each other. One 30-minute podcast will probably use around 750 unique words but obviously the number of unique words in let’s say 20000 30-minute podcasts is much higher.

So it seems the 11064 words of Shrek are more than the unique words of all 34000 podcasts combined. I can’t believe that these podcasts contain only the small talk of people chatting about the same subjects.

I don’t think the vocabulary of Shrek is that exotic, since it’s aimed at children. When I looked at the script, excluding the stage directions, it didn’t seem very special. Paul Nation even gives an example of this in his article. There is certainly some special vocabulary in Shrek because we are dealing with an ogre but I don’t see how one movie for children can require more unique vocabulary than the unique vocabulary of 34000 podcasts combined.
0 x

User avatar
luke
Green Belt
Posts: 427
Joined: Fri Aug 07, 2015 9:09 pm
Languages: English (N). Spanish (intermediate), Esperanto (B1), French (intermediate but rusting)
Language Log: https://forum.language-learners.org/vie ... 15&t=16948
x 1000

Re: The statistical distribution of language difficulty

Postby luke » Sat Jul 31, 2021 6:21 pm

Dragon27 wrote:
Le Baron wrote:I can see how someone might falter over the compound, a 'housefly' could be anything up to and including a case of a flying house! :D

luke wrote:Yes, and same with 'horsefly'. A flying horse?

I don't know. From the way these compounds are formed I don't see how 'horsefly' can mean 'a flying horse' (at best, it could mean "the flight of a horse"). Besides, it certainly would be obvious that it shouldn't mean that from the context.

In my book, first exposure to "horsefly", if it comes from a conversation is going to generate more questions than answers. Knowing "horse" and "fly", even if I know that fly is sometimes a verb and sometimes an insect, I'm still processing.

Think the country. Barns. Horses. That's where a horsefly is likely to be seen or heard. Think spoken language.

"I saw a horsefly". "I saw a horse fly".

If I'd never heard of a horsefly, I'd be inclined to think they said "horse fly".

I've never seen a fly that looks like a horse.

I don't believe in flying horses. Why would they say that?

Is it some kind of cliche or saying?

I heard "until the cows come home" today, and that didn't have to do with cows. It was some kind saying or something. Seemed like it meant "you have to wait a long time". But here, what does it mean? Who can I ask? Should I look it up on my phone?

And if I'm reading, horsefly. Is that an adverb? What's a horsef? Is is it a typo? Maybe it's an irregular adverb? If something is done "horsefly", what does that mean in this context? Maybe it means "happy". It seems the horses around here are happy. But some are tired. Does it mean tired?

It doesn't make sense here.

Let's see. Policecar. Car for police. Cops need cards. Horsefly? Fly for horses? Why would horse need a fly? Are they talking about "the environment", "ecosystems" and stuff? There's farm all around. Maybe.

Let me try another compound word. Beetlejuice. That was a weird movie. Beetle Juice. Juice for Beetles. Or is it Juice of Beetles? I don't remember beetles or juice in the movie though. It was a hard movie to sit through. "Hard Days Night" by "The Beatles" was a good movie though. Wonder if I can get it on my phone now.

My phone doesn't get good reception here on the farm.

You can see how 1000.00 took me off course early in this discussion. :)
1 x

Lisa
Orange Belt
Posts: 155
Joined: Tue Jul 30, 2019 8:08 pm
Location: Oregon, United States
Languages: English (N) German (intermediate) Spanish (intermediate) French (beginner)
Language Log: https://forum.language-learners.org/vie ... 15&t=10854
x 465

Re: The statistical distribution of language difficulty

Postby Lisa » Sat Jul 31, 2021 7:44 pm

If you just see "horsefly" in a language test, you might not know what it means.
In a sentence, however... e.g., "Joe jumped when a horsefly bit him"... if you know what a "fly" is, seems like you could deduce easily that it was some kind of biting fly.
If you haven't seen, studied, or been bitten by a horsefly personally, that would be exactly as much compression as a native speaker. "Some kind of biting fly" is all I have as a definition for black fly, for example.
0 x

User avatar
Le Baron
Blue Belt
Posts: 649
Joined: Mon Jan 18, 2021 5:14 pm
Location: The scullery
Languages: English (N), Nederlands, Français, Deutsch, Sranantongo (rusty), Esperanto.
Studying: Castellano, Indonesian, (Swahili - in storage). Dabbled in: Cantonese, Indonesian, Russian, Norwegian, Hawaiian.
x 1477

Re: The statistical distribution of language difficulty

Postby Le Baron » Sat Jul 31, 2021 8:24 pm

Lisa wrote:If you just see "horsefly" in a language test, you might not know what it means.
In a sentence, however... e.g., "Joe jumped when a horsefly bit him"... if you know what a "fly" is, seems like you could deduce easily that it was some kind of biting fly.
If you haven't seen, studied, or been bitten by a horsefly personally, that would be exactly as much compression as a native speaker. "Some kind of biting fly" is all I have as a definition for black fly, for example.


Come at it as a non-native speaker though; from a perspective of someone a different alphabet even. '...when a horsefly bit him.' Depending on how far one has made it into the language, I can imagine from my own struggles in some languages how simple words can fool you. I might think 'do they mean some sort of thing on the horse? Like a strap or something or a horse-whip maybe and that it stung him when it hit?' I do this all the time with literary stuff at the beginning. I can't be the only one.
1 x
20 books for 2021: 16 / 20

User avatar
luke
Green Belt
Posts: 427
Joined: Fri Aug 07, 2015 9:09 pm
Languages: English (N). Spanish (intermediate), Esperanto (B1), French (intermediate but rusting)
Language Log: https://forum.language-learners.org/vie ... 15&t=16948
x 1000

Re: The statistical distribution of language difficulty

Postby luke » Sat Jul 31, 2021 10:58 pm

Le Baron wrote:
Lisa wrote:In a sentence, however... e.g., "Joe jumped when a horsefly bit him"...

Come at it as a non-native speaker though; from a perspective of someone a different alphabet even. '...when a horsefly bit him.' Depending on how far one has made it into the language, I can imagine from my own struggles in some languages how simple words can fool you. I might think 'do they mean some sort of thing on the horse? Like a strap or something or a horse-whip maybe and that it stung him when it hit?' I do this all the time with literary stuff at the beginning. I can't be the only one.

That's why I called you "my love" from the first time we spoke, even though neither of us are gay and you are married.

It gets me back to "appropriateness of level" in the whole language learning thing. How do we make things interesting and appropriate in difficulty (or perhaps better said, ease)?

These are individual questions. Everyone is different. We may have common characteristics. Some of us can be grouped and categorized. In so doing, perhaps understanding is increased, but perhaps something is lost.

I've just skimmed through a couple Helen Abadzi videos. Skimmed for time. For easy first exposure. Because she's kind enough to include powerpoint slides that hit her main points. Fortunate, knowing I can go back again and again. Oh, and she did a couple in Spanish, so later I can watch without feeling guilty or like a procrastinator.

What I picked up, what was her focus on automaticity. The fundamentals must be easy. They have to be mastered.

How do we do that? We all have our ways, even if notions like "fundamentals" and "mastery" and "automaticity" aren't ideas we use to conceptualize and we may or may not even know we're after.

Sometimes numbers help. Sometimes their a crutch. Sometimes they're an obstacle. Example. I've put a "count the pages" progress bar in my signature. Can be a worthy goal. In some cases, perhaps number of pages read is a clear, indispensible metric on the way.

But, thinking now, it can be an obstacle if it's not measuring what is really needed right now. Continuing the example with "Cien años de soledad". What kind of idiot would be using that as a material when they haven't got the basics down yet. When I look in the mirror, I see a member of that set. The idiot.

But also the idiot savant. Right now, I need to continue listen/reading with Spanish/English (l/r), but that gives me 0 pages in the "pages read" metric. But it's perhaps what I should do in order to focus on the fundamentals, at least with respect to the goal of reading and enjoying that particular book.

Trying to bring this back to the actual topic. A tool, like ryanheise' difficulty and coverage rating of English podcasts, can help the learner with the broad goal of learning English through input with appropriately leveled material. It has broader implications, in that it can be applied to other languages and other input domains (videos, books, etc). In fact it seems to have grown out of similar investigations with books and reading and comprehension.

Just as some can pick apart Krashen for being a one-trick-pony, you can't deny that he moved the bar.

Similarly, I think ryanheise is moving the bar, whether incrementally, as most innovators do, or massively. It's still valuable. Not that anyone has argued against that fundamental point.

Anyone who wants to me a millionaire, I've got an idea for you. It's just a phrase. You can put it on T-shirts and coffee cups.

Life is complicated. Put on your big boy pants.

And the beauty of the world today is that I could find a website today and get a coffee cup or T-shirt that says exactly what I want it to say today.

So maybe it's not a million dollar idea.

But it's still a happy notion, a fundamental truth.
2 x

rpg
Orange Belt
Posts: 144
Joined: Fri Jul 21, 2017 2:21 pm
Languages: English (N), Spanish (B2), French (B1)
Language Log: https://forum.language-learners.org/vie ... =15&t=8368
x 428

Re: The statistical distribution of language difficulty

Postby rpg » Sun Aug 01, 2021 12:26 am

I haven't read all the text in this thread, but I just want to clarify the methodology here.

Here is what I believe the correct methodology for this is:

1) Generate a word frequency list.
2) For each podcast, find the smallest N such that if you knew the first N words in the frequency list, you would have 98% word coverage of the podcast.

I think this is what you are doing.

How are you generating the frequency list in step 1, though? I got the impression you generated it from the same corpus that you're testing on, is that right? If so I don't think that's methodologically sound; I think the corpus should be independent (and ideally extremely large, of course).
1 x
Super challenge 2020/21
French reading: 3935 / 5000      Spanish reading: 81 / 5000
French movies: 94 / 150       Spanish movies: 98 / 150

s_allard
Blue Belt
Posts: 839
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 1740

Re: The statistical distribution of language difficulty

Postby s_allard » Sun Aug 01, 2021 2:27 am

rpg wrote:I haven't read all the text in this thread, but I just want to clarify the methodology here.

Here is what I believe the correct methodology for this is:

1) Generate a word frequency list.
2) For each podcast, find the smallest N such that if you knew the first N words in the frequency list, you would have 98% word coverage of the podcast.

I think this is what you are doing.

How are you generating the frequency list in step 1, though? I got the impression you generated it from the same corpus that you're testing on, is that right? If so I don't think that's methodologically sound; I think the corpus should be independent (and ideally extremely large, of course).


Thank you. A very straightforward and simple observation. I'm glad somebody else said it so that I'm not the eternal bad boy in the thread. The methodology described here is of course what researchers like Paul Nation use.
0 x

User avatar
ryanheise
Green Belt
Posts: 370
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1246
Contact:

Re: The statistical distribution of language difficulty

Postby ryanheise » Sun Aug 01, 2021 2:35 am

s_allard wrote:
ryanheise wrote:Since I already explained before why length isn't significant, this time I'll try a demonstration rather than an explanation.

The demonstration is very nice but


It's just that I already explained to you multiple times why length is not a factor, and you weren't receptive to that (you continue to comment on length without even acknowledging those previous points about length) so I decided to try a different tack.

I wasn't talking about the size of half of Shrek. What I pointed out is that in 95 minutes Shrek has more unique vocabulary than 34000 podcasts combined if I understand chart1 correctly.


I've also already explained to you multiple times why document vocabulary size (the number of unique words in a document) is not a factor in graph 1. Bonus points if you can find it and quote it. I think it is more considerate for you to rephrase your discussion about a previously explained topic by including a reference to that previous explanation as it a starting point, rather than starting from zero each time. It takes a lot of effort for me to keep repeating the same explanation multiple times.

We would normally expect the podcasts to have some differences, however slight, in vocabulary between each other. One 30-minute podcast will probably use around 750 unique words but obviously the number of unique words in let’s say 20000 30-minute podcasts is much higher.


So the comment/explanation I would like you to go and find is the explanation of your confusion between document vocabulary size vs learner vocabulary size.

Graph 1 plots learner vocabulary size on the x-axis:

Image

This other graph appearing in a later post plots document vocabulary size on the x-axis:

Image

They plot completely different things. Again, bonus points to you if can find my previous comments about your confusion between document vs learner vocabulary size.

Length is a factor in that second graph, but it is not a factor in graph 1 as I've said multiple times before. So when you say "if I understand chart1 correctly", you do not understand chart 1 correctly now (even if maybe you did understand it in the past, you now appear to have forgotten the distinction between learner vocabulary and document vocabulary.)

The Shrek example is perfect since it isn't completely made up. It's a concrete example, just like you asked for. By splitting Shrek into two halves, we can see that length obviously has an effect on that second graph above, but not on that first graph.

* Shrek whole: 11064 learner vocabulary, 1170 document vocabulary, 11974 token count
* Shrek Part 1: 11869 learner vocabulary, 871 document vocabulary, 5408 6566 token count(*)
* Shrek Part 2: 10218 learner vocabulary, 655 document vocabulary, 5408 token count

Where the learner vocabulary is how many words the learner has learnt from the graded vocabulary lists, document vocabulary is the number of unique words in the document, and token count is like the Microsoft Word count of the document, and where the vocabularies are lemmatised.

If your question is how can this be so, then you'll find I've already explained the entire calculation in the past (the one you said wasn't really necessary, perhaps it was necessary after all?) By now I think I've written several explanations of that calculation, so if you find you now want to understand how it works, you have several explanations to choose from.

These numbers aren't directly comparable to Paul Nation's numbers since we count words in different ways, not just lemmas vs word families, we'd also have to look at the handling of proper nouns, etc. Despite those differences, it is still possible to make comparisons between two documents analysed by the same measurements, so the above analysis of Shrek whole vs Shrek Part 1 and Part 2 still tells you that that length is not a factor on graph 1.

(*) edit: what I did above was cut the script by the line count so that part 1 and part 2 each had 396 lines of the script, and it ended up that there were slightly more tokens in the first half, although this doesn't have any significant effect on the results and doesn't change any of the above points.
Last edited by ryanheise on Sun Aug 01, 2021 6:04 am, edited 1 time in total.
2 x

User avatar
ryanheise
Green Belt
Posts: 370
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1246
Contact:

Re: The statistical distribution of language difficulty

Postby ryanheise » Sun Aug 01, 2021 3:48 am

rpg wrote:I haven't read all the text in this thread, but I just want to clarify the methodology here.

Here is what I believe the correct methodology for this is:

1) Generate a word frequency list.
2) For each podcast, find the smallest N such that if you knew the first N words in the frequency list, you would have 98% word coverage of the podcast.

I think this is what you are doing.


Correct.

How are you generating the frequency list in step 1, though? I got the impression you generated it from the same corpus that you're testing on, is that right? If so I don't think that's methodologically sound; I think the corpus should be independent (and ideally extremely large, of course).


The podcast corpus that I've built is already by this stage one of the largest spoken English corpora in the world, containing 184 million tokens, and 148 thousand unique lemmas. Once it becomes automated, this number is expected increase multiple times over. Furthermore, it definitely covers the right type of language and words in the frequency distributions that are relevant for the type of content being analysed, and the alternatives for this are slim pickings.

So the corpus should be extremely large, yes, but also importantly, it should cover the right type of language. You should not use the Wikipedia corpus to analyse movies, for instance, and you should not use a movie corpus to analyse fictional literature. We want to ensure that all of the words that we expect to find in podcast-type material are actually covered by this corpus with the right type of frequency distribution for this type of material.

One of the interesting things about spoken language corpora is that they have been historically difficult to build, because they typically involved a lot of manual work to transcribe. The corpora that are very large (such as my own) analyse existing transcripts, whether that be of TV transcripts, movie transcripts or in my case, podcast transcripts. Each of these have their own natural skew which can't really be avoided. It is slim pickings, but you have to ask yourself which of those 3 corpora would be the most useful if you're interested in comparing the difficulty of different podcasts? Movies, as we've seen, can have an entirely different character than podcasts because movies are often set in fantasy or fictional worlds and are not actually using the same set of words that we use here in the real world, discussing real things.

So I would say that yes, this corpus ticks the boxes that I needed for the project. Now, there will be an issue when extending this to other languages which don't have as much podcast content, as I won't be able to build as large a corpus. If I manage to find another spoken language corpus in that language larger than my own, I will use it until my own outgrows it, but at the same time, there may be even slimmer pickings in these other languages, and I may have no choice but to build my own corpus.
9 x


Return to “General Language Discussion”

Who is online

Users browsing this forum: No registered users and 2 guests