The statistical distribution of language difficulty

Cainntear · Postby **Cainntear** » Thu Jul 29, 2021 4:54 pm

ryanheise wrote:On the grammar side, I transform the document into grammatical symbols and repeat the same sort of analysis as above.

I do not currently factor in sentence length, word length, or anything like that. At least in my experience as an adult language learner, I find I am able to cope with long sentences in a foreign language, and it is primarily my familiarity with the words and the grammar in the sentence that dictate whether I will comprehend it. I can understand that children would struggle with sentence length, particularly if as a result of recursive grammatical structures, and I think that is a matter of brain development. No doubt sentence length is a factor, but just a less significant factor for me. I guess my formula was tailored to myself. (Perhaps I could do something to measure the grammatical recursion rather than sentence length, though.)

I think you're probably right to do it that way.

Sentence length traditionally only serves as a rough proxy for the number of clauses in a sentence, and was used in older measures because it was quicker and easier.

I can't remember whether I used it for the Gaelic project. I definitely considered it, and as I recall it, the clincher for me was when I came across an interview where the transcript had a single sentence about 8 lines long, but all joined with simple conjunctions like "and" and "then".

Certainly, we were using a PoS tagger, so I had access to about 90% accurate data on the conjunctions used, so I could measure complexity grammatically.

ryanheise · Postby **ryanheise** » Thu Jul 29, 2021 4:57 pm

s_allard wrote:I hate to be seen as nagging here but I really have a problem understanding just exactly how many documents I can understand with a vocabulary size of 5000 word types or families.

By the way, the fact that the lists are lemmatized changes much of the discussion.

When I look at the chart it’s difficult to see the exact number of podcasts that correspond to a 5000-word vocabulary size. It would seem to be about 8000 podcasts. As someone else has mentioned, I the language learner could choose from these 8000 documents and be sure that they are of my level, i.e. my vocabulary size will give 98% coverage of any randomly chosen document from the corresponding histogram.

Now that we are using lemmatized lists, I find these values more realistic but the fundamental problem remains. Don’t the 8000 documents each have to contain nearly identical words for my individual list to give me 98% coverage in all 8000 documents? How could it be otherwise ?

Quite easily, actually. Let me illustrate by another example. Suppose we have a fictional language containing only 26 words: "a", "b", all the way up to "z", and just by chance, "a" happens to be the most frequent word, "b" the next most frequent, all the way down to "z", the least frequent of them all.

Now let's compare two fictional documents:

Document 1: a b c h x a b c d y
Document 2: w e f e h e f e f z

In fact, if you yourself possess a vocabulary of the 8 most frequent words, i.e. "a", "b", ... "h", and your target is 80% comprehension, then both of the above documents are comprehensible at 80% and yet have almost completely different words.

Document 1 uses these distinct words: a, b, c, d, h, x, y
Document 2 uses these distinct words: e, f, h, w, z

When it comes to Paul Nation's calculation it doesn't actually matter how many distinct words are used in a document. The only thing that matters is whether 80% (or whatever comprehension target you have) of the words in the document are in your vocabulary. So to Paul Nation's calculation, the two documents actually look like this:

Document 1: 1 1 1 1 0 1 1 1 1 0
Document 2: 0 1 1 1 1 1 1 1 1 0

where 1 means the word is within our vocabulary and 0 means it is not. After this analysis, in both cases, they are comprehensible at 80%.

But if those podcasts have individual vocabulary differences - as I suspect - despite having identical vocabulary sizes, then it's a whole different matter.

Your mistake again is to think about the vocabulary size of the document when what is actually being considered is the vocabulary size of the learner. The vocabulary size of the document is not a factor in Paul Nation's calculation.

For example, consider this very short sentence from a computer science text book:

"Java subclasses imply subtypes."

Not comprehensible to most people, I would say. And yet the document vocabulary size is only 4.

Now consider the following document.

I like playing with my friends and I like watching movies.

This is more comprehensible than the previous sentence, despite having a larger document vocabulary size of 9 words.

What matters in the comprehensibility estimation is not the vocabulary size of the document, but rather the vocabulary size of the learner. If the learner knows the top 5,000 most frequently used words in a language, they will be able to understand a "variety" of different documents that each use different subsets of that 5,000. A document will be classified into a bucket based on the least frequent word you need to know in order to understand the document at the target comprehension level. In my previous example, "h" was the least frequent word in the language that you needed to know in order to understand both documents at 80% comprehension, even though both documents used different subsets of the words "a" up to "h".

Le Baron · Postby **Le Baron** » Thu Jul 29, 2021 5:49 pm

In my experience it's not merely the size of vocabulary/knowledge of words making a text more comfortable to read, but other factors like types of sentence structure, length; use of known words in a way that is culturally recognised by (some) natives, but a mystery to an L2 reader.

Often I know the word(s) of texts, but if words even just outside the first 1000/2000 appear very frequently, e.g. many in one sentence and multiple cases of it in succeeding sentences, I slow down. The text becomes harder work. Same story if the author is being very literary/stylistic.

There is more to reading texts than knowing words, even if that does help a great deal.

s_allard · Postby **s_allard** » Thu Jul 29, 2021 7:04 pm

ryanheise wrote:...
What matters in the comprehensibility estimation is not the vocabulary size of the document, but rather the vocabulary size of the learner. If the learner knows the top 5,000 most frequently used words in a language, they will be able to understand a "variety" of different documents that each use different subsets of that 5,000. A document will be classified into a bucket based on the least frequent word you need to know in order to understand the document at the target comprehension level. In my previous example, "h" was the least frequent word in the language that you needed to know in order to understand both documents at 80% comprehension, even though both documents used different subsets of the words "a" up to "h".

We seem to be making some progress here. I’ll admit that I have problems understanding the rather artificial abstract examples. I prefer some more realistic examples with real words and, as I mentioned earlier, I suggest we don't bother with Nation because it just complicates things

Since we are now talking about lemmatized lists, vocabulary size is synonymous with vocabulary breadth or the number of different words. This is different from word token count. I agree that document length is irrelevant here.

Since we are using 98% coverage as necessary for full comprehension, I take this to mean that 98% of all the distinct words in the podcast must in the list of words known by the listener, regardless of the length of the podcast. I the listener have a vocabulary of the 5000 most common words in the language. Podcast A has 3000 tokens, 2000 unique words of which only 40 can not be in my list. Podcast B has 8000 tokens, 3000 unique words of which 60 are not on my list. Podcast C has 20000 tokens, 5000 unique words of which 100 are not in my list. I understand all three podcasts fully. No problem.

Going back to the first chart, in what histograms do I put these three charts ? They would all go in the histogram for the 5000 word family size if any words used are from the 4000 to 5000 word band of my vocabulary size. These podcasts are of different durations or sum token sizes. Here I will admit that I was mistaken in assuming that the podcasts were of the same word token size. (At the time, the lists were not lemmatized).

I have asked a number of times : Don’t the podcasts within the same histogram or bucket have to have nearly identical vocabularies ? The answer is no as long as 98% of their vocabulary comes from the 5000-word list. But that said, I tend to think that podcasts of similar vocabulary breadth, i.e. similar unique word sizes would tend to have very similar sets of word tokens. For example, two podcasts with each the same 5000 unique words from my vocabulary set would have nearly identical token vocabularies with small differences in the two sets of 100 words that are not in my vocabulary.

So having modified my understanding in light of the use of lemmatized lists, I will still say that the first chart is in my opinion still very misleading because it does not take into account the cumulative effect as the the number of known words increases. For example, at the 6000-word level, the « # of podcast episodes comprehensible at 98% » should increase not decrease and so forth for the other levels. What the chart does show is how many podcasts have a unique vocabulary size from 5000 to 6000 unique words.

The same reasoning applies to the 5000-word user vocabulary size for podcasts of 4000 to 5000 unique words size. This of course means that within a given histogram, the podcasts will tend to have similar vocabularies with variations of course.

Quite a fun discussion indeed.

s_allard · Postby **s_allard** » Thu Jul 29, 2021 7:18 pm

Le Baron wrote:In my experience it's not merely the size of vocabulary/knowledge of words making a text more comfortable to read, but other factors like types of sentence structure, length; use of known words in a way that is culturally recognised by (some) natives, but a mystery to an L2 reader.

Often I know the word(s) of texts, but if words even just outside the first 1000/2000 appear very frequently, e.g. many in one sentence and multiple cases of it in succeeding sentences, I slow down. The text becomes harder work. Same story if the author is being very literary/stylistic.

There is more to reading texts than knowing words, even if that does help a great deal.

I agree with you wholeheartedly although you wouldn't believe it from all that heavy prose I just devoted to vocabulary size. The veterans of this forum and the old HTLAL know that I believe that counting words is basically a waste of time because the word is not a fundamental unit of meaning. I have become infamous for supposedly claiming that one could pass a C2 speaking exam using less than 500 words in any language. I have also been nearly tarred and feathered for saying that you only need about 300 words to start speaking a language. I now avoid these topics like the plague and prefer to demonstrate by actions.

Vocabulary is of course very important but the way I see it you acquire the words as you need them. I learn new words in English, French, Spanish and German nearly every day. I don't know how many I know and I really don't care. What I do care about is how the use the words I know well and especially the idioms. Discourse is the real challenge; how to mobilize your thoughts and have the language come rolling out of your mouth. It only it were easy.

Beli Tsar · Postby **Beli Tsar** » Thu Jul 29, 2021 8:26 pm

s_allard wrote:I have asked a number of times : Don’t the podcasts within the same histogram or bucket have to have nearly identical vocabularies ? The answer is no as long as 98% of their vocabulary comes from the 5000-word list. But that said, I tend to think that podcasts of similar vocabulary breadth, i.e. similar unique word sizes would tend to have very similar sets of word tokens. For example, two podcasts with each the same 5000 unique words from my vocabulary set would have nearly identical token vocabularies with small differences in the two sets of 100 words that are not in my vocabulary.

This is simply not true, and the examples the OP and rdearman have given do make this clear. There is no requirement for the podcasts to cover all the words on the list, so they can differ both in the words that are in your vocabulary and in those that are not.

It's just the same as the way graded readers can be about radically different subjects while still sticking to a very narrow set of words - often much narrower than 5000!

s_allard · Postby **s_allard** » Thu Jul 29, 2021 11:56 pm

Beli Tsar wrote:
s_allard wrote:I have asked a number of times : Don’t the podcasts within the same histogram or bucket have to have nearly identical vocabularies ? The answer is no as long as 98% of their vocabulary comes from the 5000-word list. But that said, I tend to think that podcasts of similar vocabulary breadth, i.e. similar unique word sizes would tend to have very similar sets of word tokens. For example, two podcasts with each the same 5000 unique words from my vocabulary set would have nearly identical token vocabularies with small differences in the two sets of 100 words that are not in my vocabulary.

This is simply not true, and the examples the OP and rdearman have given do make this clear. There is no requirement for the podcasts to cover all the words on the list, so they can differ both in the words that are in your vocabulary and in those that are not.

It's just the same as the way graded readers can be about radically different subjects while still sticking to a very narrow set of words - often much narrower than 5000!

I think I said quite clearly in my own answer that podcasts in the same histogram generally do not need to have identical vocabularies. But what happens when multiple podcasts contain a very large proportion of the listener's vocabulary? So a number of podcasts each contain 5000 unique words and I can fully comprehend each podcast with 98% word coverage. Doesn't that mean that each podcast has the same common 5000 words from my list plus the little two percent of individual variation? What other words can there be?

I'm not saying that the documents are identical. That's a question of creativity. I'm saying that if a large number of words are shared with 98% coverage, the vocabularies will be similar by definition. I don't know if this helps but suppose the required word coverage for maximum comprehension is 100%. Given my 5000-word vocabulary, I find some podcasts that have a vocabulary of 5000 words each. No podcast can contain a word not on my list and every podcast contains all my words. It seems to me that by definition all the podcasts must therefore contain the same words.

Similarly, if some graded readers are required to use only the first 1000 word families of a beginner's list and none else, the length of the readers can vary, the subjects can be very different as the vocabulary allows but the vocabularies of the readers will be very similar because they come from the same place.

luke · Postby **luke** » Fri Jul 30, 2021 12:09 am

ryanheise wrote:Now let's compare two fictional documents:

Document 2: w e f e f e f e f z
Document 2 uses these distinct words: e, f, w, z
Document 2: 0 1 1 1 1 1 1 1 1 0

So, like if some, like podcaster, like actually used, like, certain words, like all the time, like, even like a word like "like" a lot, it could like, skew the numbers?

Thank you for brightening my day!

"Like" and "actually" are in the top 500, but perhaps because podcasters say things like "like my content" and not because they're teenagers who don't know how to, like, compose a sentence.

And with Bel Tsar's insightful, tentative conclusion:

Bel Tsar wrote:If your analysis is right, have you already demonstrated one valuable thing - that podcasts, because they are conversational, are actually a great place for language learners to start listening, much better than films? Perhaps we knew that already, ..., but might that be a plausible interpretation at least?

Perhaps ryanheise has spawned a new and innovative "podcast frequency dictionary" genre for language learners.

It's actually rather exciting.

ryanheise · Postby **ryanheise** » Fri Jul 30, 2021 4:40 am

s_allard wrote:
ryanheise wrote:...
What matters in the comprehensibility estimation is not the vocabulary size of the document, but rather the vocabulary size of the learner. If the learner knows the top 5,000 most frequently used words in a language, they will be able to understand a "variety" of different documents that each use different subsets of that 5,000. A document will be classified into a bucket based on the least frequent word you need to know in order to understand the document at the target comprehension level. In my previous example, "h" was the least frequent word in the language that you needed to know in order to understand both documents at 80% comprehension, even though both documents used different subsets of the words "a" up to "h".

We seem to be making some progress here. I’ll admit that I have problems understanding the rather artificial abstract examples. I prefer some more realistic examples with real words and, as I mentioned earlier, I suggest we don't bother with Nation because it just complicates things

I will try to come up with a less abstract example for you, since it is actually critical to understand the Nation calculation if you want to understand graph 1.

Now, I can't give you an actual document from my corpus since such concrete examples are too large to do the calculation on by hand. That is why I tried to invent an artificial language with a smaller vocabulary, and I assumed you would have no problem with an artificial language since you are a language learner after all!

But OK. I'll now invent a new artificial language, but this time it will be a subset of English.

Let's suppose the complete language consists of these words, listed from most frequent to least frequent:

1. I
2. and
3. a
4. have
5. to
6. like/likes
7. on
8. my
9. read
10. book/books
11. black
12. red
13. cat
14. hat
15. button/buttons
16. philosophy

We can create vocabulary lists for language learners based on the above frequency list, by choosing arbitrary buckets or grades.

Bucket 1 / grade 1 vocabulary list: I, and
Bucket 2: a, have/has
Bucket 3: to, like/likes
Bucket 4: on, my
Bucket 5: read, book/books
Bucket 6: black, red
Bucket 7: cat, hat
Bucket 8: button/buttons, philosophy

The buckets are arbitrary, you can make each bucket larger if you want, say, by having 4 buckets with a vocabulary list of 4 words in each bucket.

Now, let's consider two documents:

Document 1: I have a black hat. I have a black button.
Document 2: My cat likes to read my red books on philosophy.

Now, if the target is 90% comprehensibility, then document 1 would be placed into bucket 7 because if the learner has acquired all vocabulary lists up to "hat", then, comprehension will look like this:

I have a black hat. I have a black ____.

You've got 90% of the document within your acquired vocabulary.

90% comprehension for document 2 would look like this:

My cat likes to read my red books on _________.

And once again that goes into bucket 7 because the learner should have learnt up to vocabulary list number 7 which includes the word "cat".

But despite these documents both going into the same bucket, that does not imply ANYTHING about the breadth of vocabulary actually used in each document, how many unique words it has, how many time a word may be repeated in the document, or anything like that. As you can plainly see, these two documents are completely different in those respects. I have said this before, and other forum members have also pointed it out, but since you keep talking about these things such as document vocabulary size, unique tokens, etc., I just want to say it again, these things are not factors in the Paul Nation calculation as used in graph 1, and it gets us nowhere to keep talking about them if you truly want to understand what is depicted in graph 1.

Paul Nation's calculation is very basic (which is why I criticise it). It only looks at what is the most advanced word you need to know in each document. In document 1, that is "hat", and in document 2, that is "cat". Once the calculation finds that word, it pretty much ignores every other word in the document and says "hat". Right, that's bucket 7 for you. "cat" - right, that's bucket 7 for you, too.

ryanheise · Postby **ryanheise** » Fri Jul 30, 2021 7:40 am

luke wrote:
ryanheise wrote:Now let's compare two fictional documents:

Document 2: w e f e f e f e f z
Document 2 uses these distinct words: e, f, w, z
Document 2: 0 1 1 1 1 1 1 1 1 0

So, like if some, like podcaster, like actually used, like, certain words, like all the time, like, even like a word like "like" a lot, it could like, skew the numbers?

If we look at the entire corpus, "like" is indeed a much more frequent word, although skew can actually be considered as a feature. There are different corpora, for scientific literature, for fictional literature, for spoken language, etc. If your interest is in spoken language, then you'll want a corpus that is built from actual spoken language source material, and not, say, a corpus built from articles on Wikipedia which may be skewed in a very different direction.

If we look at just a particular podcaster who uses "like" more often than the average podcaster, then my current difficulty formula will treat that like any other word whose frequency is inflated above the norm. I don't think this means the assigned difficulty score will be terribly off, though. "like" really is an easy word, and it gets counted as such in the proportions it occurs. But if the podcaster also uses difficult words, those get factored in as well. Since I do something more close to (but not exactly the same as) estimating how many sentences are comprehensible, a difficult word appearing in the same sentence as an easy word will eclipse the easy word. So, like, a podcaster expounding on, like, coherentism vs internalism vs externalism and other, like, schools of like, epistemology, is still going to be classified as difficult despite using the word "like" a lot.

Sure we could refine this. It could be argued that nouns and verbs should be scored with a higher weighting than adverbs. However, a word like "like" has different meanings in different parts of speech. Sometimes it's a verb, and other times it's more an adverb. I currently do not separate frequency data for the different meanings of the same word, and there would be a cost associated with doing that. So it's about finding a balance between a scoring method that's good enough and also efficient enough to compute.

Thank you for brightening my day!

"Like" and "actually" are in the top 500, but perhaps because podcasters say things like "like my content" and not because they're teenagers who don't know how to, like, compose a sentence.

I just picked out a random podcast that used the word "like", and here is what I stumbled upon:

So he invited six other couples that we knew to the house for a dinner party in honor of my birthday with the purpose of celebrating me but also with sharing from their life, their experiences, their backgrounds, what, what it means to kind of cherish the time that I have now and how I can best like utilize who I am and my gifts to like, to really like, dive in. And like if there was ever a point to like, stand and applaud my husband, it was like this moment that he invited all of these different people and for most of them, their only connection was me.

And with Bel Tsar's insightful, tentative conclusion:

Bel Tsar wrote:If your analysis is right, have you already demonstrated one valuable thing - that podcasts, because they are conversational, are actually a great place for language learners to start listening, much better than films? Perhaps we knew that already, ..., but might that be a plausible interpretation at least?

Perhaps ryanheise has spawned a new and innovative "podcast frequency dictionary" genre for language learners.

It's actually rather exciting.

Thanks Luke and Bel Tsar! Such comments really do help reaffirm my own motivation to keep going, even though the amount of work so far has been daunting, and the amount of work still to go is daunting. If I keep going, I'll get there eventually.

A language learners’ forum

The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Re: The statistical distribution of language difficulty

Who is online