The statistical distribution of language difficulty

General discussion about learning languages
User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: The statistical distribution of language difficulty

Postby ryanheise » Wed Aug 04, 2021 2:55 am

rpg wrote:My point (and I'm sure you're aware of this) was that with a smaller corpus you can get some distortionary effects. The words that are repeated in your podcast will also be higher in the frequency list because that same podcast was used to generate the frequency list (imagine the limiting case where your corpus was so small that it only contained that one podcast, for example, and then consider how it changes as the corpus increases). I do think it's a little conceptually cleaner even still to do some cross-validation: build your frequency list based on eg 80% of your podcasts and then use the other 20% as your test set to generate the chart. But I don't think the results would be hardly any different because your corpus is pretty big.

The other thing is that I think the corpus that's the most relevant for your typical language learner would be a mix between spoken and written language--almost all language learners learn from both types of source, I think. That's what the Routledge frequency dictionaries do too (mixing the two types), if I recall correctly. Obviously that brings its own complications for how you weigh them though.


Now, the catch is that this difficulty score was developed from the perspective of a language learner who is on a mission to learn new words, so it is not just measuring how easy the known words are, but also how easy it would be to learn the unknown words. When I started developing this score, I was experimenting on myself as a language learner, and I found that it was easier to learn new words if they were repeated frequently within the episode I was listening to, almost regardless of their frequency in the global corpus. So the end result is that the score is less sensitive to the global frequency than you might think, and bias in any single document is actually not something that I want to completely avoid because I see it as a benefit to the learner. There are many rare words in cooking podcasts, but if someone is interested in cooking, then I would consider it good learning material as long as those words are of interest to the learner and they are repeated a lot within that material.
1 x

s_allard
Blue Belt
Posts: 969
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 2305

Re: The statistical distribution of language difficulty

Postby s_allard » Wed Aug 04, 2021 12:30 pm

luke wrote:
s_allard wrote:My own opinion of all this is that these frequency lists are rather blunt instruments for assessing readability and comprehension but we really don’t have much choice. They are certainly useful for the design of graded learning materials and dictionary making but for language hobbyists like us, I don’t see much use.

The simple reason is that a word in itself has little value.

But we have to start somewhere.

Thank you for reminding me of a passage I believe I read that said: En el principio, era el verbo (in the beginning, there was the word).


It’s true we have to start somewhere. The problem is that we tend to get trapped in the idea that a language is a collection of words and that learning a language is tantamount to memorizing a bunch of words. So if we learn x number of words a day with y repetitions in an SRS app, we can count the number of months to learn the 6000 most common words and become « fluent » in the target language. It only it were so simple.

We see variations of this approach from time to time in this forum. It makes for interesting statistics but it does not work, it never has and never will. For a simple reason we all know so well : the word is not a unit of discourse. Vocabulary is a byproduct of the learning process. In the end you will have acquired that 6000-word vocabulary that you coveted and you will certainly have used dictionaries and even an SRS app along the way but you don’t start with just lists of words.
2 x

User avatar
luke
Brown Belt
Posts: 1243
Joined: Fri Aug 07, 2015 9:09 pm
Languages: English (N). Spanish (intermediate), Esperanto (B1), French (intermediate but rusting)
Language Log: https://forum.language-learners.org/vie ... 15&t=16948
x 3631

Re: The statistical distribution of language difficulty

Postby luke » Wed Aug 04, 2021 1:48 pm

s_allard wrote:
luke wrote:
s_allard wrote:The simple reason is that a word in itself has little value.

But we have to start somewhere.


It’s true we have to start somewhere. The problem is that we tend to get trapped in the idea that a language is a collection of words and that learning a language is tantamount to memorizing a bunch of words. So if we learn x number of words a day with y repetitions in an SRS app, we can count the number of months to learn the 6000 most common words and become « fluent » in the target language.

It makes for interesting statistics but it does not work, it never has and never will. For a simple reason we all know so well : the word is not a unit of discourse. Vocabulary is a byproduct of the learning process. In the end you will have acquired that 6000-word vocabulary that you coveted and you will certainly have used dictionaries and even an SRS app along the way but you don’t start with just lists of words.

Sometimes I find myself focusing too much on a particular detail or event. It generally gets me off track. I highlighted a bit of what you wrote and I hope in so doing that I haven't "gone off track".

I disagree with the first part that bolded. Vocabulary is necessary ingredient. And you're not saying it's not. But the bolded part seems similar to the mother of the bride criticizing her daughter for wanting a good wedding cake. And then the mother saying in a huff, "you think a good marriage is all about wedding cake and as long as the wedding cake is good, the marriage will be successful, we'll, you're wrong". And I'm thinking the bride is not so delusional but does want a good cake.

On the second bit, it sounds like you're in complete agreement and perhaps haven't realized it yet. Using podcasts can help listening, implicit grammar, expanding vocabulary knowledge, increasing "time on task", etc. Then the question is, "how do I find an appropriate podcast"? Whether one is asking the question simply for themselves and would be accomplished with less effort by just trial and error, or one is using another one of their "other" intelligences to come up with a general (although incomplete) solution. Individual learners can look for "general solutions" whether they have realized they ultimately need a "personal solution" or not.

And general solutions may be helpful to others (top down). And that doesn't discount the value of simply personalizing someone else's experience (bottom up).
2 x

s_allard
Blue Belt
Posts: 969
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 2305

Re: The statistical distribution of language difficulty

Postby s_allard » Thu Aug 05, 2021 12:50 pm

luke wrote:
s_allard wrote:
luke wrote:
s_allard wrote:The simple reason is that a word in itself has little value.

But we have to start somewhere.


It’s true we have to start somewhere. The problem is that we tend to get trapped in the idea that a language is a collection of words and that learning a language is tantamount to memorizing a bunch of words. So if we learn x number of words a day with y repetitions in an SRS app, we can count the number of months to learn the 6000 most common words and become « fluent » in the target language.

It makes for interesting statistics but it does not work, it never has and never will. For a simple reason we all know so well : the word is not a unit of discourse. Vocabulary is a byproduct of the learning process. In the end you will have acquired that 6000-word vocabulary that you coveted and you will certainly have used dictionaries and even an SRS app along the way but you don’t start with just lists of words.

Sometimes I find myself focusing too much on a particular detail or event. It generally gets me off track. I highlighted a bit of what you wrote and I hope in so doing that I haven't "gone off track".

I disagree with the first part that bolded. Vocabulary is necessary ingredient. And you're not saying it's not. But the bolded part seems similar to the mother of the bride criticizing her daughter for wanting a good wedding cake. And then the mother saying in a huff, "you think a good marriage is all about wedding cake and as long as the wedding cake is good, the marriage will be successful, we'll, you're wrong". And I'm thinking the bride is not so delusional but does want a good cake.

On the second bit, it sounds like you're in complete agreement and perhaps haven't realized it yet. Using podcasts can help listening, implicit grammar, expanding vocabulary knowledge, increasing "time on task", etc. Then the question is, "how do I find an appropriate podcast"? Whether one is asking the question simply for themselves and would be accomplished with less effort by just trial and error, or one is using another one of their "other" intelligences to come up with a general (although incomplete) solution. Individual learners can look for "general solutions" whether they have realized they ultimately need a "personal solution" or not.

And general solutions may be helpful to others (top down). And that doesn't discount the value of simply personalizing someone else's experience (bottom up).


An interesting response. I have to say that I didn’t really understand the paragraph containing the analogy with the bride and the wedding cake. Isn’t this is an excellent example of exactly what we are talking about? I think I know 100% of the words in the paragraph but I don’t comprehend or understand how it relates to what I said about this focus on learning a specific number of words. I’m not saying that there is anything wrong. I’m only saying that I don’t get it and probably because the words used are very different from the words I used in my post. I would even say that it’s a discourse problem.

As for the second part, I hope that my original post did not give the impression that I thought using graded podcasts was useless. This is certainly not my idea. Let’s say there were a system whereby podcasts were classified into six categories : A1-A2, B1-B2 and C1-C2, referring to the CEFR system. And these podcasts came with a transcript, a translation and even a metalinguistic commentary. That would be fabulous. I could easily choose material based on my current or target level.

Now the question is would this classification give the same results if we used only vocabulary size. Something along the lines of six vocabulary sizes : 1000, 2000, 3000, 4000, 5000 and 6000 words. I’m not sure about this.

In passing, it should be pointed out that the CEFR system does not make any reference to vocabulary size. Vocabulary is obviously very important but it is use and not numbers that count. If vocabulary size were the most important determinant of language proficiency, we could simply replace language tests with vocabulary tests.

I certainly think podcasts are an excellent source of native materials, especially when accurate transcripts are available. I would also point out that the video material available freely on Internet is also very useful and often comes with subtitles. Unfortunately, the automatically generated subtitles are usually of dubious quality.

To answer the question « What material (podcast or video) to choose ? » I think the answer is in two parts. First of all for many languages there exists on the internet a huge amount of material aimed specifically at learners.

Secondly, when you feel ready for so-called native materials, I think you could choose anything as long as the subject is somewhat compelling. Then you have to just slog it out with the language.

Since language is so repetitive, you will quickly pick up or notice the small number of key words that keep coming back. Then you have to add the subject-specific words. I loved to watch cooking shows in German and I now do so in Russian. It’s a great way to pick up food and cooking vocabulary.

Since I’m very interested in conversational language, I like to watch programs with interviews or lots of dialogues. Again there is a huge selection often with subtitles on youtube. What I find particularly interesting here is the use of everyday idiomatic language that I can attempt to use.

There is no lack of audio or video material for any topic or language genre under the sun. Finally, I would add that access to a tutor or some help is very important because many things will require clarifications or explanations.
2 x

User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: The statistical distribution of language difficulty

Postby ryanheise » Thu Aug 05, 2021 3:31 pm

s_allard wrote:It’s kind of interesting that with 98% word coverage of this thread, I don’t understand most of it.


But your English ability is not the reason you didn't understand it.

In the above case, your knowledge of English is totally fine. It is your advanced English level that allowed you to engage with the discussion in the first place, to ask questions, to think deeply, to present objections, and even to misunderstand.

There is a distinction to be made between the difficulty of the language employed, and the difficulty of the concept being conveyed. If you attend an introductory lecture on discrete mathematics, you will find that many counter-intuitive concepts will be explained in plain and simple English, using a lot of ordinary vocabulary and grammar that is not in itself a roadblock. Yet, the concepts may still be difficult. The professor is not going to tell you that your English level is simply not high enough to engage with this material, but rather that you just need to ask more questions in class until you can understand it.

So what I'm interested in here is to approximately measure the language difficulty itself, in terms of the grammar and vocabulary that the learner needs to know in order to engage with the content. Sure, depending on the difficulty of the "subject matter", the learner may still have a lot of questions, and may need to think deeply, but that is a type of engagement that is first unlocked by their understanding of the language.

I have to say that I didn’t really understand the paragraph containing the analogy with the bride and the wedding cake. Isn’t this is an excellent example of exactly what we are talking about? I think I know 100% of the words in the paragraph but I don’t comprehend or understand how it relates to what I said about this focus on learning a specific number of words.


Another example where it's not your language ability at issue, it's just that you need to ask more questions, and then you should be able to resolve this misunderstanding.
3 x

s_allard
Blue Belt
Posts: 969
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 2305

Re: The statistical distribution of language difficulty

Postby s_allard » Sat Aug 07, 2021 5:41 pm

ryanheise wrote:
So what I'm interested in here is to approximately measure the language difficulty itself, in terms of the grammar and vocabulary that the learner needs to know in order to engage with the content. Sure, depending on the difficulty of the "subject matter", the learner may still have a lot of questions, and may need to think deeply, but that is a type of engagement that is first unlocked by their understanding of the language.


(Bold added by me)
This morning I was listening to a Mexican journalist speaking with two other journalists about Mexico’s performance, or rather lack thereof, at the 2020 Olympic games in Tokyo and I started thinking about my level of understanding and about how to measure the difficulty of this kind of spoken language for learners of Spanish. Obvious this material is about as native as one can get and is truly aimed at Mexican listeners.

I should point out that when dealing with conversational native material there is often a basic problem of figuring out exactly what was said, especially when people are speaking through web video applications like Zoom or Skype. Subtitles come in handy but are often not very accurate. Native listeners can fill in the gaps or decipher garbled speech but this is a challenge for learners.

The first item that comes to mind in the context of this thread is vocabulary size required to for 98% coverage of this spoken material. There are four major problems here. First there is a large number of number of proper nouns. The journalist refers constantly to the names of athletes, cities and countries, results of previous competitions and certain governmental organizations.

Then there is the terminology of the various Olympic sports. Luckily I had picked up a lot of sports terminology from talking with my tutor.

And the third problem was the omnipresence of idioms some of which were purely Mexican. I’m always taken aback by how idiomatic native conversational language is. On a few occasions I had to do some sleuthing on the Internet to find explanations and then re-listen to the relevant parts of the interview. I think I will have listened to the whole interview at least twice.

This leads me to the question of how to measure the minimum vocabulary necessary for 98% coverage of this material. I know that one could put the material into a program, eliminate the proper nouns, count the number of lemmas and then use a Spanish word frequency list to determine the overall size of vocabulary needed for 98% coverage. The sports terminology will tend to skew things to the high size because many of those terms are not very frequent. In any case, I may end up with a figure like a 8000-word vocabulary would be required for 98% coverage of this interview. But I still will have difficulty understanding the language because of the proper nouns and the idioms.

And, finally, there is the issue of measurement of grammatical difficulty. But what constitutes grammatical difficulty? Spanish verb morphology is a major challenge for learners. How does one measure this difficulty ? For example, I’m always struck by how often the subjunctive mood and particularly the imperfect subjunctive is used in Spanish verbs. Should we compare the number of subjunctives to that of non-subjunctives ? And how do we go about measuring pronominal verb usage ?
Is there a way of measuring grammar coverage like we do with word coverage ?

All of this to say that we can’t really distinguish between understanding the language and understanding the subject matter. But in this thread are we really talking about understanding the subject matter ? I think we’re talking about coverage. They are not the same thing.
1 x

User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: The statistical distribution of language difficulty

Postby ryanheise » Sun Aug 08, 2021 10:48 am

s_allard wrote:All of this to say that we can’t really distinguish between understanding the language and understanding the subject matter.


The introductory discrete mathematics lecture is an example where you may understand the plain and simple language used, but just not understand the subject matter because the concepts are counter-intuitive. I'm not going to stop you from conflating the two yourself (go ahead!) but for what I'm doing, it is essential for me to distinguish between the two because when selecting foreign language material, I don't want the discrete mathematics podcast to be filtered out even though it's difficult, if it turns out that the language itself wasn't the obstacle and it was only the brain twisting subject matter that was difficult. I love difficult subject matter. It doesn't bother me that I can't understand a difficult concept on first hearing, but at least I understand the language enough to be able to semantically understand every sentence, and that allows me to at least engage with the material and enjoy the journey to a deeper understanding of the subject matter.

I think we’re talking about coverage. They are not the same thing.


You're working within a framework of coverage, while I'm working within a framework of difficulty, and you seem to want to impose your framework onto me. I don't see the point of that.

When it comes to estimating difficulty, there is a broader range of factors beyond just coverage that can be considered as being an influence. Sentence length is one of the oldest used, more recent literature tends to look at vocabulary and grammar frequency data, but not simply in terms of coverage. Today's NLP techniques are a lot more advanced than simply counting what percentage of words in the document you know (i.e. graph 1). For example, it has already been discussed that you could measure the amount of grammatical recursion in a sentence. Because I'm dealing with audio, one other factor to be considered is the speech rate, or how many words per minute are being spoken. Almost none of these factors can be analysed in a more limited framework of coverage.

And the third problem was the omnipresence of idioms some of which were purely Mexican.


This is an interesting topic, and I found an article that gives a list here and explains that even in different Spanish-speaking countries, the idioms are different. This paper also looks into the issues for language learners. But while clearly an idiom is not simply the sum of its words, a difficulty score that treats an idiom as the sum of its words may end up being a close enough approximation of the truth. If idioms are truly frequent in Spanish, then there may turn out to be some rough correlation between the frequency of the idiom and the frequency of the idiom's rarest word. Since the frequency of the idiom cannot be greater than the frequency of any of its words, the worst that can happen is that we rate an idiom as easier to understand than it actually is, but then if you hypothesise that idioms are evenly distributed throughout all language, then it becomes more like background radiation or background noise, where on average it doesn't affect the results in a significant way. There are already many dampening effects in the difficulty score, such as the fact that documents do not consist only of idioms, and their contribution to the score is likely to be washed out to a significant degree by the other sentences within the document. But who knows? I feel that is enough theorising for now, I would like to get back to finishing the project and then we can see how it ends up working in practice.
2 x

s_allard
Blue Belt
Posts: 969
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 2305

Re: The statistical distribution of language difficulty

Postby s_allard » Sun Aug 08, 2021 12:55 pm

ryanheise wrote:

s_allard wrote:I think we’re talking about coverage. They are not the same thing.


You're working within a framework of coverage, while I'm working within a framework of difficulty, and you seem to want to impose your framework onto me. I don't see the point of that.


I certainly do not want to impose any framework on anyone. What I have pointed out repeatedly is that from the very beginning of this thread, the term comprehension has been used as synonymous for coverage. That’s what is written on the y-axis of chart1. I don’t know of any researchers that speak of 98.08% comprehension.

With reference to the framework of difficulty, as I pointed out in my last post, I was looking specifically at the nature of the difficulty of listening to what I would call a Mexican podcast. I identified four problems areas that impact difficulty of understanding. The only one that has been addressed here is the treatment of idioms, as we see below.

ryanheise wrote:
s_allard wrote:And the third problem was the omnipresence of idioms some of which were purely Mexican.


This is an interesting topic, and I found an article that gives a list here and explains that even in different Spanish-speaking countries, the idioms are different. This paper also looks into the issues for language learners. But while clearly an idiom is not simply the sum of its words, a difficulty score that treats an idiom as the sum of its words may end up being a close enough approximation of the truth. If idioms are truly frequent in Spanish, then there may turn out to be some rough correlation between the frequency of the idiom and the frequency of the idiom's rarest word. Since the frequency of the idiom cannot be greater than the frequency of any of its words, the worst that can happen is that we rate an idiom as easier to understand than it actually is, but then if you hypothesise that idioms are evenly distributed throughout all language, then it becomes more like background radiation or background noise, where on average it doesn't affect the results in a significant way. There are already many dampening effects in the difficulty score, such as the fact that documents do not consist only of idioms, and their contribution to the score is likely to be washed out to a significant degree by the other sentences within the document. But who knows? I feel that is enough theorising for now, I would like to get back to finishing the project and then we can see how it ends up working in practice.


If I understand this paragraph correctly, the hypothesis is that idioms are not a problem and are simply to be analyzed in terms of their rarest parts. This is an excellent example of how coverage is being used to eliminate the more fundamental issue of comprehension. The problem isn’t the distribution or number of idioms in a text ; it is a problem of how to interpret groupings of words that take on special meaning.

Similarly, as I pointed out, the presence of proper nouns is a real-life problem that most studies simply ignore because in certain genres of text they are not that important. But in other kinds of material such as the example I gave – and I would think in many podcasts – they are key to understanding the content.
1 x

User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: The statistical distribution of language difficulty

Postby ryanheise » Sun Aug 08, 2021 5:07 pm

s_allard wrote:What I have pointed out repeatedly is that from the very beginning of this thread, the term comprehension has been used as synonymous for coverage. That’s what is written on the y-axis of chart1.


If you take another look at chart 1 and the clarifications below it, do you still think that? (Try again after removing your "coverage"-tinted glasses 8-) )

With reference to the framework of difficulty, as I pointed out in my last post, I was looking specifically at the nature of the difficulty of listening to what I would call a Mexican podcast.


With reference to the framework of difficulty? These points were still with reference to your own framework of coverage because you only talked about points that can be reduced to coverage. In the process of reducing it to coverage, you are losing information. For example, one of your points was about grammatical difficulty, but then after squeezing it into your coverage framework you arrived at the idea of "grammatical coverage". So what information are you losing when you try to reduce difficulty to coverage here? Simple. I already gave you an example of how measuring the size of a recursive grammatical structure can be used as a difficulty factor. This simply does not fit into your coverage framework because it measures the complexity of the structures themselves rather than measuring the percentage of the document covered. I have since given more examples. In particular, since we are dealing with audio, another difficulty factor can be the speed of speech. Here, we are measuring the speed of the speech directly and using it as the factor itself, rather than measuring the percentage of the document covered. Even when it comes to using word frequency data, it is NOT used with reference to coverage. Rather, the frequency number for a word is directly transformed into a measure of difficulty.

I identified four problems areas that impact difficulty of understanding. The only one that has been addressed here is the treatment of idioms


These have all been discussed before aside from idioms.

Ok, just quickly:

1. Proper nouns: I currently treat proper nouns in a way similar to Nation, and I find this to be an amply sufficient approximation. There is no need to optimise this part of the algorithm further.
2. Domain-specific vocabulary: Such words will naturally be lower on the frequency scale, and since word frequency is already a factor, this is already factored in.
3. Idioms: answered in previous reply.
4. Grammar: answered in this reply.

, as we see below.

ryanheise wrote:
s_allard wrote:And the third problem was the omnipresence of idioms some of which were purely Mexican.


This is an interesting topic, and I found an article that gives a list here and explains that even in different Spanish-speaking countries, the idioms are different. This paper also looks into the issues for language learners. But while clearly an idiom is not simply the sum of its words, a difficulty score that treats an idiom as the sum of its words may end up being a close enough approximation of the truth. If idioms are truly frequent in Spanish, then there may turn out to be some rough correlation between the frequency of the idiom and the frequency of the idiom's rarest word. Since the frequency of the idiom cannot be greater than the frequency of any of its words, the worst that can happen is that we rate an idiom as easier to understand than it actually is, but then if you hypothesise that idioms are evenly distributed throughout all language, then it becomes more like background radiation or background noise, where on average it doesn't affect the results in a significant way. There are already many dampening effects in the difficulty score, such as the fact that documents do not consist only of idioms, and their contribution to the score is likely to be washed out to a significant degree by the other sentences within the document. But who knows? I feel that is enough theorising for now, I would like to get back to finishing the project and then we can see how it ends up working in practice.


If I understand this paragraph correctly, the hypothesis is that idioms are not a problem and are simply to be analyzed in terms of their rarest parts. This is an excellent example of how coverage is being used to eliminate the more fundamental issue of comprehension. The problem isn’t the distribution or number of idioms in a text ; it is a problem of how to interpret groupings of words that take on special meaning.


You missed the point. The difficulty score only needs to have a correlation to actual difficulty. It doesn't need to model the way difficulty actually works to achieve that correlation. So if there is a computationally inexpensive shortcut that saves months of processing time while sacrificing just a bit of accuracy, that is worth taking. Just to give you some idea of this, each time I make a modification to the formula and rerun the analysis, it currently takes 4 weeks of continuous CPU churning to produce the new podcast sort order. I do not want to make it more complicated so that it then takes 3 months to run. It doesn't need to be perfect, it just needs to fall in that sweet spot where it's good enough and actually computable in reasonable time.

And I note again your use of the "coverage"-tinted glasses above. If you understood the way the difficulty calculation works as per the explanation in a previous post, you would see that it is completely different from coverage.

Anyway, let's finish here since it really is enough theorising. The time is better spent finishing the job and putting it into practice.
1 x

User avatar
luke
Brown Belt
Posts: 1243
Joined: Fri Aug 07, 2015 9:09 pm
Languages: English (N). Spanish (intermediate), Esperanto (B1), French (intermediate but rusting)
Language Log: https://forum.language-learners.org/vie ... 15&t=16948
x 3631

Re: The statistical distribution of language difficulty

Postby luke » Sun Aug 08, 2021 5:29 pm

s_allard wrote:the presence of proper nouns ... are key to understanding the content.

You made some distinctions between when proper nouns matter a lot and when they may matter less. I'm just using that bit of quote to note the comprehension is not "all" or "nothing".

Example: Discussion that mentions some specific plants grown in a garden (in a wider discussion - I don't mean in a podcast on gardening). Knowing that certain words are plants from a garden is a certain level of comprehension. Today I was at a market that had fresh vegetables. I noticed a couple roots that I'd heard and read before in the context of garden. Even in translation they didn't mean a lot to me, but knowing they are plants from a garden has been enough to understand pretty well what the speaker/author's point was. (Dad wasn't helping in the garden).

Do I have to know the plant is a root before I understand? Do I have to taste the root before I understand? Do I have to know how easy or hard to grow the root might be? Do I need to know if there is any lore about medicinal properties of these particular roots? Do I have to have years of experience eating these roots as a staple as a child to understand?

Having personal answers to all those questions may enrich the story, but aren't critical for getting the author/speaker's point.

One can make analogous questions for sports figures, politicians, companies, makes and models, etc. Knowing more details may be helpful, but in some settings they may not be that important to a particular speaker's point. For instance, in a motivational talk discussing some champion's "mental approach to the sport", understanding which specific sport they became a champion in may not be critical to understanding the athlete's focus on drive, determination, discipline, diet, etc. (again, assuming the talk is general and not specifically about "the game").
2 x


Return to “General Language Discussion”

Who is online

Users browsing this forum: No registered users and 2 guests