A visualisation of all words in English language podcasts

General discussion about learning languages
User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: A visualisation of all words in English language podcasts

Postby ryanheise » Sat Aug 13, 2022 4:55 am

Cainntear wrote:Was just looking at your profile pic and tried to work out whether it was supposed to represent "Be the and I" or "Der ich und sein"...? :lol: :lol:


My profile pic is an amalgamation of my 3 hobbies, and so you can view the same image in 3 ways:

1. Language: It's a collection of letters from the Latin alphabet: I (red), Z (green), L (orange), J (blue)
2. Tetris: It's a tessellation of tetrominoes
3. Cubing: It's a patterned side of a 4x4 Rubik's Revenge

Image

But I liked your guess, actually. It gives me an idea because it's the sort of cryptic thing my puzzle-minded brain would have wanted to come up with, and it could be a good graphic design to represent the project itself... (using the Japanese frequency table instead of the English or German one, since they are each single characters):

Image
2 x

Cainntear
Black Belt - 3rd Dan
Posts: 3469
Joined: Thu Jul 30, 2015 11:04 am
Location: Scotland
Languages: English(N)
Advanced: French,Spanish, Scottish Gaelic
Intermediate: Italian, Catalan, Corsican
Basic: Welsh
Dabbling: Polish, Russian etc
x 8662
Contact:

Re: A visualisation of all words in English language podcasts

Postby Cainntear » Sat Aug 13, 2022 10:01 am

s_allard wrote:In summary, reducing listening difficulty or listenability to analyzing word frequencies of podcast transcripts makes for colourful and pretty graphics that tell us nothing that George Zipf had not discussed nearly a 100 years ago.

Not strictly true. Many historical observations were academic curiosities at the time with little practical, real-world utility because the amount of time it took to do anything with it was unfeasibly long.

With the computational power and storage capacity of modern computers, fairly simple concepts like Zipf's law, Markov chains and Flesch-Kincaid readability scores can be put to use in real-world problems.

The original Google Translate almost exclusively relied on n-grams (a later variation on Markov chains) and while the text it produced was obviously not human-written text, it was enough to give people some level of understanding that they wouldn't otherwise have had.

"Not perfect" is not the same as "not useful".

ryanheise is looking at a real-world problem: filtering through hours and hours of material and trying to make it more likely that the learner finds something on their own level. It doesn't have to do the job perfectly every time.

That's why he's talking about "proxy measures" and "correlation between vocabulary and understandability". A mathematically accurate, objective measure of understandability is probably impossible to define, and if such a thing is possible, we're a million miles from being able to do that.

The value of vocabulary as a proxy is that it is a much simpler measurement to make, so achieving even a marginally better measure is massively more expensive than the "good enough" proxy. So we use the good enough.


That's understandibility, and here he's talking about vocabulary not as a proxy, but as a thing in and of itself. However, I wanted to go over that first because you brought up the valid question of phrasal units like multipart verbs. Yes, these are problematic.

So let's reframe the problem.

Checking the vocabulary requirements of a piece would ideally involve cataloguing all lexical items, including not only single-token items (aka "words") but also phrasal units (eg "pick up on [sthg]") and root morphemes in derived forms (eg "apply" in "reapply").

However, such a measure is very complex.
In the case of phrasal units you have to be able to determine that in the sentence "There's someone I have to pick up on my way home from work", the phrasal verb is "pick up" with "on" acting as a preposition of time, and not "pick up on" (to notice something and react to it).
With root morphemes, you're now forced to do an in-depth analysis of which derived forms are actively recognised as such by native speakers and which aren't -- we almost all intuitively recognise reapply=re+apply, but who processes "understand" as under+stand?

Not only does this make it extremely complex to define for English, but it also makes your algorithm less generalisable -- ryanheise's running the same thing for German and Japanese as for English.

What you've got, then, is a very difficult task where every improvement in output comes at the cost of a massive increase in complexity.

So what if there was a proxy measure for lexical items that gave ~75% accuracy with very low complexity? Wouldn't that be a useful thing to have? It should be unsurprising that word tokens act as a reasonable proxy for lexical items in many languages. How closely it approximates lexical content varies by language (it wouldn't be much use in an agglutinative language like Greenlandic, for example) and yes, Germanic multipart verbs do reduce its accuracy... but it's still useful.


But that's not all, because the other side of the equation is what happens when the algorithm is wrong. If the algorithm keeps presenting the learner with a lexical unit which it assumes they know but they actually don't, they simply end up getting exposed to a new lexical unit. If Krashen's right, they learn it anyway because the text as a whole is likely to be within their "i+1" comprehensible range. Even if Krashen isn't right, exposure does have a role to play in acquisition anyway, and besides... the reader might well spot that they haven't understood and just look the term up in a dictionary.


On balance, then, it should be clear that it's a good thing, and as the saying goes: don't let the perfect be the enemy of the good
4 x

Cainntear
Black Belt - 3rd Dan
Posts: 3469
Joined: Thu Jul 30, 2015 11:04 am
Location: Scotland
Languages: English(N)
Advanced: French,Spanish, Scottish Gaelic
Intermediate: Italian, Catalan, Corsican
Basic: Welsh
Dabbling: Polish, Russian etc
x 8662
Contact:

Re: A visualisation of all words in English language podcasts

Postby Cainntear » Sat Aug 13, 2022 10:09 am

And one other thing...
s_allard wrote:The elementary statistics in the post here are of course pretty much in line with all studies of word frequency in English and probably most languages and seems to be in line with Zipf’s law.

Perhaps your mistake here is looking at ryanheise's posts as attempts at "science", and you would be better considering them as "engineering"...?

He's describing a project he's in the process of implementing that applies existing principles -- engineering.

I mean... if someone came onto a civil construction forum and started discussing his plans for a new bridge over the Rio Grande, it's unlikely that people would complain that their materials and construction techniques were all things that had been applied to bridge construction before, isn't it?

Your line about Zipf's law feels a little like looking at a plan for a bridge, noticing a circular arch on it and pointing out that we've been building arches since the Roman Empire. If it's the right tool for the job, the engineer uses it.
3 x

Online
s_allard
Blue Belt
Posts: 969
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 2305

Re: A visualisation of all words in English language podcasts

Postby s_allard » Sat Aug 13, 2022 12:10 pm

To ryanheise, first of all, again let me wish you all the best for your health. This is the most important thing. I hope that our debate here does not add any more stress to what is probably a difficult situation. Remember that our discussion is in no way personal and from my perspective this is an opportunity to advance our understanding of some issues that are of interest to some of us.

That said, I’ll leave some of the more weighty issues for another post and now look at a relatively small but significant point. For me a prerequisite of good science is the accurate use of terminology. And here is where I have an issue with the title of the thread.

When I first saw the title I was intrigued; how can one visualize vocabulary and how will this help us understand its workings? However what I saw was really a pretty pie chart illustrating the frequency values of a small number of word lemmas whereas the values of the other words are too small for the words to appear on the chart.

George Kingsley Zipf demonstrated a long time ago that this kind of statistical distribution is fundamentally the same in all human languages. Zipf’s law states basically that in human languages there are a few very high-frequency words that account for most of the words in a sample text and many low-frequency words. English-language podcasts with some minor specificities share this fundamental feature with all the various corpora of English.The same for German and Japanese.

I would have entitled the thread A visualization of word frequency distribution in English-language podcasts. Now at least I would know what to expect and not be disappointed.
0 x

Cainntear
Black Belt - 3rd Dan
Posts: 3469
Joined: Thu Jul 30, 2015 11:04 am
Location: Scotland
Languages: English(N)
Advanced: French,Spanish, Scottish Gaelic
Intermediate: Italian, Catalan, Corsican
Basic: Welsh
Dabbling: Polish, Russian etc
x 8662
Contact:

Re: A visualisation of all words in English language podcasts

Postby Cainntear » Sat Aug 13, 2022 12:53 pm

s_allard wrote:For me a prerequisite of good science is the accurate use of terminology. And here is where I have an issue with the title of the thread.

In the field of modern data, "visualisation" is used to mean any way of presenting large datasets visually to allow a user or researcher to gain some kind of insight into certain properties of the dataset. As such, there isn't anything inaccurate in his use of the term. Would "frequency distribution" been more precise? Yes. Precision is related to accuracy, but it's not the same thing.

George Kingsley Zipf demonstrated a long time ago that this kind of statistical distribution is fundamentally the same in all human languages. Zipf’s law states basically that in human languages there are a few very high-frequency words that account for most of the words in a sample text and many low-frequency words. English-language podcasts with some minor specificities share this fundamental feature with all the various corpora of English.The same for German and Japanese.

Have you ever before seen Zipf's law presented as proportional of 100% as opposed to relative to each other? I haven't, and looking at these graphs (and I'm personally not a fan of pie charts) I realise that my view of the frequency of "the" has been skewed by seeing it at the very top of the scale of line graphs and bar charts, so these charts have in fact given me insight that I wasn't expecting.

It's a shame they disappointed you, but you can't please all of the people all of the time.
2 x

Online
s_allard
Blue Belt
Posts: 969
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 2305

Re: A visualisation of all words in English language podcasts

Postby s_allard » Sat Aug 13, 2022 1:17 pm

Cainntear wrote:
ryanheise is looking at a real-world problem: filtering through hours and hours of material and trying to make it more likely that the learner finds something on their own level. It doesn't have to do the job perfectly every time.

That's why he's talking about "proxy measures" and "correlation between vocabulary and understandability". A mathematically accurate, objective measure of understandability is probably impossible to define, and if such a thing is possible, we're a million miles from being able to do that.

The value of vocabulary as a proxy is that it is a much simpler measurement to make, so achieving even a marginally better measure is massively more expensive than the "good enough" proxy. So we use the good enough.


That's understandibility, and here he's talking about vocabulary not as a proxy, but as a thing in and of itself. However, I wanted to go over that first because you brought up the valid question of phrasal units like multipart verbs. Yes, these are problematic.




On balance, then, it should be clear that it's a good thing, and as the saying goes: don't let the perfect be the enemy of the good


I get that ryanheise is trying to develop some system that could automatically grade podcasts according to levels of difficulty. Again, there is a question of terminology. If we speak of determining listening lexical difficulty or lexical listenability, I have no problem supporting this line of research. Just don’t call it difficulty of understanding.

It just so happens that we have a data set that could be used to verify all of this. As has been mentioned in this forum there is an excellent website called Dreaming Spanish https://www.dreamingspanish.com/browse that offers hundreds of videos of Spanish for four levels of learners: superbeginners, beginners intermediates and advanced.

Using transcripts of these recordings and ryanheise’s approach of analyzing word frequency as a proxy, would we arrive at exactly the same classification of these recordings according to suitability for different learners?

I don’t have the answer but I suspect the results are not that clear, My doubts are based on my observation, albeit anecdotal, that the frequency distribution of the vocabulary of these does not change much from one level to the other. Instead, what is really striking is differences of speaking rate, use of visual aids, articulation, grammatical complexity, presence of idioms and metaphors.

These are all things that don’t appear when one counts the words in a transcript. And these I suspect are the very elements that enter into the mind of the author, Pablo Ramon, as he makes these videos.

As an aside, I find the topic so interesting that I've started transcribing short excerpts of four videos to see what kind of word frequency distribution we have.
1 x

User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: A visualisation of all words in English language podcasts

Postby ryanheise » Sat Aug 13, 2022 4:00 pm

s_allard wrote:It just so happens that we have a data set that could be used to verify all of this. As has been mentioned in this forum there is an excellent website called Dreaming Spanish https://www.dreamingspanish.com/browse that offers hundreds of videos of Spanish for four levels of learners: superbeginners, beginners intermediates and advanced.

Using transcripts of these recordings and ryanheise’s approach of analyzing word frequency as a proxy, would we arrive at exactly the same classification of these recordings according to suitability for different learners?


Spanish? No... I asked if anyone want to help with Spanish, but you didn't put your hand up.
Exact classification? No. I said many times I'm sacrificing accuracy in order to scale.

For English (which I have implemented), we did a test on the LearnWithPeter podcast, and the results were close to Peter's own rankings with some small differences that I explained as being due to the algorithm giving more weight to words repeated within a document. The two dials are:

1. word frequency within the corpus (indicating words you are likely to already know)
2. word frequency within the document (indicating words that may be easier to learn due to the repetition)

If I adjust the dials so that more weight is given to 1 and less weight is given to 2, then we get the same sequence as Peter's assessment:

Episode 12: 1816.3459101556728
Episode 14: 3443.5543054234563
Episode 15: 4017.5954645526554
Episode 8: 4568.508438045392
Episode 13: 5323.452068904534

Note that Peter rated Episode 12 as Low intermediate, he ranked episodes 14 and 15 as High intermediate, and he ranked episodes 8 and 13 as advanced.

I have since turned the dials back to how they were originally, because as a language learner, I actually like having unfamiliar words repeated within a document.

The main problem with my algorithm as far as English is concerned is not any of the things you keep bringing up, but rather the quality of the transcripts. That is where most of my effort is taking place currently. And also being slightly distracted by coming up with a logo graphic:

Image
1 x

Cainntear
Black Belt - 3rd Dan
Posts: 3469
Joined: Thu Jul 30, 2015 11:04 am
Location: Scotland
Languages: English(N)
Advanced: French,Spanish, Scottish Gaelic
Intermediate: Italian, Catalan, Corsican
Basic: Welsh
Dabbling: Polish, Russian etc
x 8662
Contact:

Re: A visualisation of all words in English language podcasts

Postby Cainntear » Sat Aug 13, 2022 4:15 pm

s_allard wrote:I get that ryanheise is trying to develop some system that could automatically grade podcasts according to levels of difficulty. Again, there is a question of terminology. If we speak of determining listening lexical difficulty or lexical listenability, I have no problem supporting this line of research. Just don’t call it difficulty of understanding.

This thread is specifically about vocabulary, as it says in the title that you were unhappy with a short while ago.

It was you who decided to talk about understanding in general with this statement:
s_allard wrote:But I would prefer to look at the bigger picture and ask the fundamental question : what makes something (text or recording) difficult to understand?

After that, ryanheise deliberately clarified that he was talking about vocabulary here.

You quoted ryanheise thusly to justify your change of topic (your emphasis)
ryanheise wrote:I found it interesting that these numbers above are significantly more skewed toward easier words, suggesting (not surprisingly) that podcasts may be easier to understand than general language.

Let me shift the emphasis and see what happens:
ryanheise wrote:I found it interesting that these numbers above are significantly more skewed toward easier words, suggesting (not surprisingly) that podcasts may be easier to understand than general language.

He did not present it as proof that podcasts are easier.

It is reasonable when working with data to propose hypotheses for further investigation, and ryanheise's plan is clearly to do more investigation based on more variables. Perhaps when he's finished he'll have demonstrated that podcasts are indeed simpler to understand than many other media, or maybe he'll find that increased simplicity in one measure corresponds in increased complexity in another and that it's a zero-sum game.
Neither you, nor I, nor he know what his final conclusion will be, so it is unfair to attack him for not having proof of what has only ever been presented as a working hypothesis.

(Reordering points here for clarity)

I don’t have the answer but I suspect the results are not that clear, My doubts are based on my observation, albeit anecdotal, that the frequency distribution of the vocabulary of these does not change much from one level to the other. Instead, what is really striking is differences of speaking rate, use of visual aids, articulation, grammatical complexity, presence of idioms and metaphors.

And as ryanheise has said here and on previous occasions, he is not looking for a single measure, but multiple variables -- dials that an individual learner can tweak to customise what they're getting.

When he is constantly making explicit that the topic under discussion is only one of many factors, the recurring criticism from you that it is only one of many factors is utterly unjustified.

It shows a deep lack of respect for anyone when your criticism of them is based on you not reading and/or understanding that they just said the exact thing that you are criticising them for not saying.

It just so happens that we have a data set that could be used to verify all of this. As has been mentioned in this forum there is an excellent website called Dreaming Spanish https://www.dreamingspanish.com/browse that offers hundreds of videos of Spanish for four levels of learners: superbeginners, beginners intermediates and advanced.
...
Using transcripts of these recordings and ryanheise’s approach of analyzing word frequency as a proxy, would we arrive at exactly the same classification of these recordings according to suitability for different learners?

This will not verify anything. Instead, it can be used to calibrate the system. And no, we would not arrive at exactly the same classification, because human judgement is not infallible, and sometimes there will be "intermediate" episodes that could just as easily have been part of the "advanced" series.

OK, so what do I mean by "calibrating" difficulty measures?

Well, first he establishes measures of multiple independent factors determining difficulty: speed, lexical diversity, lexical density, clause length, number of clauses in sentences, whatever.

Then he gives his system a pile of texts for which he has an expert difficulty rating (e.g. the Dreaming in Spanish videos) and generates the individual measures.

Next, he feeds that into a basic algebraic solver which generates an equation like this:
a*speed + b*lex_div + c*lex_dens + ... = category *

The computer then substitutes in all the data for every single text provided and determines values for coefficients a,b,c etc and the upper and lower bounds of each categories that put as many of the texts into the categories they were given as possible.

As I said, it will never give the exact same classification as a human, and that's OK.

(* in reality, the equation is more likely to be a1*2^speed+a2*speed^2+a3*speed+a4*log speed+a5*1/speed+ ... but most of the coefficients will solve to effectively zero, but that's by-the-by...)
0 x

Online
s_allard
Blue Belt
Posts: 969
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 2305

Re: A visualisation of all words in English language podcasts

Postby s_allard » Sat Aug 13, 2022 5:36 pm

ryanheise wrote:

For English (which I have implemented), we did a test on the LearnWithPeter podcast, and the results were close to Peter's own rankings with some small differences that I explained as being due to the algorithm giving more weight to words repeated within a document. The two dials are:

1. word frequency within the corpus (indicating words you are likely to already know)
2. word frequency within the document (indicating words that may be easier to learn due to the repetition)

If I adjust the dials so that more weight is given to 1 and less weight is given to 2, then we get the same sequence as Peter's assessment:

Episode 12: 1816.3459101556728
Episode 14: 3443.5543054234563
Episode 15: 4017.5954645526554
Episode 8: 4568.508438045392
Episode 13: 5323.452068904534

Note that Peter rated Episode 12 as Low intermediate, he ranked episodes 14 and 15 as High intermediate, and he ranked episodes 8 and 13 as advanced.

I have since turned the dials back to how they were originally, because as a language learner, I actually like having unfamiliar words repeated within a document.

The main problem with my algorithm as far as English is concerned is not any of the things you keep bringing up, but rather the quality of the transcripts. That is where most of my effort is taking place currently. And also being slightly distracted by coming up with a logo graphic:

Image


I really don’t want to rehash a tiresome debate that he had a while back although I see the problems are the same. But I do remember that the same algorithm ranked the podcast of the prayer John 17 of the World English Bible as the second easiest podcast out of 40000. Here is just the first paragraph of that prayer:

Jesus said these things, and lifting up his eyes to heaven, he said, “Father, the time has come. Glorify your Son, that your Son may also glorify you; even as you gave him authority over all flesh, so he will give eternal life to all whom you have given him. This is eternal life, that they should know you, the only true God, and him whom you sent, Jesus Christ. I glorified you on the earth. I have accomplished the work which you have given me to do. Now, Father, glorify me with your own self with the glory which I had with you before the world existed.


If this is the second easiest out of 40000 podcasts, what does this say about all the others? The point here is that despite all this scientific-sounding statistical sleight-of-hand the algorithm doesn’t really work for ranking difficulty of understanding. It may work for all sorts of other things, as this excerpt from the bible shows, but not for understanding.
0 x

Cainntear
Black Belt - 3rd Dan
Posts: 3469
Joined: Thu Jul 30, 2015 11:04 am
Location: Scotland
Languages: English(N)
Advanced: French,Spanish, Scottish Gaelic
Intermediate: Italian, Catalan, Corsican
Basic: Welsh
Dabbling: Polish, Russian etc
x 8662
Contact:

Re: A visualisation of all words in English language podcasts

Postby Cainntear » Sat Aug 13, 2022 7:19 pm

s_allard wrote:I really don’t want to rehash a tiresome debate

Then please feel free to stop doing so.
3 x


Return to “General Language Discussion”

Who is online

Users browsing this forum: s_allard and 2 guests