A visualisation of all words in English language podcasts

General discussion about learning languages
User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

A visualisation of all words in English language podcasts

Postby ryanheise » Wed Aug 10, 2022 11:12 am

Image

This is based on a corpus of 336 million tokens tokens across 67,000 transcribed English language podcast episodes. There have been similar studies done on the Oxford English Corpus, but I found it interesting that these numbers above are significantly more skewed toward easier words, suggesting (not surprisingly) that podcasts may be easier to understand than general language. As one point of comparison:

* In the podcast corpus, 25% of all language is covered by 7 lemmas
* In the Oxford English Corpus, 25% of all language is covered by 10 lemmas
* In the COW English corpus, 25% of all language is covered by 13 lemmas

(News headline: "Learn these 7 words to understand 25% of all language!" ;-) It's actually understanding sentences that matters.)

It's been a while since I did any work on this corpus due to health issues but I've finally gotten back on the horse and have started building similar corpora for other languages, which was the eventual goal. Here are the stats:

* 66,923 transcribed English episodes
* 14,706 transcribed German episodes
* 1,789 transcribed French episodes
* 2,079 transcribed Japanese episodes

This is basically how many episodes have transcripts, which is a helpful thing to know from a language learning perspective. This is how many podcast episodes you could listen to with the aid of transcripts.

The Japanese stat at the end is inflated because I transcribed all 1600 episodes of Nihongo con Teppei's 6 podcasts (with his permission). These were produced through automated speech-to-text which is not perfect, so he is not ready to publish them himself officially, although he does not mind if I share the transcripts elsewhere. If there is interest I'll make a download for them.

Here is also the pie chart for German which has the second-largest corpus:

Image

And Japanese (with kanji converted to hiragana in post processing):

Image

Previous posts in my journey:

The statistical distribution of language difficulty
SRS vs natural repetition
Japanese podcasts sorted by difficulty
14 x

User avatar
rdearman
Site Admin
Posts: 7231
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23127
Contact:

Re: A visualisation of all words in English language podcasts

Postby rdearman » Wed Aug 10, 2022 11:32 am

Welcome back. Nice to see your work again.
4 x
: 0 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.

s_allard
Blue Belt
Posts: 969
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 2305

Re: A visualisation of all words in English language podcasts

Postby s_allard » Thu Aug 11, 2022 2:23 am

ryanheise wrote:...
This is based on a corpus of 336 million tokens tokens across 67,000 transcribed English language podcast episodes. There have been similar studies done on the Oxford English Corpus, but I found it interesting that these numbers above are significantly more skewed toward easier words, suggesting (not surprisingly) that podcasts may be easier to understand than general language. As one point of comparison:

* In the podcast corpus, 25% of all language is covered by 7 lemmas
* In the Oxford English Corpus, 25% of all language is covered by 10 lemmas
* In the COW English corpus, 25% of all language is covered by 13 lemmas

(News headline: "Learn these 7 words to understand 25% of all language!" ;-) It's actually understanding sentences that matters.)
.



Also a warm welcome back. I’m glad that the health issues are behind you. I would certainly like to see any interesting results. The elementary statistics in the post here are of course pretty much in line with all studies of word frequency in English and probably most languages and seems to be in line with Zipf’s law.

The issue I would like to raise, and for which there should be ample debate, is this idea that since 7 lemmas in the podcast corpus cover 25% all language, the podcasts should be easier to understand than more general language where in the two corpora presented here the number of lemmas for 25% coverage is 10 and 13 respectively.

The idea here is that a text with fewer lemmas is easier to understand than a text with more lemmas, assuming texts of similar token length. I believe this is a fallacy for a fundamental reason. A smaller lemma number denotes a larger number of repetitions and inevitably a larger number of different meaning or uses. Thus the most common words are the ones used in many different ways, and this increases the difficulty of understanding.

Take for example that well-known phenomenon of prepositional verbs in English made by verb+preposition. With the lemma up we have forms like make up, do up, talk up, give up, write up, live up, etc. Something similar can said for prepositions like down, by, over among others. So we now have verbs like live up, live down, live by, live out, etc. Then these combinations can be put into idioms that have their own distinct meanings. So the reader or listener has to sort out all the meanings of a relatively small of verbs combined with an also small number of prepositions.

I tend to argue that fewer lemmas can make for a more difficult text. But I would prefer to look at the bigger picture and ask the fundamental question : what makes something (text or recording) difficult to understand?

Referring only to spoken language, factors of difficulty include: the accent, the speaking rate, the nature of spoken language, sentence complexity, idioms and metaphoric usage, technical terms, dialectisms, proper nouns and cultural and historical references and less frequent words.

I see all this playing out when I listen to recordings with my Mexican Spanish tutor and I discover whole levels of meaning that I never imagined existed. For example geographical references tend to be meaningless to me and I have to look them up on a map. But there is more to it than just a name on a map. All of this to say that measuring difficulty by the number of lemmas for a given level of coverage is in my opinion very inaccurate.
2 x

User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: A visualisation of all words in English language podcasts

Postby ryanheise » Thu Aug 11, 2022 11:27 am

s_allard wrote:is this idea that since 7 lemmas in the podcast corpus cover 25% all language, the podcasts should be easier to understand than more general language where in the two corpora presented here the number of lemmas for 25% coverage is 10 and 13 respectively.


To clarify: Not just the 25% coverage point, but the 50%, 75%, 90% and 98% coverage points are all significantly skewed toward more common words. (Another data point: A vocabulary of 1028 words will on average give you 90% coverage.) But let's move on to your point.

The idea here is that a text with fewer lemmas is easier to understand than a text with more lemmas, assuming texts of similar token length.


To clarify: my statement was about the skew of the entire corpus, not about the number of lemmas in a given text. You have inadvertently set up a straw man, but let me help you turn it into something stronger. What you should say is not that "fewer lemmas in a text is easier", but rather that "lemmas that are found more frequently in the corpus are probably easier".

But anyway, let's move on to your point.

All of this to say that measuring difficulty by the number of lemmas for a given level of coverage is in my opinion very inaccurate.


By the way, lexical coverage is not related to the number of lemmas in a text. It is simply this: If a sentence has 10 tokens and you know 9 of them, then the lexical coverage is 90%. It says nothing about how common a lemma is within that sentence/text, and therefore says nothing about the amount of repetition of lemmas within a text. Whether a text has high repetition or low repetition, if you simply know 90% of the tokens, the coverage is still 90%. If you are more likely to learn the more common words first, and the rare words last, then the determining factor for coverage is not how many lemmas, but rather, whether those lemmas are found more frequently in the corpus.

But anyway, let's move on to your point and let's take out the part about the number of lemmas in a text which is a misrepresentation, and instead go with something more general. If I understand correctly, you're saying that measuring difficulty by lexical coverage alone (or even more generally by "vocabulary alone") is in your opinion "very inaccurate".

Now, here is my response. How do you quantify "very"? Is it 100% inaccurate? 50% inaccurate? Or 25% inaccurate? If it is 25% inaccurate, that is not totally useless. On the contrary, it is immensely useful because such a simple proxy for difficulty can be computed at large scale, say, in a search engine that indexes millions of podcasts and needs to sort them efficiently.

You point out that there is a long list of other factors which one might consider to give a more accurate assessment of difficulty, but how significant is each of those factors? How well does each factor alone predict difficulty? If it turns out that all factors have an equal weight of importance, then yes, that would be one scenario in which you are right. If there are 10 factors to consider and you only take into account one of them, then you're only going to be 10% accurate.

But let's have some science.

Schumacher, Eskenazi, Frishkoff, Collins-Thompson, 2016 Predicting the Relative Difficulty of Single Sentences With and Without Surrounding Context Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

The pairwise prediction results indicate that a large
proportion of the crowdsourced pair orderings can
be decided using vocabulary features, due to the
strong performance of the Age of Acquisition fea-
tures. To identify the relative importance of vocab-
ulary and syntax in our data, we reviewed each pair
and judged whether the sentence’s syntax or vocab-
ulary, or the combination of both, were needed to
correctly predict the more difficult sentence. For
many pairs, either syntax or vocabulary could be
used to correctly predict the more difficult sentence
since each factor indicated the same sentence was
more difficult. We found that 19% of pairs had only
a vocabulary distinction, and 65% of pairs could
be judged correctly either by vocabulary or syntax.
Therefore, 84% of pairs could be judged using vo-
cabulary, which explains the high performance of
the Age of Acquisition features.
3 x

User avatar
luke
Brown Belt
Posts: 1243
Joined: Fri Aug 07, 2015 9:09 pm
Languages: English (N). Spanish (intermediate), Esperanto (B1), French (intermediate but rusting)
Language Log: https://forum.language-learners.org/vie ... 15&t=16948
x 3631

Re: A visualisation of all words in English language podcasts

Postby luke » Thu Aug 11, 2022 1:47 pm

I'm happy to see you back and on the mend too.

SpanishInput did an interesting video where the notion that the learning words with greater contextual diversity can help more than simply frequent words. That is, "common" words can help learners even more than "frequent" words.

Pictures are helpful. SpanishInput provides some helpful diagrams around 2:00 into his video:
2 x
: 124 / 124 Cien años de soledad 20x
: 5479 / 5500 5500 pages - Reading
: 51 / 55 FSI Basic Spanish 3x
: 309 / 506 Camino a Macondo

User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: A visualisation of all words in English language podcasts

Postby ryanheise » Thu Aug 11, 2022 3:08 pm

Thanks (although I'm not actually healed, they are chronic conditions that have no cure, but I am at least able to work at a computer for longer than 5 minutes at a time now which is a game changer for both my productivity and my mood. Next, I just need to recover my mobility.)

To avoid confusion between similar words, I am actually using "common" and "frequent" as synonyms, and so the distinction is between whether they are frequent within a particular text, vs frequent within the whole corpus.

(This thread was about the corpus as a whole, while the previous thread was about individual texts which actually entails looking at both levels of frequency).

I don't disagree with either point about words that are repeated within a text. As you pointed out Luke, when the same word is repeated within a text in different contexts, I tend to agree that that is a boost to the learning process because each context helps to strengthen your memory by adding another connection and bond. And as you said s_allard, I also agree that the exception is prepositions and homonyms in general, where each time you hear the word it could make it harder to remember, because I feel as though you'd be trying to squeeze multiple meanings into the same sound box in your brain. So there's an upside and a downside. But that's not something I lose sleep over (unless I happen to get dragged into a debate where I literally lose sleep over it :-) ) because they are a universal constant. It is easy to demonstrate that they are a constant because there are just as many prepositions in advanced material as there are in beginner material, on average. And if they are a constant by the law of averages, they can be eliminated from this equation. I don't know enough about Spanish to know whether the Spanish "idiom" problem is the same sort of universal constant. It might not be, but I don't have the capacity right now to find out, which is why I am not currently bothering to build a similar corpus for Spanish. If, on the other hand, someone can answer these questions for me about Spanish, I wouldn't mind doing this work for Spanish, too. Even if it turns out that the meanings of the component words of an idiom help in some way for you to guess a rough meaning of the idiom as a whole, that might be good enough for me to try these methods out on Spanish. If I ever get some free time in the future, I might go ahead with it anyway and let people judge for themselves whether my analysis produced sensible results. If Spanish is too problematic, I might try Italian next.

Every language has its own unique features that need their own special attention and solutions. For example, with German, it is the compound nouns problem (so for the above analysis, I actually split the compound nouns to get a more meaningful and useful analysis for my purposes). For Japanese, it is the problem that there are several different ways to write the same word: in hiragana vs in kanji, and even in kanji there are sometimes multiple ways to write the same word (so for the above analysis, I converted all words into hiragana to get a more useful analysis.)
3 x

User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: A visualisation of all words in English language podcasts

Postby ryanheise » Thu Aug 11, 2022 9:52 pm

Hi all, didn't expect to be here tonight but am suddenly in hospital now. I am sorry if I am unresponsive from this point. Hopefully if they can get me back to normal and I can return home, I still think I may limit my posts somewhat.
2 x

s_allard
Blue Belt
Posts: 969
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 2305

Re: A visualisation of all words in English language podcasts

Postby s_allard » Thu Aug 11, 2022 10:27 pm

Sorry to hear the bad news. Get well soon.

My bold in the following quote (S. Allard)

ryanheise wrote:...
There have been similar studies done on the Oxford English Corpus, but I found it interesting that these numbers above are significantly more skewed toward easier words, suggesting (not surprisingly) that podcasts may be easier to understand than general language. As one point of comparison:

* In the podcast corpus, 25% of all language is covered by 7 lemmas
* In the Oxford English Corpus, 25% of all language is covered by 10 lemmas
* In the COW English corpus, 25% of all language is covered by 13 lemmas

(News headline: "Learn these 7 words to understand 25% of all language!" ;-) It's actually understanding sentences that matters.)


There is a lot to unpack here. Note that the passage in bold above says “ easier to understand”. This is important and the cause of much confusion here. For example, that academic article referred to and quoted from as some form of scientific foundation for the observations here is actually quite interesting once one gets through the typical gobbledygook. Here is the first paragraph of the article.

The reading difficulty, or readability, of a text is an estimate of linguistic complexity and is typically based on lexical and syntactic features, such as text length, word frequency, and grammatical complexity (Collins-Thompson and Callan, 2004; Schwarm and Ostendorf, 2005; Kidwell et al., 2011; Kanungo and Orr, 2009). Such estimates are often expressed as age- or grade-level measures and are useful for a range of educational and research applications. For example, instructors often wish to select stories or books that are appropriately matched to student grade level.

Notice the researchers use the term readability and do not refer to comprehension or understanding. In fact, unless I’m mistaken these two latter words are never used in the article. And rightly so because this study has nothing to do with understanding.

This is the problem we have here. First of all the OP refers to podcasts when in fact they are referring to transcripts of podcasts. In other words, they are looking at the readability of podcast transcripts.

On the other hand, if one is interested in understanding of the spoken language of podcasts, one should develop methodologies to deal with the spoken word. Just like readability of written texts we should have “listenability” for spoken spoken language. To use transcripts as proxies of the spoken language and analyze them as any written texts is a fundamental methodological error.

I introduced a number of factors that I believe enter into the difficulty of understanding spoken language such as we may find in podcasts. Can we measure the relative importance of each of these factors? It’s a good question, and I haven’t given it much thought.

My thinking is that anything that prevents the parsing of the language is a clear impediment to understanding. For example, just before writing this text I had to deal by telephone with a credit card support person from India. I had great difficulty understanding the person and it took some time for me to get my problem.

For an example of accent in podcasts, here is a link to podcasts by an Australian comedy duo Hamish and Andy.

https://www.listnr.com/podcasts/hamish-and-andy/hamish-andy-2018-ep-1

(You can listen to the first few seconds without signing up). When I listen to these podcasts first of all I have a very hard time figuring out what is being said. And then I don’t really get the humour. What is the problem? Where is the difficulty?

Obviously the difficulty lies with me. I am not used to the accent and to Australian culture. I simply cannot make out many of the words used. Of course if I had a transcript of the recording that would help a lot but the whole point is that the first challenge of understanding spoken language is simply figuring out what words are used.

What about proper words such as people, geography, historical events and cultural artefacts? Let’s say you hear a reference to a proper noun that you don’t know? Can we simply say that that is just one word out of 200, so that represents 0,5% less understanding? Of course not.

Where I live Québecois French speakers will say that they often have difficulty understanding films from France and the same is true for the French watching Québécois films. The problem is of course the cultural references and sometimes the sheer language itself. Something like swearing is totally different in both countries.

In summary, reducing listening difficulty or listenability to analyzing word frequencies of podcast transcripts makes for colourful and pretty graphics that tell us nothing that George Zipf had not discussed nearly a 100 years ago.
1 x

Cainntear
Black Belt - 3rd Dan
Posts: 3469
Joined: Thu Jul 30, 2015 11:04 am
Location: Scotland
Languages: English(N)
Advanced: French,Spanish, Scottish Gaelic
Intermediate: Italian, Catalan, Corsican
Basic: Welsh
Dabbling: Polish, Russian etc
x 8665
Contact:

Re: A visualisation of all words in English language podcasts

Postby Cainntear » Fri Aug 12, 2022 3:03 pm

ryanheise wrote:...

Was just looking at your profile pic and tried to work out whether it was supposed to represent "Be the and I" or "Der ich und sein"...? :lol: :lol:
0 x

User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: A visualisation of all words in English language podcasts

Postby ryanheise » Fri Aug 12, 2022 3:26 pm

To update: I am back home now, basically my heart beat went all over the place, unfortunately with atrial fibrillation, and unfortunately no sleep during the hospital stay over night. Please understand if I try to avoid debate oriented comments. It is difficult, obviously not to want to reply when you say something like I haven't contribute anything beyond what Zipf did 100 years ago, sure. But I don't want you to get the wrong idea that I'm rudely ignoring you, it's just that I need to avoid certain risks that may cause me a little bit of stress since my heart seems to act up when I'm not perfectly calm.

Although, I will add some comments below (maybe the last time, though, then I would really like to move to a more productive discussion about how to improve the system rather than how to knock the system down.)

s_allard wrote:
ryanheise wrote:I found it interesting that these numbers above are significantly more skewed toward easier words, suggesting (not surprisingly) that podcasts may be easier to understand than general language.


There is a lot to unpack here. Note that the passage in bold above says “ easier to understand”. This is important and the cause of much confusion here. For example, that academic article referred to and quoted from as some form of scientific foundation for the observations here


To clarify, I was not using that conference paper to support the above conclusion. You were the one who actually brought up the old thread topic of assessing text difficulty, talking about how we need to consider a long list of factors, and I used this conference paper to show that it's not that simple, and the weight of each factor is critical. (Something you are yet to resolve in your own model.)

I base the above quoted statement not on that conference paper, but rather on the basis that there is a correlation between vocabulary and understandability. Let's say for a document D, your understanding is X% and your vocabulary is V. Now as we gradually make V smaller, it is likely that X will go down. At a corpus level of analysis, where I am observing a shift in the skew to the left, we're not interested in a super accurate measurement of an individual document because things will average out over massive numbers of documents. You would need to come up with a very contrived podcast corpus for it to defy the averages!! Imagine what corpus you must conceive of for it to be true that over such a massive number of documents, documents become easier to understand the more rare words you introduce. Yes it's possible to come up with an artificial corpus that exhibits this property, but it would not be true about actual language.

On the other hand, if one is interested in understanding of the spoken language of podcasts, one should develop methodologies to deal with the spoken word. Just like readability of written texts we should have “listenability” for spoken spoken language. To use transcripts as proxies of the spoken language and analyze them as any written texts is a fundamental methodological error.


A lot of this is arbitrary too, in relation to how one would choose to break down the factors upon which to filter in a search engine. If you were only going to have a single factor, let's say, "listenability", then sure you may incorporate these multiple factors into one. But when I first started trying to build a database of Japanese podcasts for learners years ago, I had conceived of these factors as being separate dials on the search form. So for example, you would be able to select a point on the difficulty scale, and then independently select a point on the speed scale. This is because some learners who are quite good at listening skills may naturally be able to cope with higher speeds with ease and it won't hold them back as much as the actual difficulty of the language itself. So that learner will bump up the threshold for speed to the max, but then choose an appropriate level of difficulty for the language. In this model, the language is a separate thing from the speed. But if you wanted to merge them, you could do that in your own model if you ever decided to build a search engine yourself. It's an arbitrary choice, I have just chosen to do it this way. It has the added benefit that I could add features over time without needing to modify any existing metrics.

For an example of accent in podcasts, here is a link to podcasts by an Australian comedy duo Hamish and Andy.

https://www.listnr.com/podcasts/hamish-and-andy/hamish-andy-2018-ep-1

(You can listen to the first few seconds without signing up). When I listen to these podcasts first of all I have a very hard time figuring out what is being said. And then I don’t really get the humour. What is the problem? Where is the difficulty?

Obviously the difficulty lies with me. I am not used to the accent and to Australian culture. I simply cannot make out many of the words used. Of course if I had a transcript of the recording that would help a lot but the whole point is that the first challenge of understanding spoken language is simply figuring out what words are used.

What about proper words such as people, geography, historical events and cultural artefacts? Let’s say you hear a reference to a proper noun that you don’t know? Can we simply say that that is just one word out of 200, so that represents 0,5% less understanding? Of course not.


These are points that have been discussed in the previous thread, so I'll defer to that. The topic of this thread is different.

Cainntear wrote:
ryanheise wrote:...

Was just looking at your profile pic and tried to work out whether it was supposed to represent "Be the and I" or "Der ich und sein"...? :lol: :lol:


Ah, I get asked about that a lot, actually. Can anybody guess?
3 x


Return to “General Language Discussion”

Who is online

Users browsing this forum: TimButterfield and 2 guests