A visualisation of all words in English language podcasts

General discussion about learning languages
Online
s_allard
Blue Belt
Posts: 969
Joined: Sat Jul 25, 2015 3:01 pm
Location: Canada
Languages: French (N), English (N), Spanish (C2 Cert.), German (B2 Cert)
x 2305

Re: A visualisation of all words in English language podcasts

Postby s_allard » Sun Aug 14, 2022 4:43 am

So let’s put the past behind us and look to see how ryanheise’s system can be improved. Despite earlier misgivings, I’m willing to accept that we now have an algorithm that can classify transcripts of podcasts from easiest to hardest for understanding. Furthermore we now have a set of 40,000 English-language podcasts organized by order of difficulty. This could be of great use to students and teachers of a language.

In order to improve this system, my suggestion is to first revisit that fundamental question: what makes a podcast difficult to understand? Obviously, vocabulary is key, and ryanheise has pursued this angle to great length. But there is more to this than vocabulary.

In fact, we don’t have to go far to answer this question. Everybody here is confronted with this problem of not adequately understanding spoken target language. In reality, the problem is not that the utterances are too difficult to understand; it’s that we do not yet have the level of proficiency necessary to understand the target language. As our linguistic skills improve, we obviously find things easier. The target language hasn’t changed. We have changed.

So what things can be done to make sure we have material appropriate for our level? I humbly submit two features that I think can make a huge difference and maybe could be incorporated into ryanheise’s algorithm. The first is speaking rate. Slower speech is definitely easier to decode than faster speech for obvious reasons. So let’s say we had a classification something like: very slow, slow, medium, fast native speed.

The second thing is articulation. Clearly articulated speech, especially when combined with slow speaking rate makes a huge difference. We could imagine some system of classification like: clear, normal and less clear.

To get an idea of how all this works in practice, one only has to look at the site of Dreaming Spanish that I mentioned in a previous post. One sees easily that the videos are clearly differentiated by rate of speaking and clarity of articulation.

There are other things that could be discussed, like the presence of proper nouns, idioms and abstractions but I think that with just the two elements I have proposed ryanheise’s system would be more useful.
2 x

User avatar
ryanheise
Green Belt
Posts: 459
Joined: Tue Jun 04, 2019 3:13 pm
Location: Australia
Languages: English (N), Japanese (beginner)
x 1681
Contact:

Re: A visualisation of all words in English language podcasts

Postby ryanheise » Sun Aug 14, 2022 9:55 am

s_allard, thank you for the change in tone. I appreciate it more than you can imagine.

I think if there were a way to filter for articulation, it would have to be by user-submitted ratings. This approach could also incorporate all of the other features that are not easy to analyse by machine, such as whether a podcast contains is an audio book, or a conversation, or a monologue.

Speed is already implemented as a separate attribute.

Proper nouns are currently excluded from analysis, by conveniently adopting Paul Nation's assumption that proper nouns have a minimal learning burden (but aware that it's debatable). Maybe I am secretly just glad that excluding proper nouns from the equation conveniently saves on disk space and processing time which is expensive for this project. I guess some day, I could have an advanced filter option listing all of these extra things such as the percentage of proper nouns.

Though, keep in mind that in a page of search results, you can click the play button, listen to a bit of it, and decide for a range of other criteria you might have, whether it's really a good match for you. Even if there is no filter for proper nouns, you will still be able to just click on a few of the top results and make the judgement call yourself. Doing this for 69,000 episodes one by one is of course not feasible, but going through a list of the top 10 this way should be feasible or is at least far superior to having to manually go through 69,000 episodes. When I use Google, I usually click on around 5 different results to open them up in new tabs, and then evaluate them all to pick which one best serves my purpose. The way that Google makes this more practical is to show snippets for each Search result so that you can see at a glance whether it's really what you were looking for. I don't have snippets yet, but I feel I should include enough information in each search result for people to see at a glance whether it's relevant, how fast the speech is, how difficult the vocabulary is, and maybe(?) what percentage of proper nouns there are.
3 x


Return to “General Language Discussion”

Who is online

Users browsing this forum: s_allard and 2 guests