s_allard wrote:I get that ryanheise is trying to develop some system that could automatically grade podcasts according to levels of difficulty. Again, there is a question of terminology. If we speak of determining listening lexical difficulty or lexical listenability, I have no problem supporting this line of research. Just don’t call it difficulty of understanding.
This thread is specifically about vocabulary, as it says in the title that you were unhappy with a short while ago.
It was you who decided to talk about understanding in general with this statement:
s_allard wrote:But I would prefer to look at the bigger picture and ask the fundamental question : what makes something (text or recording) difficult to understand?
After that, ryanheise deliberately clarified that he was talking about vocabulary here.
You quoted ryanheise thusly to justify your change of topic (your emphasis)
ryanheise wrote:I found it interesting that these numbers above are significantly more skewed toward easier words, suggesting (not surprisingly) that podcasts may be easier to understand than general language.
Let me shift the emphasis and see what happens:
ryanheise wrote:I found it interesting that these numbers above are significantly more skewed toward easier words, suggesting (not surprisingly) that podcasts may be easier to understand than general language.
He did not present it as proof that podcasts
are easier.
It is reasonable when working with data to propose hypotheses for further investigation, and ryanheise's plan is clearly to do more investigation based on more variables. Perhaps when he's finished he'll have demonstrated that podcasts are indeed simpler to understand than many other media, or maybe he'll find that increased simplicity in one measure corresponds in increased complexity in another and that it's a zero-sum game.
Neither you, nor I, nor he know what his final conclusion will be, so it is unfair to attack him for not having proof of what has only ever been presented as a working hypothesis.
(Reordering points here for clarity)
I don’t have the answer but I suspect the results are not that clear, My doubts are based on my observation, albeit anecdotal, that the frequency distribution of the vocabulary of these does not change much from one level to the other. Instead, what is really striking is differences of speaking rate, use of visual aids, articulation, grammatical complexity, presence of idioms and metaphors.
And as ryanheise has said here and on previous occasions, he is not looking for a single measure, but multiple variables -- dials that an individual learner can tweak to customise what they're getting.
When he is constantly making explicit that the topic under discussion is only one of many factors, the recurring criticism from you that it is only one of many factors is utterly unjustified.
It shows a deep lack of respect for anyone when your criticism of them is based on you not reading and/or understanding that they just said the exact thing that you are criticising them for not saying.
It just so happens that we have a data set that could be used to verify all of this. As has been mentioned in this forum there is an excellent website called Dreaming Spanish
https://www.dreamingspanish.com/browse that offers hundreds of videos of Spanish for four levels of learners: superbeginners, beginners intermediates and advanced.
...
Using transcripts of these recordings and ryanheise’s approach of analyzing word frequency as a proxy, would we arrive at exactly the same classification of these recordings according to suitability for different learners?
This will not verify anything. Instead, it can be used to
calibrate the system. And no, we would not arrive at exactly the same classification, because human judgement is not infallible, and sometimes there will be "intermediate" episodes that could just as easily have been part of the "advanced" series.
OK, so what do I mean by "calibrating" difficulty measures?
Well, first he establishes measures of multiple independent factors determining difficulty: speed, lexical diversity, lexical density, clause length, number of clauses in sentences, whatever.
Then he gives his system a pile of texts for which he has an expert difficulty rating (e.g. the Dreaming in Spanish videos) and generates the individual measures.
Next, he feeds that into a basic algebraic solver which generates an equation like this:
a*speed + b*lex_div + c*lex_dens + ... = category *
The computer then substitutes in all the data for every single text provided and determines values for coefficients a,b,c etc and the upper and lower bounds of each categories that put as many of the texts into the categories they were given as possible.
As I said, it will never give the exact same classification as a human, and that's OK.
(* in reality, the equation is more likely to be
a1*2^speed+a2*speed^2+a3*speed+a4*log speed+a5*1/speed+ ... but most of the coefficients will solve to effectively zero, but that's by-the-by...)