Another wall of text incoming, look out below!
After my last post, I started thinking about how OP was probably interested more in comparing the latest book with the previous two editions, both by Francisco Javier Antón Martinez, for which audio is still available for download, rather than SWT, but since the latest and SWT were what I had used, I offered what I could. However, the data-lover in me decided to try to see what could be discerned among the books with data I could get my hands on, and so I grabbed the text of the lessons, exercises, and revision dialogues (when applicable) for each and did some basic analysis. While looking at just the words can drastically undersell the value in the grammatical/syntactic/structural exposure as well as the notes and explanations, it's what can most easily be done and it does seem to point to a few things clearly enough.
I looked at the 4 beginner level Assimil courses for Spanish. The French originals/earliest for each generation were the following (with title of first chapter in parentheses, since that can be quite helpful, especially when trying to identify a translation or second-hand audio):
Code: Select all
A. Chérel - L'espagnol sans peine - 1934 (Alberto va a Paris)
3h audio
112 lessons, 96 lessons with text, 96 lessons with new vocab
19544 total words, 204 per lesson with text
4117 unique word forms, 43 new per lesson with new vocab, 21 new per 100 words
2641 headwords*, 28 new per lesson with new vocab, 14 new per 100 words
78% coverage of the top 1k most frequent words**
39% coverage of the top 5k most frequent words**
Code: Select all
F. J. Antón Martinez - Le Nouvel Espagnol Sans Piene - 1981 (Un encuentro)
3h30m audio
109 lessons, 94 lessons with text, 94 lessons with new vocab
15368 total words, 163 per lesson with text
3404 unique word forms, 36 new per lesson with new vocab, 22 new per 100 words
2183 headwords*, 23 new per lesson with new vocab, 14 new per 100 words
75% coverage of the top 1k most frequent words**
36% coverage of the top 5k most frequent words**
Code: Select all
F. J. Antón Martinez - L'Espagnol - 2004 (Un aperitivo)
2h30m audio
100 lessons, 100 lessons with text, 86 lessons with new vocab
9707 total words, 97 per lesson with text
2422 unique word forms, 28 new per lesson with new vocab, 25 new per 100 words
1719 headwords*, 20 new per lesson with new vocab, 18 new per 100 words
66% coverage of the top 1k most frequent words**
28% coverage of the top 5k most frequent words**
Code: Select all
J. Córdoba - L'Espagnol - 2017 (Qué sorpresa)
3h audio
100 lessons, 100 lessons with text, 86 lessons with new vocab
16373 total words, 164 per lesson with text
3321 unique word forms, 39 new per lesson with new vocab, 20 new per 100 words
2163 headwords*, 25 new per lesson with new vocab, 13 new per 100 words
76% coverage of the top 1k most frequent words**
36% coverage of the top 5k most frequent words**
*
actually stems as determined in Python by NLTK's SnowballStemmer for Spanish (not the original Porter stemmer algorithm) It's an imperfect but consistent way of getting an approximation of headwords, though lemmas would be better. It under-counts what would be found by a proper lemmatization process, though from a comprehension standpoint it might better represent individual lexical units of meaning, since if one already understands the adjective form of a given root/stem, the meaning of any other part of speech derived from the same root should be much clearer even if it would have its own headword in the dictionary. This metric gives those roots.
**
based on the 2018 OpenSubtitles dataset It isn't perfect, but it reflects spoken language (admittedly not everybody's goal) better than, say, text scraped from the internet, newspapers, or digitized public domain literature (skewed strongly in both in both age and style). The main critique of this dataset that I have heard is that many of these subtitles are translations, which therefore don't necessarily represent native speech as well as would be ideal. I believe the greatest value here lies in the relative comparison among books, for which I believe this corpus forms a sufficient basis, however anyone who takes issue with the biases of this dataset is free to ignore the derived statistics. While it seems that there's a correlation between text length and coverage, as one might presume, I would think this is still a decent metric of how ready one might be for native material after finishing such a course. I'm actually a little surprised by how well SWT holds up in this metric, given its age and literary leaning versus a movie dialogue database that's heavily skewed towards contemporary speech.
I initially ran some time-based metrics, minutes per lesson, words per minute, etc., but decided that it couldn't be done in any reliable way without actually analyzing the audio itself, so I left that as a potentially collaborative exercise for later. In particular, I found the calculated average words per minute to not match the listening experience well at all, so without some sort of silence clipping/normalization, any such metric would be more misleading than helpful (and, of course, naively clipping all silence would be just as misleading as it would destroy the natural rhythm of the speech anyway). Perhaps the main takeaway is that recording duration should be mostly ignored, but I hope that length of target text in number of words is at least a decent indicator for whatever one may intend to glean from the audio stats.
One metric I calculated but didn't list with each because it was nearly identical is what percentage of the text is in the top 1k vs 5k most frequent words (kind of the inverse of the listed statistics, but a decent measure of how well the text keeps to a subset of the language), and the answer for each is 70% of the text is in the 1k most frequent words (72% for the 1981 edition), and 85% is within the top 5k. I would also like to note that I think some people around here (and on YouTube) may overstate the need for any one piece of learning material to strictly conform to some frequency list, especially since all native material, once finally consumable, will, by definition, conform to the frequency distribution of native material (at least within a given type of communication), so while I'm not in that extreme camp, I believe there truly are benefits in having a beginner (still needs hand-holding to be able to handle native content) level resource that stays within a common pool of vocabulary in which already-known high-frequency words are chosen more often over downright infrequent words that can much more easily be assimilated via native content once the learner has a sufficient base. (I will forever know how to say hunchback in Swedish. Thanks Assimil! LOL) While I'm not aware of any existing data on the topic, I would be shocked if assimilation of grammatical patterns with well-known words weren't far more efficient than assimilation of grammatical patterns with brand new words or ones only seen once or twice before. There obviously needs to be a balance between the ability of a text to expand the reader's vocabulary versus grammatical skill.
One obvious conclusion is that the 2004 edition is by far the most meager of them all and while it may very well make a fine introduction to the language in its own right (I'll leave that judgement for those who have gone through it), it's a much smaller introduction than is available in any of the others, all else being equal. It's still impressive how many new words the author was able to squeeze in per 100 words of text, though, depending on how it's being used (the lessons are shorter and can therefore be repeated more times per 30 minute session), that could be a plus or a minus. [Edit: Among the remainder we see a range with regards to vocabulary density, with SWT not only covering more vocabulary in total but also being more dense in vocabulary per lesson (made more so by the fact that the lesson notes themselves introduce related vocabulary not contained in the text or exercises, and hence not included in these numbers).]
Obviously none of this speaks to the learning curve, amount of grammar covered, or how enjoyable the material may be to use, but for those who focus on vocabulary and want to know how they compare, we actually have some numbers! I must say, though, that the most interesting statistics were when I looked at the three advanced level Spanish courses, and then looked at how well they complemented each one of the lower level courses when their vocabulary was combined. Spoiler alert: the latest advanced course from 2015 by David Tarradas in combination with SWT gives shockingly good coverage of core and advanced vocabulary while exposing the reader to a wide range of language, from contemporary to more dated, and from colloquial to more formal and literary...the best of both worlds! I won't derail this thread with an analysis of the advanced courses, though.
Edit: Upon running the numbers again with some tweaks, I realized the numbers previously listed as the coverage for the 5k most frequent words were actually for the 20k most frequent words! Sorry about that. I also added a distinction in the "per lesson" statistics to be against either the number of lessons with text or number of lessons with new vocabulary to better estimate target language reading and new vocabulary loads in a given study session. While this doesn't account for the fact that revision dialogues are somewhat shorter than the texts from regular lessons, I figured this was still a more accurate representation. The raw base numbers remain the same, so the original values (or any others) can still be calculated, if desired. With this update, calculated numbers are now more consistently rounded, so there may be a few minor variances from the original post, apart from the ones intentionally changed as per above. I also added a note about vocab density, given the new text lesson calculations.