Re: Coldrainwater's German Log
Posted: Sat Nov 21, 2020 1:07 am
In keeping with promises, I would like to share a small amount of data that I collected and analyzed. My goal was to look at books that I judged to be at differing reading levels and determine non-lemmatized distinct word counts and also to examine sentence length, since I have found that to be a significant factor in driving reading difficulty. The data below should be relatively reliable. If I find inaccuracies, I will update.
Method: Starting with the full text from Calibre for each book, I split the text by distinct word and sentence. I inserted the sentences into a local SQL Server database and that made it really easy for me to run a few quick and practical queries. If this is helpful and there is interest, I may look at additional texts going forward as well as modify metrics should these prove inadequate to the explain the data.
In order of increasing difficulty:
Die Schwingen der Dunkelheit (Erikson)
About 20% of the sentences have 16 words or more.
Average sentence length: 10 words
20364 sentences analyzed.
0 sentences have over 100 words per sentence.
4% of sentences have 30 or more words.
17745 distinct words (not lemmatized)
Das Parfum (Süskind)
About 20% of the sentences have 30 words or more.
Average sentence length: 20 words
3708 sentences analyzed.
15 sentences have over 100 words per sentence.
21% of the sentences have 30 or more words.
12948 distinct words (not lemmatized)
Der Zauberberg (Mann)
About 20% of the sentences have 38 or more words.
Average sentence length includes 25 words.
12255 sentences analyzed.
177 sentences have over 100 words per sentence.
31% of sentences have 30 or more words.
33669 distinct words (not lemmatized)
Method: Starting with the full text from Calibre for each book, I split the text by distinct word and sentence. I inserted the sentences into a local SQL Server database and that made it really easy for me to run a few quick and practical queries. If this is helpful and there is interest, I may look at additional texts going forward as well as modify metrics should these prove inadequate to the explain the data.
In order of increasing difficulty:
Die Schwingen der Dunkelheit (Erikson)
About 20% of the sentences have 16 words or more.
Average sentence length: 10 words
20364 sentences analyzed.
0 sentences have over 100 words per sentence.
4% of sentences have 30 or more words.
17745 distinct words (not lemmatized)
Das Parfum (Süskind)
About 20% of the sentences have 30 words or more.
Average sentence length: 20 words
3708 sentences analyzed.
15 sentences have over 100 words per sentence.
21% of the sentences have 30 or more words.
12948 distinct words (not lemmatized)
Der Zauberberg (Mann)
About 20% of the sentences have 38 or more words.
Average sentence length includes 25 words.
12255 sentences analyzed.
177 sentences have over 100 words per sentence.
31% of sentences have 30 or more words.
33669 distinct words (not lemmatized)