How does German capitalization affect unique word count?

Ask specific questions about your target languages. Beginner questions welcome!
User avatar
leosmith
Brown Belt
Posts: 1341
Joined: Thu Sep 29, 2016 10:06 pm
Location: Seattle
Languages: English (N)
Spanish (adv)
French (int)
German (int)
Japanese (int)
Korean (int)
Mandarin (int)
Portuguese (int)
Russian (int)
Swahili (int)
Tagalog (int)
Thai (int)
x 3098
Contact:

How does German capitalization affect unique word count?

Postby leosmith » Wed Mar 17, 2021 2:25 am

I don't speak German, but I'm trying to figure out how to count unique German words in my reading tool.
1) We used to consider a capitalized word to be the same as a lower case word.
2) Recently we switched to a method that considers them to be separate words.
I assume that capitalization is required at the beginning of a sentence, which probably means neither of our methods was perfect. Is that correct? Which of the two would be more accurate? Do you have suggestions for alternate methods?
0 x
https://languagecrush.com/reading - try our free multi-language reading tool

白田龍
Orange Belt
Posts: 242
Joined: Wed Mar 21, 2018 6:54 pm
Languages: English, Portuguese, Spanish, Catalan, French, Persian, Arabic, Mandarin, Japanese.
x 444

Re: How does German capitalization affect unique word count?

Postby 白田龍 » Wed Mar 17, 2021 7:50 am

First count only the ocurrences that do not start a sentence, then aply the capitalised/noncapitalised ratio to its sentence-initial count.

So if you found that word X is 97% noncapitalised in the middle of the sentence, for every 100 times you find it in sentence-initial position, you add 97 to x, and 3 to X.

But counting words in German is more complicated than that, there are cases, and aglutination... You'd better off learning how to use some language-processing tools, but I can't help you here.
1 x

User avatar
Pegasusangel
White Belt
Posts: 42
Joined: Wed Mar 10, 2021 9:04 am
Location: United states
Languages: Native:
English
Learning:
German, Icelandic
Maintaining:
Spanish, Japanese
Language Log: https://forum.language-learners.org/vie ... 15&t=16635
x 98
Contact:

Re: How does German capitalization affect unique word count?

Postby Pegasusangel » Wed Mar 17, 2021 8:34 am

I'm not quite sure i understand the question as i've never had to count German words. With german though one word can be two where like Der Mann for example means (The Man) or it can just be (man) Also in German Nouns are Capitalized. So like Ich mag Katzen (I like cats) is how you would write it in German or Der Bürgermeister ist in der Stadt. (The Mayor is in the city) -I'm still not great at articles so i may have written that wrong-
Now granted I'm not like the greatest in this language and even with as much study as I put in it myself I could be wrong but I would say if it's something like Das Mädchen (The girl) it may only count as one word. I hope I understood correctly and anyone who needs to correct or make a clearer explanation please do. Much love!
1 x
German
Duolingo: 109 / 122 109 gold
Busuu: 45 / 10045% course
pimsleur German 1: 23 / 3023 lessons
Speakly Intermediate 2: 1408 / 1450

Icelandic
Mangolanguages unit 1: 1 / 51 chapter complete
drops tourist: 100 / 300100 words
Pimsleur: 6 / 306 completed
memrise: 24 / 8224 items

alaart
Green Belt
Posts: 338
Joined: Sat Aug 03, 2019 6:58 am
Location: Kaoshiung
Languages: DE (N), EN
B1: NL, JP, PT (BR), ZH
A2: KR
A1: ES
Language Log: https://forum.language-learners.org/vie ... hp?t=10867
x 1027

Re: How does German capitalization affect unique word count?

Postby alaart » Wed Mar 17, 2021 8:47 am

Usually lower case words are always lower case words, and upper case words (nouns and names) always upper case words. On the beginning of a sentence they would always be written in upper case, but that doesn't change the meaning or the word.
There are words that exist in both, lower- and upper case and are in fact two different words, as there are words that are different because of the article (f.e. das Band, die Band), but these words are not frequent.

I would not differentiate, in case of hybrid words that have both - both meanings could be listed, and the reader should probably be able to tell from context which is the right one. And since the noun is always in capital letters this is rather easy and you cannot confuse the two.

Examples: Der Vogel sagt: "Ich halle wenn ich in der Halle singe". (The bird says: When I sing in the hall there is echo). "Ich fische Fische." (I'm fishing fish)

If you want to use an algorithm, I'm not a programmer, but I guess that it should also be possible to predict the right word. I think spell correction in German in various software manages to do it reliably.
If you have a database for English that has vocabularies categorized into nouns and other words, then you could actually predict the noun word if the word is not in the sentence initial position. But that is all I can think of right now.
2 x

User avatar
Iversen
Black Belt - 4th Dan
Posts: 4768
Joined: Sun Jul 19, 2015 7:36 pm
Location: Denmark
Languages: Monolingual travels in Danish, English, German, Dutch, Swedish, French, Portuguese, Spanish, Catalan, Italian, Romanian and (part time) Esperanto
Ahem, not yet: Norwegian, Afrikaans, Platt, Scots, Russian, Serbian, Bulgarian, Albanian, Greek, Latin, Irish, Indonesian and a few more...
Language Log: viewtopic.php?f=15&t=1027
x 14962

Re: How does German capitalization affect unique word count?

Postby Iversen » Wed Mar 17, 2021 9:27 am

Because there always is capitalization at the beginning of a sentence you can't use the inverse rule to the rule that substantives also are capitalized - in a concrete case a word after a dot might be a subsanstive, but you can't be sure. The only conclusion is that you may assume that a non-capitalized word isn't a substantive (at least in standardized texts), however the inverse rule can't be used. In other words you may compile a list of words that sometimes are and sometimes aren't capitalized and expect them to be two different words - but that will still just be a subset of the set of homonymes. Another subset would be substantives with different genders, but in a concrete text you might not be able to guess the gender except by using semantic criteria - for instance the word could be in the plural where there isn't any difference.

In my opinion you can just as well give up using capitalization as a criterion in your wordcounts. There simply isn't any foolproof non-semantic way to distinguish homonyms.
1 x

Doitsujin
Green Belt
Posts: 402
Joined: Sat Jul 18, 2015 6:21 pm
Languages: German (N)
x 801

Re: How does German capitalization affect unique word count?

Postby Doitsujin » Wed Mar 17, 2021 10:29 am

leosmith wrote:I assume that capitalization is required at the beginning of a sentence, which probably means neither of our methods was perfect.
Capitalization is indeed required at the beginning of German sentences, however, since German sentences rarely start with nouns, you could use the following algorithm:

1. Split German texts into sentences using the regex-based rules defined in this SRX* file. (Search for languagerulename="German".)
2. Change the first word of each sentence to lower case and remove punctuation characters.
3. Generate a case-sensitive word frequency list.

* Segmentation Rules eXchange (SRX) is a standard that was originally developed for CAT tools.
2 x

User avatar
leosmith
Brown Belt
Posts: 1341
Joined: Thu Sep 29, 2016 10:06 pm
Location: Seattle
Languages: English (N)
Spanish (adv)
French (int)
German (int)
Japanese (int)
Korean (int)
Mandarin (int)
Portuguese (int)
Russian (int)
Swahili (int)
Tagalog (int)
Thai (int)
x 3098
Contact:

Re: How does German capitalization affect unique word count?

Postby leosmith » Thu Mar 18, 2021 2:40 am

alaart wrote:Examples: Der Vogel sagt: "Ich halle wenn ich in der Halle singe". (The bird says: When I sing in the hall there is echo). "Ich fische Fische." (I'm fishing fish)

Thanks!
Der Vogel sagt: "Ich halle wenn ich in der Halle singe". = The correct count would be 9. We would get 8 words with method one, and 11 with method two.
Ich fische Fische. = The correct count would be 3. We would get 2 words with method one, and 3 with method two.
Assuming we are stuck with method one or two, which do you think would be more accurate overall?
(We aren't trying to count word families, so conjugations and such are all counted. For example, der and das are two unique words in our database.)
0 x
https://languagecrush.com/reading - try our free multi-language reading tool

alaart
Green Belt
Posts: 338
Joined: Sat Aug 03, 2019 6:58 am
Location: Kaoshiung
Languages: DE (N), EN
B1: NL, JP, PT (BR), ZH
A2: KR
A1: ES
Language Log: https://forum.language-learners.org/vie ... hp?t=10867
x 1027

Re: How does German capitalization affect unique word count?

Postby alaart » Thu Mar 18, 2021 6:01 am

If you don't want to filter capitalization somewhat at the beginning of the sentence, you should definitely go with method one. Otherwise you would need to learn any word "double", just because of the capitalization rule. At some point in some book or situation any word will appear at the initial position. German word order is flexible (with changing nuances):

Wir müssen nun zur Schule gehen. (We have to go to school now)
Müssen wir nun zur Schule gehen? (Do we have to go to school now)
Nun müssen wir zur Schule gehen. (Now we have to go to school)
Zur Schule müssen wir nun gehen. (To the school is where we have to go)

And with some additional context:
Schule? Müssen wir nun zur Schule gehen? (School? Do we have to go to school now?)
Gehen? Ich will nicht zur Schule gehen. Ich will Bus fahren. (Going? I don't want to go. I want to take the bus.)

So, you could basically have any word at the initial sentence position.

Yes, you would miss the nuances of "halle" and "Halle" etc. , but unless you are going to have an algorithm which can analyze the word type of the sentence initial word, if you count every capitlaliazation as two words you are going to double the vocabulary database. For me that would potentially be annoying (hey, wait I know that word! - oh, where have I seen that word before, didn't I make some note?).
1 x

User avatar
leosmith
Brown Belt
Posts: 1341
Joined: Thu Sep 29, 2016 10:06 pm
Location: Seattle
Languages: English (N)
Spanish (adv)
French (int)
German (int)
Japanese (int)
Korean (int)
Mandarin (int)
Portuguese (int)
Russian (int)
Swahili (int)
Tagalog (int)
Thai (int)
x 3098
Contact:

Re: How does German capitalization affect unique word count?

Postby leosmith » Fri Mar 19, 2021 12:10 am

alaart wrote:If you don't want to filter capitalization somewhat at the beginning of the sentence, you should definitely go with method one.

Thanks very much!
0 x
https://languagecrush.com/reading - try our free multi-language reading tool


Return to “Practical Questions and Advice”

Who is online

Users browsing this forum: No registered users and 2 guests