low-frequency words that are unexpectedly frequent

General discussion about learning languages
User avatar
trippingly
White Belt
Posts: 17
Joined: Mon Jun 04, 2018 10:06 pm
Languages: English (N)
Spanish (~C1)
French (~B1)
German (beginner)
Hindi (beginner)
x 34

Re: low-frequency words that are unexpectedly frequent

Postby trippingly » Sun Aug 12, 2018 9:50 pm

Hashimi wrote:
Hashimi wrote:Actually, I'm surprised by the contrary. For example, common words like "headache", "boyfriend", "airport", "bathroom", "motorbike", "underground", "midnight", "classroom", "bedroom", "girlfriend", "birthday", "timetable", "weekend", "upstairs", "suitcase", "motorcycle", "homework", "businessman", "website", and even "forever" are not in the list of the most frequent 25,000 words in the British National Corpus and the Corpus of Contemporary American English!


Now I understand why these common words are not on the list of the most frequent 25K words in the BNC-COCA. They are all considered as two-word words so they removed them from the list!

I wonder if this is just an issue with the interface you're using the search these corpora? The quickest way I know to look up multiple words in COCA is using the analyze texts feature on Wordandphrase.info (kind of a front end for COCA maintained by the same linguist, Marc Davies). Copy-pasting the word list above, it appears that all of the words in the list are in the top 25000. Motorbike is the least frequent, with a rank of 23938. The rest are in the top 10000, with six (weekend, classroom, airport, bedroom, forever, bathroom) in the top 3000.

You can also search the BNC here and COCA here, but I find that interface less useful, because as far as I know you can only look up one word at a time, and (again AFAIK) you only get the word's frequency in the corpus, not its rank relative to other words. Looks like most of the words are in the BNC though, with the exception of website. Motorbike is slightly less frequent in the BNC than in COCA, so it's possible that it's not in the top 25000 in that corpus, but I think the rest probably are.
0 x

Hashimi
Yellow Belt
Posts: 66
Joined: Sun Jan 10, 2016 12:45 pm
x 92

Re: low-frequency words that are unexpectedly frequent

Postby Hashimi » Sun Aug 12, 2018 10:14 pm

trippingly wrote:I wonder if this is just an issue with the interface you're using the search these corpora? The quickest way I know to look up multiple words in COCA is using the analyze texts feature on Wordandphrase.info (kind of a front end for COCA maintained by the same linguist, Marc Davies). Copy-pasting the word list above, it appears that all of the words in the list are in the top 25000. Motorbike is the least frequent, with a rank of 23938. The rest are in the top 10000, with six (weekend, classroom, airport, bedroom, forever, bathroom) in the top 3000.

You can also search the BNC here and COCA here, but I find that interface less useful, because as far as I know you can only look up one word at a time, and (again AFAIK) you only get the word's frequency in the corpus, not its rank relative to other words. Looks like most of the words are in the BNC though, with the exception of website. Motorbike is slightly less frequent in the BNC than in COCA, so it's possible that it's not in the top 25000 in that corpus, but I think the rest probably are.


Try this one:

https://www.lextutor.ca/vp/comp/

https://www.victoria.ac.nz/lals/about/s ... -lists.pdf
0 x

User avatar
trippingly
White Belt
Posts: 17
Joined: Mon Jun 04, 2018 10:06 pm
Languages: English (N)
Spanish (~C1)
French (~B1)
German (beginner)
Hindi (beginner)
x 34

Re: low-frequency words that are unexpectedly frequent

Postby trippingly » Mon Aug 13, 2018 3:21 am

Very interesting. So it looks like the list that's being searched there is derived from the BNC and COCA but does things a little differently than its sources.

In the BNC and COCA, the verb forms work, worked, and working are combined into a single lemma, but there are separate lemmas for work as a noun or working as an adjective or the derived noun worker. In this combined list, as far as I can tell, those are all included in a single word family.

But then what do you do with a word like homework? Does it belong in the home family, the work family, or its own family? Looks like they've addressed that by putting "transparent compounds" such as homework in a separate list (which I don't see a way to search on that site) but including non-transparent compounds (network, say) in the main list.

My first thought was that the distinction between transparent and non-transparent compounds is pretty arbitrary, but I have to say, they've applied it pretty thoughtfully. For example, which of these words is a transparent compound: eyeball, eyelash, eyelid, eyebrow?

They say eyeball and eyelid are transparent, and eyelash and eyebrow aren't. And you can kind of see how ball and lid are being used in fairly prototypical ways in those compounds, whereas lash on its own means something like whip, so that's a bit further removed.

So I'd say they've done a good job making the distinction, but it still strikes me as arbitrary. Homework and housework are both counted as transparent, and I can see why, but I wouldn't say their meanings are totally predictable from the meanings of their parts. Their meanings could be switched, for example, and they would still seem reasonably transparent.

Anyway, I would guess that this list isn't really designed to tell you how common a specific word is. I don't think it's ideal for that. First, "transparent" compounds like homework aren't included (at least not when you search on that site). Second, if you want to know how common a word like stroller is, knowing how common the stroll family is doesn't really help. And third, with the BNC and COCA combined, you'd have no way of knowing that stroller is common in North American English but not in British English.

So for finding out how common a specific word is, I think one would be better off consulting the BNC or COCA (or Routledge, etc.) directly.

On the other hand, it seems like the purpose of this list might be to help learners decide what vocabulary to learn first, what gives you the most bang for your buck. Paul Nation, the linguist who compiled the list, is the author of a series of books along those lines. The list seems better suited for that.
3 x

白田龍
White Belt
Posts: 30
Joined: Wed Mar 21, 2018 6:54 pm
Languages: English, Spanish, Portuguese, French, Persian, Arabic, Mandarin.
x 27

Re: low-frequency words that are unexpectedly frequent

Postby 白田龍 » Mon Aug 13, 2018 3:31 pm

General (balanced) frequency lists are of limited usage other than for selecting a core (about 3000?) of commonly used words, because the frequency distribution of words will vary greatly from author or genre to genre. If you want to use frequency lists to guide your vocabulary learning, you shoud have the frequency counted from a corpora that reflects rather closely the gerna and registers you are getting exposed to, lest you will be missing frequent words and learning a lot of useless words.
0 x

StringerBell
Yellow Belt
Posts: 62
Joined: Mon Jul 23, 2018 3:30 am
Languages: English (n)
Italian: ~ intermediate
Polish : ~ lower intermediate
x 156

Re: low-frequency words that are unexpectedly frequent

Postby StringerBell » Mon Aug 13, 2018 4:17 pm

白田龍 wrote:General (balanced) frequency lists are of limited usage other than for selecting a core (about 3000?) of commonly used words, because the frequency distribution of words will vary greatly from author or genre to genre. If you want to use frequency lists to guide your vocabulary learning, you shoud have the frequency counted from a corpora that reflects rather closely the gerna and registers you are getting exposed to, lest you will be missing frequent words and learning a lot of useless words.


I can't tell if this is directed at me or someone else...if it was a response meant for me, then it doesn't apply because I already mentioned that I don't use frequency lists, I decide for myself which words are likely frequent enough to be worth focusing on.
0 x
Polish 1st goal: 1100 hours : 714 / 1100
Italian 1st goal: 730 hours : 730 / 730 COMPLETED! YAY!
Italian 2nd goal: read 150 articles/blog posts in 3 months : 26 / 100

User avatar
Iversen
Black Belt - 1st Dan
Posts: 1762
Joined: Sun Jul 19, 2015 7:36 pm
Location: Denmark
Languages: Monolingual travels in Danish, English, German, Dutch, Swedish, French, Portuguese, Spanish, Catalan, Italian, Romanian and (part time) Esperanto
Ahem, not yet: Norwegian, Afrikaans, Platt, Scots, Russian, Serbian, Bulgarian, Albanian, Greek, Latin, Irish, Indonesian and a few more...
Language Log: viewtopic.php?f=15&t=1027
x 3568

Re: low-frequency words that are unexpectedly frequent

Postby Iversen » Mon Aug 13, 2018 11:32 pm

People who define "headache", "boyfriend", "airport" etc. as two-words combinations may have inhaled too much coca or something ... and their results couldn't be taken seriously if that's how they do their counting. There are combinations which definitely are two words - like "port authority". But then their parts are mostly pronounced as two words, with a weak stress on each word. But even allowing for a grey zone the words on Hashimi's list aren't inside it. I seriously hope that trippingly is right, i.e. that the words are accepted on the frequency list, which would indicate that they are seen as single words.

Apart from that: in my world there is a lot of scientific terms which never would reach the 25.000 word treshold in a general corpus, but that's because I read too much about science...
2 x


Return to “General Language Discussion”

Who is online

Users browsing this forum: David27, Google [Bot] and 1 guest