low-frequency words that are unexpectedly frequent

General discussion about learning languages
StringerBell
Brown Belt
Posts: 1035
Joined: Mon Jul 23, 2018 3:30 am
Languages: English (n)
Italian
x 3289

low-frequency words that are unexpectedly frequent

Postby StringerBell » Sat Aug 11, 2018 9:13 pm

When I started learning Italian and Polish, I decided to focus on high frequency words first, and focus later on less frequent words. There were a few words that I initially decided to ignore because I figured they were a little too "specific" or infrequent, but then I was surprised to see how often they reoccured in a variety of sources so I ended up learning them without even trying.

For Italian, one example is il campanile (the bell tower).
For some reason, this word kept coming up over and over...in the Veleno podcast serial, in a general conversation podcast between two people discussing a trip abroad where they climbed a bell tower, in a travel blog post, in a tv show...it seems like almost everything I listen to/watch manages to find a way to mention this word.

In Polish, one example is na brzegu (on the shore)
Somehow, everything I read and listen to always manages to involve being on the shore. I originally thought I would ignore this one for a while, but I've seen it so many times at this point that it's one of my best-remembered vocab words.

Which words have surprised you with their frequency?
1 x
Season 4 Lucifer Italian transcripts I created: https://learnanylanguage.fandom.com/wik ... ranscripts

User avatar
jeff_lindqvist
Black Belt - 3rd Dan
Posts: 3153
Joined: Sun Aug 16, 2015 9:52 pm
Languages: sv, en
de, es
ga, eo
---
fi, yue, ro, tp, cy, kw, pt, sk
Language Log: viewtopic.php?f=15&t=2773
x 10540

Re: low-frequency words that are unexpectedly frequent

Postby jeff_lindqvist » Sat Aug 11, 2018 10:03 pm

Isn't this an example of the Baader-Meinhof phenomenon?

Dog-owners suddenly see other dog-owners everywhere, first-time parents see other parents pushing strollers, owners of red cars see an increased number of red cars on the road... (so they say - none of this applies to me).

One thing that has occured to me is that written sources sometimes seem to change over time. All of a sudden you see something in a book/CD booklet/movie that you haven't seen before, despite actively looking for it. Explanations which are suddenly there, new actors on cast lists/guest musicians on recordings appear (although you had it all memorized). When you're ready for the information. :?
9 x
Leabhair/Greannáin léite as Gaeilge: 9 / 18
Ar an seastán oíche: Oileán an Órchiste
Duolingo - finished trees: sp/ga/de/fr/pt/it
Finnish with extra pain : 100 / 100

Llorg Blog - Wiki - Discord

User avatar
BalancingAct
Orange Belt
Posts: 117
Joined: Thu Jan 12, 2017 6:37 am
Languages: Mandarin, Cantonese, English (Prof.), French (Adv. - Prof.), Italian (Adv.), German (Adv. receptive), Spanish (Int. receptive)
x 182

Re: low-frequency words that are unexpectedly frequent

Postby BalancingAct » Sun Aug 12, 2018 12:05 am

StringerBell wrote:For Italian, one example is il campanile (the bell tower).
Which words have surprised you with their frequency?


I frequently meet "campanello" (bell). In fact, I have just come across it in "un campanello d'allarme dell'aumento di un generalizzato disagio sociale" (alarm bell). Just the other day I saw it used as "door bell". I had previously thought that it would only mean the kind of bell in a church bell tower.
1 x

StringerBell
Brown Belt
Posts: 1035
Joined: Mon Jul 23, 2018 3:30 am
Languages: English (n)
Italian
x 3289

Re: low-frequency words that are unexpectedly frequent

Postby StringerBell » Sun Aug 12, 2018 12:48 pm

jeff_lindqvist wrote:Isn't this an example of the Baader-Meinhof phenomenon?

Dog-owners suddenly see other dog-owners everywhere, first-time parents see other parents pushing strollers, owners of red cars see an increased number of red cars on the road... (so they say - none of this applies to me).

One thing that has occured to me is that written sources sometimes seem to change over time. All of a sudden you see something in a book/CD booklet/movie that you haven't seen before, despite actively looking for it. Explanations which are suddenly there, new actors on cast lists/guest musicians on recordings appear (although you had it all memorized). When you're ready for the information. :?


No, this isn't what I'm talking about. I do experience this phenomenon when I spend time learning a new expression or idiom, then it feels like I hear it everywhere. What I'm talking about here is different. This is when as a beginner or early intermediate learner I am confronted with a certain amount of high specific words that I choose to not spend time attempting to learn or really care about because I'm assuming that I won't come across them in the near future, yet a handful of them do consistently repeat to the point that I automatically remember them because I see them so frequently.
1 x
Season 4 Lucifer Italian transcripts I created: https://learnanylanguage.fandom.com/wik ... ranscripts

StringerBell
Brown Belt
Posts: 1035
Joined: Mon Jul 23, 2018 3:30 am
Languages: English (n)
Italian
x 3289

Re: low-frequency words that are unexpectedly frequent

Postby StringerBell » Sun Aug 12, 2018 1:05 pm

Hashimi wrote:Actually, I'm surprised by the contrary. For example, common words like "headache", "boyfriend", "airport", "bathroom", "motorbike", "underground", "midnight", "classroom", "bedroom", "girlfriend", "birthday", "timetable", "weekend", "upstairs", "suitcase", "motorcycle", "homework", "businessman", "website", and even "forever" are not in the list of the most frequent 25,000 words in the British National Corpus and the Corpus of Contemporary American English!


I would consider most of those words to be high frequency. I don't use any "official" high frequency word lists, I basically make my own judgement call when I come across a word based on how likely I think I am to see it again soon or want to use it myself.

Because I've never used language learning books or programs and am 100% self-directed, I am constantly making the decision on what I consider to be essential vocabulary words and whether I want to learn/remember them. And since I listen to a massive amount of native (or "near" native) material very early on, the words that tend to repeat are true high frequency words.

The fact that these words you listed don't appear on "official" high frequency lists probably means those lists are garbage, because some of the very first things I learned to say and then subsequently use on a regular basis (and also repeatedly hear and read) are words like "headache", "bathroom", "bedroom", "suitcase", "website", and "boyfriend/girlfriend".

So I'm not really talking about words that aren't on a high frequency list but then appear a lot because they are actually useful everyday words, but words that seem like they are too highly specific and wouldn't tend to come up a lot but then actually do tend to resurface consistently. This would be different for every person, since it would be based on the individual input a person is using (someone else might only see "bell tower" or "on the shore" once or even never (early on) if they are watching, listening, and reading to different stuff than me.
0 x
Season 4 Lucifer Italian transcripts I created: https://learnanylanguage.fandom.com/wik ... ranscripts

User avatar
Adrianslont
Blue Belt
Posts: 827
Joined: Sun Aug 16, 2015 10:39 am
Location: Australia
Languages: English (N), Learning Indonesian and French
x 1936

Re: low-frequency words that are unexpectedly frequent

Postby Adrianslont » Sun Aug 12, 2018 2:14 pm

Hashimi wrote:Actually, I'm surprised by the contrary. For example, common words like "headache", "boyfriend", "airport", "bathroom", "motorbike", "underground", "midnight", "classroom", "bedroom", "girlfriend", "birthday", "timetable", "weekend", "upstairs", "suitcase", "motorcycle", "homework", "businessman", "website", and even "forever" are not in the list of the most frequent 25,000 words in the British National Corpus and the Corpus of Contemporary American English!
Yes, these words are high frequency. I know I used the Indonesian equivalents of 17/20 of them when I was in Indonesia recently - and most of those were used multiple times over three weeks. And I guess my Indonesian vocabulary is only a few thousand words - wild guess.

However, I’m not surprised that they don’t appear in the top 25,000 of those two copora because I know how the corpora are made - they include 80/90% written texts depending which one - including a large chunk of academic texts, newspapers etc.

And the spoken texts used are from tv.

If you want to see a list of the words that people use on a day to day basis you would need to find a corpus that was compiled from recordings made by following people around in their day to day activities. This, of course, doesn’t happen because it’s such a huge undertaking - you would need to follow how many people to make a 40 million word corpus? For how long? And then transcribe it all. And then there are ethical and permission issues about recording people.

It’s relatively - I stress, relatively - trivial to make a corpus from the written word. And actually very simple to make your own specialist corpus and frequency lists from written sources such as ebooks and web sites. Look for the Antconc software. And Ant’s other software.

I think there is some value in corpora and frequency lists for learners but they will always lack a large chunk of frequently used day to day vocabulary because of the reasons described above.
0 x

User avatar
devilyoudont
Blue Belt
Posts: 571
Joined: Tue Jun 26, 2018 1:34 am
Location: Philadelphia
Languages: EN (N), EO (C), JA (B), ES (A)
Language Log: https://forum.language-learners.org/vie ... 15&t=16424
x 1829
Contact:

Re: low-frequency words that are unexpectedly frequent

Postby devilyoudont » Sun Aug 12, 2018 3:23 pm

I wonder if there will one day be like an Amazon Alexa corpus haha. Just what it hears people saying around the house.
3 x

User avatar
Adrianslont
Blue Belt
Posts: 827
Joined: Sun Aug 16, 2015 10:39 am
Location: Australia
Languages: English (N), Learning Indonesian and French
x 1936

Re: low-frequency words that are unexpectedly frequent

Postby Adrianslont » Sun Aug 12, 2018 5:52 pm

devilyoudont wrote:I wonder if there will one day be like an Amazon Alexa corpus haha. Just what it hears people saying around the house.

My cousin has Alexa. She mainly swears at it. :lol: I don’t think it understands her particular British accent. :roll:
2 x

User avatar
jonm
Orange Belt
Posts: 202
Joined: Mon Jun 04, 2018 10:06 pm
Location: Massachusetts, USA
Languages: English (N)
Spanish (adv.)
Bangla (int.)
French (passive)
Language Log: https://forum.language-learners.org/vie ... =15&t=9402
x 667

Re: low-frequency words that are unexpectedly frequent

Postby jonm » Sun Aug 12, 2018 9:50 pm

Hashimi wrote:
Hashimi wrote:Actually, I'm surprised by the contrary. For example, common words like "headache", "boyfriend", "airport", "bathroom", "motorbike", "underground", "midnight", "classroom", "bedroom", "girlfriend", "birthday", "timetable", "weekend", "upstairs", "suitcase", "motorcycle", "homework", "businessman", "website", and even "forever" are not in the list of the most frequent 25,000 words in the British National Corpus and the Corpus of Contemporary American English!


Now I understand why these common words are not on the list of the most frequent 25K words in the BNC-COCA. They are all considered as two-word words so they removed them from the list!

I wonder if this is just an issue with the interface you're using the search these corpora? The quickest way I know to look up multiple words in COCA is using the analyze texts feature on Wordandphrase.info (kind of a front end for COCA maintained by the same linguist, Marc Davies). Copy-pasting the word list above, it appears that all of the words in the list are in the top 25000. Motorbike is the least frequent, with a rank of 23938. The rest are in the top 10000, with six (weekend, classroom, airport, bedroom, forever, bathroom) in the top 3000.

You can also search the BNC here and COCA here, but I find that interface less useful, because as far as I know you can only look up one word at a time, and (again AFAIK) you only get the word's frequency in the corpus, not its rank relative to other words. Looks like most of the words are in the BNC though, with the exception of website. Motorbike is slightly less frequent in the BNC than in COCA, so it's possible that it's not in the top 25000 in that corpus, but I think the rest probably are.
0 x

User avatar
jonm
Orange Belt
Posts: 202
Joined: Mon Jun 04, 2018 10:06 pm
Location: Massachusetts, USA
Languages: English (N)
Spanish (adv.)
Bangla (int.)
French (passive)
Language Log: https://forum.language-learners.org/vie ... =15&t=9402
x 667

Re: low-frequency words that are unexpectedly frequent

Postby jonm » Mon Aug 13, 2018 3:21 am

Very interesting. So it looks like the list that's being searched there is derived from the BNC and COCA but does things a little differently than its sources.

In the BNC and COCA, the verb forms work, worked, and working are combined into a single lemma, but there are separate lemmas for work as a noun or working as an adjective or the derived noun worker. In this combined list, as far as I can tell, those are all included in a single word family.

But then what do you do with a word like homework? Does it belong in the home family, the work family, or its own family? Looks like they've addressed that by putting "transparent compounds" such as homework in a separate list (which I don't see a way to search on that site) but including non-transparent compounds (network, say) in the main list.

My first thought was that the distinction between transparent and non-transparent compounds is pretty arbitrary, but I have to say, they've applied it pretty thoughtfully. For example, which of these words is a transparent compound: eyeball, eyelash, eyelid, eyebrow?

They say eyeball and eyelid are transparent, and eyelash and eyebrow aren't. And you can kind of see how ball and lid are being used in fairly prototypical ways in those compounds, whereas lash on its own means something like whip, so that's a bit further removed.

So I'd say they've done a good job making the distinction, but it still strikes me as arbitrary. Homework and housework are both counted as transparent, and I can see why, but I wouldn't say their meanings are totally predictable from the meanings of their parts. Their meanings could be switched, for example, and they would still seem reasonably transparent.

Anyway, I would guess that this list isn't really designed to tell you how common a specific word is. I don't think it's ideal for that. First, "transparent" compounds like homework aren't included (at least not when you search on that site). Second, if you want to know how common a word like stroller is, knowing how common the stroll family is doesn't really help. And third, with the BNC and COCA combined, you'd have no way of knowing that stroller is common in North American English but not in British English.

So for finding out how common a specific word is, I think one would be better off consulting the BNC or COCA (or Routledge, etc.) directly.

On the other hand, it seems like the purpose of this list might be to help learners decide what vocabulary to learn first, what gives you the most bang for your buck. Paul Nation, the linguist who compiled the list, is the author of a series of books along those lines. The list seems better suited for that.
2 x


Return to “General Language Discussion”

Who is online

Users browsing this forum: mick33 and 2 guests