Language chunks to ease language activation

Ask specific questions about your target languages. Beginner questions welcome!
User avatar
reineke
Black Belt - 3rd Dan
Posts: 3570
Joined: Wed Jan 06, 2016 7:34 pm
Languages: Fox (C4)
Language Log: https://forum.language-learners.org/vie ... =15&t=6979
x 6554

Re: Language chunks to ease language activation

Postby reineke » Thu Mar 30, 2017 4:03 am

---
Last edited by reineke on Fri Dec 27, 2019 3:10 am, edited 2 times in total.
2 x

jeffers
Blue Belt
Posts: 848
Joined: Sat Aug 22, 2015 4:12 pm
Location: UK
Languages: Speaks: English (N), Hindi (A2-B1)

Learning: The above, plus French (A2-B1), German (A1), Ancient Greek (?), Sanskrit (beginner)
Language Log: https://forum.language-learners.org/vie ... 15&t=19785
x 2786
Contact:

Re: Language chunks to ease language activation

Postby jeffers » Thu Mar 30, 2017 9:24 am

This thread made me wonder if it would be feasible to write a program which finds chunks in texts. The obvious use would be to create a frequency list of chunks for any language, and use it much like people do with vocabulary frequency lists.

The complex part would be to find pairs of words used more than a set threshold. You would have to do something like check every word against every word, a problem which at first glance is going to grow exponentially as the text file increases [O(2^n) for you computer science boffins]. Once you have pairs, finding chunks of 3, chunks of 4, etc, would be quick because each subsequent set will be a fractional subset of the previous.

There could be "heuristic" solutions (i.e. approximations which shorten the processing). Off the top of my head, you could find the most common pairs in several small to medium size texts, then only search for these pairs in a massive corpus. Another approach might be to only look for pairs where both words appear in the top 2000 (or any other arbitrary number) frequent words in the corpus.

I feel a summer project coming on.......
3 x
Le mieux est l'ennemi du bien (roughly, the perfect is the enemy of the good)

French SC Books: 0 / 5000 (0/5000 pp)
French SC Films: 0 / 9000 (0/9000 mins)

DaveBee
Blue Belt
Posts: 952
Joined: Wed Nov 02, 2016 8:49 pm
Location: UK
Languages: English (native). French (studying).
Language Log: https://forum.language-learners.org/vie ... =15&t=7466
x 1386

Re: Language chunks to ease language activation

Postby DaveBee » Thu Mar 30, 2017 10:09 am

reineke wrote:
neofight78 wrote:
I've encountered the effects of ignoring Russian grammar and the results are more often than not awful. A purely lexical approach without grammar or vocabulary study doesn't work in all situations for all people.


The lexical approach is about studying "grammaticalized lexis" i.e. it does away with the dichotomy of grammar and vocabulary as separate study subjects.

Anyone actively studying lexical chunks cannot possibly be ignoring vocabulary. While grammar plays second fiddle, it is hardly ignored. Paying attention to lexical chunks could prove especially useful while studying a highly inflected language like Russian. Nobody's forcing students, teachers or self-learners to adopt this approach. What's more, the lexical approach is apparently largely ignored:

"A quick glance at any commercially available EFL textbook reveals that a traditional grammar syllabus, the main object of Lewis's attack, is still alive and kicking, albeit more cleverly disguised."

https://www.google.com/amp/s/amp.thegua ... revolution
The author of Polish for Dummies, Daria Gabryanczyk, recommends learning phrases of Polish, rather than words.
So, what’s the best approach when it comes to learning Polish? Simply relax, take things easy, don’t worry if you can’t always get the endings right, don’t look too much ahead but just let yourself gradually dive into Polish, follow simple tips your teacher gives you during your classes, take every opportunity to speak Polish and listen to the language and keep repeating full phrases. In the case of the Polish language, learning isolated words is not a good idea, as you might not know how to put them together. Therefore, especially at the beginning, you should always focus on memorising full phrases as this is the best way to learn Polish. And you will soon notice that the more Polish phrases you already know the easier it becomes for you to take in more and more and in no time you will realise it’s actually not that difficult.
3 x

User avatar
reineke
Black Belt - 3rd Dan
Posts: 3570
Joined: Wed Jan 06, 2016 7:34 pm
Languages: Fox (C4)
Language Log: https://forum.language-learners.org/vie ... =15&t=6979
x 6554

Re: Language chunks to ease language activation

Postby reineke » Thu Mar 30, 2017 2:39 pm

---
Last edited by reineke on Fri Dec 27, 2019 3:11 am, edited 1 time in total.
1 x

User avatar
tommus
Blue Belt
Posts: 957
Joined: Sat Jul 04, 2015 3:59 pm
Location: Kingston, ON, Canada
Languages: English (N), French (B2), Dutch (B2)
x 1937

Re: Language chunks to ease language activation

Postby tommus » Sat Apr 01, 2017 12:49 am

jeffers wrote:This thread made me wonder if it would be feasible to write a program which finds chunks in texts.

Excellent idea. You inspired me to put together what turned out to be a rather simple Java program to find chunks. Here are some Dutch results.

Corpus: 2 years of daily Dutch news from NOS Journaal
Number of words: over 2 million
Condition: words longer than 3 letters
Processing time: less than a minute
2-word chunks, occurring at least 10 times each: 5,118
3-word chunks, occurring at least 10 times each: 327

Examples, showing frequency of occurrence, excluding proper names

Top 30 two-letter chunks

1300 niet meer
756 jaar geleden
726 vorig jaar
709 heel veel
665 niet alleen
651 nieuws vandaag
650 steeds meer
604 veel mensen
532 veel meer
396 alleen maar
395 miljoen euro
337 niet voor
315 maar niet
315 niet goed
310 volgend jaar
309 mensen zijn
283 terug naar
276 twee jaar
276 vorige week
268 onder meer
266 miljard euro
262 ieder geval
256 deze week
254 daar zijn
254 helemaal niet
254 volgende week
243 laten zien
243 niet veel
229 afgelopen jaren
229 fijne avond

Top 30 three-letter chunks

91 twee jaar geleden
80 heel veel mensen
73 mensen raakten gewond
63 twee weken geleden
60 steeds meer mensen
55 zich zorgen over
53 eerder deze week
53 voor veel mensen
47 zich grote zorgen
44 niet alleen voor
42 maakt zich zorgen
41 veel meer over
40 paar jaar geleden
40 vijf jaar geleden
39 maken zich zorgen
39 prettige avond verder
38 jaar geleden werd
37 maar niet iedereen
37 meeste plaatsen droog
36 goed nieuws voor
35 heel veel geld
35 over twee weken
35 vier jaar geleden
33 niet alleen maar
32 veel mensen zijn
31 vrij veel bewolking
29 paar weken geleden
28 eind vorig jaar
27 even terug naar
27 komen steeds meer
5 x
Dutch: 01 September -> 31 December 2020
Watch 1000 Dutch TV Series Videos : 40 / 1000

User avatar
tommus
Blue Belt
Posts: 957
Joined: Sat Jul 04, 2015 3:59 pm
Location: Kingston, ON, Canada
Languages: English (N), French (B2), Dutch (B2)
x 1937

Re: Language chunks to ease language activation

Postby tommus » Sat Apr 01, 2017 12:50 pm

Here are links for the Dutch two- and three-word chunks and their frequencies from two years of NOS Journaal. They include proper names such as people and places. They may contain some chunks that are unique to journalism.

Corpus: 2 years of daily Dutch news from NOS Journaal
Number of words: over 2 million
Condition: words longer than 3 letters
Processing time: less than a minute

2-word chunks, occurring at least 2 times each: 56,833

3-word chunks, occurring at least 2 times each: 13,156
1 x
Dutch: 01 September -> 31 December 2020
Watch 1000 Dutch TV Series Videos : 40 / 1000

Cainntear
Black Belt - 3rd Dan
Posts: 3521
Joined: Thu Jul 30, 2015 11:04 am
Location: Scotland
Languages: English(N)
Advanced: French,Spanish, Scottish Gaelic
Intermediate: Italian, Catalan, Corsican
Basic: Welsh
Dabbling: Polish, Russian etc
x 8781
Contact:

Re: Language chunks to ease language activation

Postby Cainntear » Sat Apr 01, 2017 1:20 pm

I realise this is quite an old post, but I think the point raised demands a response.
lusan wrote:I believe, as soon research showed, that we remember best no more than 7 items at the time. Very long sentences fail in that respect.

The reason we talk about chunking is exactly that limit of 7 things at a time.

The idea is that by "chunking" multiple words together, we can see them as one thing, and then we can incorporate them into larger and larger constructs.

But it's a complex process, and the contents of the chunks usually still follow the grammatical rules of the language, so I don't see learning chunks as an alternative to learning grammar, but a supplement. I'm sure you can learn some grammar through studying chunks, but grammar is a series of generalisable rules, and you can't learn a generalisable rule from only one or two examples.
0 x

User avatar
tommus
Blue Belt
Posts: 957
Joined: Sat Jul 04, 2015 3:59 pm
Location: Kingston, ON, Canada
Languages: English (N), French (B2), Dutch (B2)
x 1937

Re: Language chunks to ease language activation

Postby tommus » Sat Apr 01, 2017 1:24 pm

Here are links for the Dutch two- and three-word chunks and their frequencies from several years of some Dutch TV series. They contain mainly conversational material.

Corpus: 2 years of Dutch TV series.
Number of words: over 278,000
Condition: words longer than 3 letters
Processing time: less than 10 seconds

2-word chunks, occurring at least 2 times each: 6,635

3-word chunks, occurring at least 2 times each: 1,121
0 x
Dutch: 01 September -> 31 December 2020
Watch 1000 Dutch TV Series Videos : 40 / 1000

User avatar
tommus
Blue Belt
Posts: 957
Joined: Sat Jul 04, 2015 3:59 pm
Location: Kingston, ON, Canada
Languages: English (N), French (B2), Dutch (B2)
x 1937

Re: Language chunks to ease language activation

Postby tommus » Sat Apr 01, 2017 1:31 pm

If there is interest, I will process other corpus material in other languages into 2-word and 3-word chunks. What I need is a link to plain text material. It should include at least 200,000 words, and preferably 1 million words or more, in Latin script.
0 x
Dutch: 01 September -> 31 December 2020
Watch 1000 Dutch TV Series Videos : 40 / 1000

jeffers
Blue Belt
Posts: 848
Joined: Sat Aug 22, 2015 4:12 pm
Location: UK
Languages: Speaks: English (N), Hindi (A2-B1)

Learning: The above, plus French (A2-B1), German (A1), Ancient Greek (?), Sanskrit (beginner)
Language Log: https://forum.language-learners.org/vie ... 15&t=19785
x 2786
Contact:

Re: Language chunks to ease language activation

Postby jeffers » Sat Apr 01, 2017 2:12 pm

tommus wrote:If there is interest, I will process other corpus material in other languages into 2-word and 3-word chunks. What I need is a link to plain text material. It should include at least 200,000 words, and preferably 1 million words or more, in Latin script.


Thanks for setting that up. I was probably over-complicating the process in my mind.

The only problem with your solution is restricting it two words of at least 3 letters (evidently to cut down processing time). I imagine a high proportion of interesting word groups contain words of one and two letters . The obvious example in French is "il y a", but we could imagine others ("What a crock!").
0 x
Le mieux est l'ennemi du bien (roughly, the perfect is the enemy of the good)

French SC Books: 0 / 5000 (0/5000 pp)
French SC Films: 0 / 9000 (0/9000 mins)


Return to “Practical Questions and Advice”

Who is online

Users browsing this forum: No registered users and 2 guests