Making word lists

Ask specific questions about your target languages. Beginner questions welcome!
User avatar
sfuqua
Black Belt - 1st Dan
Posts: 1642
Joined: Sun Jul 19, 2015 5:05 am
Location: san jose, california
Languages: Bad English: native
Samoan: speak, but rusty
Tagalog: imperfect, but use all the time
Spanish: read
French: read some
Japanese: beginner, obsessively studying
Language Log: https://forum.language-learners.org/vie ... =15&t=9248
x 6299

Making word lists

Postby sfuqua » Fri Nov 17, 2017 6:51 pm

I know this is pretty easy in a Unix environment, but I'm trapped on a Windows machine with no python...
I know I can do it with a spreadsheet also, but it will choke my machine if the file is big.
I own the slow, old windows machine, so I could install software..

What is the easiest way to make a word list of unique words from a book length text file, that gives me unique word in their original order of appearance?

The cat in the hat chased the rat in the hat -->

the
cat
in
hat
chased
rat

Not a frequency list, but a list of unique words in their order of appearance.
0 x
荒海や佐渡によこたふ天の川

the rough sea / stretching out towards Sado / the Milky Way
Basho[1689]

Sometimes Japanese is just too much...

User avatar
sfuqua
Black Belt - 1st Dan
Posts: 1642
Joined: Sun Jul 19, 2015 5:05 am
Location: san jose, california
Languages: Bad English: native
Samoan: speak, but rusty
Tagalog: imperfect, but use all the time
Spanish: read
French: read some
Japanese: beginner, obsessively studying
Language Log: https://forum.language-learners.org/vie ... =15&t=9248
x 6299

Re: Making word lists

Postby sfuqua » Fri Nov 17, 2017 9:14 pm

I may have fixed this already, but we'll see. I suspect I'm overlooking something easy.

I seem to have python installed on this computer now; now to learn enough python to do the program.
NLTK doesn't seem to like my python 3.6.
0 x
荒海や佐渡によこたふ天の川

the rough sea / stretching out towards Sado / the Milky Way
Basho[1689]

Sometimes Japanese is just too much...

Whodathunkitz
Green Belt
Posts: 416
Joined: Mon Dec 26, 2016 7:40 pm
Location: UK
Languages: English (N), Cebuano (basic spoken daily, best L2), Spanish (beginner, but can read), Esperanto (beginner and not maintained). Sometimes dabble with Dutch, Serbian, Slovak, Czech, German and Arabic.
Language Log: viewtopic.php?f=15&t=5133&start=30
x 315

Re: Making word lists

Postby Whodathunkitz » Fri Nov 17, 2017 9:26 pm

Notepad++ or Powershell plus regular expressions?

Possibly do a first pass with swapping spaces for newlines (2 chars \r\n or vice versa) and then see if a further reg expression might do it.

Eg http://www.thescarms.com/dotnet/RegExUnique.aspx

On phone so can't test.
1 x
2018 Cebuano SuperChallenge 1 May 2018-Dec 2019
: 150 / 600 SC days:
: 6 / 1250 Read (aim daily 2000 words):
: 299 / 9000 Video (aim daily 15 minutes):

User avatar
sfuqua
Black Belt - 1st Dan
Posts: 1642
Joined: Sun Jul 19, 2015 5:05 am
Location: san jose, california
Languages: Bad English: native
Samoan: speak, but rusty
Tagalog: imperfect, but use all the time
Spanish: read
French: read some
Japanese: beginner, obsessively studying
Language Log: https://forum.language-learners.org/vie ... =15&t=9248
x 6299

Re: Making word lists

Postby sfuqua » Fri Nov 17, 2017 9:50 pm

I'm kind of surprised there isn't an online utility for it...
Maybe there is and I'm not searching correctly.
2 x
荒海や佐渡によこたふ天の川

the rough sea / stretching out towards Sado / the Milky Way
Basho[1689]

Sometimes Japanese is just too much...

Whodathunkitz
Green Belt
Posts: 416
Joined: Mon Dec 26, 2016 7:40 pm
Location: UK
Languages: English (N), Cebuano (basic spoken daily, best L2), Spanish (beginner, but can read), Esperanto (beginner and not maintained). Sometimes dabble with Dutch, Serbian, Slovak, Czech, German and Arabic.
Language Log: viewtopic.php?f=15&t=5133&start=30
x 315

Re: Making word lists

Postby Whodathunkitz » Fri Nov 17, 2017 10:30 pm

sfuqua wrote:I'm kind of surprised there isn't an online utility for it...
Maybe there is and I'm not searching correctly.


It's trivial except for the ordering of first usage.

I can understand why that's useful if you're going through a text.

It can be done, but a bit harder and maybe less call for it.

Perhaps sort and unique just to get an idea of numbers. Sort, unique with a count is easy (probably), re-sort in descending frequency and the common words will be encountered earlier.

But a few regular expressions should do it. Check out some of the regexp sites especially those that have a working live aspect.
1 x
2018 Cebuano SuperChallenge 1 May 2018-Dec 2019
: 150 / 600 SC days:
: 6 / 1250 Read (aim daily 2000 words):
: 299 / 9000 Video (aim daily 15 minutes):

User avatar
reineke
Black Belt - 3rd Dan
Posts: 3570
Joined: Wed Jan 06, 2016 7:34 pm
Languages: Fox (C4)
Language Log: https://forum.language-learners.org/vie ... =15&t=6979
x 6554

Re: Making word lists

Postby reineke » Fri Nov 17, 2017 10:38 pm

http://www.online-toolz.com/tools/string-functions.php#

https://www.freeformatter.com/string-utilities.html

Both seem to work. Paste your text and press space in the required field. I googled "online word string splitter".

Quoique
ce
détail
ne
touche
en
aucune
manière
au
fond
même
de
ce
que
nous
avons
à
raconter,
il
n'est
peut-être
pas
inutile,
ne
fût-ce
que
pour
être
exact
en
tout,
d'indiquer
ici
les
bruits
et
les
propos
qui
avaient
couru
sur
son
compte
au
moment

il
était
arrivé
dans
le
diocèse.
Vrai
ou
faux,
ce
qu'on
dit
des
hommes
tient
souvent
autant
de
place
dans
leur
vie
et
surtout
dans
leur
destinée
que
ce
qu'ils
font.
M.
Myriel
était
fils
d'un
conseiller
au
parlement
d'Aix
;
noblesse
de
robe.

V. Hugo

If you're still reading, let me mention TextSTAT, a simple program "for the analysis of texts. It reads plain text files (in different encodings) and HTML files (directly from the internet) and it produces word frequency lists and concordances from these files". The link is also in the "language tools section".

http://neon.niederlandistik.fu-berlin.de/en/textstat/
Last edited by reineke on Fri Nov 17, 2017 11:02 pm, edited 1 time in total.
5 x

User avatar
jeff_lindqvist
Black Belt - 3rd Dan
Posts: 3135
Joined: Sun Aug 16, 2015 9:52 pm
Languages: sv, en
de, es
ga, eo
---
fi, yue, ro, tp, cy, kw, pt, sk
Language Log: viewtopic.php?f=15&t=2773
x 10462

Re: Making word lists

Postby jeff_lindqvist » Fri Nov 17, 2017 10:56 pm

5 x
Leabhair/Greannáin léite as Gaeilge: 9 / 18
Ar an seastán oíche: Oileán an Órchiste
Duolingo - finished trees: sp/ga/de/fr/pt/it
Finnish with extra pain : 100 / 100

Llorg Blog - Wiki - Discord

User avatar
Iversen
Black Belt - 4th Dan
Posts: 4768
Joined: Sun Jul 19, 2015 7:36 pm
Location: Denmark
Languages: Monolingual travels in Danish, English, German, Dutch, Swedish, French, Portuguese, Spanish, Catalan, Italian, Romanian and (part time) Esperanto
Ahem, not yet: Norwegian, Afrikaans, Platt, Scots, Russian, Serbian, Bulgarian, Albanian, Greek, Latin, Irish, Indonesian and a few more...
Language Log: viewtopic.php?f=15&t=1027
x 14962

Re: Making word lists

Postby Iversen » Fri Nov 17, 2017 11:25 pm

I used Words and Excel for a similar task. I analyzed a corpus of around 37.000 words from my own writings at HTLAL, and the first thing I did was to import one passage after the other of ordinary English text into Word. After a bit of manual revision to remove non-English passages and quotes I replaced all spaces with the sign for linebreak, removed all empty lines and copied the result to Excel, where there is an advanced filter that can find unique values. In the old Excel version I used there could be some 64K lines, which has been changed to more than a million lines now - but I also discovered that the filter couldn't cope with all the 65.000 lines at once. No problem: I divided the lot into two parts and filtered each part separately and then applied the filter once again to the combined results. And then I had a list of all wordforms in the corpus. I could also have done some statistics on a sorted list if I had wanted to do it, but I didn't occur to me that the result could be of interest. I only wanted to know the the number of different words I had used.

But there is probably some fancy software out there that can do a similar thing automatically...
2 x

User avatar
sfuqua
Black Belt - 1st Dan
Posts: 1642
Joined: Sun Jul 19, 2015 5:05 am
Location: san jose, california
Languages: Bad English: native
Samoan: speak, but rusty
Tagalog: imperfect, but use all the time
Spanish: read
French: read some
Japanese: beginner, obsessively studying
Language Log: https://forum.language-learners.org/vie ... =15&t=9248
x 6299

Re: Making word lists

Postby sfuqua » Sat Nov 18, 2017 3:03 am

I found an online utility at https://www.tracemyip.org/tools/remove-duplicate-words-in-text/ that works OK on files that are 200 000 words long.
I'm going to fight with python until I can do it by myself.

I'm plotting what to do the next few months, of course, and a word list in order of appearance is part of the plot.
2 x
荒海や佐渡によこたふ天の川

the rough sea / stretching out towards Sado / the Milky Way
Basho[1689]

Sometimes Japanese is just too much...

mcthulhu
Orange Belt
Posts: 228
Joined: Sun Feb 26, 2017 4:01 pm
Languages: English (native); strong reading skills - Russian, Spanish, French, Italian, German, Serbo-Croatian, Macedonian, Bulgarian, Slovene, Farsi; fair reading skills - Polish, Czech, Dutch, Esperanto, Portuguese; beginner/rusty - Swedish, Norwegian, Danish
x 590

Re: Making word lists

Postby mcthulhu » Sat Nov 18, 2017 4:41 am

One problem with your http://www.tracemyip.org tool, I think, is that it seems to split only on white space, not punctuation, so "three" and "three." would be treated as separate and unique words. If that's good enough, fine, but otherwise you need either to remove all punctuation first, or split on both white space and punctuation characters (which ones would depend on what language the text is in). I've used regular expressions to do this in the past, though usually the regular expression keeps expanding as I try to make it cover additional languages, etc. Likewise, "three" and "Three" are treated as separate words by that tool; so when you write your own script, you might consider whether to normalize everything to lower case first. (Not to mention the question of whether you want a list of base forms, or whether you want to list different forms of the same word, like singular vs. plural, listed as separate "words." De-inflection would add another step.) Word tokenization isn't really as trivial as it seems.

NLTK, with its supplied word tokenizers, actually is available for Python 3.6, however. If you use the Anaconda installer for 3.6, NLTK should be one of the packages installed along with it:

Code: Select all

Python 3.6.1 |Anaconda custom (32-bit)| (default, May 11 2017, 14:16:49) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>>


I'm curious about why you need to keep the word list in order of appearance. For context, or so that a glossary will be ordered in such a way as to follow the text?

Jorkens automatically produces a word form list (not de-inflected) when a text is loaded into it. The list is sorted by descending frequency, but merely skipping the sorting step would not be difficult, if that would be useful to add as an option.
5 x


Return to “Practical Questions and Advice”

Who is online

Users browsing this forum: Google [Bot] and 2 guests