I know this is pretty easy in a Unix environment, but I'm trapped on a Windows machine with no python...
I know I can do it with a spreadsheet also, but it will choke my machine if the file is big.
I own the slow, old windows machine, so I could install software..
What is the easiest way to make a word list of unique words from a book length text file, that gives me unique word in their original order of appearance?
The cat in the hat chased the rat in the hat -->
the
cat
in
hat
chased
rat
Not a frequency list, but a list of unique words in their order of appearance.
Making word lists
- sfuqua
- Black Belt - 1st Dan
- Posts: 1642
- Joined: Sun Jul 19, 2015 5:05 am
- Location: san jose, california
- Languages: Bad English: native
Samoan: speak, but rusty
Tagalog: imperfect, but use all the time
Spanish: read
French: read some
Japanese: beginner, obsessively studying - Language Log: https://forum.language-learners.org/vie ... =15&t=9248
- x 6299
Making word lists
0 x
荒海や佐渡によこたふ天の川
the rough sea / stretching out towards Sado / the Milky Way
Basho[1689]
Sometimes Japanese is just too much...
the rough sea / stretching out towards Sado / the Milky Way
Basho[1689]
Sometimes Japanese is just too much...
- sfuqua
- Black Belt - 1st Dan
- Posts: 1642
- Joined: Sun Jul 19, 2015 5:05 am
- Location: san jose, california
- Languages: Bad English: native
Samoan: speak, but rusty
Tagalog: imperfect, but use all the time
Spanish: read
French: read some
Japanese: beginner, obsessively studying - Language Log: https://forum.language-learners.org/vie ... =15&t=9248
- x 6299
Re: Making word lists
I may have fixed this already, but we'll see. I suspect I'm overlooking something easy.
I seem to have python installed on this computer now; now to learn enough python to do the program.
NLTK doesn't seem to like my python 3.6.
I seem to have python installed on this computer now; now to learn enough python to do the program.
NLTK doesn't seem to like my python 3.6.
0 x
荒海や佐渡によこたふ天の川
the rough sea / stretching out towards Sado / the Milky Way
Basho[1689]
Sometimes Japanese is just too much...
the rough sea / stretching out towards Sado / the Milky Way
Basho[1689]
Sometimes Japanese is just too much...
-
- Green Belt
- Posts: 416
- Joined: Mon Dec 26, 2016 7:40 pm
- Location: UK
- Languages: English (N), Cebuano (basic spoken daily, best L2), Spanish (beginner, but can read), Esperanto (beginner and not maintained). Sometimes dabble with Dutch, Serbian, Slovak, Czech, German and Arabic.
- Language Log: viewtopic.php?f=15&t=5133&start=30
- x 315
Re: Making word lists
Notepad++ or Powershell plus regular expressions?
Possibly do a first pass with swapping spaces for newlines (2 chars \r\n or vice versa) and then see if a further reg expression might do it.
Eg http://www.thescarms.com/dotnet/RegExUnique.aspx
On phone so can't test.
Possibly do a first pass with swapping spaces for newlines (2 chars \r\n or vice versa) and then see if a further reg expression might do it.
Eg http://www.thescarms.com/dotnet/RegExUnique.aspx
On phone so can't test.
1 x
2018 Cebuano SuperChallenge 1 May 2018-Dec 2019
: SC days:
: Read (aim daily 2000 words):
: Video (aim daily 15 minutes):
: SC days:
: Read (aim daily 2000 words):
: Video (aim daily 15 minutes):
- sfuqua
- Black Belt - 1st Dan
- Posts: 1642
- Joined: Sun Jul 19, 2015 5:05 am
- Location: san jose, california
- Languages: Bad English: native
Samoan: speak, but rusty
Tagalog: imperfect, but use all the time
Spanish: read
French: read some
Japanese: beginner, obsessively studying - Language Log: https://forum.language-learners.org/vie ... =15&t=9248
- x 6299
Re: Making word lists
I'm kind of surprised there isn't an online utility for it...
Maybe there is and I'm not searching correctly.
Maybe there is and I'm not searching correctly.
2 x
荒海や佐渡によこたふ天の川
the rough sea / stretching out towards Sado / the Milky Way
Basho[1689]
Sometimes Japanese is just too much...
the rough sea / stretching out towards Sado / the Milky Way
Basho[1689]
Sometimes Japanese is just too much...
-
- Green Belt
- Posts: 416
- Joined: Mon Dec 26, 2016 7:40 pm
- Location: UK
- Languages: English (N), Cebuano (basic spoken daily, best L2), Spanish (beginner, but can read), Esperanto (beginner and not maintained). Sometimes dabble with Dutch, Serbian, Slovak, Czech, German and Arabic.
- Language Log: viewtopic.php?f=15&t=5133&start=30
- x 315
Re: Making word lists
sfuqua wrote:I'm kind of surprised there isn't an online utility for it...
Maybe there is and I'm not searching correctly.
It's trivial except for the ordering of first usage.
I can understand why that's useful if you're going through a text.
It can be done, but a bit harder and maybe less call for it.
Perhaps sort and unique just to get an idea of numbers. Sort, unique with a count is easy (probably), re-sort in descending frequency and the common words will be encountered earlier.
But a few regular expressions should do it. Check out some of the regexp sites especially those that have a working live aspect.
1 x
2018 Cebuano SuperChallenge 1 May 2018-Dec 2019
: SC days:
: Read (aim daily 2000 words):
: Video (aim daily 15 minutes):
: SC days:
: Read (aim daily 2000 words):
: Video (aim daily 15 minutes):
- reineke
- Black Belt - 3rd Dan
- Posts: 3570
- Joined: Wed Jan 06, 2016 7:34 pm
- Languages: Fox (C4)
- Language Log: https://forum.language-learners.org/vie ... =15&t=6979
- x 6554
Re: Making word lists
http://www.online-toolz.com/tools/string-functions.php#
https://www.freeformatter.com/string-utilities.html
Both seem to work. Paste your text and press space in the required field. I googled "online word string splitter".
Quoique
ce
détail
ne
touche
en
aucune
manière
au
fond
même
de
ce
que
nous
avons
à
raconter,
il
n'est
peut-être
pas
inutile,
ne
fût-ce
que
pour
être
exact
en
tout,
d'indiquer
ici
les
bruits
et
les
propos
qui
avaient
couru
sur
son
compte
au
moment
où
il
était
arrivé
dans
le
diocèse.
Vrai
ou
faux,
ce
qu'on
dit
des
hommes
tient
souvent
autant
de
place
dans
leur
vie
et
surtout
dans
leur
destinée
que
ce
qu'ils
font.
M.
Myriel
était
fils
d'un
conseiller
au
parlement
d'Aix
;
noblesse
de
robe.
V. Hugo
If you're still reading, let me mention TextSTAT, a simple program "for the analysis of texts. It reads plain text files (in different encodings) and HTML files (directly from the internet) and it produces word frequency lists and concordances from these files". The link is also in the "language tools section".
http://neon.niederlandistik.fu-berlin.de/en/textstat/
https://www.freeformatter.com/string-utilities.html
Both seem to work. Paste your text and press space in the required field. I googled "online word string splitter".
Quoique
ce
détail
ne
touche
en
aucune
manière
au
fond
même
de
ce
que
nous
avons
à
raconter,
il
n'est
peut-être
pas
inutile,
ne
fût-ce
que
pour
être
exact
en
tout,
d'indiquer
ici
les
bruits
et
les
propos
qui
avaient
couru
sur
son
compte
au
moment
où
il
était
arrivé
dans
le
diocèse.
Vrai
ou
faux,
ce
qu'on
dit
des
hommes
tient
souvent
autant
de
place
dans
leur
vie
et
surtout
dans
leur
destinée
que
ce
qu'ils
font.
M.
Myriel
était
fils
d'un
conseiller
au
parlement
d'Aix
;
noblesse
de
robe.
V. Hugo
If you're still reading, let me mention TextSTAT, a simple program "for the analysis of texts. It reads plain text files (in different encodings) and HTML files (directly from the internet) and it produces word frequency lists and concordances from these files". The link is also in the "language tools section".
http://neon.niederlandistik.fu-berlin.de/en/textstat/
Last edited by reineke on Fri Nov 17, 2017 11:02 pm, edited 1 time in total.
5 x
- jeff_lindqvist
- Black Belt - 3rd Dan
- Posts: 3135
- Joined: Sun Aug 16, 2015 9:52 pm
- Languages: sv, en
de, es
ga, eo
---
fi, yue, ro, tp, cy, kw, pt, sk - Language Log: viewtopic.php?f=15&t=2773
- x 10462
Re: Making word lists
Pyton script:
https://www.dotnetperls.com/duplicates-python
Online resources:
http://www.esqsoft.com/tools/dedupe-list.htm
http://www.dedupelist.com/
https://www.dotnetperls.com/duplicates-python
Online resources:
http://www.esqsoft.com/tools/dedupe-list.htm
http://www.dedupelist.com/
5 x
Leabhair/Greannáin léite as Gaeilge:
Ar an seastán oíche:Oileán an Órchiste
Duolingo - finished trees: sp/ga/de/fr/pt/it
Finnish with extra pain :
Llorg Blog - Wiki - Discord
Ar an seastán oíche:
Duolingo - finished trees: sp/ga/de/fr/pt/it
Finnish with extra pain :
Llorg Blog - Wiki - Discord
- Iversen
- Black Belt - 4th Dan
- Posts: 4768
- Joined: Sun Jul 19, 2015 7:36 pm
- Location: Denmark
- Languages: Monolingual travels in Danish, English, German, Dutch, Swedish, French, Portuguese, Spanish, Catalan, Italian, Romanian and (part time) Esperanto
Ahem, not yet: Norwegian, Afrikaans, Platt, Scots, Russian, Serbian, Bulgarian, Albanian, Greek, Latin, Irish, Indonesian and a few more... - Language Log: viewtopic.php?f=15&t=1027
- x 14962
Re: Making word lists
I used Words and Excel for a similar task. I analyzed a corpus of around 37.000 words from my own writings at HTLAL, and the first thing I did was to import one passage after the other of ordinary English text into Word. After a bit of manual revision to remove non-English passages and quotes I replaced all spaces with the sign for linebreak, removed all empty lines and copied the result to Excel, where there is an advanced filter that can find unique values. In the old Excel version I used there could be some 64K lines, which has been changed to more than a million lines now - but I also discovered that the filter couldn't cope with all the 65.000 lines at once. No problem: I divided the lot into two parts and filtered each part separately and then applied the filter once again to the combined results. And then I had a list of all wordforms in the corpus. I could also have done some statistics on a sorted list if I had wanted to do it, but I didn't occur to me that the result could be of interest. I only wanted to know the the number of different words I had used.
But there is probably some fancy software out there that can do a similar thing automatically...
But there is probably some fancy software out there that can do a similar thing automatically...
2 x
- sfuqua
- Black Belt - 1st Dan
- Posts: 1642
- Joined: Sun Jul 19, 2015 5:05 am
- Location: san jose, california
- Languages: Bad English: native
Samoan: speak, but rusty
Tagalog: imperfect, but use all the time
Spanish: read
French: read some
Japanese: beginner, obsessively studying - Language Log: https://forum.language-learners.org/vie ... =15&t=9248
- x 6299
Re: Making word lists
I found an online utility at https://www.tracemyip.org/tools/remove-duplicate-words-in-text/ that works OK on files that are 200 000 words long.
I'm going to fight with python until I can do it by myself.
I'm plotting what to do the next few months, of course, and a word list in order of appearance is part of the plot.
I'm going to fight with python until I can do it by myself.
I'm plotting what to do the next few months, of course, and a word list in order of appearance is part of the plot.
2 x
荒海や佐渡によこたふ天の川
the rough sea / stretching out towards Sado / the Milky Way
Basho[1689]
Sometimes Japanese is just too much...
the rough sea / stretching out towards Sado / the Milky Way
Basho[1689]
Sometimes Japanese is just too much...
-
- Orange Belt
- Posts: 228
- Joined: Sun Feb 26, 2017 4:01 pm
- Languages: English (native); strong reading skills - Russian, Spanish, French, Italian, German, Serbo-Croatian, Macedonian, Bulgarian, Slovene, Farsi; fair reading skills - Polish, Czech, Dutch, Esperanto, Portuguese; beginner/rusty - Swedish, Norwegian, Danish
- x 590
Re: Making word lists
One problem with your http://www.tracemyip.org tool, I think, is that it seems to split only on white space, not punctuation, so "three" and "three." would be treated as separate and unique words. If that's good enough, fine, but otherwise you need either to remove all punctuation first, or split on both white space and punctuation characters (which ones would depend on what language the text is in). I've used regular expressions to do this in the past, though usually the regular expression keeps expanding as I try to make it cover additional languages, etc. Likewise, "three" and "Three" are treated as separate words by that tool; so when you write your own script, you might consider whether to normalize everything to lower case first. (Not to mention the question of whether you want a list of base forms, or whether you want to list different forms of the same word, like singular vs. plural, listed as separate "words." De-inflection would add another step.) Word tokenization isn't really as trivial as it seems.
NLTK, with its supplied word tokenizers, actually is available for Python 3.6, however. If you use the Anaconda installer for 3.6, NLTK should be one of the packages installed along with it:
I'm curious about why you need to keep the word list in order of appearance. For context, or so that a glossary will be ordered in such a way as to follow the text?
Jorkens automatically produces a word form list (not de-inflected) when a text is loaded into it. The list is sorted by descending frequency, but merely skipping the sorting step would not be difficult, if that would be useful to add as an option.
NLTK, with its supplied word tokenizers, actually is available for Python 3.6, however. If you use the Anaconda installer for 3.6, NLTK should be one of the packages installed along with it:
Code: Select all
Python 3.6.1 |Anaconda custom (32-bit)| (default, May 11 2017, 14:16:49) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>>
I'm curious about why you need to keep the word list in order of appearance. For context, or so that a glossary will be ordered in such a way as to follow the text?
Jorkens automatically produces a word form list (not de-inflected) when a text is loaded into it. The list is sorted by descending frequency, but merely skipping the sorting step would not be difficult, if that would be useful to add as an option.
5 x
Return to “Practical Questions and Advice”
Who is online
Users browsing this forum: Google [Bot] and 2 guests