sentence generator

All about language programs, courses, websites and other learning resources
User avatar
guiguixx1
Orange Belt
Posts: 190
Joined: Sat Oct 10, 2015 6:10 pm
Location: Belgium
Languages: French (N), English (C2), Dutch (C1), Spanish (C1), Italian (B2), Esperanto (A2), Portuguese (B2), German (A2), Catalan (passively)
x 238
Contact:

sentence generator

Postby guiguixx1 » Fri Jan 11, 2019 7:19 am

Hi all!

I am looking for a website/programme/whatever that can create, generate sentences based on specific words that we put in the "database". For example, if I put the first 300 most frequent words in a language, I would like the website/programme to generate sentences using only these words (a bit like what Duolingo does). Does this exist?

Thanks a lot in advance!
0 x
Language learning and teaching website as a French teacher of Dutch and English: cameleondeslangues.be

User avatar
zenmonkey
Black Belt - 2nd Dan
Posts: 2528
Joined: Sun Jul 26, 2015 7:21 pm
Location: California, Germany and France
Languages: Spanish, English, French trilingual - German (B2/C1) on/off study: Persian, Hebrew, Tibetan, Setswana.
Some knowledge of Italian, Portuguese, Ladino, Yiddish ...
Want to tackle Tzotzil, Nahuatl
Language Log: viewtopic.php?f=15&t=859
x 7032
Contact:

Re: sentence generator

Postby zenmonkey » Fri Jan 11, 2019 8:49 am

1 x
I am a leaf on the wind, watch how I soar

白田龍
Orange Belt
Posts: 242
Joined: Wed Mar 21, 2018 6:54 pm
Languages: English, Portuguese, Spanish, Catalan, French, Persian, Arabic, Mandarin, Japanese.
x 444

Re: sentence generator

Postby 白田龍 » Fri Jan 11, 2019 2:20 pm

I have written a Python script that takes from a corpus that sentences contain only words with the frequency below a threshold, plus any words you select. It requires a corpus (i.e. a huge text (I usually get them from opus and/or mass download books and blogs)) and a lemmatizer (it knows how to use the inflection dictionaries from lexiconista , and MeCab for japanese). It can do without a lemmatizer, treating every orthografic form as an independent word, but this severly limits its usability.

I can set up a version that works independently from my current project over the weekend, would you want any language in particular?
I have a usable library for French, Spanish, Catalan, Chinese, Japanese, Persian, (probably others) but I still need to parse the files.
Last edited by 白田龍 on Tue Jan 15, 2019 9:28 am, edited 1 time in total.
1 x

白田龍
Orange Belt
Posts: 242
Joined: Wed Mar 21, 2018 6:54 pm
Languages: English, Portuguese, Spanish, Catalan, French, Persian, Arabic, Mandarin, Japanese.
x 444

Re: sentence generator

Postby 白田龍 » Tue Jan 15, 2019 1:14 am

Finally I got it working here:

https://github.com/madokaK/corporaSRS

1 Install the latest version of Python 2 if you don't have it.
2 Place the file corpusSearch.py in a new folder.
3 Download the lemmatization list for your language here, and save it on the same folder (optional).
4 Create a sub folder called corpora.
5 Download the corpus you want from http://opus.nlpl.eu/OpenSubtitles-v2018.php in the Moses format and extract it into the corpora folder.
6 if you don't want transations, don't save the English parallel corpus, or edit corpusSearch.py and change line 3 to showTranslation = False. On line 4 you choose the maximum frequency rank allowed for forming the sentences.
7 Run corpusSearch.py

The translation corpus needs always to be English. This is so because the vast majority of the translations would be to or from English, thus other pairs would have twice as much errors. (The program will just check if the file ends with '.en', so you can work this around by renaming...) If you want to use an English corpus you will need to change the extension to something other than '.en'.

Sorry it is messy, but it works. I have never properly learned programming, and I don't have much time to work on the project...

It should work for all languages that words are separated by space (i.e. no Chinese or Japanese). I do have a version that works with Japanese, but I still did not write a script that automates the processing of the files.

If you need to add you custom wordlist, in addition to the top X most frequent words, I can write in this feature later...

I can, if requested, upload corpora with sentences from books for French, Catalan, Spanish, Italian, Persian or English.

I can also quickly set up a lemmatization dictionary for Russian.

I will be adding support for Chinese, Japanese and Arabic later...

It still have some issues I need to fix... searching compound words is not working as intended.

By default, only the first 300Mb of the corpus file are used, you can change this value on line 50 (delete the index files to re-index after you do this this.) Decreasing this value will make indexing and searching faster, and prevent memory errors. Increasing it will allow you to find more results for low frequency words and expressions.

If a translation corpus is present, the memory usage is much larger.




Image
1 x

User avatar
guiguixx1
Orange Belt
Posts: 190
Joined: Sat Oct 10, 2015 6:10 pm
Location: Belgium
Languages: French (N), English (C2), Dutch (C1), Spanish (C1), Italian (B2), Esperanto (A2), Portuguese (B2), German (A2), Catalan (passively)
x 238
Contact:

Re: sentence generator

Postby guiguixx1 » Wed Jan 16, 2019 10:02 pm

Hi,

So sorry I haven't answered you earlier :/ :/

This language I'm interested in for my particular purpose is Dutch, I'll need to enter the first 100 words from this list: http://www.csgn.be/langues/5eNL/Zinnig%20Woord_revu.pdf

Thanks so much in advance for all you've already done. I have to go to bed but I'll try to have a look at your programme asap!
0 x
Language learning and teaching website as a French teacher of Dutch and English: cameleondeslangues.be

白田龍
Orange Belt
Posts: 242
Joined: Wed Mar 21, 2018 6:54 pm
Languages: English, Portuguese, Spanish, Catalan, French, Persian, Arabic, Mandarin, Japanese.
x 444

Re: sentence generator

Postby 白田龍 » Thu Jan 17, 2019 9:36 am

I have uploaded a Dutch lemmatization file. I have also made some fixes the code, so re-download the program file if you need to.

You can now create "knownWords.txt" file in the program folder, containing any words you want included in the sentence formation, each word in a new line.

If searching is too slow or you get out of memory errors, you can try reducing the maximum lenght of the corpus that is going to be processed on line 50. After you do this you need to delete any files the program has created on the corpora folder.

There are a number of corpora for Dutch on http://opus.nlpl.eu/, select nl (Dutch) and en (English) on the form on the front page, then download it on the Moses format. Extract the contents of the zip file to corpora sub folder where you have saved the program file.
1 x

User avatar
Random Review
Green Belt
Posts: 449
Joined: Tue Jul 21, 2015 8:41 pm
Location: UK/Spain/China
Languages: En (N), Es (int), De (pre-int), Pt (pre-int), Zh-CN (beg), El (beg), yid (beg)
Language Log: https://forum.language-learners.org/vie ... 75#p123375
x 919

Re: sentence generator

Postby Random Review » Thu Jan 17, 2019 10:39 pm

guiguixx1 wrote:Hi all!

I am looking for a website/programme/whatever that can create, generate sentences based on specific words that we put in the "database". For example, if I put the first 300 most frequent words in a language, I would like the website/programme to generate sentences using only these words (a bit like what Duolingo does). Does this exist?

Thanks a lot in advance!


Have you read Arthur Cotton?
0 x
German input 100 hours by 30-06: 4 / 100
Spanish input 200 hours by 30-06: 0 / 200
German study 50 hours by 30-06: 3 / 100
Spanish study 200 hours by 30-06: 0 / 200
Spanish conversation 100 hours by 30-06: 0 / 100


Return to “Language Programs and Resources”

Who is online

Users browsing this forum: dgc1970 and 2 guests