I'm wondering whether it's possible to bulk extract the example sentances, phrases, idioms, collocations etc that are found in digital (i.e. ebook dictionaries) and online dictionaries, for use in SRS? I tried converting a kindle Collins dictionary but that didn't go far.
I've done enough tedious copying and pasting in my life to never want to go down that route again (it'd be much easier to just delete ones I don't want/need as and when they come up). Plus, just cherry picking ones that I stumble across aren't going to help me go out of my comfort zone and find the phrases that I don't know that I don't know.
Has anyone had any success with this before? Or any ideas on where to start?
How to batch extract sentances and phrases for use in Anki?
-
- White Belt
- Posts: 31
- Joined: Mon Oct 26, 2015 4:33 am
- Languages: British English (N); Italian (B1)
- x 29
- jeff_lindqvist
- Black Belt - 3rd Dan
- Posts: 3167
- Joined: Sun Aug 16, 2015 9:52 pm
- Languages: sv, en
de, es
ga, eo
---
fi, yue, ro, tp, cy, kw, pt, sk - Language Log: viewtopic.php?f=15&t=2773
- x 10597
Re: How to batch extract sentances and phrases for use in Anki?
You can always import (or copy/paste) the entire text, and then use a number of Find/Replace commands in a good word processor. For instance, Find any string ending with a period/full stop and a space (which is likely to be the end of a sentence), then Replace that with (the same) period and space PLUS a line break. Then you have every sentence on a separate line:
sentence 1 blablablabla.
sentence 2 blablablabla.
sentence 3 blablablabla.
You can also look for double line breaks and replace that with single ones.
When you're done, copy/paste that to a spreadsheet.
If you're going to use Anki, you need something on the back of the card, e.g. a translation. You can get somewhere with Google translate, then copy/paste everything into another column in the spreadsheet. Now you have thousands of sentences in both languages. Save the file as something you can import in Anki (tab-separated or csv).
Still not a one-click process, but easier than doing it sentence by sentence. Assume that you have a classic novel in a word document - all this could be done in a matter of minutes.
sentence 1 blablablabla.
sentence 2 blablablabla.
sentence 3 blablablabla.
You can also look for double line breaks and replace that with single ones.
When you're done, copy/paste that to a spreadsheet.
If you're going to use Anki, you need something on the back of the card, e.g. a translation. You can get somewhere with Google translate, then copy/paste everything into another column in the spreadsheet. Now you have thousands of sentences in both languages. Save the file as something you can import in Anki (tab-separated or csv).
Still not a one-click process, but easier than doing it sentence by sentence. Assume that you have a classic novel in a word document - all this could be done in a matter of minutes.
1 x
Leabhair/Greannáin léite as Gaeilge:
Ar an seastán oíche:Oileán an Órchiste
Duolingo - finished trees: sp/ga/de/fr/pt/it
Finnish with extra pain :
Llorg Blog - Wiki - Discord
Ar an seastán oíche:
Duolingo - finished trees: sp/ga/de/fr/pt/it
Finnish with extra pain :
Llorg Blog - Wiki - Discord
- arthaey
- Brown Belt
- Posts: 1080
- Joined: Sat Jul 18, 2015 9:11 pm
- Location: Seattle, WA, USA
- Languages: :
EN (native);
ES (adv receptive, int productive);
FR (false beginner);
DE (lapsed beg);
ASL (lapsed beg);
HU (tourist) - Language Log: viewtopic.php?f=15&t=3864&view=unread#unread
- x 1675
- Contact:
Re: How to batch extract sentances and phrases for use in Anki?
Not sure if the Incremental Reading addon will work for your needs, but give it a try.
0 x
Posts in: French • German • Hungarian • Spanish
NaNoWriMo: 10,000 words
Corrections welcome in any language; I prefer an informal register.
NaNoWriMo: 10,000 words
Corrections welcome in any language; I prefer an informal register.
- rdearman
- Site Admin
- Posts: 7260
- Joined: Thu May 14, 2015 4:18 pm
- Location: United Kingdom
- Languages: English (N)
- Language Log: viewtopic.php?f=15&t=1836
- x 23316
- Contact:
Re: How to batch extract sentances and phrases for use in Anki?
Use calibre to convert the book to text.
https://manual.calibre-ebook.com/generated/en/ebook-convert.html
Then if you have a linux machine, or if you have bash installed on your Windows 10 machine.
You'll then have a csv file with each sentence in the first column. You may have to mess around with this in excel (or your favourite spreadsheet program), then if you want translations of the sentences by google you can upload it into google docs, and then add this command to the second column
Where the command is in the form: GOOGLETRANSLATE(text, [source_language, target_language])
Save it all as a tab-delimited csv and import into anki.
https://manual.calibre-ebook.com/generated/en/ebook-convert.html
Then if you have a linux machine, or if you have bash installed on your Windows 10 machine.
Code: Select all
cat file.txt | sed 's/\([.!?]\) \([[:upper:]]\)/\1\n\2/g' > sentence.csv
You'll then have a csv file with each sentence in the first column. You may have to mess around with this in excel (or your favourite spreadsheet program), then if you want translations of the sentences by google you can upload it into google docs, and then add this command to the second column
Code: Select all
=GOOGLETRANSLATE(A1,"TL","NL")
Where the command is in the form: GOOGLETRANSLATE(text, [source_language, target_language])
Save it all as a tab-delimited csv and import into anki.
11 x
: Read 150 books in 2024
My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter
I post on this forum with mobile devices, so excuse short msgs and typos.
My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter
I post on this forum with mobile devices, so excuse short msgs and typos.
- MorkTheFiddle
- Black Belt - 2nd Dan
- Posts: 2141
- Joined: Sat Jul 18, 2015 8:59 pm
- Location: North Texas USA
- Languages: English (N). Read (only) French and Spanish. Studying Ancient Greek. Studying a bit of Latin. Once studied Old Norse. Dabbled in Catalan, Provençal and Italian.
- Language Log: https://forum.language-learners.org/vie ... 11#p133911
- x 4886
Re: How to batch extract sentances and phrases for use in Anki?
Just a minor note. In addition to a period/full stop, question marks and exclamation marks must be replaced as well, as seen in rdearman's post for using bash to do this.jeff_lindqvist wrote:You can always import (or copy/paste) the entire text, and then use a number of Find/Replace commands in a good word processor. For instance, Find any string ending with a period/full stop and a space (which is likely to be the end of a sentence), then Replace that with (the same) period and space PLUS a line break. Then you have every sentence on a separate line:
sentence 1 blablablabla.
sentence 2 blablablabla.
sentence 3 blablablabla.
And even more minor, abbreviations ending with a period can throw things off: Mrs., Mr., Jr. and that sort of thing.
I replace such things with their non-period equivalents before anything else: replace Mrs. with Mrs and so on. You can go back later and put the periods back.
2 x
Many things which are false are transmitted from book to book, and gain credit in the world. -- attributed to Samuel Johnson
- jeff_lindqvist
- Black Belt - 3rd Dan
- Posts: 3167
- Joined: Sun Aug 16, 2015 9:52 pm
- Languages: sv, en
de, es
ga, eo
---
fi, yue, ro, tp, cy, kw, pt, sk - Language Log: viewtopic.php?f=15&t=2773
- x 10597
Re: How to batch extract sentances and phrases for use in Anki?
MorkTheFiddle wrote:Just a minor note. In addition to a period/full stop, question marks and exclamation marks must be replaced as well, as seen in rdearman's post for using bash to do this.(...)
Of course you're right - usually you don't get away with just one find/replace. Each language has its own set of titles, abbreviations, punctuation, capitalization, spelling conventions etc. I once imported a set of Esperanto sentences and found that it didn't use the superscript (while my other content did). Anyway, make sure you know what to look for and make the find/replace tool act accordingly.
2 x
Leabhair/Greannáin léite as Gaeilge:
Ar an seastán oíche:Oileán an Órchiste
Duolingo - finished trees: sp/ga/de/fr/pt/it
Finnish with extra pain :
Llorg Blog - Wiki - Discord
Ar an seastán oíche:
Duolingo - finished trees: sp/ga/de/fr/pt/it
Finnish with extra pain :
Llorg Blog - Wiki - Discord
- rdearman
- Site Admin
- Posts: 7260
- Joined: Thu May 14, 2015 4:18 pm
- Location: United Kingdom
- Languages: English (N)
- Language Log: viewtopic.php?f=15&t=1836
- x 23316
- Contact:
Re: How to batch extract sentances and phrases for use in Anki?
Although my little script does take into account things like e.g. it isn't perfect. You'll still end up with some rubbish. But like emk always says, delete, delete, delete. This would mass produce hundreds of potential cards, but some bad stuff. 80/20 rule. This script is for English. I did a similar thing awhile back at HTLAL.com to get all the unique characters used in a Chinese book. You can find it in my logs somewhere.
3 x
: Read 150 books in 2024
My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter
I post on this forum with mobile devices, so excuse short msgs and typos.
My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter
I post on this forum with mobile devices, so excuse short msgs and typos.
-
- Orange Belt
- Posts: 228
- Joined: Sun Feb 26, 2017 4:01 pm
- Languages: English (native); strong reading skills - Russian, Spanish, French, Italian, German, Serbo-Croatian, Macedonian, Bulgarian, Slovene, Farsi; fair reading skills - Polish, Czech, Dutch, Esperanto, Portuguese; beginner/rusty - Swedish, Norwegian, Danish
- x 590
Re: How to batch extract sentances and phrases for use in Anki?
I have parsed dictionaries and dictionary query results before. It would depend entirely on the format of the data and how well-structured it is. Rather than copying and pasting a dictionary as one big blob of text and then running scripts to try to reformat it, I would look for the already labeled fields you want in the XML or HTML (e.g., any online dictionary) source, and then extract those fields.
Unless you need a particular dictionary, the dictionaries available for download at https://github.com/freedict/fd-dictionaries might be a place to start; the Freedict dictionaries are well-structured XML (TEI). A random sample entry from the French-English .tei file is:
<entry>
<form>
<orth>aigreur</orth>
<pron>ɛgʀœʀ</pron>
</form>
<gramGrp>
<pos>n</pos>
<gen>fem</gen>
</gramGrp>
<sense>
<cit type="trans">
<quote>tartness</quote>
</cit>
</sense>
</entry>
where it shouldn't take much work to extract exactly what you want into a tab-separated or .csv format.
Are you comfortable with XPath to parse HTML from a Web page? Bear in mind that while there are plenty of Web scraping tools around, Web sites with copyrighted data might object to having all their data scraped.
Unless you need a particular dictionary, the dictionaries available for download at https://github.com/freedict/fd-dictionaries might be a place to start; the Freedict dictionaries are well-structured XML (TEI). A random sample entry from the French-English .tei file is:
<entry>
<form>
<orth>aigreur</orth>
<pron>ɛgʀœʀ</pron>
</form>
<gramGrp>
<pos>n</pos>
<gen>fem</gen>
</gramGrp>
<sense>
<cit type="trans">
<quote>tartness</quote>
</cit>
</sense>
</entry>
where it shouldn't take much work to extract exactly what you want into a tab-separated or .csv format.
Are you comfortable with XPath to parse HTML from a Web page? Bear in mind that while there are plenty of Web scraping tools around, Web sites with copyrighted data might object to having all their data scraped.
1 x
- emk
- Black Belt - 1st Dan
- Posts: 1708
- Joined: Sat Jul 18, 2015 12:07 pm
- Location: Vermont, USA
- Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish. - Language Log: viewtopic.php?f=15&t=723
- x 6730
- Contact:
Re: How to batch extract sentances and phrases for use in Anki?
rdearman wrote:You'll still end up with some rubbish. But like emk always says, delete, delete, delete. This would mass produce hundreds of potential cards, but some bad stuff. 80/20 rule.
Yeah, I'm all for mass-production of cards, but you've gotta be ruthless about deletion. Anki (and other SRS software) can turn into a very efficient torture device if you don't delete aggressively.
Also, I would be very reluctant to learn a dictionary, unless you already read at a C1 level and it's a very small dictionary. I personally prefer words in context, so if I were going to do this, I'd pick sentences from interesting books. That has the advantage of focusing more on common words. If you must learn a dictionary, at least try a frequency dictionary.
If you can't find a frequency dictionary in electronic format for your language, I've got compressed CSVs of the 10,000 most frequent word forms for about 60 languages, though some Windows decompression programs may not like the compression.
Actually, I could make a pretty good tool for generating Anki decks, with bilingual examples and cards in frequency order, from all that OPUS subtitle data. Hmm. But I think I'll keep focusing on subtitles with audio, first.
4 x
-
- White Belt
- Posts: 31
- Joined: Mon Oct 26, 2015 4:33 am
- Languages: British English (N); Italian (B1)
- x 29
Re: How to batch extract sentances and phrases for use in Anki?
Thanks for the responses so far. I probably should have been clearer in my OP, but I'm looking to extract some specific phrases from a specific source. The Collins online and/or Kindle dictionary has example phrases/collocations/sentances etc in each entry. I'm looking to extract these, rather than the headwords:
sphere [sfɪər] n (gen) sfera
his sphere of interest la sua sfera d'interessi
his sphere of activity il suo campo di attività
within a limited sphere in un ambito molto ristretto
sphere of influence sfera d'influenza
that's outside my sphere non rientra nelle mie competenze
(see http://www.wordreference.com/enit/sphere [collins tab], and https://www.collinsdictionary.com/dicti ... ian/sphere)
Any idea how to do this? Calibre can't seem to convert e-dictionaries so I'm stuck for ideas. I'm willing to learn something about programming if need be!
sphere [sfɪər] n (gen) sfera
his sphere of interest la sua sfera d'interessi
his sphere of activity il suo campo di attività
within a limited sphere in un ambito molto ristretto
sphere of influence sfera d'influenza
that's outside my sphere non rientra nelle mie competenze
(see http://www.wordreference.com/enit/sphere [collins tab], and https://www.collinsdictionary.com/dicti ... ian/sphere)
Any idea how to do this? Calibre can't seem to convert e-dictionaries so I'm stuck for ideas. I'm willing to learn something about programming if need be!
1 x
Return to “Practical Questions and Advice”
Who is online
Users browsing this forum: No registered users and 2 guests