How to batch extract sentances and phrases for use in Anki?

Ask specific questions about your target languages. Beginner questions welcome!
Haiku D'etat
White Belt
Posts: 31
Joined: Mon Oct 26, 2015 4:33 am
Languages: British English (N); Italian (B1)
x 29

How to batch extract sentances and phrases for use in Anki?

Postby Haiku D'etat » Thu Feb 02, 2017 7:13 pm

I'm wondering whether it's possible to bulk extract the example sentances, phrases, idioms, collocations etc that are found in digital (i.e. ebook dictionaries) and online dictionaries, for use in SRS? I tried converting a kindle Collins dictionary but that didn't go far.

I've done enough tedious copying and pasting in my life to never want to go down that route again (it'd be much easier to just delete ones I don't want/need as and when they come up). Plus, just cherry picking ones that I stumble across aren't going to help me go out of my comfort zone and find the phrases that I don't know that I don't know.

Has anyone had any success with this before? Or any ideas on where to start?
1 x

User avatar
jeff_lindqvist
Black Belt - 3rd Dan
Posts: 3135
Joined: Sun Aug 16, 2015 9:52 pm
Languages: sv, en
de, es
ga, eo
---
fi, yue, ro, tp, cy, kw, pt, sk
Language Log: viewtopic.php?f=15&t=2773
x 10462

Re: How to batch extract sentances and phrases for use in Anki?

Postby jeff_lindqvist » Thu Feb 02, 2017 9:27 pm

You can always import (or copy/paste) the entire text, and then use a number of Find/Replace commands in a good word processor. For instance, Find any string ending with a period/full stop and a space (which is likely to be the end of a sentence), then Replace that with (the same) period and space PLUS a line break. Then you have every sentence on a separate line:
sentence 1 blablablabla.
sentence 2 blablablabla.
sentence 3 blablablabla.

You can also look for double line breaks and replace that with single ones.

When you're done, copy/paste that to a spreadsheet.

If you're going to use Anki, you need something on the back of the card, e.g. a translation. You can get somewhere with Google translate, then copy/paste everything into another column in the spreadsheet. Now you have thousands of sentences in both languages. Save the file as something you can import in Anki (tab-separated or csv).

Still not a one-click process, but easier than doing it sentence by sentence. Assume that you have a classic novel in a word document - all this could be done in a matter of minutes.
1 x
Leabhair/Greannáin léite as Gaeilge: 9 / 18
Ar an seastán oíche: Oileán an Órchiste
Duolingo - finished trees: sp/ga/de/fr/pt/it
Finnish with extra pain : 100 / 100

Llorg Blog - Wiki - Discord

User avatar
arthaey
Brown Belt
Posts: 1080
Joined: Sat Jul 18, 2015 9:11 pm
Location: Seattle, WA, USA
Languages: :
EN (native);
ES (adv receptive, int productive);
FR (false beginner);
DE (lapsed beg);
ASL (lapsed beg);
HU (tourist)
Language Log: viewtopic.php?f=15&t=3864&view=unread#unread
x 1675
Contact:

Re: How to batch extract sentances and phrases for use in Anki?

Postby arthaey » Thu Feb 02, 2017 9:50 pm

Not sure if the Incremental Reading addon will work for your needs, but give it a try.
0 x
Posts in: FrenchGermanHungarianSpanish
NaNoWriMo: 10,000 words
Corrections welcome in any language; I prefer an informal register.

User avatar
rdearman
Site Admin
Posts: 7231
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23122
Contact:

Re: How to batch extract sentances and phrases for use in Anki?

Postby rdearman » Fri Feb 03, 2017 3:10 pm

Use calibre to convert the book to text.
https://manual.calibre-ebook.com/generated/en/ebook-convert.html
Then if you have a linux machine, or if you have bash installed on your Windows 10 machine.

Code: Select all

cat file.txt | sed 's/\([.!?]\) \([[:upper:]]\)/\1\n\2/g' > sentence.csv


You'll then have a csv file with each sentence in the first column. You may have to mess around with this in excel (or your favourite spreadsheet program), then if you want translations of the sentences by google you can upload it into google docs, and then add this command to the second column

Code: Select all

=GOOGLETRANSLATE(A1,"TL","NL")


Where the command is in the form: GOOGLETRANSLATE(text, [source_language, target_language])

Save it all as a tab-delimited csv and import into anki.
11 x
: 0 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.

User avatar
MorkTheFiddle
Black Belt - 2nd Dan
Posts: 2113
Joined: Sat Jul 18, 2015 8:59 pm
Location: North Texas USA
Languages: English (N). Read (only) French and Spanish. Studying Ancient Greek. Studying a bit of Latin. Once studied Old Norse. Dabbled in Catalan, Provençal and Italian.
Language Log: https://forum.language-learners.org/vie ... 11#p133911
x 4823

Re: How to batch extract sentances and phrases for use in Anki?

Postby MorkTheFiddle » Sat Feb 04, 2017 10:42 pm

jeff_lindqvist wrote:You can always import (or copy/paste) the entire text, and then use a number of Find/Replace commands in a good word processor. For instance, Find any string ending with a period/full stop and a space (which is likely to be the end of a sentence), then Replace that with (the same) period and space PLUS a line break. Then you have every sentence on a separate line:
sentence 1 blablablabla.
sentence 2 blablablabla.
sentence 3 blablablabla.

Just a minor note. In addition to a period/full stop, question marks and exclamation marks must be replaced as well, as seen in rdearman's post for using bash to do this.
And even more minor, abbreviations ending with a period can throw things off: Mrs., Mr., Jr. and that sort of thing.
I replace such things with their non-period equivalents before anything else: replace Mrs. with Mrs and so on. You can go back later and put the periods back.
2 x
Many things which are false are transmitted from book to book, and gain credit in the world. -- attributed to Samuel Johnson

User avatar
jeff_lindqvist
Black Belt - 3rd Dan
Posts: 3135
Joined: Sun Aug 16, 2015 9:52 pm
Languages: sv, en
de, es
ga, eo
---
fi, yue, ro, tp, cy, kw, pt, sk
Language Log: viewtopic.php?f=15&t=2773
x 10462

Re: How to batch extract sentances and phrases for use in Anki?

Postby jeff_lindqvist » Sat Feb 04, 2017 11:51 pm

MorkTheFiddle wrote:Just a minor note. In addition to a period/full stop, question marks and exclamation marks must be replaced as well, as seen in rdearman's post for using bash to do this.(...)


Of course you're right - usually you don't get away with just one find/replace. Each language has its own set of titles, abbreviations, punctuation, capitalization, spelling conventions etc. I once imported a set of Esperanto sentences and found that it didn't use the superscript (while my other content did). Anyway, make sure you know what to look for and make the find/replace tool act accordingly.
2 x
Leabhair/Greannáin léite as Gaeilge: 9 / 18
Ar an seastán oíche: Oileán an Órchiste
Duolingo - finished trees: sp/ga/de/fr/pt/it
Finnish with extra pain : 100 / 100

Llorg Blog - Wiki - Discord

User avatar
rdearman
Site Admin
Posts: 7231
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23122
Contact:

Re: How to batch extract sentances and phrases for use in Anki?

Postby rdearman » Sun Feb 05, 2017 10:37 am

Although my little script does take into account things like e.g. it isn't perfect. You'll still end up with some rubbish. But like emk always says, delete, delete, delete. This would mass produce hundreds of potential cards, but some bad stuff. 80/20 rule. This script is for English. I did a similar thing awhile back at HTLAL.com to get all the unique characters used in a Chinese book. You can find it in my logs somewhere.
3 x
: 0 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.

mcthulhu
Orange Belt
Posts: 228
Joined: Sun Feb 26, 2017 4:01 pm
Languages: English (native); strong reading skills - Russian, Spanish, French, Italian, German, Serbo-Croatian, Macedonian, Bulgarian, Slovene, Farsi; fair reading skills - Polish, Czech, Dutch, Esperanto, Portuguese; beginner/rusty - Swedish, Norwegian, Danish
x 590

Re: How to batch extract sentances and phrases for use in Anki?

Postby mcthulhu » Mon Apr 03, 2017 12:35 pm

I have parsed dictionaries and dictionary query results before. It would depend entirely on the format of the data and how well-structured it is. Rather than copying and pasting a dictionary as one big blob of text and then running scripts to try to reformat it, I would look for the already labeled fields you want in the XML or HTML (e.g., any online dictionary) source, and then extract those fields.

Unless you need a particular dictionary, the dictionaries available for download at https://github.com/freedict/fd-dictionaries might be a place to start; the Freedict dictionaries are well-structured XML (TEI). A random sample entry from the French-English .tei file is:

<entry>
<form>
<orth>aigreur</orth>
<pron>ɛgʀœʀ</pron>
</form>
<gramGrp>
<pos>n</pos>
<gen>fem</gen>
</gramGrp>
<sense>
<cit type="trans">
<quote>tartness</quote>
</cit>
</sense>
</entry>

where it shouldn't take much work to extract exactly what you want into a tab-separated or .csv format.

Are you comfortable with XPath to parse HTML from a Web page? Bear in mind that while there are plenty of Web scraping tools around, Web sites with copyrighted data might object to having all their data scraped.
1 x

Online
User avatar
emk
Black Belt - 1st Dan
Posts: 1619
Joined: Sat Jul 18, 2015 12:07 pm
Location: Vermont, USA
Languages: English (N), French (B2+)
Badly neglected "just for fun" languages: Middle Egyptian, Spanish.
Language Log: viewtopic.php?f=15&t=723
x 6318
Contact:

Re: How to batch extract sentances and phrases for use in Anki?

Postby emk » Mon Apr 03, 2017 2:14 pm

rdearman wrote:You'll still end up with some rubbish. But like emk always says, delete, delete, delete. This would mass produce hundreds of potential cards, but some bad stuff. 80/20 rule.

Yeah, I'm all for mass-production of cards, but you've gotta be ruthless about deletion. Anki (and other SRS software) can turn into a very efficient torture device if you don't delete aggressively.

Also, I would be very reluctant to learn a dictionary, unless you already read at a C1 level and it's a very small dictionary. I personally prefer words in context, so if I were going to do this, I'd pick sentences from interesting books. That has the advantage of focusing more on common words. If you must learn a dictionary, at least try a frequency dictionary.

If you can't find a frequency dictionary in electronic format for your language, I've got compressed CSVs of the 10,000 most frequent word forms for about 60 languages, though some Windows decompression programs may not like the compression.

Actually, I could make a pretty good tool for generating Anki decks, with bilingual examples and cards in frequency order, from all that OPUS subtitle data. Hmm. But I think I'll keep focusing on subtitles with audio, first.
4 x

Haiku D'etat
White Belt
Posts: 31
Joined: Mon Oct 26, 2015 4:33 am
Languages: British English (N); Italian (B1)
x 29

Re: How to batch extract sentances and phrases for use in Anki?

Postby Haiku D'etat » Sat Apr 08, 2017 3:01 pm

Thanks for the responses so far. I probably should have been clearer in my OP, but I'm looking to extract some specific phrases from a specific source. The Collins online and/or Kindle dictionary has example phrases/collocations/sentances etc in each entry. I'm looking to extract these, rather than the headwords:

sphere [sfɪər] n (gen) sfera
his sphere of interest la sua sfera d'interessi
his sphere of activity il suo campo di attività
within a limited sphere in un ambito molto ristretto
sphere of influence sfera d'influenza
that's outside my sphere non rientra nelle mie competenze

(see http://www.wordreference.com/enit/sphere [collins tab], and https://www.collinsdictionary.com/dicti ... ian/sphere)

Any idea how to do this? Calibre can't seem to convert e-dictionaries so I'm stuck for ideas. I'm willing to learn something about programming if need be!
1 x


Return to “Practical Questions and Advice”

Who is online

Users browsing this forum: jackb and 2 guests