How to batch extract sentances and phrases for use in Anki?

Ask specific questions about your target languages. Beginner questions welcome!
mcthulhu
White Belt
Posts: 20
Joined: Sun Feb 26, 2017 4:01 pm
Languages: English (native); strong reading skills - Russian, Spanish, French, Italian, German, Serbo-Croatian, Macedonian, Bulgarian, Slovene, Farsi; fair reading skills - Polish, Czech, Dutch, Esperanto, Portuguese; beginner/rusty - Swedish, Norwegian, Danish
x 28

Re: How to batch extract sentances and phrases for use in Anki?

Postby mcthulhu » Sun Apr 09, 2017 1:08 pm

To repeat my question - Are you comfortable with XPath to parse HTML from a Web page?

Your more specific requirement narrows it down a lot. I haven't looked at that particular dictionary, but this is how you would parse a regular WordReference dictionary entry to identify definitions of the headword:

Code: Select all

var result = doc.evaluate("//tr/td[contains(@class, 'FrWrd')]/strong/text()[1][.='" + term + "']//ancestor::tr/td[contains(@class, 'ToWrd')]/text()", doc.documentElement, null,
                 XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
   if (result){
      for (var i=0, len=result.snapshotLength; i < len; i++) {
              meanings.push(result.snapshotItem(i).textContent.trim());
      }
   }

where "term" is the word being searched on, and "meanings" is a variable that collects the definitions.

The HTML in your case is very simple and I don't think you even need to worry about tables. If you right-click on a page element in your browser and select Inspect element (I'm using Firefox at the moment), you can see the HTML for what you are looking at (or ctrl-U to see the whole Web page). The information you want is all contained in pairs of span tags, with

Code: Select all

 class="phrase"
and

Code: Select all

 class="example translation"
used to distinguish the spans for the original phrase and the translation.
0 x

Haiku D'etat
White Belt
Posts: 19
Joined: Mon Oct 26, 2015 4:33 am
Languages: British English (N); Italian (B1)
x 10

Re: How to batch extract sentances and phrases for use in Anki?

Postby Haiku D'etat » Mon Apr 10, 2017 6:37 pm

Thanks. I've googled Xpath and I don't even understand the basic description of it - so I wouldn't know where to start! I managed to download every .html file from https://www.collinsdictionary.com/dicti ... an-english, and I've managed to convert the .azw dictionary to a single html file (although it either crashes or the formatting breaks down when I try to open it in Firefox - perhaps the file is to big?). I just don't know where to start, what software to use, etc. Any pointers at preliminary steps would be massively appreciated.
0 x

mcthulhu
White Belt
Posts: 20
Joined: Sun Feb 26, 2017 4:01 pm
Languages: English (native); strong reading skills - Russian, Spanish, French, Italian, German, Serbo-Croatian, Macedonian, Bulgarian, Slovene, Farsi; fair reading skills - Polish, Czech, Dutch, Esperanto, Portuguese; beginner/rusty - Swedish, Norwegian, Danish
x 28

Re: How to batch extract sentances and phrases for use in Anki?

Postby mcthulhu » Tue Apr 11, 2017 12:50 am

OK. XPath might actually be overkill for this. What browser are you using, and can you use JavaScript, or Python?
With your "sphere" example page loaded in Firefox, if you go to Tools/Web Developer/Scratchpad, paste in the following code, and click "Run," what do you see?

Code: Select all

var phrases=document.getElementsByClassName("phrase");
var arr =Array.from(phrases);
var len=arr.length;
var results=[];
for(var i=0;i<len;i++) {
  results.push(arr[i].textContent);
}
alert(results.join(":\n"));
0 x

Haiku D'etat
White Belt
Posts: 19
Joined: Mon Oct 26, 2015 4:33 am
Languages: British English (N); Italian (B1)
x 10

Re: How to batch extract sentances and phrases for use in Anki?

Postby Haiku D'etat » Tue Apr 18, 2017 5:27 pm

Thank you, that's definitely a step in the right direction - the phrases neatly pop up in a little dialogue box. Unfortunately it loses the formatting, so I'd have to manually separate each Italian phrase from the English translation (my plan was to use a 'split by bold' command in Excel to separate them all at once). So how would I go about doing it as a batch? I have 80,000 html files that I've pulled from the Collins site. I have minimal knowledge of Python or Jave, but if I know which aspects to focus on, I'm more than willing to take the time to learn, because if could figure out a way of getting them in bulk, I'd have my Anki decks sorted for the next 5 years, haha.
0 x


Return to “Practical Questions and Advice”

Who is online

Users browsing this forum: No registered users and 3 guests