Page 2 of 2

Re: How to batch extract sentances and phrases for use in Anki?

Posted: Sun Apr 09, 2017 1:08 pm
by mcthulhu
To repeat my question - Are you comfortable with XPath to parse HTML from a Web page?

Your more specific requirement narrows it down a lot. I haven't looked at that particular dictionary, but this is how you would parse a regular WordReference dictionary entry to identify definitions of the headword:

Code: Select all

var result = doc.evaluate("//tr/td[contains(@class, 'FrWrd')]/strong/text()[1][.='" + term + "']//ancestor::tr/td[contains(@class, 'ToWrd')]/text()", doc.documentElement, null,
                 XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
   if (result){
      for (var i=0, len=result.snapshotLength; i < len; i++) {
              meanings.push(result.snapshotItem(i).textContent.trim());
      }
   }

where "term" is the word being searched on, and "meanings" is a variable that collects the definitions.

The HTML in your case is very simple and I don't think you even need to worry about tables. If you right-click on a page element in your browser and select Inspect element (I'm using Firefox at the moment), you can see the HTML for what you are looking at (or ctrl-U to see the whole Web page). The information you want is all contained in pairs of span tags, with

Code: Select all

 class="phrase"
and

Code: Select all

 class="example translation"
used to distinguish the spans for the original phrase and the translation.

Re: How to batch extract sentances and phrases for use in Anki?

Posted: Mon Apr 10, 2017 6:37 pm
by Haiku D'etat
Thanks. I've googled Xpath and I don't even understand the basic description of it - so I wouldn't know where to start! I managed to download every .html file from https://www.collinsdictionary.com/dicti ... an-english, and I've managed to convert the .azw dictionary to a single html file (although it either crashes or the formatting breaks down when I try to open it in Firefox - perhaps the file is to big?). I just don't know where to start, what software to use, etc. Any pointers at preliminary steps would be massively appreciated.

Re: How to batch extract sentances and phrases for use in Anki?

Posted: Tue Apr 11, 2017 12:50 am
by mcthulhu
OK. XPath might actually be overkill for this. What browser are you using, and can you use JavaScript, or Python?
With your "sphere" example page loaded in Firefox, if you go to Tools/Web Developer/Scratchpad, paste in the following code, and click "Run," what do you see?

Code: Select all

var phrases=document.getElementsByClassName("phrase");
var arr =Array.from(phrases);
var len=arr.length;
var results=[];
for(var i=0;i<len;i++) {
  results.push(arr[i].textContent);
}
alert(results.join(":\n"));

Re: How to batch extract sentances and phrases for use in Anki?

Posted: Tue Apr 18, 2017 5:27 pm
by Haiku D'etat
Thank you, that's definitely a step in the right direction - the phrases neatly pop up in a little dialogue box. Unfortunately it loses the formatting, so I'd have to manually separate each Italian phrase from the English translation (my plan was to use a 'split by bold' command in Excel to separate them all at once). So how would I go about doing it as a batch? I have 80,000 html files that I've pulled from the Collins site. I have minimal knowledge of Python or Jave, but if I know which aspects to focus on, I'm more than willing to take the time to learn, because if could figure out a way of getting them in bulk, I'd have my Anki decks sorted for the next 5 years, haha.

Re: How to batch extract sentances and phrases for use in Anki?

Posted: Sat Apr 29, 2017 8:42 pm
by mcthulhu
Yes, that was meant as just a step or hint - I wasn't planning to write a whole script. You'd need another variable to hold the elements with the class "example translation," so you'd have both languages extracted, and you would probably want to append the pairs to a file instead of using an alert() function. Manually separating them should not be an issue. Beyond that, your script would need to loop through each HTML page at a time, extracting and saving the data before loading the next page. That could be done in either Python or JavaScript (probably Node.js would be best). For Javascript, see Node.js's fs (file system) module for reading local files and directories, and DOMParser() for loading an HTML document into something easy to parse. For Python, see the discussion at http://www.mzan.com/article/19385837-py ... data.shtml, which looks like a similar scraping problem to yours.

I would keep WordReference's copyright notice in mind, though, and their page about licensing this dictionary data. Maybe they would even sell it to you in the format you want.