Re: How to batch extract sentances and phrases for use in Anki?
Posted: Sun Apr 09, 2017 1:08 pm
To repeat my question - Are you comfortable with XPath to parse HTML from a Web page?
Your more specific requirement narrows it down a lot. I haven't looked at that particular dictionary, but this is how you would parse a regular WordReference dictionary entry to identify definitions of the headword:
where "term" is the word being searched on, and "meanings" is a variable that collects the definitions.
The HTML in your case is very simple and I don't think you even need to worry about tables. If you right-click on a page element in your browser and select Inspect element (I'm using Firefox at the moment), you can see the HTML for what you are looking at (or ctrl-U to see the whole Web page). The information you want is all contained in pairs of span tags, with and used to distinguish the spans for the original phrase and the translation.
Your more specific requirement narrows it down a lot. I haven't looked at that particular dictionary, but this is how you would parse a regular WordReference dictionary entry to identify definitions of the headword:
Code: Select all
var result = doc.evaluate("//tr/td[contains(@class, 'FrWrd')]/strong/text()[1][.='" + term + "']//ancestor::tr/td[contains(@class, 'ToWrd')]/text()", doc.documentElement, null,
XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
if (result){
for (var i=0, len=result.snapshotLength; i < len; i++) {
meanings.push(result.snapshotItem(i).textContent.trim());
}
}
where "term" is the word being searched on, and "meanings" is a variable that collects the definitions.
The HTML in your case is very simple and I don't think you even need to worry about tables. If you right-click on a page element in your browser and select Inspect element (I'm using Firefox at the moment), you can see the HTML for what you are looking at (or ctrl-U to see the whole Web page). The information you want is all contained in pairs of span tags, with
Code: Select all
class="phrase"
Code: Select all
class="example translation"