I've tested it with the English, German and Italian versions of the first Harry Potter book and, as far as I can tell, the results are perfect.
(I've attached some aligned IT->DE and IT->EN paragraphs.)
Unfortunately, the library is not available on PyPi. I.e., you'll have to manually install it.
FYI, when you run the tool for the first time, it'll download a 1.76GB(!) library to the user folder.
The package doesn't come with a native method for saving the results, but that can be easily fixed.
I only have rudimentary Python skills, but if you add the save_sents code below print_sents in aligner.py, you'll be able to save the results as a tab-delimited file:
Code: Select all
def print_sents(self):
for bead in (self.result):
src_line = self._get_line(bead[0], self.src_sents)
tgt_line = self._get_line(bead[1], self.tgt_sents)
print(src_line + "\n" + tgt_line + "\n")
## ===============
def save_sents(self, file_name= 'align.csv'):
all_sents = ''
for bead in (self.result):
src_line = self._get_line(bead[0], self.src_sents)
tgt_line = self._get_line(bead[1], self.tgt_sents)
all_sents += src_line + "\t" + tgt_line + "\n"
with open(file_name, "w", encoding="utf-8") as f:
f.write(all_sents)
print("Sentences written to", file_name)
## ===============
Since I didn't want to upload copyrighted material to some sketchy website, I also had to find a method for converting the source .html files to .txt files.
I finally decided to use a batch file that invoked pandoc.
I did the following:
1. I installed pandoc and removed the DRM from the source and target epub files.
2. I changed the file extension from .epub to .zip and extracted the folder that contained the .html files to the Desktop.
3. In that folder I created a text file with the following line:
Code: Select all
for %%f in (*.html) do pandoc --wrap=none --from=html --to=plain --output="%%~nf.txt" "%%f"
(If the epub contains .xhtml files, simply change the file filter to *.xhtml.)
4. I then renamed the text file to convert_to_plain.cmd and double-clicked it.
(The name doesn't matter, but the file extension needs to be .bat or .cmd.)