Bertalign: multilingual Python sentence aligner

All about language programs, courses, websites and other learning resources
Doitsujin
Green Belt
Posts: 413
Joined: Sat Jul 18, 2015 6:21 pm
Languages: German (N)
x 840

Bertalign: multilingual Python sentence aligner

Postby Doitsujin » Sun Jan 21, 2024 10:36 am

Recently, I was looking for an offline sentence aligner and stumbled upon the Bertalign Python library.
I've tested it with the English, German and Italian versions of the first Harry Potter book and, as far as I can tell, the results are perfect.
(I've attached some aligned IT->DE and IT->EN paragraphs.)

Unfortunately, the library is not available on PyPi. :( I.e., you'll have to manually install it.
FYI, when you run the tool for the first time, it'll download a 1.76GB(!) library to the user folder.

The package doesn't come with a native method for saving the results, but that can be easily fixed.
I only have rudimentary Python skills, but if you add the save_sents code below print_sents in aligner.py, you'll be able to save the results as a tab-delimited file:

Code: Select all

    def print_sents(self):
        for bead in (self.result):
            src_line = self._get_line(bead[0], self.src_sents)
            tgt_line = self._get_line(bead[1], self.tgt_sents)
            print(src_line + "\n" + tgt_line + "\n")

## ===============
    def save_sents(self, file_name= 'align.csv'):
        all_sents = ''
        for bead in (self.result):
            src_line = self._get_line(bead[0], self.src_sents)
            tgt_line = self._get_line(bead[1], self.tgt_sents)
            all_sents += src_line + "\t" + tgt_line + "\n"
        with open(file_name, "w", encoding="utf-8") as f:
            f.write(all_sents)
        print("Sentences written to", file_name)
## ===============

Since I didn't want to upload copyrighted material to some sketchy website, I also had to find a method for converting the source .html files to .txt files.
I finally decided to use a batch file that invoked pandoc.

I did the following:
1. I installed pandoc and removed the DRM from the source and target epub files.
2. I changed the file extension from .epub to .zip and extracted the folder that contained the .html files to the Desktop.
3. In that folder I created a text file with the following line:

Code: Select all

for %%f in (*.html) do pandoc --wrap=none --from=html --to=plain --output="%%~nf.txt" "%%f"

(If the epub contains .xhtml files, simply change the file filter to *.xhtml.)
4. I then renamed the text file to convert_to_plain.cmd and double-clicked it.
(The name doesn't matter, but the file extension needs to be .bat or .cmd.)
You do not have the required permissions to view the files attached to this post.
4 x

Return to “Language Programs and Resources”

Who is online

Users browsing this forum: No registered users and 2 guests