A tool to generate sentence-aligned bilingual texts, including novels

All about language programs, courses, websites and other learning resources
kundalini
Orange Belt
Posts: 116
Joined: Sun Jan 24, 2021 8:17 pm
Languages: English (C), Greek (low intermediate)
x 363

A tool to generate sentence-aligned bilingual texts, including novels

Postby kundalini » Tue Mar 19, 2024 2:11 am

Thanks to a post by emk, I recently discovered a Python program called Bertalign that can create bilingual texts that are aligned at the sentence level.

Here's an example from the Bertalign Github page:

两年以后,大兴安岭。
Two years later, the Greater Khingan Mountains

“顺山倒咧——”
“Tim-ber…”

随着这声嘹亮的号子,一棵如巴特农神庙的巨柱般高大的落叶松轰然倒下,叶文洁感到大地抖动了一下。
Following the loud chant, a large Dahurian larch, thick as the columns of the Parthenon, fell with a thump, and Ye Wenjie felt the earth quake.

她拿起斧头和短锯,开始去除巨大树身上的枝丫。
She picked up her ax and saw and began to clear the branches from the trunk.

每到这时,她总觉得自己是在为一个巨人整理遗体。
Every time she did this, she felt as though she were cleaning the corpse of a giant.

她甚至常常有这样的想象:这巨人就是自己的父亲。
Sometimes she even imagined the giant was her father.


If you want a quick demo, you can make a copy of Bertalign's Google Colab Notebook, replacing the src and tgt texts in the notebook. In Python, texts that span over multiple lines need to be enclosed within a pair of triple quotation marks, so be sure to preserve them: """ start text ... end text """

If you haven't used Google Colab before, it's an online platform to run code on Google's computers.

I made a copy of the Bertalign Colab notebook, and modified it so that I could upload txt files instead of copying and pasting text into the cells. I also modified the code to be able to download the merged and aligned txt file. If you want to try making your own interlaced bilingual text, follow the steps in the guide below.

1. Copy the notebook into your Google account. Go to Colab Notebook and make a copy of it by selecting File -> Save a copy in Drive. This assumes that you have a Google account and are logged into one. This step will save a copy of this notebook in your Google Drive.

2. Upload text files in txt format. You can upload both your source and target language files by first clicking on the folder icon to the left, then the file upload icon above it to the right. Beware that when the session times out, the files will be cleared, and you may have to repeat this process.

Image

If you need to convert other formats, such as epub, to txt, there are free websites that perform this, as well as desktop programs like Calibre.

3. Specify the location of the files you uploaded in step 2. In the notebook, look for the cell that contains the code below and modify the two lines to reflect your file names. Here, the two files are named comte_fr.txt and comte_en.txt. Leave everything else alone, including ../ and the quotation marks. According to this set up, the French sentence comes first, then its matching English sentence.

Code: Select all

src_path = '../comte_fr.txt'
tgt_path = '../comte_en.txt'


If you have two copies of Don Quijote in Spanish and English, whose file names are quijote_es.txt and quijote_en.txt, here's what the code would look like:

Code: Select all

src_path = '../quijote_es.txt'
tgt_path = '../quijote_en.txt'


By the way, you do not need to include language codes such as es or en in your file name. Bertalign automatically detects the languages.

4. Run the code. You can run all the code in the notebook by selecting Runtime -> Run all.
You may be asked to Restart the session. If so, click the button as seen below, and simply run the code again in the notebook by selecting Runtime -> Run all.

Image

5. Download the file. After the program finishes aligning the two text files into a single file, you will be asked where you want to download the merged file, temporarily named output.txt. Then you're done.

The finished product looks like this:

Image

At this point, I convert the txt file into an epub file and load it onto my ereader and read it while listening to its audiobook version.

Image

Here's a aligned and merged copy of Notre Dame de Paris by Victor Hugo, made from a public domain translation:
https://drive.google.com/file/d/1T-uGrzTuOhHe5FcVnerPzfLSnB6sYBYA/view

Some caveats:

- The alignment depends on the quality of the translation. Bertalign tends to do well with faithful translations. With more liberal translations that don't preserve original sentences, it can produce artefacts, such as cutting off in the middle of a sentence.

- Colab isn't completely free. While testing out the code with several books, I've already used up my free usage allowance and will likely have to pay to align books in the future.

Despite some relatively minor limitations, I'm overjoyed to have found this tool, as it allows me to build bilingual texts that I find easy to read and enjoy. I'm happy to try to answer any questions! Conversely, if you have suggestions, I'm all ears as well.

If you want to produce similar texts with movie or TV show subtitles, the details are in the below thread. Among other things, it discusses how to download Netflix subtitles in batch.

https://forum.language-learners.org/viewtopic.php?f=19&t=16607
Last edited by kundalini on Tue Mar 19, 2024 5:39 pm, edited 1 time in total.
7 x
Iliad: 12 / 24
French Super Challenge Books: 0 / 5000 (0/5000 pages)
French Super Challenge Films: 0 / 9000 (0/9000 minutes)

User avatar
MorkTheFiddle
Black Belt - 2nd Dan
Posts: 2143
Joined: Sat Jul 18, 2015 8:59 pm
Location: North Texas USA
Languages: English (N). Read (only) French and Spanish. Studying Ancient Greek. Studying a bit of Latin. Once studied Old Norse. Dabbled in Catalan, Provençal and Italian.
Language Log: https://forum.language-learners.org/vie ... 11#p133911
x 4889

Re: A tool to generate sentence-aligned bilingual texts, including novels

Postby MorkTheFiddle » Tue Mar 19, 2024 5:20 pm

Comprehensive and comprehensible instructions. Thank you for this. :)
1 x
Many things which are false are transmitted from book to book, and gain credit in the world. -- attributed to Samuel Johnson


Return to “Language Programs and Resources”

Who is online

Users browsing this forum: No registered users and 2 guests