Making a subtitle translator & merger (Linux)

All about language programs, courses, websites and other learning resources
User avatar
bedtime
Orange Belt
Posts: 101
Joined: Tue Dec 03, 2019 7:12 pm
Location: Ontario, Canada
Languages: English (N), French (beginner/intermediate), Latin (beginner)
x 226

Making a subtitle translator & merger (Linux)

Postby bedtime » Sat Mar 21, 2020 9:22 pm

The program I've made can translate an .srt file into another language and merge the two languages into one file so that the user can see both subtitles on the screen at once. This can be great for learning.

example.srt:

Code: Select all

1
00:00:06,000 --> 00:00:12,074
Advertise your product or brand here contact www.OpenSubtitles.org today

2
00:00:26,424 --> 00:00:29,087
Writers basically take readers hostage.

3
00:00:29,297 --> 00:00:33,261
You're forcing someone to spend 5, 6, 7, 8 hours in your brain.

4
00:00:34,182 --> 00:00:35,763
People have less time now.

5
00:00:35,764 --> 00:00:35,804
People have less time now.

6
00:00:36,054 --> 00:00:38,686
10 years ago you had a minor bestseller.

7
00:00:38,687 --> 00:00:38,727
10 years ago you had a minor bestseller.

8
00:00:39,227 --> 00:00:41,269
Now people are saturated with info.

9
00:00:41,519 --> 00:00:43,351
They have every excuse not to read.
...


The new file with both subtitles is a single .ssa or .ass (SubStation Alpha: https://en.wikipedia.org/wiki/SubStation_Alpha) subtitle file, which VLC and other major video players can run.

example.ssa:

Code: Select all

[Script Info]
ScriptType: v4.00+
Collisions: Normal
PlayDepth: 0
Timer: 100,0000
Video Aspect Ratio: 0
WrapStyle: 0
ScaledBorderAndShadow: no

[V4+ Styles]
Format: Name,Fontname,Fontsize,PrimaryColour,SecondaryColour,OutlineColour,BackColour,Bold,Italic,Underline,StrikeOut,ScaleX,ScaleY,Spacing,Angle,BorderStyle,Outline,Shadow,Alignment,MarginL,MarginR,MarginV,Encoding
Style: Default,Arial,10,&H00FFFFFF,&H00FFFFFF,&H00000000,&H00000000,-1,0,0,0,100,100,0,0,1,1,0,2,10,10,10,0
Style: Top,Arial,10,&H00F9FFFF,&H00FFFFFF,&H00000000,&H00000000,-1,0,0,0,100,100,0,0,1,1,0,8,10,10,10,0
Style: Mid,Arial,18,&H0000FFFF,&H00FFFFFF,&H00000000,&H00000000,-1,0,0,0,100,100,0,0,1,2,0,5,10,10,10,0
Style: Bot,Arial,18,&H00F9FFF9,&H00FFFFFF,&H00000000,&H00000000,-1,0,0,0,100,100,0,0,1,2,0,2,10,10,10,0

[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
Dialogue: 0,0:00:06.00,0:00:12.07,Top,,0000,0000,0000,,Advertise your product or brand here contact www.OpenSubtitles.org today
Dialogue: 0,0:00:06.00,0:00:12.07,Bot,,0000,0000,0000,,Annoncez votre produit ou votre marque ici, contactez www.OpenSubtitles.org dès aujourd'hui
Dialogue: 0,0:00:26.42,0:00:29.08,Top,,0000,0000,0000,,Writers basically take readers hostage.
Dialogue: 0,0:00:26.42,0:00:29.08,Bot,,0000,0000,0000,,Les écrivains prennent essentiellement les lecteurs en otage.
Dialogue: 0,0:00:29.29,0:00:33.26,Top,,0000,0000,0000,,You're forcing someone to spend 5, 6, 7, 8 hours in your brain.
Dialogue: 0,0:00:29.29,0:00:33.26,Bot,,0000,0000,0000,,Vous forcez quelqu'un à passer 5, 6, 7, 8 heures dans votre cerveau.
Dialogue: 0,0:00:34.18,0:00:35.76,Top,,0000,0000,0000,,People have less time now.
Dialogue: 0,0:00:34.18,0:00:35.76,Bot,,0000,0000,0000,,Les gens ont moins de temps maintenant.
Dialogue: 0,0:00:35.76,0:00:35.80,Top,,0000,0000,0000,,People have less time now.
Dialogue: 0,0:00:35.76,0:00:35.80,Bot,,0000,0000,0000,,Les gens ont moins de temps maintenant.
Dialogue: 0,0:00:36.05,0:00:38.68,Top,,0000,0000,0000,,10 years ago you had a minor bestseller.
Dialogue: 0,0:00:36.05,0:00:38.68,Bot,,0000,0000,0000,,Il y a 10 ans, vous aviez un best-seller mineur.
Dialogue: 0,0:00:38.68,0:00:38.72,Top,,0000,0000,0000,,10 years ago you had a minor bestseller.
Dialogue: 0,0:00:38.68,0:00:38.72,Bot,,0000,0000,0000,,Il y a 10 ans, vous aviez un best-seller mineur.
Dialogue: 0,0:00:39.22,0:00:41.26,Top,,0000,0000,0000,,Now people are saturated with info.
Dialogue: 0,0:00:39.22,0:00:41.26,Bot,,0000,0000,0000,,Maintenant, les gens sont saturés d'informations.
Dialogue: 0,0:00:41.51,0:00:43.35,Top,,0000,0000,0000,,They have every excuse not to read.
Dialogue: 0,0:00:41.51,0:00:43.35,Bot,,0000,0000,0000,,Ils ont toutes les excuses pour ne pas lire.
...


This is what it looks like in my VLC player:
Screenshot from 2020-02-28 17-44-341.png

You would first need to give it a source .srt file to translate. Yes, you must have a file for the program to translate from. This program does not make .srt files but translates them. Translations typically take 1-2 minutes.

The idea of the program is to be as simple and concise as possible, keeping maintenance low and reliability high. The program is a simple shell script with almost no dependencies (aside from extremely common command line tools that are included in nearly all Linux distrubutions). It has around 300 lines of code (compare to other projects with thousands), one file, and requires no installation. It uses Google Translate for its translations. The program is in ALPHA stage, and I've little idea when or if it'll reach BETA. Error detection is incomplete, and there is little instruction; however, it works excellent for the over 100 .srt files I've put into it.

This program will never cost money, never be sold, never take on a copyright, and never ask for donations (nor do I want them). You may edit and distribute this program, but please notify me of bugs or code changes so we can all benefit. Again, if you try the program and it doesn't work, please let me know. Simply run the program with the debug flag (-d) and give me the *.log file.

There are several options which can be chosen at runtime, such as a very simple gui mode, and automatic source and target language detection, and the ability to change the translation engine to any other CLI engine. Lots of other little tweaking options, too...

Currently, my tests indicate that it can translate English into 102 languages:

Afrikaans Afrikaans [af]
Albanian Shqip [sq]
Amharic አማርኛ [am]
Arabic العربية [ar]
Armenian Հայերեն [hy]
Azerbaijani Azərbaycanca [az]
Basque Euskara [eu]
Belarusian беларуская [be]
Bengali বাংলা [bn]
Bosnian Bosanski [bs]
Bulgarian български [bg]
Catalan Català [ca]
Cebuano Cebuano [ceb]
Chichewa Nyanja [ny]
Chinese Simplified 简体中文 [zh-CN]
Chinese Traditional 正體中文 [zh-TW]
Corsican Corsu [co]
Croatian Hrvatski [hr]
Czech Čeština [cs]
Danish Dansk [da]
Dutch Nederlands [nl]
Esperanto Esperanto [eo]
Estonian Eesti [et]
Filipino Tagalog [tl]
Finnish Suomi [fi]
French Français [fr]
Frisian Frysk [fy]
Galician Galego [gl]
Georgian ქართული [ka]
German Deutsch [de]
Greek Ελληνικά [el]
Gujarati ગુજરાતી [gu]
Haitian Creole Kreyòl Ayisyen [ht]
Hausa Hausa [ha]
Hawaiian ʻŌlelo Hawaiʻi [haw]
Hebrew עִבְרִית [he]
Hmong Hmoob [hmn]
Hungarian Magyar [hu]
Icelandic Íslenska [is]
Igbo Igbo [ig]
Indonesian Bahasa Indonesia [id]
Irish Gaeilge [ga]
Italian Italiano [it]
Japanese 日本語 [ja]
Javanese Basa Jawa [jv]
Kannada ಕನ್ನಡ [kn]
Kazakh Қазақ тілі [kk]
Kinyarwanda Kinyarwanda [rw]
Korean 한국어 [ko]
Kurdish Kurdî [ku]
Kyrgyz Кыргызча [ky]
Lao ລາວ [lo]
Latin Latina [la]
Latvian Latviešu [lv]
Lithuanian Lietuvių [lt]
Luxembourgish Lëtzebuergesch [lb]
Macedonian Македонски [mk]
Malagasy Malagasy [mg]
Malay Bahasa Melayu [ms]
Malayalam മലയാളം [ml]
Maltese Malti [mt]
Maori Māori [mi]
Mongolian Монгол [mn]
Norwegian Norsk [no]
Pashto پښتو [ps]
Persian فارسی [fa]
Polish Polski [pl]
Portuguese Português [pt]
Romanian Română [ro]
Russian Русский [ru]
Samoan Gagana Sāmoa [sm]
Scots Gaelic Gàidhlig [gd]
Serbian (Cyrillic) српски [sr-Cyrl]
Serbian (Latin) srpski [sr-Latn]
Sesotho Sesotho [st]
Shona chiShona [sn]
Sindhi سنڌي [sd]
Sinhala සිංහල [si]
Slovak Slovenčina [sk]
Slovenian Slovenščina [sl]
Somali Soomaali [so]
Spanish Español [es]
Sundanese Basa Sunda [su]
Swahili Kiswahili [sw]
Swedish Svenska [sv]
Tajik Тоҷикӣ [tg]
Tamil தமிழ் [ta]
Tatar татарча [tt]
Telugu తెలుగు [te]
Thai ไทย [th]
Turkish Türkçe [tr]
Turkmen تۆرکمنچه‎ [tk]
Ukrainian Українська [uk]
Urdu اُردُو [ur]
Uyghur ئۇيغۇر تىلى [ug]
Uzbek Oʻzbek tili [uz]
Vietnamese Tiếng Việt [vi]
Welsh Cymraeg [cy]
Xhosa isiXhosa [xh]
Yiddish ייִדיש [yi]
Yoruba Yorùbá [yo]
Zulu isiZulu [zu]еларуская [be]

The program will translate the following languages but very slowly (line by line):

Hindi हिन्दी [hi]
Khmer ភាសាខ្មែរ [km]
Marathi मराठी [mr]
Myanmar (Burmese) မြန်မာစာ [my]
Nepali नेपाली [ne]
Odia (Oriya) ଓଡ଼ିଆ [or]
Punjabi ਪੰਜਾਬੀ [pa]

Likely, it could translate the above languages back into English or those in the list, but I've yet to test that out; there are just too many combinations to test. It's been heavily tested in all combinations of English, French, and Russian.

The newest version of the code can be found here: https://gitlab.com/bedtime_/srtssa/-/tree/master

*** EDIT ***

The program now has the ability to translate and merge .txt, .pdf, and .epub (html and xhtml-based!) files into a dual language .txt file, which can be read in reader or made into flashcards. The text is formatted in such as way as to break down paragraphs and sentences to make reading easier. The file may be any size.

An epub file was translated into a text document here:
Screenshot from 2020-05-08 10-55-33.png
You do not have the required permissions to view the files attached to this post.
Last edited by bedtime on Sat May 16, 2020 10:53 pm, edited 18 times in total.
5 x

User avatar
rdearman
Site Admin
Posts: 7260
Joined: Thu May 14, 2015 4:18 pm
Location: United Kingdom
Languages: English (N)
Language Log: viewtopic.php?f=15&t=1836
x 23317
Contact:

Re: Making a subtitle translator & merger (Linux)

Postby rdearman » Sun Mar 22, 2020 10:05 am

Sed or awk would take care of the newlines in the original srt if you had a small, preprocessing stage.

Also, why SH and not BASH ?
3 x
: 26 / 150 Read 150 books in 2024

My YouTube Channel
The Autodidactic Podcast
My Author's Newsletter

I post on this forum with mobile devices, so excuse short msgs and typos.

User avatar
bedtime
Orange Belt
Posts: 101
Joined: Tue Dec 03, 2019 7:12 pm
Location: Ontario, Canada
Languages: English (N), French (beginner/intermediate), Latin (beginner)
x 226

Re: Making a subtitle translator & merger (Linux)

Postby bedtime » Sun Mar 22, 2020 12:29 pm

rdearman wrote:Sed or awk would take care of the newlines in the original srt if you had a small, preprocessing stage.

Could you explain this to me?

Also, why SH and not BASH ?

I think it's just my knack for minimalism. I like the efficiency of SH.

...

I managed to make the program do all the work, so no user intervention is required. The program first takes the source .srt file, converts it to a special readable UTF type format (allows accents and special characters to be used), arranges all the text lines and removes junky characters, and does the translation.

Code: Select all

$./transmerge.sh en fr top mysubs.srt

The above would translate and merge an English .srt file into an English/French subtitle file—mysubs.ass. The source text, English in this case, would be displayed at the top of the screen. The font size and type, boldness, outline thickness, colour, shadow, position, and other parameters may be adjusted in the file.

I watched about 30 minutes of the movie Sibyl (2019) last night with English/French subs. The formatting was perfect. About 1/20 of the translations were off a little. It was a pleasure to be able to relax at the end of the night and benefit from all that work.
Screenshot from 2020-03-22 08-17-216.jpeg


*** EDIT ***

Find the updated version of the program here: https://gist.github.com/bathtime/8ae8303e870b2909c03f3b9332a4dd47
You do not have the required permissions to view the files attached to this post.
1 x

User avatar
bedtime
Orange Belt
Posts: 101
Joined: Tue Dec 03, 2019 7:12 pm
Location: Ontario, Canada
Languages: English (N), French (beginner/intermediate), Latin (beginner)
x 226

Re: Making a subtitle translator & merger (Linux)

Postby bedtime » Thu Apr 09, 2020 3:07 pm

There is very good news! :)

A Russian programmer has joined the effort in making a program do what mine does. He is much more experienced at programming than I am, and he's already made a gitlab entry for it. His program will translate a movie in 3-10 minutes:

https://gitlab.com/nezabudka/srt_to_ass/-/blob/master/to.sh

If it is okay, I'll post the link to our conversations in case anyone here is interested. The forum is completely different from this one, so there is no competition:

https://www.unix.com/shell-programming-and-scripting/284008-combining-two-perl-commands-into-one.html

I've not yet been able to get his program working with all subtitles, but it is only the very beginning and he is working on it.

...

More good news!

I've made a new program that can translate just like the old program but in just a few minutes, as opposed to hours. It does this by sending out larger chunks of data at one time. A typical movie would take 3-10 minutes.

It is only 132 lines of simple SH shell script, which is very small and easy to understand:

https://gist.github.com/bathtime/eac3cbd4c26e477f45f141b66cd00f84

For now, my program works with everything I've tested it on. I've translated between English, French, and Russian. Haven't tried any other languages.
1 x

User avatar
Querneus
Blue Belt
Posts: 841
Joined: Thu Dec 01, 2016 5:28 am
Location: Vancouver, Canada
Languages: Speaks: Spanish (N), English
Studying: Latin, French, Mandarin
x 2287

Re: Making a subtitle translator & merger (Linux)

Postby Querneus » Sat Apr 18, 2020 5:56 pm

This is definitely a nice tool. I am actually surprised that Google is now allowing simple calls like that to translate.googleapis.com again... I remember trying that when making a chatroom bot five or six years ago so that people in the chatroom could easily request translations, and I found it blocked all requests unless the request was (or imitated) a JavaScript request from a browser made on translate.google.com (the online Google Translate).

Your program only works on Linux (and maybe BSD or even Mac, I don't know), but it would be quite nice to make an equivalent with a GUI that'd work on Windows and Mac. However, I'm not going to try because I'm wary of getting into trouble with Google with a public tool like that. They tend to tolerate this sort of programs better when they're harder to use (see also their historical greater tolerance for YouTube tools that require some knowledge of the command line or programming to be used, as opposed to easy programs).
1 x

User avatar
Axon
Blue Belt
Posts: 776
Joined: Thu Jun 16, 2016 12:29 am
Location: California
Languages: Native English, in order of comfort: Mandarin, German, Indonesian,
Spanish, French, Russian,
Cantonese, Vietnamese, Polish.
Language Log: viewtopic.php?f=15&t=5086
x 3298

Re: Making a subtitle translator & merger (Linux)

Postby Axon » Sat Apr 18, 2020 5:59 pm

I've been working on a similar tool, using the googletrans Python library, and found it absolutely abysmal for Chinese - significantly worse than pasting the same text into Google's web translation service, and little better than simple dictionary lookups for each word. I wonder if this is doing anything differently or if it will run into the same problems that googletrans does.
1 x

User avatar
bedtime
Orange Belt
Posts: 101
Joined: Tue Dec 03, 2019 7:12 pm
Location: Ontario, Canada
Languages: English (N), French (beginner/intermediate), Latin (beginner)
x 226

Re: Making a subtitle translator & merger (Linux)

Postby bedtime » Tue Apr 28, 2020 11:27 pm

Ser wrote:However, I'm not going to try because I'm wary of getting into trouble with Google with a public tool like that. They tend to tolerate this sort of programs better when they're harder to use (see also their historical greater tolerance for YouTube tools that require some knowledge of the command line or programming to be used, as opposed to easy programs).

Google Translate will only ban if you contact it several times with rest periods of less than 10 seconds between. Even if you do get banned, it's only for an hour or two. I've been banned from it dozens of times when making changes and testing this program. Google always lets me back. This only happens during tests. The code posted on github should not get you banned unless you manually insert your own rest parameters to less than the recommended.

I've translated with this program continuously for over 12 hours. No problems. But lets say you do get banned by shortening the rest intervals, you can then switch to the alternate engine (param: -alt) which contacts Google a different way (I suspect using Google's API, but you would need 'translate-shell' to be installed for it to run). After a couple hours you could switch back. The translations are exactly the same in any event.

I've been working on a similar tool, using the googletrans Python library, and found it absolutely abysmal for Chinese - significantly worse than pasting the same text into Google's web translation service, and little better than simple dictionary lookups for each word. I wonder if this is doing anything differently or if it will run into the same problems that googletrans does.

I have no idea if this program can translate Chinese. It's been tested with English, French, and Russian only. Good news is that the program now accepts an alternate translation engine at runtime as a command line parameter (param: -engine ...); so, you could essentially use any CLI engine you want with the program to try it out.

Like so,

Code: Select all

$ ./str2ass.sh -ch 1000 -d -engine 'trans -s %s -t %s -b %x' -g


Where %s and %t would drop in the source and target languages and %x the text. Super simple to use. ;)

https://gist.github.com/bathtime/eac3cb ... b66cd00f84
0 x

User avatar
Querneus
Blue Belt
Posts: 841
Joined: Thu Dec 01, 2016 5:28 am
Location: Vancouver, Canada
Languages: Speaks: Spanish (N), English
Studying: Latin, French, Mandarin
x 2287

Re: Making a subtitle translator & merger (Linux)

Postby Querneus » Wed Apr 29, 2020 12:09 am

bedtime wrote:
rdearman wrote:Sed or awk would take care of the newlines in the original srt if you had a small, preprocessing stage.

Could you explain this to me?

I think he might've misinterpreted something about your program. You constantly need newlines, it doesn't matter.
bedtime wrote:Google Translate will only ban if you contact it several times with rest periods of less than 10 seconds between. Even if you do get banned, it's only for an hour or two. I've been banned from it dozens of times when making changes and testing this program. Google always lets me back. This only happens during tests. The code posted on github should not get you banned unless you manually insert your own rest parameters to less than the recommended.

I don't remember what I did exactly years ago. Maybe I did something wrong.
I have no idea if this program can translate Chinese. It's been tested with English, French, and Russian only. Good news is that the program now accepts an alternate translation engine at runtime as a command line parameter (param: -engine ...); so, you could essentially use any CLI engine you want with the program to try it out.

What is this "trans" program that is called in the program with the -alt option? I don't have it by default in Fedora Linux (so your program would fail if I tried it), and it's not on the RedHat package manager either. Maybe it's a Debian/Ubuntu thing? Googling hasn't helped either. I imagine it is some kind of Google tool because it ostensibly calls the Google Translate API (you say the translations are the same).

By the way, options (also called flags) marked with hyphens (like -a or --alternative) are not normally called "parameters" if you're using them the conventional way (the getopt way). Parameters are bare strings whether used with an option or directly on the program, as in the three bolded parts of:

ffmpeg -i 'pretty trees.mp4' -ss 0:00:40 reducedvideo.mp4


https://gist.github.com/bathtime/eac3cbd4c26e477f45f141b66cd00f84

I think it would be good for you to edit the first post and insert the latest version in it, ideally at the top with a notice in bold.
1 x

User avatar
bedtime
Orange Belt
Posts: 101
Joined: Tue Dec 03, 2019 7:12 pm
Location: Ontario, Canada
Languages: English (N), French (beginner/intermediate), Latin (beginner)
x 226

Re: Making a subtitle translator & merger (Linux)

Postby bedtime » Wed Apr 29, 2020 12:40 am

Just before I go to bed...

What is this "trans" program that is called in the program with the -alt option? I don't have it by default in Fedora Linux (so your program would fail if I tried it), and it's not on the RedHat package manager either. Maybe it's a Debian/Ubuntu thing? Googling hasn't helped either. I imagine it is some kind of Google tool because it ostensibly calls the Google Translate API (you say the translations are the same).

The package is called 'translate-shell' and is run by 'trans'. It can be installed in Fedora with 'dnf install translate-shell' (I don't remember if you need to add a repo or not). Anyways, it can also be found here: https://github.com/soimort/translate-shell

By the way, options (also called flags) marked with hyphens (like -a or --alternative) are not normally called "parameters" if you're using them the conventional way (the getopt way). Parameters are bare strings whether used with an option or directly on the program, as in the three bolded parts of:

ffmpeg -i 'pretty trees.mp4' -ss 0:00:40 reducedvideo.mp4

Thank you. I'm still learning, as you can see. :oops:

I think it would be good for you to edit the first post and insert the latest version in it, ideally at the top with a notice in bold.

Good idea. Done. :)
0 x

User avatar
bedtime
Orange Belt
Posts: 101
Joined: Tue Dec 03, 2019 7:12 pm
Location: Ontario, Canada
Languages: English (N), French (beginner/intermediate), Latin (beginner)
x 226

Re: Making a subtitle translator & merger (Linux)

Postby bedtime » Wed Apr 29, 2020 11:47 am

Axon wrote:I've been working on a similar tool, using the googletrans Python library, and found it absolutely abysmal for Chinese - significantly worse than pasting the same text into Google's web translation service, and little better than simple dictionary lookups for each word. I wonder if this is doing anything differently or if it will run into the same problems that googletrans does.

I've just added support for detecting and translating simplified and traditional Chinese. It appears to work fine as everything seems to line up well, but I don't know Chinese, so I cannot be sure. Below is a sample translation.

Also, could you give a link to your program? I've little knowledge in python but am still curious, and maybe someone else might know python here and be able to help.

(You will notice text that says Errors. While they are initially errors, 99.99% of them are fixed by the program and have no effect on the final translation.)
You do not have the required permissions to view the files attached to this post.
1 x


Return to “Language Programs and Resources”

Who is online

Users browsing this forum: No registered users and 2 guests