Re: substudy: A tool for making bilingual subtitles (MacOS X or Linux, command-line)
Posted: Sun Feb 12, 2017 3:33 pm
One more tiny piece is working! Using the sample sub/idx files so generously provided by Stefan:
…I was able to write a working sub/idx image decoder library in Rust. Sample output:
This was harder than I expected, because the 'sub' files are technically MPEG 2 Program Streams containing Packetized Elementary Stream packets containing split up pieces of subtitle information packets, which in turn contain run-length-encoded image data and a series of control sequences. Decoding all of this proved rather fiddly.
If you're a Rust programmer, you can find documentation and example code at docs.rs.
This is the first step towards improved subtitle OCR: I can now parse the images into a standard format. The next challenge is breaking the image up into individual letters. This is tricky because:
Stefan wrote:I believe I've got .sub/.idx for 10 movies if it would be of interest?
…I was able to write a working sub/idx image decoder library in Rust. Sample output:
This was harder than I expected, because the 'sub' files are technically MPEG 2 Program Streams containing Packetized Elementary Stream packets containing split up pieces of subtitle information packets, which in turn contain run-length-encoded image data and a series of control sequences. Decoding all of this proved rather fiddly.
If you're a Rust programmer, you can find documentation and example code at docs.rs.
This is the first step towards improved subtitle OCR: I can now parse the images into a standard format. The next challenge is breaking the image up into individual letters. This is tricky because:
- The "r" and the "i" are attached by the black shadow, but should be separate letters.
- The stem and the dot of the "i" are attached by the black shadow but should be treated as a single letter. And indeed, they might not always be attached.
- Short spaces indicate letter breaks, long spaces indicate word breaks.
- Subtitles may extend across more than one line.