working on a parallel text epub generator

mcthulhu · Postby **mcthulhu** » Sat Mar 27, 2021 3:23 am

Following up on a recent discussion here, I've been working on adding an option to Jorkens to convert Global Voices articles to parallel text epubs. Jorkens prompts for the URL of a Global Voices article, downloads that one, looks for a version in your native language, downloads that, and then generates an epub book with the two languages merged in alternating paragraphs. I'd originally planned on a two-column view; but now have the translation interlaced with the original. It's hidden by default (because I want to avoid looking at the reference version until I need to); and when you hover over the original paragraph, the translation becomes visible just below it, in blue. I'm quite pleased with how nice this looks so far (at least to me), though I still have some tweaks to make before I'm ready to upload this to GitHub.

So far I've tested the generated epubs in another desktop epub reader, but haven't tried them on mobile yet. (I'm not sure what the mobile equivalent of a hover should be -- tap? swipe?, and I'm also unsure about conflicts with mobile epub readers' default behavior for those actions.) I do want the epubs to be usable in other readers, anyway.

One possible improvement to this tool later on might be to download more data, since one article isn't a very long book. Maybe get the last 100 news articles in a given language for which the desired equivalents exist, and treat each article as a chapter in a book of parallel texts? It might take a while to build the book but that might be easier than making an epub for each article.

Another will be to split paragraphs into sentences and try matching those between versions, and have pairs of span elements within a given paragraph. I might save this for a later version.

And of course the plan is to generalize this eventually to produce parallel text epubs easily for other sources of parallel data as well, ideally any two versions of the same book in different languages.

Doitsujin · Postby **Doitsujin** » Sat Mar 27, 2021 1:10 pm

You might want to consider adding an import filter for the Aglona PBO file format.
The format isn't documented, however, since .pbo files are .xml files, you should be able to reverse-engineer it from this source code file.
BTW, if you're still looking for a suitable target format, maybe you could base it on the Doppeltext epub3 format.

mcthulhu · Postby **mcthulhu** » Sun Mar 28, 2021 4:00 am

Doitsujin, thanks for the suggestions. I'd looked at Aglona Reader in the past, and I looked at it a little more today. It's an interesting tool, and there are are some things I liked about the reader itself, like the multiple views, and some things I didn't like, like the use of a format that no other reader can handle. It seems like the preparation of parallel books in that non-standard format is a very labor-intensive process, about the same level of effort as NOVA Text Aligner, which is also a GUI for an entirely manual, sentence-by-sentence approach. That might be why there seem to be so few books available in that format, all old public-domain works, and in very few languages (three, I think). Am I mistaken?

I think the author might have been hoping for an army of volunteers to contribute books in this format, but that does not appear to have happened. I agree with him that hand-aligned parallel texts are of higher quality, but personally, I don't have the time, and a semi-automated aligner is time-consuming enough for me. Anyway, it would probably be possible to import .PBO files and convert them into epubs, but given the small number (as far as I can tell) of .PBO books available to convert, I'm not sure this would be worth the investment of time, at least right now. Maybe later. A higher priority for me might be importing the parallel novels on Farkas's site, which cover 15 languages.

The Doppeltext format, though a bit fancier and using clicks instead, looks very much like my first effort, which used HTML title attributes to display translations when hovering over the original text. It might be possible to keep this as one of several possible views, and let a reader choose among them.

mcthulhu · Postby **mcthulhu** » Thu Apr 01, 2021 11:02 pm

This is now working, more or less, and the source code has been updated on GitHub.

EGP · Postby **EGP** » Fri Apr 02, 2021 12:38 am

You might be interested in something like AntPConc too if you like analysis.

mcthulhu · Postby **mcthulhu** » Fri Apr 02, 2021 1:36 am

Thanks, EGP. I'm aware of AntPConc, though I've used AntConc more. I have both of them and they are very nice tools.

kundalini · Postby **kundalini** » Fri Apr 02, 2021 11:45 am

I was able to get the new version up and running on my Mac. I was able to advance to the next page, but not go backwards, but this isn't such a big deal. Looks great!

jorkens.png

A language learners’ forum

working on a parallel text epub generator

working on a parallel text epub generator

Re: working on a parallel text epub generator

Re: working on a parallel text epub generator

Re: working on a parallel text epub generator

Re: working on a parallel text epub generator

Re: working on a parallel text epub generator

Re: working on a parallel text epub generator

Who is online