Help required: FSI digitisation project

Cainntear · Postby **Cainntear** » Tue Apr 07, 2020 9:00 pm

OK, so I mentioned a few weeks ago a little project I had in mind, and at first I thought it would take me a matter of days to get something online, but the stress of the last few weeks robbed me of focus, and I had a lot to deal with at home, so things stalled.

Anyhow, my plan is to do a staged digitisation of the FSI materials.

Phase 1 is just to finish the work that some others have started in splitting the materials into chapters/units so that it's easier to manage -- less scrolling through long PDFs and all that.

This will also make it easier to start splitting the work further, so that we can extract parts of it (e.g. dialogues, drills and vocab lists) that can then be used in other tools (Anki, lingq etc).

I've put up a very scratchy web app.

On the opening screen you get to choose between two options. The second option, Preview phase 1 output, gives you a simple prototype of what I think we can get done collectively with very little effort indeed. You get drop down lists to choose which course, which lesson within that course, and which audio file you want to use. At the moment, there's only one book in there -- the German Basic course which I worked through while prototyping the systems.

The first option is where you guys come in: Contents capture (stage1)

All I'm asking people to do for me just now is identify a few simple bits of information from each PDF file:

1) Does it have a contents page?
If so:
2) what page number* does the contents page start on?
3) what page number* does it end on?

4) Does the main text have "continuous" numbers 1,2,3,4 etc? It's OK if there's additional pages i, ii, iii etc before page 1; I have seen a couple of PDFs which are a collection of printed units, and each unit has its own numbering: Unit 1: 1-1,1-2,1-3...; Unit 2: 2-1,2-2,2-3,...; etc -- this is not continuous numbering.
If so:
5) What is the number* of the page that is printed with the number 1. If there is no number 1, then calculate what it would be (for example, I came across 1 PDF with a single cover page followed by page 3, so page 1 was "0", even though there is no page zero)

* When I talk about the page numbers here, I mean the page number that your PDF reader gives it, not anything printed on the page itself.

So it's quite simple. Open up the page, click Contents capture (stage1), then click the button Load new PDF. Once the file is open, fill in the details by clicking the checkboxes and typing in the numbers. There is a box marked "comments", and if there's anything you think I should know or any reason why the particular file can't be processed, type it in there and I'll deal with it. Click Submit and wait for an alert box popping up saying "data saved".

If there's an error, or "data saved" never appears, I'd appreciate it if you copy all the text from the screen above the PDF file and paste it into a private message for me to deal with.

When all of the files have been processed....ermmmm... nothing will happen... I think.
As in you'll click on "Load new PDF" and literally nothing will happen. I apologise for this -- I know my process is pretty scrappy, but it's deliberately quick and dirty because there's only a couple of hundred files and if I tried to get it perfect I'd end up working through them all just testing and debugging!

Cainntear · Postby **Cainntear** » Tue Apr 07, 2020 9:13 pm

What comes next...?
Once I've got this information, the next step will be for me to extract the contents pages from all of the PDF files that have them.

I'm currently improving the function of another part of the web app that makes it relatively quick and straightforward to transcribe the contents pages into a semi-structured format. I'll ask people again to help with the process of transcribing the contents data, and I will use that data to split the files into units/chapters. These will then be uploaded in an improved version of my phase 1 output demo.

For relatively little effort, we can get everything in a more structured setup, which will put us in a better place to start into the actual digitisation, so if everyone take 10 minutes here and there to sort out a bit of the data, that would be very much appreciated.

All data collected and all files adapted/split/cut during the process will continue to be available free, as per the source material; it should be understood that anyone contributing to the project is willing for their work to be available under the same terms as the FSI materials themselves -- this means that other people will be legally permitted to use your work for any purpose, including commercial use.

A language learners’ forum

Help required: FSI digitisation project

Help required: FSI digitisation project

Re: Help required: FSI digitisation project

Who is online