Every vocab list + audio from book2/50languages/Goethe-Verlag in Anki deck form

mokibao · Postby **mokibao** » Sun Mar 14, 2021 2:04 am

Book2, also known as 50languages or Goethe-Verlag, is a quite underrated website that provides a bunch of quality free resources. Among them, a comprehensive vocab list and phrasebook complete with audio recordings from native speakers. According to their website it should be enough to put you to A2 level.

One of the draws of the site is that it offers the same content in 50+ languages, which means you can practice any language from any other language. Too often people have to resort to courses in English or another more 'popular' language, and while it may be a way to refresh one of those, it does impede your L3 language acquisition.

The only problem is that you have to go through their website or app that then showers you with ads, prompts you to buy the pro version, etc. So, I wrote a little script that basically downloads the wordlist (1904 words) and audio and wraps it into an anki deck. You prompt it with your source and target languages (say, 'en', 'fr', 'de', 'es', etc.) and it spits an .apkg file that you can import and study however you want.

Theoretically I could do it for all 50*49*2 = 4900 combinations but it's probably best not to overload them with so many requests, lol. So I suggest you use the script for your own needs and then publish the resulting deck so people don't have to do it again. (You will need Python 3 with the beautifulsoup and genanki libraries installed - huge shoutout to genanki's author for letting me make a deck from scratch without reading a word of the anki manual).

You will find attached the script with a couple of example decks (German for French speakers, Georgian for Lithuanian speakers, Portuguese for Japanese speakers). The code is very dirty but it does the job, the decks seem to work, no issue with special characters, and I'm of course open to improvements and feedback.

Link to script and decks: https://files.catbox.moe/2t61vj.zip (updated, somewhat more permanent link)

Code: Select all


#!/usr/bin/env python3

from lxml import html
import requests
from bs4 import BeautifulSoup
import sys
import itertools
import genanki
import glob
import shutil
import os.path

origin_language = sys.argv[1].upper()
target_language = sys.argv[2].upper()


url = "https://www.goethe-verlag.com/book2/_VOCAB"

target_url = f"{url}/{origin_language}/{origin_language}{target_language}/"


def pad_number(n):
    if n < 10:
        return "0" + str(n)
    else:
        return str(n)


my_model = genanki.Model(
    1091735104,
    "Simple Model with Media",
    fields=[
        {"name": "Question"},
        {"name": "Answer"},
        {"name": "MyMedia"},  # ADD THIS
    ],
    templates=[
        {
            "name": "Card 1",
            "qfmt": "{{Question}}",  # AND THIS
            "afmt": '{{FrontSide}}<hr id="answer">{{Answer}}<br>{{MyMedia}}',
        },
    ],
    css=""".card {
 font-family: arial;
 font-size: 20px;
 text-align: center;
 color: black;
 background-color: white;
}

.card1 { background-color: #FFFFFF; }
.card2 { background-color: #FFFFFF; }"""
)

my_deck = genanki.Deck(
    2059400111, f"Book2 {origin_language}-{target_language} (words)"
)

MAX_LESSONS = 42

for i in range(1, MAX_LESSONS + 1):
    r = requests.get(f"{target_url}/{pad_number(i)}.HTM")
    soup = BeautifulSoup(r.content, "html.parser")
    words = str(soup.select("meta")[-1]).split('"')[1].split("| ")
    mp3s = [target_url + str(u).split('"')[1] for u in soup.select("source")]

    for w, m in zip(words, mp3s):

        filename = f"word_{origin_language}{target_language}_" + m.split("/")[-1]

        target_w = " - ".join(w.split(" - ")[:-1]) # necessary because some words have several translations in TL
        source_w = w.split(" - ")[-1]
        if not os.path.isfile(filename):
            dl_file = requests.get(m, stream=True)
            print(m)
            with open(filename, "wb") as out_file:
                shutil.copyfileobj(dl_file.raw, out_file)

        my_note = genanki.Note(
            model=my_model, fields=[source_w, target_w, f"[sound:{filename}]"]
        )

        my_deck.add_note(my_note)

my_package = genanki.Package(my_deck)
my_package.media_files = [m for m in glob.glob(f"word_{target_language}_*.mp3")]
my_package.write_to_file(f"book2_{origin_language}{target_language}_words.apkg")

Edit 1: improved the deck's appearance with some css

Edit 2: someone requested a way to also download the individual sentences. Because the logic is a bit different here is an additional script:

Code: Select all

#!/usr/bin/env python3

from lxml import html
import requests
from bs4 import BeautifulSoup
import sys
import itertools
import genanki
import glob
import shutil
import os.path

origin_language = sys.argv[1].upper()
target_language = sys.argv[2].upper()

url = "https://www.goethe-verlag.com/book2"

target_url = f"{url}/{origin_language}/{origin_language}{target_language}/{origin_language}{target_language}"

def pad_number(n):
    if n < 10:
        return "00" + str(n)
    elif n < 100:
       return "0" + str(n)
    else:
        return str(n)

my_model = genanki.Model(
    1091735104,
    "Simple Model with Media",
    fields=[
        {"name": "Question"},
        {"name": "Answer"},
        {"name": "MyMedia"},  # ADD THIS
    ],
    templates=[
        {
            "name": "Card 1",
            "qfmt": "{{Question}}",  # AND THIS
            "afmt": '{{FrontSide}}<hr id="answer">{{Answer}}<br>{{MyMedia}}',
        },
    ],
    css=""".card {
 font-family: arial;
 font-size: 20px;
 text-align: center;
 color: black;
 background-color: white;
}

.card1 { background-color: #FFFFFF; }
.card2 { background-color: #FFFFFF; }"""
)

my_deck = genanki.Deck(
    2059400110, f"Book2 {origin_language}-{target_language} (sentences)"
)

MIN_LESSON = 3 # 2 is the index page
MAX_LESSON = 102 # 103 is the youtube video

for i in range(MIN_LESSON, MAX_LESSON + 1):
    r = requests.get(f"{target_url}{pad_number(i)}.HTM") # no slash unlike vocab scraping
    soup = BeautifulSoup(r.content, "html.parser")

    # header
    header_l1_sentences = [t.text for t in soup.find_all("span", {"class": "Stil36"})]
    header_l2_sentences = [t.text for t in soup.find_all("span", {"class": "Stil46"})]
    l2_audio = [t.find_all("source")[0]["src"] for t in soup.find_all("audio")]

    body_l1_sentences = [t.text.strip() for t in soup.find_all("div", {"class": "Stil35"})][:18] # last element is some text about Alzheimer
    body_l2_sentences = [t.text.strip().split('\r\n\n')[1] for t in soup.find_all("div", {"class": "Stil45"})]

    l1_sentences = header_l1_sentences + body_l1_sentences
    l2_sentences = header_l2_sentences + body_l2_sentences

    for l1_s, l2_s, m in zip(l1_sentences, l2_sentences, l2_audio):

        filename = f"sentence_{origin_language}{target_language}_" + m.split("/")[-1]

        if not os.path.isfile(filename):
            dl_file = requests.get(m, stream=True)
            print(m)
            with open(filename, "wb") as out_file:
                shutil.copyfileobj(dl_file.raw, out_file)

        my_note = genanki.Note(
            model=my_model, fields=[l1_s, l2_s, f"[sound:{filename}]"]
        )

        my_deck.add_note(my_note)

my_package = genanki.Package(my_deck)
my_package.media_files = [m for m in glob.glob(f"sentence_{target_language}_*.mp3")]
my_package.write_to_file(f"book2_{origin_language}{target_language}_sentences.apkg")

It works very much the same way, you basically save this into a file called booksentences2anki.py supply it your source and target languages this way (e.g. learning Modern Greek from Brazilian Portuguese):

Code: Select all

./booksentences2anki.py px el

Edit 3: I changed the way it names the audio files and decks so you can run it consecutively for multiple languages and it doesn't trigger bugs in Anki due to having the same filenames etc.

Edit 4: fixed a bug where multiple translations in the target language for a single word in the source language would make the script fail

Postby **jeff_lindqvist** » Sun Mar 14, 2021 7:37 am

Now, THIS is helpful! I saw a topic on reddit where someone had posted more than 80 decks (from another source, I guess). I thought it could be interesting to grab the content from the 50languages website, so I spent some time last night searching for clever ways to do that. I was going to post a question here, but now I don't have to.

Thanks again. It'll take some time to fully grasp what your script is doing, and how to use that (I think I've lost most of my rudimentary Python skills).

BriWe · Postby **BriWe** » Wed Apr 07, 2021 5:44 pm

So I suggest you use the script for your own needs and then publish the resulting deck so people don't have to do it again.

This is amazing!!! Great idea and super nice that you're sharing the code. I'm dying to try, just one tiny problem. Too ignorant in Python. But I'm doing my best here... Already spend quite a while just learning how to install the libraries

Could you be even more kind and let me know where I insert the source and target languages? (in my case PX/DE) :lol:

The rest I think I can figure out myself.

jeff_lindqvist wrote: I saw a topic on Reddit where someone had posted more than 80 decks (from another source, I guess). I thought it could be interesting to grab the content from the 50languages website, so I spent some time last night searching for clever ways to do that. I was going to post a question here, but now I don't have to.

I really liked the xefjords project but the 50languages has really good content and I also ended up here searching for better ways to use their material.

mokibao · Postby **mokibao** » Wed Apr 07, 2021 7:48 pm

Here you go. I ran:

Code: Select all

./book2anki.py px de

(If it spouts an error (too many requests or something) just try again, it won't start over again and will look at what files are already present. There's probably a more gracious way to handle it but I sort of threw it together hastily and it does the job.)

And here is the link to the deck: https://we.tl/t-qvP6MeNP14

Hash · Postby **Hash** » Fri Apr 09, 2021 9:54 am

Thank you mokibao for this great script!

mokibao · Postby **mokibao** » Fri Apr 09, 2021 5:59 pm

I updated the OP to add a script that scrapes the sentences with audio (2000 total) and puts that into an Anki deck. I also added some CSS to make the deck display look nicer.

BriWe · Postby **BriWe** » Fri Apr 09, 2021 6:47 pm

mokibao wrote:Here you go.
(...)

And here is the link to the deck: https://we.tl/t-qvP6MeNP14

Thanks!! I checked the deck and everything looks fine. I had downloaded the mp3 zip yesterday , now I can work through it better.

Also, after your instruction I managed to get the code running, so you helped me learning two new things. I appreciate :mrgreen:

bem · Postby **bem** » Sun May 02, 2021 5:52 am

mokibao wrote:I updated the OP to add a script that scrapes the sentences with audio (2000 total) and puts that into an Anki deck. I also added some CSS to make the deck display look nicer.

Nice work. I made a recognition card version of the script, and made some minor changes like the way numbers are displayed (not displaying the number in the l1/l2 cards) and eliminating duplicates in the sentences. I had hoped the duplicate sentences meant that there was some relationship between the vocab and the sentences so I could add the word and it's definition on the back side, but there doesn't appear to be one. I also fixed the audio, since there was a globing error in the original resulting in the audio never making it to the apkg.

https://gist.github.com/bemitc/72c1e527 ... b3d7199fc0

Unfortunately, the change of the number cards probably mean the output of the script can't be freely distributed due to the license (cc by-nc-nd).

sp111 · Postby **sp111** » Sun Jun 13, 2021 1:10 am

bem wrote:
mokibao wrote:I updated the OP to add a script that scrapes the sentences with audio (2000 total) and puts that into an Anki deck. I also added some CSS to make the deck display look nicer.

Nice work. I made a recognition card version of the script, and made some minor changes like the way numbers are displayed (not displaying the number in the l1/l2 cards) and eliminating duplicates in the sentences. I had hoped the duplicate sentences meant that there was some relationship between the vocab and the sentences so I could add the word and it's definition on the back side, but there doesn't appear to be one. I also fixed the audio, since there was a globing error in the original resulting in the audio never making it to the apkg.

https://gist.github.com/bemitc/72c1e527 ... b3d7199fc0

Unfortunately, the change of the number cards probably mean the output of the script can't be freely distributed due to the license (cc by-nc-nd).

giving me error -

Code: Select all

  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1-3: character maps to <undefined>

Canafro · Postby **Canafro** » Sat Jul 24, 2021 3:14 am

mokibao wrote:Book2, also known as 50languages or Goethe-Verlag, is a quite underrated website that provides a bunch of quality free resources. Among them, a comprehensive vocab list and phrasebook complete with audio recordings from native speakers. According to their website it should be enough to put you to A2 level.

One of the draws of the site is that it offers the same content in 50+ languages, which means you can practice any language from any other language. Too often people have to resort to courses in English or another more 'popular' language, and while it may be a way to refresh one of those, it does impede your L3 language acquisition.

The only problem is that you have to go through their website or app that then showers you with ads, prompts you to buy the pro version, etc. So, I wrote a little script that basically downloads the wordlist (1904 words) and audio and wraps it into an anki deck. You prompt it with your source and target languages (say, 'en', 'fr', 'de', 'es', etc.) and it spits an .apkg file that you can import and study however you want.

Theoretically I could do it for all 50*49*2 = 4900 combinations but it's probably best not to overload them with so many requests, lol. So I suggest you use the script for your own needs and then publish the resulting deck so people don't have to do it again. (You will need Python 3 with the beautifulsoup and genanki libraries installed - huge shoutout to genanki's author for letting me make a deck from scratch without reading a word of the anki manual).

You will find attached the script with a couple of example decks (German for French speakers, Georgian for Lithuanian speakers, Portuguese for Japanese speakers). The code is very dirty but it does the job, the decks seem to work, no issue with special characters, and I'm of course open to improvements and feedback.

Link to script and decks: https://files.catbox.moe/2t61vj.zip (updated, somewhat more permanent link)
Code: Select all
#!/usr/bin/env python3 from lxml import html import requests from bs4 import BeautifulSoup import sys import itertools import genanki import glob import shutil import os.path origin_language = sys.argv[1].upper() target_language = sys.argv[2].upper() url = "https://www.goethe-verlag.com/book2/_VOCAB" target_url = f"{url}/{origin_language}/{origin_language}{target_language}/" def pad_number(n): if n < 10: return "0" + str(n) else: return str(n) my_model = genanki.Model( 1091735104, "Simple Model with Media", fields=[ {"name": "Question"}, {"name": "Answer"}, {"name": "MyMedia"}, # ADD THIS ], templates=[ { "name": "Card 1", "qfmt": "{{Question}}", # AND THIS "afmt": '{{FrontSide}}<hr id="answer">{{Answer}}<br>{{MyMedia}}', }, ], css=""".card { font-family: arial; font-size: 20px; text-align: center; color: black; background-color: white; } .card1 { background-color: #FFFFFF; } .card2 { background-color: #FFFFFF; }""" ) my_deck = genanki.Deck( 2059400111, f"Book2 {origin_language}-{target_language} (words)" ) MAX_LESSONS = 42 for i in range(1, MAX_LESSONS + 1): r = requests.get(f"{target_url}/{pad_number(i)}.HTM") soup = BeautifulSoup(r.content, "html.parser") words = str(soup.select("meta")[-1]).split('"')[1].split("| ") mp3s = [target_url + str(u).split('"')[1] for u in soup.select("source")] for w, m in zip(words, mp3s): filename = f"word_{origin_language}{target_language}_" + m.split("/")[-1] target_w = " - ".join(w.split(" - ")[:-1]) # necessary because some words have several translations in TL source_w = w.split(" - ")[-1] if not os.path.isfile(filename): dl_file = requests.get(m, stream=True) print(m) with open(filename, "wb") as out_file: shutil.copyfileobj(dl_file.raw, out_file) my_note = genanki.Note( model=my_model, fields=[source_w, target_w, f"[sound:{filename}]"] ) my_deck.add_note(my_note) my_package = genanki.Package(my_deck) my_package.media_files = [m for m in glob.glob(f"word_{target_language}_*.mp3")] my_package.write_to_file(f"book2_{origin_language}{target_language}_words.apkg")

Edit 1: improved the deck's appearance with some css

Edit 2: someone requested a way to also download the individual sentences. Because the logic is a bit different here is an additional script:

Code: Select all
#!/usr/bin/env python3 from lxml import html import requests from bs4 import BeautifulSoup import sys import itertools import genanki import glob import shutil import os.path origin_language = sys.argv[1].upper() target_language = sys.argv[2].upper() url = "https://www.goethe-verlag.com/book2" target_url = f"{url}/{origin_language}/{origin_language}{target_language}/{origin_language}{target_language}" def pad_number(n): if n < 10: return "00" + str(n) elif n < 100: return "0" + str(n) else: return str(n) my_model = genanki.Model( 1091735104, "Simple Model with Media", fields=[ {"name": "Question"}, {"name": "Answer"}, {"name": "MyMedia"}, # ADD THIS ], templates=[ { "name": "Card 1", "qfmt": "{{Question}}", # AND THIS "afmt": '{{FrontSide}}<hr id="answer">{{Answer}}<br>{{MyMedia}}', }, ], css=""".card { font-family: arial; font-size: 20px; text-align: center; color: black; background-color: white; } .card1 { background-color: #FFFFFF; } .card2 { background-color: #FFFFFF; }""" ) my_deck = genanki.Deck( 2059400110, f"Book2 {origin_language}-{target_language} (sentences)" ) MIN_LESSON = 3 # 2 is the index page MAX_LESSON = 102 # 103 is the youtube video for i in range(MIN_LESSON, MAX_LESSON + 1): r = requests.get(f"{target_url}{pad_number(i)}.HTM") # no slash unlike vocab scraping soup = BeautifulSoup(r.content, "html.parser") # header header_l1_sentences = [t.text for t in soup.find_all("span", {"class": "Stil36"})] header_l2_sentences = [t.text for t in soup.find_all("span", {"class": "Stil46"})] l2_audio = [t.find_all("source")[0]["src"] for t in soup.find_all("audio")] body_l1_sentences = [t.text.strip() for t in soup.find_all("div", {"class": "Stil35"})][:18] # last element is some text about Alzheimer body_l2_sentences = [t.text.strip().split('\r\n\n')[1] for t in soup.find_all("div", {"class": "Stil45"})] l1_sentences = header_l1_sentences + body_l1_sentences l2_sentences = header_l2_sentences + body_l2_sentences for l1_s, l2_s, m in zip(l1_sentences, l2_sentences, l2_audio): filename = f"sentence_{origin_language}{target_language}_" + m.split("/")[-1] if not os.path.isfile(filename): dl_file = requests.get(m, stream=True) print(m) with open(filename, "wb") as out_file: shutil.copyfileobj(dl_file.raw, out_file) my_note = genanki.Note( model=my_model, fields=[l1_s, l2_s, f"[sound:{filename}]"] ) my_deck.add_note(my_note) my_package = genanki.Package(my_deck) my_package.media_files = [m for m in glob.glob(f"sentence_{target_language}_*.mp3")] my_package.write_to_file(f"book2_{origin_language}{target_language}_sentences.apkg")

It works very much the same way, you basically save this into a file called booksentences2anki.py supply it your source and target languages this way (e.g. learning Modern Greek from Brazilian Portuguese):

Code: Select all
./booksentences2anki.py px el

Edit 3: I changed the way it names the audio files and decks so you can run it consecutively for multiple languages and it doesn't trigger bugs in Anki due to having the same filenames etc.

Edit 4: fixed a bug where multiple translations in the target language for a single word in the source language would make the script fail

Hello I tried to run the Script , but I couldnt , I have to admit that I don't have any knoweledge of Pyton, can you share a screenshot with the source and target languages adding to the script, I tried Spanish to English and could not download.

Thank you in advance

A language learners’ forum

Every vocab list + audio from book2/50languages/Goethe-Verlag in Anki deck form

Every vocab list + audio from book2/50languages/Goethe-Verlag in Anki deck form

Re: Every vocab list + audio from book2/50languages/Goethe-Verlag in Anki deck form

Re: Every vocab list + audio from book2/50languages/Goethe-Verlag in Anki deck form

Re: Every vocab list + audio from book2/50languages/Goethe-Verlag in Anki deck form

Re: Every vocab list + audio from book2/50languages/Goethe-Verlag in Anki deck form

Re: Every vocab list + audio from book2/50languages/Goethe-Verlag in Anki deck form

Re: Every vocab list + audio from book2/50languages/Goethe-Verlag in Anki deck form

Re: Every vocab list + audio from book2/50languages/Goethe-Verlag in Anki deck form

Re: Every vocab list + audio from book2/50languages/Goethe-Verlag in Anki deck form

Re: Every vocab list + audio from book2/50languages/Goethe-Verlag in Anki deck form

Who is online