A Practical Yet Sloppy Frequency Dictionary?

zenmonkey · Postby **zenmonkey** » Sat May 07, 2022 5:51 pm

The discussion on the Routledge Italian Frequency Dictionary led me to see that there is a Persian Frequency Dictionary in place and I looked about it on the Internet to see if I could find something useful for me. What I mean as useful is a tool, list, or database that helps me learn words and phrases based so that I can focus, early in my learning path, on high-frequency words and at the same time have short simple sentences with audio to work with.

And down the rabbit hole, here we go. First off, there is an interesting presentation on the work done to create the Routledge Persian Frequency Dictionary here:
https://www.fbcinc.com/e/LEARN/e/Middle ... janian.pdf

They describe their premise and scope and roughly outline the steps in normalizing, tokenizing, and lexical/morphological analysis they carried out. They also described some of the limitations of their work and hopes for the future.

I found this interesting but saw that the example sentences they produced are unfortunately not as simple as they could be. For example, the example sentence for the first word 'and' (و) is "This boy lived alone with his father and a special relationship existed between the two." Yikes! That's not very useful for me - I'd prefer some like the Assimil sentence "They boy and the father came".

So I thought perhaps the Tatoeba sentence database could provide a good basis for simple sentences - downloaded those for English-Persian, imported them into Excel, and just organized them by length and word count. I'm making the assumption that sentences that are shorter in both English and Persian will tend to be simpler.

But I went off on a tangent of thinking to actually build out from these sentences my own frequency dictionary. Or at least set up a generic process for that.

There are a few software libraries out there to manage lemmas etc and I've played around a bit with one in the past. But I did find an excellent article on the topic of Persian - https://aclanthology.org/L18-1179.pdf whxih described a language toolkit and the authors have make their toolkit public on github: https://github.com/ICTRC/Parsivar

So I thought that perhaps the thing to do is normalize and tokenized my own corpus, build a database, etc... but that's really more than just a few hours of work. And I got lazy.

Instead, I found that someone has already uploaded a persian frequency dictionary onto Ankiweb. Good enough, even if the sentence structure is not adequate for my needs - what I'm now building is much more simple. I'm going to just search the first instance of the head word in the Tatoeba sorted sentences. With this list of sentences, I'm putting them back into Anki and using the HyperTTS to create the audio.

So for the 'and' example the Tatoeba example sentence is

آنها جر و بحث کردند

Ha! That's not any better. :shock:

The translation of that into English is "They argued." I have no idea why there is an 'and' in that sentence.

As imperfect as this is, the next choice is ... بخندید و چاق باشید. - Laugh and be fat. That works perfectly.

So. Reimport into Anki, add sound with the TSS plug in and (after reformatting the cards.) This frequency deck should look like this:

Screenshot 2022-05-07 at 19.49.04.png

It's sloppy, and I'm playing around with this some more. I'm posting to see what the collective hive mind has done in these areas.
I'm open to ideas...

Postby **rdearman** » Sat May 07, 2022 7:01 pm

I found some open-source French and Italian corpus at various universities, which were broken out by frequency. You might be able to find some at a university "in country".

Beli Tsar · Postby **Beli Tsar** » Sat May 07, 2022 7:21 pm

This looks extremely useful. Persian could do with a few good resources like this.

I used the Routledge dictionary, and while it is good in many ways, the example sentences aren't the only problem with it. The main issue was that it was so heavily skewed to news and politics: which is fine for those who like it, but not ideal for general use. I could talk about nuclear war, the evils of America, and sanctions, but not so much about anything human or practical.

Hilbert · Postby **Hilbert** » Sat May 07, 2022 9:45 pm

There are two Persian frequency lists here. The subtitle one has a lot of foreign place names in it and things like that. Perhaps you've already seen them.

https://www.reddit.com/r/farsi/comments/5uajqm/10k_most_common_word_lists_anki_packages/

You are using TTS to do the persian audio? I've not found an adequate TTS program for Persian. Azure has beautiful natural sounding voices, but it mispronounces so many words and is really bad at figuring out where the ezafehs go.

zenmonkey · Postby **zenmonkey** » Sun May 08, 2022 10:17 am

Hilbert wrote:There are two Persian frequency lists here. The subtitle one has a lot of foreign place names in it and things like that. Perhaps you've already seen them.

https://www.reddit.com/r/farsi/comments/5uajqm/10k_most_common_word_lists_anki_packages/

You are using TTS to do the Persian audio? I've not found an adequate TTS program for Persian. Azure has beautiful natural sounding voices, but it mispronounces so many words and is really bad at figuring out where the ezafehs go.

Thanks for sharing those, will see what I can do with them. The Anki files that they come with are not very useful, just having the Persian word and frequency number. This gives you:

Screenshot 2022-05-08 at 11.59.24.png

Why would someone build an Anki package like that?

To answer your other question, I'm using the Azure AI for pronunciation and I found that while it does mispronounce some words, it's overall the best compromise between ease of use and quality. Searching and downloading Forvo files is just not practical and the quality is rarely much better.

For those words where I know that there is a sound issue, I use conditional formatting and turn off sound and turn on transliteration (if I have it). For ezafehs, if I find it is missed and needs to be added, I add the marker and regenerated the sound file. But at this point, I'm not systemic about it. The TTS tool is better than my own pronunciation :lol:

. In the past, I've used Audacity to chop up Assimil sound files but that is just too time-consuming. I've also paid sound actors to do a couple of hundred words for me in various languages but that was for my language apps.

If I had a truly great set of sentences (and they were open-source) I'd think of paying to build that, but right now I have too many projects.

Odair · Postby **Odair** » Sun May 08, 2022 1:21 pm

I have created a database of a corpus where each sentence is ranked according the the frequency of the least frequent wordform in it. It only considers unlemmatized wordforms because I could not find a good lemmatizer that works.

Here is a sample from a corpus of novels written in standard Persian (i.e. in most of them the text is written in formal style, and the dialogue in colloquial style), up to the 8000th most frequent wordform:

https://drive.google.com/file/d/1q8XLxf ... sp=sharing

And here is one extracted from movie subtiltes:

https://drive.google.com/file/d/1WBbNJo ... sp=sharing

Each set of example sentences begins with two numbers. The first number is the frequency of the word I tried to fetch sentences for, the second number is the number of example sentences in the database that met the criteria.

If you want more, I also have a corpus of novels in colloquial style, and a corpus science articles that I could generate similar lists. I could also go deeper in the frequency list or show more examples for each word.

In the past I have generated a similar file with translations from a biligual corpus, but this feature is currently nonfunctional.

zenmonkey · Postby **zenmonkey** » Sun May 08, 2022 2:33 pm

Odair wrote:I have created a database of a corpus where each sentence is ranked according the the frequency of the least frequent wordform in it. It only considers unlemmatized wordforms because I could not find a good lemmatizer that works.

Thanks for sharing, I'm not sure how you are using your output. Do you have a specific field that gives your headword or just sentences from the corpus?

One of the entries is this.

نه نه نه نه نه
تو اون نه نه نه نه
ما نه فقط به من
نه فقط شما و من
نه نه نه نه نه نه
نه نه با اون و نه با تو
باشه باشه نه نه نه باشه
نه نه نه نه نه نه نه نه
نه نه نه نه نه نه نه
و نه فقط اون رو
نه نه نه نه نه نه نه نه نه
نه نه نه نه نه نه نه نه نه نه
نه نه نه نه نه نه نه نه نه نه نه نه نه نه
نه تو و نه تو
باشه باشه نه نه نه نه

How are you using that output? I can guess that the term here is نه but not sure for the other entries.

zenmonkey · Postby **zenmonkey** » Sun May 08, 2022 3:22 pm

So here is my sloppy output.

The Anki deck can be directly downloaded here: https://ankiweb.net/shared/info/1907666022

The result is an Anki pack of 3000 cards from about 1500 Persian Tatoeba sentences with sound organized from a frequency dictionary (the Wikipedia corpus). Unfortunately, the Tatoeba database I pulled did not have sufficient sentences to deliver more than these 1500 entries.

I've added sound to all the entries so that we have the (word) (pause) (sentence) in Persian using the Azure TTS engine.

Most of the work was simply done just using Excel and Google Sheets, so if someone wants my worksheets, I can share those.

Again, this is sloppy so some of the sentences are unfortunately not the best quality. Do not hesitate to suspend or delete any sentences you don't like or find as not very good.

I have not "borrowed" the transliterations or sentences from the existing Routledge files due to copyright, but know that they exist, and if you prefer those longer, more academic sentences - it's easy to find them and create your own deck by adding TSS sound to those for personal use. For obvious reasons, I'm not going to share links to that work.

My cards look like this:

Screenshot 2022-05-08 at 17.11.19.png

Screenshot 2022-05-08 at 17.11.47.png

Comments and criticisms are welcome.

Note that I did not filter out the few people or place names, or the truly bad sentences from Tatoeba (I just ordered these to get the simplest sentence with the headword). There are a few things I think I can automate to make those sentences more relevant (like filtered by headwork back translation) if I think that the result would be significantly better.

Odair · Postby **Odair** » Sun May 08, 2022 5:21 pm

I have added translations and also added the words the examples are meant to be.

https://drive.google.com/file/d/1vMI7C0 ... sp=sharing
https://drive.google.com/file/d/1H2AvOA ... sp=sharing

These are from the Opensubtitles and Mizan corpora I got at https://opus.nlpl.eu/

I would use it to learn vocabulary from the bottom up (Not from the very bottom where the sentences are messy because not enough words are allowed to find meaningful sentences). So every new word had example sentences that only contained words that I had already seen before.

But I have long abandoned this method as my reading level has improved. Reading real texts is a lot more effective than reading fragmented sentences.

zenmonkey · Postby **zenmonkey** » Sun May 08, 2022 5:38 pm

Odair wrote:I have added translations and also added the words the examples are meant to be.

https://drive.google.com/file/d/1vMI7C0 ... sp=sharing
https://drive.google.com/file/d/1H2AvOA ... sp=sharing

These are from the Opensubtitles and Mizan corpora I got at https://opus.nlpl.eu/

I would use it to learn vocabulary from the bottom up (Not from the very bottom where the sentences are messy because not enough words are allowed to find meaningful sentences). So every new word had example sentences that only contained words that I had already seen before.

But I have long abandoned this method as my reading level has improved. Reading real texts is a lot more effective than reading fragmented sentences.

Wow, thanks, the TEP corpus looks large enough to run through a lemmatizer/tokenizer if I ever get around to using the code I listed up-thread. And thanks for the updated lists, I'll likely use that when I find a sentence in my current that doesn't satisfy me.

A language learners’ forum

A Practical Yet Sloppy Frequency Dictionary?

A Practical Yet Sloppy Frequency Dictionary?

Re: A Practical Yet Sloppy Frequency Dictionary?

Re: A Practical Yet Sloppy Frequency Dictionary?

Re: A Practical Yet Sloppy Frequency Dictionary?

Re: A Practical Yet Sloppy Frequency Dictionary?

Re: A Practical Yet Sloppy Frequency Dictionary?

Re: A Practical Yet Sloppy Frequency Dictionary?

Re: A Practical Yet Sloppy Frequency Dictionary?

Re: A Practical Yet Sloppy Frequency Dictionary?

Re: A Practical Yet Sloppy Frequency Dictionary?

Who is online