And down the rabbit hole, here we go. First off, there is an interesting presentation on the work done to create the Routledge Persian Frequency Dictionary here:
https://www.fbcinc.com/e/LEARN/e/Middle ... janian.pdf
They describe their premise and scope and roughly outline the steps in normalizing, tokenizing, and lexical/morphological analysis they carried out. They also described some of the limitations of their work and hopes for the future.
I found this interesting but saw that the example sentences they produced are unfortunately not as simple as they could be. For example, the example sentence for the first word 'and' (و) is "This boy lived alone with his father and a special relationship existed between the two." Yikes! That's not very useful for me - I'd prefer some like the Assimil sentence "They boy and the father came".
So I thought perhaps the Tatoeba sentence database could provide a good basis for simple sentences - downloaded those for English-Persian, imported them into Excel, and just organized them by length and word count. I'm making the assumption that sentences that are shorter in both English and Persian will tend to be simpler.
But I went off on a tangent of thinking to actually build out from these sentences my own frequency dictionary. Or at least set up a generic process for that.
There are a few software libraries out there to manage lemmas etc and I've played around a bit with one in the past. But I did find an excellent article on the topic of Persian - https://aclanthology.org/L18-1179.pdf whxih described a language toolkit and the authors have make their toolkit public on github: https://github.com/ICTRC/Parsivar
So I thought that perhaps the thing to do is normalize and tokenized my own corpus, build a database, etc... but that's really more than just a few hours of work. And I got lazy.
Instead, I found that someone has already uploaded a persian frequency dictionary onto Ankiweb. Good enough, even if the sentence structure is not adequate for my needs - what I'm now building is much more simple. I'm going to just search the first instance of the head word in the Tatoeba sorted sentences. With this list of sentences, I'm putting them back into Anki and using the HyperTTS to create the audio.
So for the 'and' example the Tatoeba example sentence is
آنها جر و بحث کردند
Ha! That's not any better. The translation of that into English is "They argued." I have no idea why there is an 'and' in that sentence.
As imperfect as this is, the next choice is ... بخندید و چاق باشید. - Laugh and be fat. That works perfectly.
So. Reimport into Anki, add sound with the TSS plug in and (after reformatting the cards.) This frequency deck should look like this:
It's sloppy, and I'm playing around with this some more. I'm posting to see what the collective hive mind has done in these areas.
I'm open to ideas...