I have a hobby project dienos-zodis.lt, which is a popular Wordle game alternative in Lithuanian. It has about 1000 daily players. The idea is simple: players need to guess a daily word in 7 tries, and with each guess, correct and incorrect letters and their placements in the word are revealed.
The game has two dictionaries: one for correct words (i.e., the one’s players need to guess) and one for incorrect words, which the player can enter as a guess. The second list is necessary to make the game more challenging, as it requires players to enter actual words rather than random letter sequences.
Choosing the correct words for guessing is rather easy. Any dictionary does the job, as I need just 365 words per year and just the main word form.
But an incorrect word list is another story… One of the common complaints I received over the past year is that the incorrect word list is limited and does not accept otherwise correct words. For 2023, I have decided to improve this situation. However, I had trouble finding a comprehensive list of Lithuanian words. Many dictionaries include only the main form of the word, but the Lithuanian language is unique in that each word may have many different variants (e.g., vyras [a man] could be vyrą, vyro, vyrui, etc.). I have also been unable to find a dictionary that I can easily scrape or obtain in text files. I need to come up with a more creative solution to get all possible words.
I realized that the best database of all Lithuanian words would be a collection of books if only I would be able to get it and process it. The challenge was accepted.
First, I needed to obtain a lot of books in digital format. Unfortunately, the easiest way to do this was through illegal means, such as pirating a LT book pack (~200 books) in EPUB format from a torrent website. I apologize to all authors and everyone else for doing this illegally. I promise that I used the books only for this analytical task and removed everything at the end.
Second, I needed to convert the EPUB format to TXT. I used the free Calibre program to do this, which was the easiest part. Now I had 183 LT books in TXT form:
Now the fun part – how do I process it? How to extract all words from all of the books and get the unique word list?
I used Python script to loop through all TXT books and split them into words, store words in an array, and make a distinct list with frequency distribution.
I did a couple of iterations. I needed to fix encoding, needed to clear special characters. But above is the final script. The result was simple, but exactly what I needed:
Next, I continued the process using Excel PowerQuery. I trimmed, lowercased, and created a rule engine to identify “suspicious” words, such as those with digits, those with the letters “x,” “w,” and “q” which do not exist in Lithuanian, and others:
And finally, I got a decent data set:
It is still not perfect, because it still has some nonsensical, English, mistaken words in it:
But because I also have a frequency distribution, I can choose the level of confidence I want. For the purpose of getting the most frequently used words, I can easily eliminate those that are only used once or twice (out of 10 mln. words in total!). If I want to make the game easier, I can include more if I wish so.
What does the Lithuanian language look like in numbers?
Detailed analysis of the Lithuanian language might not be very interesting for non-Lithuanian readers, so I will leave just a general summary here. For a more detailed Lithuanian version, you can jump here.
So how many Lithuanian words are there? After all eliminations, I have got 159 000 words. I eliminated a lot of real infrequent words by dropping used only 1-2 times, so the final number likely is somewhere higher. But if the word is used only once out of 10 mln., then probably it’s safe to assume, we can communicate just fine without it 🙂
How many different words do you need to communicate “good enough”?
Disclaimer. “Word” below will mean any word variant. Vyras, vyro, vyui, vyrą would be 4 words, but in reality, it’s one word with 4 variants.
The answer depends on the “good enough” definition. Here is the cumulative frequency of the most used words:
1000 most used words cover 50% of the language. 5000 words get you to 70%, 10 000 words will get you to 80%. To get to 90% you will need 30 000 words.
What is word frequency by the length of the word? Depends on how you measure it: by its total occurrence in the text (10mln. words) or in a unique word list (159 000 words):
Most words in Lithuanian are 7-9 characters long, but most frequently used words are shorter and have 2-6 characters.
For my game dienos-zodis.lt, I needed a list of all 5-letter Lithuanian words, so I wrote a Python script and processed 183 Lithuanian books. In total 10mln. words. I have got an initial list of 390 000 distinct words. I dropped 8 000 containing digits or incorrect characters. Furthermore, I eliminated 220 000 words that were used only once or twice as likely mistakes or just very rare words. Ended up with 159 000 Lithuanian words. For my game, I needed only 5 letter words, which bought my final list down to ~10 000 words. Here how does it look like: