When I was writing Slydr, I found it difficult to get hold of decent word lists. A commercial company quoted me £500 odd per language for a 50k word list. So I decided to write my own. If you decide to use it, please let me know what you are using it for. Its yours to use.
Note: I used public / free subtitles to generate these and like most things, it will have errors. If you want me to create an updatable repository, i can put these in codeplex and you are welcome to update them.
I would like to thank opensubtitles.org as their subtitles form the basis of the word lists. I would also like to thank the Tehran University for Persian Language corpus which allowed me to build Persian / Farsi word list.
Creative Commons – Attribution / ShareAlike 3.0 license applies to the use of the word lists.
While the subtitles are free, donations do motivate further work. If you would like to donate, please click the Donate button to donate using Paypal.
If you like to create you own word lists, here’s something to get you started. Download FrequencyWordsHelper. When you run the app, it will ask for a directory to scan and then ask for output filename. once you provide both, it will scan the directory for all txt files and create a word list out of it. The app requires .NET framework 4.5
Format of the frequency lists:
word1 number1 (number1 represents occurance of word1 across all files)
word2 number2 (number2 represents occurance of word2 across all files)