A while ago I started working on newer version of Word list based 2018 OpenSubtitle Corpus. Downloading the data takes a while and can be frustrating – Microsoft Edge not being able to continue down in the background (should you close the main window) etc. I got distracted into writing a download manager.
Thanks To Hugo Lopez‘s comments on GitHub I was reminded again and I got started the 2nd time.
Based on comments from Hugo and others I added Language DetectionLanguage Detection to the list generator and generated the word lists.
I have not sanitised the lists in any way.
50k wordlists were only created only if overall word count exceeded 50k
Created ignored file that lists words identified as invalid. It is very likely that many words inside the ignored files are valid. Their frequency within the corpus is preserved should you wish to correct them.
With ja and fr a few input files contained invalid words. I added check to prevent the generator from crashing. Further processing of that file was not done.
From a given directory, only 1 file was processed. Any additional files in the input data were ingored.
The generated dataset can be found on Github along with code.
https://github.com/hermitdave/FrequencyWords
https://github.com/hermitdave/FrequencyWords/tree/master/content/2018