Word list by frequency based on Open Subtitle corpus 2018

A while ago I started working on newer version of Word list based 2018 OpenSubtitle Corpus. Downloading the data takes a while and can be frustrating – Microsoft Edge not being able to continue down in the background (should you close the main window) etc. I got distracted into writing a download manager.

Thanks To Hugo Lopez‘s comments on GitHub I was reminded again and I got started the 2nd time.

Based on comments from Hugo and others I added Language DetectionLanguage Detection to the list generator and generated the word lists.

I have not sanitised the lists in any way.
50k wordlists were only created only if overall word count exceeded 50k
Created ignored file that lists words identified as invalid. It is very likely that many words inside the ignored files are valid. Their frequency within the corpus is preserved should you wish to correct them.
With ja and fr a few input files contained invalid words. I added check to prevent the generator from crashing. Further processing of that file was not done.
From a given directory, only 1 file was processed. Any additional files in the input data were ingored.
The generated dataset can be found on Github along with code.



One thought on “Word list by frequency based on Open Subtitle corpus 2018

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s