Word list by frequency based on Open Subtitle corpus 2018

A while ago I started working on newer version of Word list based 2018 OpenSubtitle Corpus. Downloading the data takes a while and can be frustrating – Microsoft Edge not being able to continue down in the background (should you close the main window) etc. I got distracted into writing a download manager.

Thanks To Hugo Lopez‘s comments on GitHub I was reminded again and I got started the 2nd time.

Based on comments from Hugo and others I added Language DetectionLanguage Detection to the list generator and generated the word lists.

I have not sanitised the lists in any way.
50k wordlists were only created only if overall word count exceeded 50k
Created ignored file that lists words identified as invalid. It is very likely that many words inside the ignored files are valid. Their frequency within the corpus is preserved should you wish to correct them.
With ja and fr a few input files contained invalid words. I added check to prevent the generator from crashing. Further processing of that file was not done.
From a given directory, only 1 file was processed. Any additional files in the input data were ingored.
The generated dataset can be found on Github along with code.

https://github.com/hermitdave/FrequencyWords
https://github.com/hermitdave/FrequencyWords/tree/master/content/2018

One thought on “Word list by frequency based on Open Subtitle corpus 2018”

Hi Hermit!
I wanted to utilize italian/english frequency list for a quick dictionary app:
https://italian.kejender.vercel.app/
I hope you like it!
Best Regards,
Perttu

Invoke IT Limited

Invoke IT Blog

Word list by frequency based on Open Subtitle corpus 2018

One thought on “Word list by frequency based on Open Subtitle corpus 2018”

Leave a comment Cancel reply

Invoke IT Limited

Invoke IT Blog

Share this:

Related

One thought on “Word list by frequency based on Open Subtitle corpus 2018”

Leave a comment Cancel reply