Frequency Word Lists

I will start working on 2018 OpenSubtitles dataset soon. Watch the space.

Download Frequency Words lists for 2016 OpenSubtitles datasets and the code used to generate them are now publicly available.
Click here to go to the GitHub

Previous post and links to old data files

Go to skydrive download page

I originally created the word lists while I was trying to improve the dictionaries I used for my windows phone app called Slydr.

Of course there were commercial options – however I was quoted about £500 per language for a nice / cleaned wordlist.. Me of course being a cheap git.. decided to create my own.

If you decide to use it, please let me know what you are using it for. Its yours to use.

Note: I used public / free subtitles to generate these and like most things, it will have errors.

I would like to thank opensubtitles.org as their subtitles form the basis of the word lists. I would also like to thank the Tehran University for Persian Language corpus which allowed me to build Persian / Farsi word list (2011 version).

Creative Commons – Attribution / ShareAlike 3.0 license applies to the use of the word lists.

While the subtitles are free, donations do motivate further work. If you would like to donate, please click the Donate button to donate using Paypal.

If you like to create you own word lists, here’s something to get you started. Download FrequencyWordsHelper. When you run the app, it will ask for a directory to scan and then ask for output filename. once you provide both, it will scan the directory for all txt files and create a word list out of it. The app requires .NET framework 4.5

Download it from Microsoft downloads page

Format of the frequency lists:
word1 number1 (number1 represents occurance of word1 across all files)
word2 number2 (number2 represents occurance of word2 across all files)

Language	2011	2012
Arabic – ar	Download	Download
Bulgarian – bg	Download	Download
Czech – cs	Download	Download
Danish – da	Download	Download
German – de	Download	Download
Greek – el	Download	Download
English – en	Download	Download
Spanish – es	Download	Download
Estonian – et	Download	Download
Farsi – fa	Download	Download
Finnish – fi	Download	Download
French – fr	Download	Download
Hebrew – he	Download	Download
Croatian – hr	Download	Download
Hungarian – hu	Download	Download
Indonesian – id	Download	Download
Icelandic – is	Download	Download
Italian – it	Download	Download
Korean – ko	Download	Download
Lithuanian – lt	Download	Download
Latvian – lv	Download	Download
Macedonian – mk	Download	Download
Malay – ms	Download	Download
Dutch – nl	Download	Download
Norwegian – no	Download	Download
Polish – pl	Download	Download
Portuguese – pt	Download	Download
Portuguese Brazilian – pt-br	Download	Download
Romanian – ro	Download	Download
Russian – ru	Download	Download
Slovak – sk	Download	Download
Slovenian – sl	Download	Download
Albanian – sq	Download	Download
Serbian Cyrillic – sr-Cyrl	Download	Download
Serbian Latin – sr-Latn	Download	Download
Swedish – sv	Download	Download
Turkish – tr	Download	Download
Ukrainian – uk	Download	Download
Simplified Chinese – zh-CN	Download	Download

320 thoughts on “Frequency Word Lists”

Detroit Tigers Tickets says:

October 14, 2011 at 4:42 am

Thanks for that awesome posting. It saved MUCH time 🙂

Reply
Jcwf says:

October 15, 2011 at 3:41 pm

The Dutch Wiktionary is very grateful for this. See http://nl.wiktionary.org/wiki/WikiWoordenboek:Top_50000_%28ondertiteling%29

Reply
- Hermit Dave says:
  
  October 15, 2011 at 4:59 pm
  
  you are welcome and thank you. i stumbled upon wiktionary when i was looking for word lists and i know its not easy to find many free / extensive sources. since i only frequented English Wiktionary, i only posted links on the english version of the page.
  
  Reply
anousakiirene says:

October 16, 2011 at 1:44 pm

many thanx! i downloaded the greek FL! ive made some of my own for research purposes… it would be interesting to compare the results!

Reply
- Hermit Dave says:
  
  October 16, 2011 at 3:40 pm
  
  glad to be of help. let me know how the lists compare. would be interesting to find out.
  
  Reply
anousakiirene says:

October 16, 2011 at 4:48 pm

what other details are there available?? for example what is the size of the corpus?? im comparing your frequency counts with counts from here-> http://hnc.ilsp.gr/default.asp and my personal -small corpus

Reply
- Hermit Dave says:
  
  October 16, 2011 at 9:20 pm
  
  well my overall subtitle corpus for all languages was 53 GB compressed archive. I unfortunately deleted all except the original archive. Let me open it and i can give you an idea on the number of files at least. Based on my tests, frequency lists generated using decent amount of data should be comparable. I am assure you that there were lot more entries than 50k i used and provided for download
  
  Reply
- Hermit Dave says:
  
  October 25, 2011 at 10:38 pm
  
  hello mate… i am redoing the word lists and here’s corpus info for greek.
  Total files: 31348
  Unique word count: 834515
  Total word count: 696415285225
  
  Reply
ignatius says:

October 25, 2011 at 10:09 am

thank you for such a great work

Reply
Trond Trosterud says:

November 1, 2011 at 11:32 am

I suggest you lemmatize your wordlists, and not only present them as wordforms (group verbforms walks, walked, under walk_V, and noun forms a walk, the walks under walk_N), and similarily for the other languages. Here is an overview of software to do so: http://en.wikipedia.org/wiki/Constraint_Grammar

Reply
- Hermit Dave says:
  
  November 1, 2011 at 11:38 am
  
  well i can do that but the word lists i consume are for a keyboard app and i need raw words to match user input. infact when i started i came across a few word lists and i could not use them for my requirement purely because i depending upon user input i would want to show walked in my app and lemmatised word lists would make loading a lot slower.
  
  maybe at some point once i am done with current work load, i will look into lemmatising the lists. thanks for writing
  
  Reply
Robert in Arabia says:

November 2, 2011 at 3:56 pm

Wonderful!

Reply
grill says:

November 5, 2011 at 5:02 pm

Great! thanks for the share!
Arron

Reply
AML says:

November 7, 2011 at 8:20 pm

When I download Hebrew (or Arabic, or Farsi) it’s just gibberish. How to go from gibberish to Hebrew?

Excellent work, and I thank you once again if you can help with this issue.

Reply
- Hermit Dave says:
  
  November 7, 2011 at 10:58 pm
  
  how are you opening the files ? the files are save in utf-8 format. I use notepad (the best so far) or visual studio to open the files
  
  Reply
AML says:

November 8, 2011 at 7:05 am

Ok, I got it to work with OpenOffice.
Thanks again!!!

Reply
Magyar says:

November 19, 2011 at 1:47 pm

That’s a fantastic work indeed. I am using it for studying Russian now. Learning the most frequent words helps you to understand the spoken languages much easier and faster. And it’s really fabulous there’re people who share that. Thanks.

Reply
Pingback: Most common words list for Korean « Acquiring Korean
- Hermit Dave says:
  
  December 5, 2011 at 10:32 pm
  
  i have a korean list somewhere… its just that i haven’t consumed it so i never bothered to look into it. will drop you a message when i upload it
  
  Reply
  - Sêyivê Fréjus PADONOU says:
    
    September 18, 2012 at 7:28 am
    
    Hi,
    
    Thanks for this great job.Please can you do the same for japanese. I really love manga and I would like to learn japanese.
  - Hermit Dave says:
    
    September 19, 2012 at 7:52 am
    
    I unfortunately do not have a large Japanese list. if you have a Japanese repo i’d be more than happy to churn them out for you. In the mean time, I will also check for some public repo
  - Sêyivê Fréjus PADONOU says:
    
    September 23, 2012 at 3:09 am
    
    hi Dave. Sorry but I don’t anyway I’ll let you know if I find some.
    
    Cheers ^_^
Anno says:

December 5, 2011 at 10:41 pm

Oh thank you so much! 🙂 I’d love to have the Korean list.^^

Reply
- Hermit Dave says:
  
  December 5, 2011 at 11:16 pm
  
  there.. i’ve added the links to Korean list i had.. it was small.. after initial clean up it was 14k or so words only. Not a very high quality frequency list i must admit – but at least something to start with.
  
  Reply
Anno says:

December 6, 2011 at 3:18 am

Thank you! Is it small because there there weren’t that many subtitle files in Korean?

Reply
- Hermit Dave says:
  
  December 6, 2011 at 6:20 am
  
  the size of word lists depends on how many subtitles are used (and their size and the unique word count in them). i have a log somewhere of how many files / words per language
  Total files: 15
  Unique word count: 19215
  Total word count: 369216225
  See how low the file count was. higher the file / word count, better is the quality of word list
  
  Reply
  - Anno says:
    
    December 6, 2011 at 1:54 pm
    
    Oh yes, I see!^^ Quite low file count.
    
    Thank you though. It’s a good start. 🙂
  - Hermit Dave says:
    
    December 6, 2011 at 2:08 pm
    
    yes when you compare with say for example greek subtitle corpus (details which i posted earlier)..
    Total files: 31348
    Unique word count: 834515
    Total word count: 696415285225
    
    Glad to help.
Rich says:

December 11, 2011 at 5:57 pm

Is there a reason I get a list of numbers? I get the words in the native language but next to them numbers, am opening in notepad

Reply
- Hermit Dave says:
  
  December 11, 2011 at 6:09 pm
  
  well the format i used for word list is sort of generic one i found around. it goes like this
  word1 wordfrequency1
  word2 wordfrequency2
  word3 wordfrequency3
  word comes first, and is followed by word frequency which is a number, with space in between.
  
  Reply
hkcu says:

December 19, 2011 at 5:18 am

could you say the corpus info for english as well?

Reply
- Hermit Dave says:
  
  December 19, 2011 at 10:11 am
  
  Here you go. For english the details are:
  Total files: 106828
  Unique word count: 830854
  Total word count: 690318369316
  
  Reply
  - hkcu says:
    
    December 19, 2011 at 6:24 pm
    
    thanx, dave.) but isnt that just a crazy figure? ~700 BILLION words, can it be true? @_@ as far as i know the lagest courpus of all available is COCA that made up only from around 425 million words!! http://www.wordfrequency.info
  - Hermit Dave says:
    
    December 19, 2011 at 7:29 pm
    
    The unique word count is unique occurance. subtitles has lot of errors for example words “i don’t” sometimes occurs as idon’t which would be a unique word. however its occurance would be lot further in the list than i and don’t as those two words are used lot often. Total word count is just a count of all words in all subtitles. Each word could be repeated 1000s of times.
hkcu says:

December 20, 2011 at 4:49 am

dave, i meant the total words figure, of course, that makes up your corpus. it appears that this figure of 70 billions words is just incorrect, because a simple calculation such as “Total word count/Total files” gives 6 and a half millions words in a single file, which, you must acknowlege, merely cant be true, especially applying to an average subtitle-file that is supposed to be quite small in terms of the words contained in it.

6,5 mln. words per file would give a file of 10 Mb as minimum. so, again: this cant be a true figure — Total word count: 690,318,369,316.

judging by words occurancy in your corpus, its size cant exceed that one of COCA.

top10 from yours

you — 21,953,223
the 18609293
to 13815857
and 8134383
it 7913344
of 7131178
that 6534300
in 6010124
is 5924671
me 5619307

top10 from coca’s

the a 22,038,615
be v 12545825
and c 10741073
of i 10343885
a a 10144200
in i 6996437
to t 6332195
have v 4303955
to i 3856916
it p 3872477

so, it turns out the total words number of your corpus is slightely less than that of COCA which means the total number of word in yours must be around 400,000,000 words.

can you recalculate the real figure?

Reply
- Hermit Dave says:
  
  December 20, 2011 at 7:17 am
  
  Okay i will redo the counts (plus remember that) this is the first set of frequently lists i did. I have been tweaking a bit. I will do a count again and post it when i get to work.
  
  Reply
- Hermit Dave says:
  
  December 20, 2011 at 1:54 pm
  
  you were correct. I re-ran the word list generator a couple of times and I found the mistake i made in computing the total count. The other details are correct however the total word count came up to 765703147 and not 690788712769
  
  Reply
amacinho says:

December 20, 2011 at 6:24 pm

Thanks for sharing this data with us! I’m getting 404 errors for Estonian and Ukrainian files. Any chance you can upload them again? And the Arabic line is repeated on the first and third lines.

Reply
- Hermit Dave says:
  
  December 20, 2011 at 8:30 pm
  
  I will check estonian and latvian files in a little bit. What did you mean by arabic line is repeated in first and third lines ?
  
  aah i see what you mean.. i will fix that.
  
  Reply
- Hermit Dave says:
  
  December 20, 2011 at 10:20 pm
  
  I have uploaded estonian and latvian files again. Thanks for pointing it out.
  
  Reply
hkcu says:

December 20, 2011 at 7:04 pm

thank you, dave. this one indeed looks more realistic as it gives now the figure of about 7,000 words per file which is way closer to the truth.

and a special thank for the great job you’ve done as i’m one of many who also may be intersted in such useful lists!!

ATB

Reply
- Hermit Dave says:
  
  December 20, 2011 at 8:27 pm
  
  not to worry mate.
  
  Reply
hkcu says:

December 21, 2011 at 5:01 am

i was about to think the question had been closed, but yet another thought has just occured to me, this time about the total amount of files figure.)

this number of 107, 000 files, as it seems to me, might also be incorrect… why? because the total amount of subtitle files on http://www.opensubtitles.org is currently 1,593,684. but this figure seems not to refer to the number of UNIQUE files on the site, but rather to the number of all possible files available, while it is a well known fact that sometimes several or even dozen files can be attached to a single movie. for example, 62 english subtitles are currently attached to Avatar movie:
http://www.opensubtitles.org/ru/search/sublanguageid-eng/idmovie-19984
thus, all i want to say is, are you sure all the files making up your english corpus are really unique? the current figure of 107 thousand files seems to be dubious as it implies there are actual 107 thousand englishspeaking movies which is an enormous number.

Reply
- Hermit Dave says:
  
  December 21, 2011 at 12:05 pm
  
  that is true… based on my random scan i would say that most directories only had one file however there were a few directories with multiple files (repeating subtitles) and in at least one case i saw 10 entries. so yes there is a flaw however some subtitles are split and some are not.
  
  Reply
hkcu says:

December 21, 2011 at 1:17 pm

well, as long as there are recurring files in the corp it cant be authentic to the fullest. moreover, this fact also means the total words figure should be lessened even further, so i think you have to solve this problem somehow for sake of accuravy, although it might prove to be a not easy task this time, indeed…

Reply
- Hermit Dave says:
  
  December 21, 2011 at 1:33 pm
  
  true.. i do understand how corpus is created. i might have to write my own directory scan mechanism – its just a problem of logic and time.. its takes ages to generate word lists… more than a day as i couldn’t be asked to multi thread it (overall data is more than 50GB compressed). i’ll try to update the files over the next few days depending upon what i am doing.
  
  thanks for making me look into this.
  
  Reply
hkcu says:

December 21, 2011 at 5:15 pm

thank you.) i really like your idea of creating a corpus based on subtitles to movies.

i think if you do recreate the corpus it could affect the words frequency lists as well, so it worth doing the job, especially since at the end of it you will get a more precise and accurate corpus!

ATB.

p.s. could you please notify people then by writing a message on here? it seems like i get an e-mail evry time someone post a message to comments which is useful 😉

Reply
- Hermit Dave says:
  
  December 21, 2011 at 7:06 pm
  
  I have reworked the code… i will give it a go later on today or tomorrow morning and it should be ready hopefully on friday.
  
  Reply
hkcu says:

December 21, 2011 at 7:23 pm

it would be highly appreciated! thanx.)

Reply
- Hermit Dave says:
  
  December 24, 2011 at 3:06 pm
  
  my reworked frequency list builder works well.. managed to process all languages though i need to rework english to handle subtitle annoyances with don’t etc… its usually split into don ‘t as two words.. hopefully that will be done at some point this week.. right now both desktop and laptop busy downloading maps for another project
  
  Reply
hkcu says:

December 24, 2011 at 5:43 pm

this is great news, dave!.. have you thought of finding some ways as to lemmatize your corpras, at least that english one? it would be even greater!.. i’m not an expert, but it seems like i”ve seen a piece of software for that purpose somewhere in the internet…

Reply
- Hermit Dave says:
  
  December 27, 2011 at 11:40 am
  
  I haven’t gotten around to reworking the english dictionary yet.. I have been asked about Lemmatizing the corpas but so far i haven’t gone that route for 2 reasons. 1) i consume straightforward word lists and thats why i build them in this manner. 2) i need to look into it and since i didn’t need for consumption it becomes low priority. Anyways been busy with christmas. I will try to get english list sorted tomorrow and then probably upload raw lists.
  
  Reply
hkcu says:

December 27, 2011 at 5:17 pm

i didnt quite understand what you really meant by your two reasons, but never mind, this present format will also do.

Reply
- Hermit Dave says:
  
  December 29, 2011 at 2:04 pm
  
  there… uploaded the lists and added a few more languages.
  
  Reply
Sant Dafydd says:

December 29, 2011 at 4:01 pm

I couldn’t download the lists. They sound great. are they still active?

Reply
- Hermit Dave says:
  
  December 29, 2011 at 4:18 pm
  
  yes they are what could you not download ? I just uploaded a new set today.
  
  Reply
- Hermit Dave says:
  
  December 29, 2011 at 4:23 pm
  
  fixed.. thanks for pointing it out.
  
  Reply
hkcu says:

December 29, 2011 at 4:56 pm

thanx! but about corpus’ info?

Total files:
Unique word count:
Total word count:

Reply
- Hermit Dave says:
  
  December 29, 2011 at 5:40 pm
  
  I will upload the logs files and put them in the table tomorrow.
  
  Reply
- Hermit Dave says:
  
  December 30, 2011 at 10:58 am
  
  Uploaded the logs. the format is
  Total files: 12601
  Unique word count: 401001
  Total word count: 64362991
  Overall word count: 91273545
  
  Total word count is the total count used for frequency list.
  Overall word count was the actual word count. some words has junk character or at length of 1 which are ignored. Hence Total word count <= Overall word count.
  
  Reply
hkcu says:

December 29, 2011 at 4:58 pm

i guess i should have to attach such information on each language in advance!

Reply
Sant Dafydd says:

December 29, 2011 at 5:20 pm

Just tried them again after restart, they download fine. Thanks a lot, I’m going to use them with dictionaries to help study French and Spanish on my Kindle.

Reply
- Hermit Dave says:
  
  December 29, 2011 at 5:41 pm
  
  wish i could use my kindle for studies… it just doesn’t happen.. good luck with languages… i could do with learning spanish and italian.
  
  Reply
Laiyth says:

January 3, 2012 at 5:51 am

Dear Dave,

Kindly explain to me the number next to the word, i assumed that it shows how common the specified word is used or how popular it is, no?

p.s. this list is a gift from heaven merci, gracias for thess great lists

Reply
- Hermit Dave says:
  
  January 3, 2012 at 6:40 am
  
  Yes you are correct. The number denotes the number of times this word was used – its popularity in other words. You are welcome.
  
  Reply
Luigi Assom says:

January 4, 2012 at 5:50 pm

Thank You Dave!

very useful repository.
However, I am analyzing the English corpus, first 10K words.
I found there is not the “I” pronoun in the first 10K entries, which made me thing about some oddities.

Can you tell me the amount of words and sources you retrieved the corpus from?
I read is from opensubtitles, but how many movies have been processed? do you have some info about this sources?

I am asking this to see if I can compare the corpus with authoritative sources, such as published dictionaries (oxford) where words are sorted by frequency too.
thank you very much!!

And happy 2012 and “fatherhood :)”

Reply
- Hermit Dave says:
  
  January 4, 2012 at 8:16 pm
  
  Luigi,
  The word lists i have generated ignore 1 letter words like a and i. Its difficult to validate a single char word across multiple languages unless you know the language or can spend time tuning the rules per language. I know a bit about it as i have done something similar for accents across various latin based european languages. If you really want one, i can generate a one off and email it to you.
  
  The details of corpus is available and you should check the log file. Most languages have a log file entry in the table.
  
  Reply
Luigi Assom says:

January 4, 2012 at 6:21 pm

Dave,
I’ve got another question please.
I am looking at the Russian dataset, but it is not encoded to handle cirillic.
What should i do to see kirillic charset?
thank you again!

Reply
- Hermit Dave says:
  
  January 4, 2012 at 8:18 pm
  
  Luigi,
  I use notepad. It can display all character sets without any issue – including Cyrillic.
  
  Happy new year and thanks
  
  Hermit Dave
  
  Reply
  - Luigi Assom says:
    
    January 7, 2012 at 5:38 pm
    
    mmmm I am using a mac, I can’t see the font.. 😕
    I will try with another editor, your suggestion are anyway welcome 😀
  - Hermit Dave says:
    
    January 7, 2012 at 6:41 pm
    
    Sorry mate I can’t real help you with a Mac. Try Vi Sent from my Windows Phone
Luigi Assom says:

January 7, 2012 at 5:36 pm

Thank you Dave,
yes, if you don’t mind I would ask you a copy.
You can contact me to the email address I wrote to comment your post..
Let me understand: is it you that constructed the word lists across opensubtitles movies?
If so, which movies did you pick up as genre? I mean, I’d like to understand if that list can be actually picked up to represent spoken EN on average.
I’d like to use it to compeare twith type of lists alike this one:
http://books.google.it/books?id=J69KTr60yt8C&printsec=frontcover&dq=english+russian+10000+words+dictionary&hl=it&sa=X&ei=goIIT46lC8P74QSSy9SNCA&ved=0CDYQ6AEwAA#v=onepage&q=english%20russian%2010000%20words%20dictionary&f=false

Thank you!

Reply
- Hermit Dave says:
  
  January 7, 2012 at 6:30 pm
  
  Luigi,
  
  I found an extract someone had already done across various languages. I just consumed what i found – i think the resource was up to date with all movies across various genres. The concept of frequency lists dictates that it should be close if not representative of the actual usage. UK english usage is different from US English which is different from that in Canada, Australia, India etc etc.
  
  this word list is a general one that doesn’t represent the en-UK or en-US etc just english in general.
  
  sure you can compare the lists.
  
  Hermit
  
  Reply
- Hermit Dave says:
  
  January 11, 2012 at 5:29 pm
  
  I have reworked the word lists and now they allow single character words.
  
  The files are availble as zipped text files. there’s a 50K word lists and then there’s full word list.
  
  Reply
Luigi Assom says:

January 7, 2012 at 5:43 pm

Hi Dave,
nevermind, i find the way to see it correctly.
Good!
Please, would you mind to let me know which other rules you adopted to construct the corpuses?
– exclude words with one char
– … ?

Do you also have an idea where i could find a digital resource of a russian (and other languages) with definitions of words, which I can import easily (a list in txt, csv, xml are perfect…, while pdf is not …) ?

thank you again for your work!
Luigi

Reply
- Hermit Dave says:
  
  January 7, 2012 at 6:43 pm
  
  Only other rule is – words with hypen are split into distinct words. See if you can hook into Wiktionary Sent from my Windows Phone
  
  Reply
Nerdr says:

January 25, 2012 at 2:20 pm

Hi Hermit Dave, if you are still checking this, would it be possible to include all the files in a single zipped download?

Reply
- Hermit Dave says:
  
  January 25, 2012 at 4:00 pm
  
  all files as in all language word lists ?
  
  Reply
Jonathan Strang (@jrstrang) says:

January 26, 2012 at 11:39 pm

This is wonderful! You just enabled me to help a linguistics student here in Vancouver! Thanks!!!

Reply
- Hermit Dave says:
  
  January 27, 2012 at 6:47 am
  
  you are welcome
  
  Reply
Nerdr says:

January 27, 2012 at 11:47 am

@Hermit Dave – Yes please! All word lists for all languages as a single download/single zip file.

And if you have time, by column too please!

Reply
- Hermit Dave says:
  
  January 27, 2012 at 12:10 pm
  
  I unfortunately dont have them locally (they are on the hosting server). If i get around to generating them again, i will zip them up in a single archive.
  
  Having said that i will try to generate torrent files, one that references all the 50kzip and another one that references all full zips. Once i generate these, i will udpate this page with the torrent files.
  
  I however did not get what you mean by “by column too” !! each column currently offers a 50k zip and a full zip for that language. Providing all language download in same column is confusing – rather it should be a single entry at the very top or possibly on top of the table itself.
  
  Reply
Nerdr says:

January 27, 2012 at 1:25 pm

Can you down them with FTP then pack them?

For the columns, I mean to say all the 50k for every language, and all the full for every language, as two separate downloads like you described first, Not everything in one column! It is just an extra option for downloaders to choose…not urgent or important really 🙂

You can host the files on a hosting site if bandwidth is a problem. Multi upload is a good option here.

Reply
- Hermit Dave says:
  
  January 27, 2012 at 2:00 pm
  
  🙂 i have download them off my smallbusiness live hosting account.. and packaged them up… took a lot of clicks.
  50k can be upload to my host, full one is about 80megs and is not allowed.. will have to upload to megaupload.. its been a while since i uploaded anything there.. will have to do it from home..
  
  Reply
Nerdr says:

January 27, 2012 at 2:23 pm

Megaupload is down forever, FBI raid last week 🙂 multi upload still works.

I too just downloaded the files individually, so not to worry about it (sorry for the bandwidth use!). I’m surprised you have hosting limits on uploads. I suggest moving to a better provider.

Reply
- Hermit Dave says:
  
  January 27, 2012 at 2:30 pm
  
  oops… sorry know about megaupload… i’d be out of touch if i didn’t know that.. meant multiupload – i used to host Windows Mobile ROMs that i used to create there :)… for some reason it says at upload initializing..
  
  well i used to have an excellent package which would give me tons of bandwidth and allow me to host couple of gigs of data however i was not using it.. i dont even know if it still works (actually i will check in a bit).. eventually i moved my email hosting to microsoft live a while back and moved hosting there as well.. worse is wordpress.. they allow you to upload tons of things including movies but not zipped files..
  
  Reply
Pingback: 10 links « Pierre Rømër
K. says:

February 14, 2012 at 10:40 am

Thanks a bunch.

Reply
Eric says:

February 15, 2012 at 11:26 pm

Hey Dave,

Your wordlists are interesting. So they are completely composed of tokenizing movie subtitles? I am currently working on twitter research and am trying to set up lists to rate words in many of these languages. Our master-lists were generated by tokenizing google books but the tokenizer separates strings at apostrophes. This is a huge problem for our french word-list since words like c’est were appearing as two different words c’ and est which made the set unusable. Are the words on your french wordlist split by apostrophes?

Reply
- Hermit Dave says:
  
  February 16, 2012 at 6:28 am
  
  Yes they are very interesting. Gives you something to think about. In case of subtitles, i used ‘ ‘ and ‘-‘ to split the words. In case of subtitles, french subtitles were in good condition. English however had say don’t like don’ t and my code would assume don’ and t are two different words. So i changed logic for english to say that if last char of word is ‘ then join them and that worked perfectly. try something like that.
  
  No my french and english words have apostrophes
  
  Reply
David Carroll says:

February 16, 2012 at 10:25 am

Wow. Awesome. I’m definitely planning to use your Albanian word lists in developing an early grades reading test.

Reply
Frapy says:

February 27, 2012 at 5:21 pm

Thanks a lot for these. But I can’t download them, is it normal ?

Reply
- Hermit Dave says:
  
  February 27, 2012 at 5:33 pm
  
  not under usual circumstances. However i am moving my host (wordpress doesn’t allow files to be hosted) so they are not available for that reason. I should have a solution (a place to host the files soon). If you want, i can email them to you on your hotmail.fr address – which ones would you like ?
  
  Reply
- Hermit Dave says:
  
  February 29, 2012 at 12:47 pm
  
  the wordlists have been uploaded to skydrive.
  
  Reply
alarichall says:

February 27, 2012 at 6:05 pm

Thanks for these lists! Just so you know, I used the Icelandic one as the basis for this vocab-learning page: http://www.memrise.com/set/10015051/the-250-commonest-words-in-icelandic-speech/

Reply
- Hermit Dave says:
  
  February 27, 2012 at 6:35 pm
  
  glad to hear that. good work.
  
  Reply
Dan says:

February 29, 2012 at 1:54 am

Hello, mentioned above is the lists are being moved to a different host. Is this still in progress? I can’t find a date on any of these entries, so I have no idea if this is an abandoned project or not. I am particularly interested in a Korean list, but the others as well.

Thank you,
2/28/2012

Reply
- Hermit Dave says:
  
  February 29, 2012 at 6:05 am
  
  I do get dates on comments. So you didn’t need to add it. I think it should be resolved by this weekend (4th March 2012- Sunday).
  
  Reply
- Hermit Dave says:
  
  February 29, 2012 at 12:47 pm
  
  uploaded to skydrive. try downloading them now
  
  Reply
Claudio Santori Spadini says:

February 29, 2012 at 12:00 pm

Where is the list going to be after you moved the files somewhere else and delete this post?
I really need the lithuanian files

Reply
- Hermit Dave says:
  
  February 29, 2012 at 12:14 pm
  
  give me 2 mins.. i have uploaded them to my skydrive and am linking them again
  
  Reply
- Hermit Dave says:
  
  February 29, 2012 at 12:46 pm
  
  uploaded to skydrive. give it a go
  
  Reply
  - Dan says:
    
    February 29, 2012 at 8:24 pm
    
    Thank you, this is much appreciated.
hakan says:

March 11, 2012 at 6:15 pm

i am studying english and i translate the most common words in my language to english to check if i have missing word. that’s very useful really appreciate it

Reply
Andrey says:

March 13, 2012 at 2:30 pm

Thank you, I really appreciate the effort.

Reply
Aran Chandran says:

April 23, 2012 at 9:00 am

Hey these are awesome! I’m trying to pick up some basic Norwegian before my trip to Norway and Iceland.

Reply
- Hermit Dave says:
  
  April 23, 2012 at 12:08 pm
  
  have a great trip 🙂
  
  Reply
M. Reynolds says:

April 24, 2012 at 9:12 am

Hi! I need to make my own frequency lists out of some documents I have in Chinese. Is there anyway you could share the lemmatizer or concordancer you used for Mandarin?　

I really appreciate your help. I use your Chinese frequency list almost every day.

Reply
- Hermit Dave says:
  
  April 24, 2012 at 9:34 am
  
  well i have a bit of c# code that churns through files. what format of files do you have ? are they utf-8 / unicode text files ? are they xml files. i have two sets routines, 1 deals with data in text files and another in specialised xml files
  
  Reply
- carlfordham says:
  
  October 4, 2012 at 12:35 pm
  
  Did you really get the Chinese wordlist to work? I downloaded zh_50K.txt but no matter what options I choose open in Microsoft Word and Open Office (both on Mac) it just displays corrupted characters. Any way around this?
  
  Reply
  - Hermit Dave says:
    
    October 4, 2012 at 12:44 pm
    
    I just download the 50k zip and after uncompressing the zip file, opened it with notepad just fine. try that.
  - carlfordham says:
    
    October 4, 2012 at 12:53 pm
    
    Hmm, it displays fine when I open it in BBEdit, but that’s only a trial version. Hmm, I’ll just copy the contents into Office… yes, that does seem to work now. Thank you.
Akuini says:

May 3, 2012 at 3:27 am

Hi Dave,
I’m creating a MS Word add-in to optimize the AutoCorrect. The add-in is useful to shorten typing, not to correct typographical errors. One part of its function is to shorten typing English words with suffix.

For example: caref – careful, darkn – darkness, acceptg -accepting.

So, I need a list of English words (probably around 4000 words) to create a database that will be used for the add-ins. I intend to share the add-in for the public, this is a freeware and an open source. Can I get your permission to use the list of English words from your Word Frequency List.
Thank you.

Akuini

Reply
- Hermit Dave says:
  
  May 3, 2012 at 7:59 am
  
  of course.. go ahead.. let me know if there’s any way i can help. good luck either way.
  
  Reply
Pingback: Voyage au bout de la langue | Ressources pour apprendre le hongrois
Helge says:

May 16, 2012 at 4:35 pm

Thanks for the great work!
I use your list as input for a password/passphrase generator.

If you’re interested in another comprehensive wordlist (en_UK) with frequency classification, I also found this one:
http://www.bckelk.ukfsn.org/words/wlist.zip

While this list is not as long (57k words), things such as slang, typos and abbreviations are omitted.

Reply
- Hermit Dave says:
  
  May 17, 2012 at 1:54 pm
  
  thats nice.. i did come across many english language resources. my work mainly to get hands on other languages.
  
  Reply
George says:

June 18, 2012 at 11:48 am

I’ve been using the German wordlist for some Psychology experiments. We needed emotionally neutral words for the task we designed, and the top few were a great starting point. Extremely helpful, thanks very, very much!

Reply
- Hermit Dave says:
  
  June 18, 2012 at 11:49 am
  
  glad to see it being used in so many different areas 🙂 you are most welcome
  
  Reply
Billy says:

June 21, 2012 at 12:37 pm

Is there any way that you could add an Azeri list?

Reply
- Hermit Dave says:
  
  June 21, 2012 at 12:56 pm
  
  i dont have Azeri in my subtitle repository. When i do a refresh, i will try and pull in all available languages.
  
  Sorry
  
  Reply
Pingback: La recette pour apprendre une langue étrangère facilement
Pingback: Towards the free and open way of learning Spanish « ma.juii.net
Mitch says:

July 24, 2012 at 4:38 pm

Hi. Well done for a great site. Such a great resource. I am developing a word game and plan to use the lists for foreign language versions. It’s a crowded market so don’t expect to make any money but enjoying doing it. I plan to explain about the source of the word list but if you have developed a more “formal” list of any languages except for English, then I would be interested in obtaining (and paying) for them. If not then, your excellent list is a fantastic fall back position for me. Regards. Mitch.

Reply
- Hermit Dave says:
  
  July 24, 2012 at 5:10 pm
  
  Mitch,
  
  the ones available are the last version I built. I have created as smaller subset that is slightly cleaner 🙂 I use 25- 30K lists for Slydr and now for a word game called Wordastic
  
  I’d like to emphasize on slighty clean… remove words with odd chars etc.. nothing massive.
  
  Reply
  - Mitch says:
    
    July 24, 2012 at 10:49 pm
    
    Hi Dave. Thanks for your prompt reply. I think I am doing more clean up than that so I will continue working on your files (except English). This might seem a bit mad but hey ho, you never know … if you ever need a list of words in alphabetical order, where the frequency has been deleted, virtually all the words contain at least three consonants, all accents/acutes/etc have been either replaced with ordinary caps or deleted, no dupes then I’m your man! Did I say a BIT mad! It’s what I need for my app.
    Once again, thanks for a great resource. If I publish my game I’ll let you know!
    Regards.
    Mitch.
  - Hermit Dave says:
    
    July 25, 2012 at 8:13 am
    
    absolutely thats the spirit.. good luck with your game.. what platform are you targetting?
vinny says:

August 1, 2012 at 3:54 am

Thanks , Its a great work , God bless you

Reply
Prasad says:

September 19, 2012 at 5:16 am

Hi Hermit ,

Thanks for the wonderful collection of word lists from various languages!

I am using some of the English word lists as a marker against a dictionary word list in my word game to determine the difficulty level for individual words. It is an indirect use of your work. I would like to know how I can acknowledge/credit you?

Cheers ,
Prasad

Reply
- Hermit Dave says:
  
  September 19, 2012 at 7:53 am
  
  If you want to credit me.. just mentioning me here or @hermitdave on twitter will do 🙂
  
  Reply
Prasad says:

September 22, 2012 at 10:14 am

Hi Hermit,

Thanks for your prompt reply. The game is called “Code-Z” and it is now available in English on Android (http://market.android.com/search?q=pname%3Acom.codez). I have added your name and web site in the Credits/Acknowledgment section.

Pretty soon we will be rolling out a German version and I have used your wordlist as a difficulty marker for it too. I am planning for more languages. Will keep you updated.

Cheers ,
Prasad

Reply
Tony Kimball says:

September 24, 2012 at 12:18 am

The skydrive links appear broken 23 sep 2012.

Reply
Tony Kimball says:

September 24, 2012 at 12:20 am

However, some links work, and link to all the files, so – no matter.

Reply
albert says:

September 24, 2012 at 10:06 pm

There is something strange ( file: en.zip from line 180394)

makе 3
іntο 3
mergеr 3
саse 3
οwn 3
сhіnese 3
againѕt 3
ѕех 3
оther 3
аlwаys 3
ﬁrѕt 3
сοuld 3
tаlk 3
αуе 3
wе’ve 3
іѕn’t 3
mіght 3
сοmе 3
gоd 3
gіvе 3

Reply
- Hermit Dave says:
  
  September 26, 2012 at 9:14 am
  
  I will check later
  
  Reply
Andrea says:

September 29, 2012 at 3:52 pm

Hi Dave,

thanks for the data, which could be really useful for me. Do you also have a document frequency list of the opensutitles corpus?

Thanks!

Andrea

Reply
- Hermit Dave says:
  
  October 2, 2012 at 1:35 pm
  
  Andrea.. i am not sure i fully understand what you asked. I did document how many files were consumed and the unique / total word counts.. what other parameters are you looking for ?
  
  Reply
moon face says:

October 10, 2012 at 2:58 pm

Hi Hermit, thanks for these. I will be using a tweaked version of your French list on my website http://www.surfacelanguages.com before long. I hope that is ok and will credit you when I add it.

Cheers,
MF

Reply
El Presidente says:

October 13, 2012 at 10:47 am

Hi Hermit, sorry if it took so long to respond. what I actually need is a document frequency list: a frequency list of the documents where that specific word appears… do you think this is doable for you to do?

cheers!

am

Reply
- El Presidente says:
  
  October 13, 2012 at 10:53 am
  
  whoops, maybe I was not clear enough… in other words, a list where the frequency number corresponds to the number of documents where that word is to be found.
  
  tks++
  
  am
  
  Reply
me says:

November 9, 2012 at 7:05 pm

Using it to learn Spanish. Thank you!

Reply
Endre says:

November 12, 2012 at 5:32 pm

Dave, thanks for your work, it can be put to so many uses. I have recently learned that the creator of a smartphone keyboard (which all use wordlists for prediction/validation) used your lists (as one input among others). I have now noted that there seem to be an unusually high number of words that are falsely spelt in lower case instead of capitals. In many languages this only affects proper nouns (which is bad enough), but for some languages which use capitals for regular nouns (like German), your list needs a lot of cleaning up before it can be relied on.
So my question is: do you do any processing that can cause this effect, or are all these errors really in the subtitle files?
Feel free to answer by e-mail if you like.

Best regards from Nuremberg,
Endre

Reply
- Hermit Dave says:
  
  November 13, 2012 at 10:01 am
  
  the repo i used is an open source repo additonally i have little knowledge of how capital letters appear in non english language. so the two combined meant that i wouldn’t know if i can rely of each creator using the capitalisation as required and whether i can actually understand that part.
  
  for that reason, i force all words to lower case to build the frequency word list. i myself used it in Slydr (keyboard like app in Windows Phone) i created last year. i can look further but again without language specific input, i am helpless 😦
  
  Reply
Endre says:

November 14, 2012 at 8:47 pm

Thanks for the explanation, Dave. I guess it probably depends on what you actually want to do with the lists. If you want to use the frequency data alone (whatever they might be good for on their own) forcing lower case might be fine. However, if you want to re-use the actual words, I’m afraid that forcing all words to lower case might do more harm than good, particularly (but not only) in languages with extensive use of capitalisation like German.

All other corpus-based word lists l’ve come across so far leave the data untouched¹. After all, anyone can convert a list to lower case (and recompute frequencies if desired) with minimal effort. However, restoring the original state from a lower-case list is impossible without external sources (and difficult even *with* external help, for instance dictionaries or spellcheckers). So here’s an emphatic vote to leave the data unchanged, even if this might mean double entries for many words – which again would also carry potentially useful information, eg. on the likelihood of occurrence of a particular lemma at the start of an utterance.

¹ I have seen one corpus-based list carry additional entries (with asterisks, e.g. That* or Man*) for upper-case occurrences of words whose dictionary form is lower-case, presumably where context indicated that the upper case was attributable to the position of the word (beginning of sentence or paragraph). Similarly, in your case one might argue that in occurrences where it’s reasonably likely that the capitalisation is due to the word’s position (rather than being a basic attribute of the word like in proper nouns), it makes sense to convert to lower case before processing (e.g. computing frequencies). This way, you might even provide added value to the users of the lists (who don’t have context information to make that distinction). In contrast, with indiscriminate lc conversion you do what any user can do if they want (so no real value added), but at the same time you corrupt the list for many uses.

Reply
- Hermit Dave says:
  
  November 14, 2012 at 10:21 pm
  
  Endre,
  
  You make a fine point. its easy to rework to compute frequencies in lower case and persist in case specific word. I however need a few days. Thanks for persisting and pushing your logic in clear manner.
  
  Hermit
  
  Reply
Nick Bloom says:

November 22, 2012 at 7:03 pm

Hi Dave,
I’m still trying to open the Hebrew file with Ms word on OSX, trying 20 different encodings incl. unicode usf8 and I still get gibberish, the same with nearly all the files, is there a quick fix? Also concerning the english files, did you come across a ressource with Pos and/or IPA translations? Kind regards, Nick

Reply
- Hermit Dave says:
  
  November 23, 2012 at 2:48 pm
  
  Have you tried TextEdit ? i believe that is the OS X text file editor ?
  
  Reply
Michael De Bois says:

November 30, 2012 at 5:17 pm

Thanks a lot for the lists, Hermit! Now, three brief questions:
1) What version of the opensubtitles corpora did you use? Did you use only this source for the lists?
2) In the end, did you use all the available translations for each movie or just one?
3) For some reason, the Hebrew list seem to have a high degree of dissimilarity with other equivalent lists from purely written language. Any idea?

best,
Michael

Reply
- Michael De Bois says:
  
  December 5, 2012 at 9:21 pm
  
  I found the difference – the Hebrew list does not have 1-character letters! Perhaps you did not actualize it? I remember that at some point you were not including 1-char words. Am I right?
  
  Reply
  - Hermit Dave says:
    
    December 6, 2012 at 11:15 pm
    
    that is correct. I however tend to rebuild them at the same time and I am sure I did add single character entries after some discussion here. let me check it tomorrow and if required rerun the code again
Michael De Bois says:

December 7, 2012 at 5:31 pm

Thank you Dave, great job!

Reply
- Hermit Dave says:
  
  December 10, 2012 at 3:44 pm
  
  I have looked at the he.zip and he_50k.zip and i can see single character entries in there.
  
  Reply
Svetlin says:

December 11, 2012 at 5:12 pm

Hi Dave,
Thanks for the frequency lists! I am writing a little class term paper (totally nonbinding) and need to explain the source of the Bulgarian corpus (besides the info you put in the log text file). Do you know where you took the Bulgarian frequency lists from? Was it from the Bulgarian Natioanal Corpus (that is written) or did you also get info from a spoken corpus? How do I quote your frequency lists? Thanks for your help!

Best,
Svetlin

Reply
- Hermit Dave says:
  
  December 12, 2012 at 6:41 pm
  
  it was generated using bulgarian subtitles in xml format from opensubtitles.org… was hosted by someone else.. you can quote me by name and just link me to this blog or @hermitdave as on twitter
  
  Reply
- Hermit Dave says:
  
  December 12, 2012 at 6:41 pm
  
  subtitles belong to spoken corpa
  
  Reply
  - Marco Albanese says:
    
    December 12, 2012 at 7:07 pm
    
    why when i extract the file i saw only number code? help me plese!
    thank you!
  - Hermit Dave says:
    
    December 13, 2012 at 10:10 am
    
    while file are you trying to open and what program are you using ? try notepad
  - Svetlin says:
    
    December 18, 2012 at 5:30 pm
    
    Thank you so much! Really appreciate your help!
    
    Svetlin
Nikola says:

December 15, 2012 at 12:40 pm

Thank you people.

Reply
Lukas says:

December 17, 2012 at 5:28 pm

Hi Hermit,

I have a question about the license.
I didn’t find your email-adress anywhere, so could you drop me a line?
You can also reach me on Twitter: @lukaskawerau

Would love to hear from you,
Lukas

Reply
sandover (@sandover) says:

January 7, 2013 at 7:46 pm

Using it to get ideas for the naming my software project. Thanks!

Reply
sandrushba says:

January 24, 2013 at 11:54 am

Thanks a lot!

Reply
Nate says:

February 4, 2013 at 11:26 pm

Hi Dave. Awesome list. For each corpus did you only use subtitles for movies that were in their native language (i.e., only French film subtitles like Amélie for the French corpus)? Or did you also include subtitles that were translated from different languages?

Reply
- Hermit Dave says:
  
  February 5, 2013 at 6:20 am
  
  Each corpus uses open source / user created subtitles for movies. Only ones that feature xml output were used and only 1 set per movie
  
  Reply
Endre says:

February 5, 2013 at 9:27 am

The corpora include translated material (in fact in most languages other than English, an overwhelming majority of the corpus will consist of subtitles translated from English). This does introduce a certain skew that is particularly noticeable with names – while non-English corpora typically contain English names in dozens of variants, many names from the respective language are missing or underrepresented. All in all, the picture that the corpora give you represent the language as it is used in blockbusters in your local cineplex, *not* a more general picture of the language at large. That’s an inevitable consequence of the sample used and not a deficiency per se, just something to keep in mind.

Reply
Monika says:

February 19, 2013 at 2:01 pm

Hi Hermit,

How did you do these? Do you use any script for it? I ask you this question as I would like to find somewhere or to do by myself such list, but for a specific purposes. I mean for some narrow subjects like most frequent words for nurses, lawyers, construction workers and other such a groups. I would appreciate very much if you could help me in any way to establish such lists or give me some tools or advices how to do it relatively easy way, fast and cheap, as
Hoping to hear from you soon, I wish you all the best.

Monika

Reply
- Hermit Dave says:
  
  February 19, 2013 at 11:08 pm
  
  I created a program that could iterate through my data source. Any program that you want to use needs to be data specific. what is the source of your data ?
  
  Reply
  - Monika says:
    
    February 25, 2013 at 4:04 pm
    
    If I don’t find any ready lists, I will have to do it by hand, putting words and expressions into Excel document. There is no one source of it. I have found few books to teach English as a second language for specific purposes, for law, nursing or medicine for instance, then I will use adequats dictionaries, books for students of law, medicine, nurcery school, some websites where this kind of vocab is used, etc. It will takes me weeks of a hard work 😦 So I will get lists of few thousand words for some of disciples probably. And then I have to check frequency of every word or expression (maybe with google search) to choose the most frequent once. That is a very hard work, so I search for any possibility to do it faster or easier way. So far did not found any better idea. Do you have any, maybe?
  - Hermit Dave says:
    
    February 25, 2013 at 4:48 pm
    
    The way I have done is using files. I can share my solution with you which can scan all files in a directory and generate word list out of it. All you’d need to do is create relevant files and run the app. Let me know if that’s good enough.
  - Monika says:
    
    April 24, 2013 at 2:18 pm
    
    Hi Hermit,
    I’m so sorry to answer you after so long time, but I had so much work and other obligations… Yes, I thing your proposal is very kind and you application sufficient for my needs. I would appreciate very much if you share your application with me. I will then build a folder with documents I have already found and then run your application. I would save me a lot of time and hard work.
    Kind regards,
    Monika
  - Hermit Dave says:
    
    April 26, 2013 at 9:25 am
    
    Monika,
    
    I have created a small app for you for the purpose. Download it (details on the top of this page). Let me know how it works out
    
    Hermit
  - Monika says:
    
    April 29, 2013 at 8:57 pm
    
    Hello Hermit,
    thank you very much for helping me. I have finally managed with your application 🙂 It works 🙂 Your an angel 🙂 Thank you. Tell me:
    1) it scan only one text document in the folder or all .txt document in the folder? because it has scaned only one from all .txt files in the choosen folder
    2) do you have any idea how to copy resaults from one column (numbers and words in the .txt document are, let say, in one “column”) and I would like to have them in 2 column in the Excel document, like numbers in A column and words in B column?
    Regards,
    Monika
  - Hermit Dave says:
    
    April 29, 2013 at 9:13 pm
    
    Monika,
    
    I just ran the test with 3 English files and it processed them all – random work check in the list. If in doubt, email them to me hermitd at Hotmaildotcom
    
    To move columns etc, its easiest to open file in excel (delimited file option) and then cut / copy / paste as needed.
    
    Hermit
  - Monika says:
    
    April 30, 2013 at 11:12 am
    
    ok, I will try again with another folder and another files 🙂 That was quite hard to me first time, I couldn’t at first understand what it is all about 🙂 maybe second time will be easier 🙂 Need to get used to this application. Since today I will not have any access to the computer and internet for 5 days, so I will let you know next week if I succeed this time 🙂 Thank you Hermit for helping me. You’re very kind to me.
  - Hermit Dave says:
    
    May 1, 2013 at 10:38 pm
    
    sorry its a simple program with a very simple code behind it. nothing fancy.. something I used for myself
  - Monika says:
    
    May 21, 2013 at 10:37 am
    
    Hello Hermit,
    
    thank you very much for your app.
    I have finally succeeded to run it
    It worked this time
    All .txt files was scaned and I got a frequency list of all of them
    It will facilitate my teaching job a lot!
    
    Would it be possible in any way to use it for PDF documents
    as I have a lot of books in PDF format
    or
    to make a frequency list from some www sites?
    For instance I would like to prepare a frequency lists for my students
    from some journals online like Le Figaro ou Le Monde?
    
    Cheers,
    Monika
  - Hermit Dave says:
    
    May 21, 2013 at 11:55 am
    
    glad it worked. The problem with PDF is many fold, it can be text + image or image only etc. Its difficult to work out. The easier solution is to extract text from PDF and operate on the extracted data.
    
    PDF readers have an option of saving contents in text files.
    
    Websites are easier but they are a different fish as code will have to deal with markups etc. its not difficult – just annoying as its easy to break such a mechanism. Plus websites do not like screen scraping and move swiftly to block the IPs
    
    Hermit
Ruth says:

February 21, 2013 at 11:31 pm

bonjour a tous je viens de telechargé un fichier Serbe Latin peut etre que je ne sais pas comment faire puisque dans ma liste il y a tout les mots ex: Mozda , Zasto… mais je n’est pas leurs signification -_- ca s’affiche comme cela :
“samo 178651
od 167786
bi 163893 ”
Comment faire pour avoir la signification de ses mots? j’ai pas trop confiance dans certain site qui donne plusieurs définitions différentes Pourrez-vous m’aider s’il vous plait merci

Reply
- Hermit Dave says:
  
  February 22, 2013 at 6:53 am
  
  Salut, cette liste comme indiqué dans la liste de fréquence. Il montre juste quels mots sont utilisés plus souvent que d’autres. Malheureusement, je ne conservent pas de dictionnaire. Jetez un oeil sur http://wiktionary.org/
  
  Reply
Steve Ridout says:

February 28, 2013 at 6:35 pm

Thanks a lot for this, it’s made it very simple for me to prioritise the most important words in the language learning site I’m working on.

(PS: If you’re interested it’s here: http://readlang.com, and I’ll be adding an attributions page to my site soon, it’s all a bit rough at the moment :))

Reply
- Hermit Dave says:
  
  February 28, 2013 at 7:21 pm
  
  Just had a look at the video.. Nice work I must say.. I am going to develop a language helper app over the next days
  
  Reply
Eric says:

February 28, 2013 at 9:29 pm

Hello, I get an error when I try to download any of the files. Is the server down?

Reply
- Hermit Dave says:
  
  February 28, 2013 at 9:46 pm
  
  nope I can download them just fine
  
  Reply
Jarvis says:

March 10, 2013 at 2:13 am

Hi Dave,

Thanks for your work. I downloaded the Simplified Chinese list but fount there are a lot of Traditional characters in it. Maybe the resources you use are a mixture of Traditional Chinese and Simplified Chinese? Not sure. But it’s very useful anyway.

Reply
- Hermit Dave says:
  
  March 11, 2013 at 9:23 am
  
  Jarvis,
  
  I have two sets of files, 1 was a common Chinese Dictionary which had words in both Simplified and Traditional Chinese side-by-side and other was the Subtitles – thought that was only in one language can’t remember which one. I used subtitles wordlist and then the other dictionary to then build a dictionary for both.
  
  Unfortunately I don’t know either to know about the mixing of characters. I apologize.
  
  Reply
Taoufik says:

March 13, 2013 at 5:28 pm

Taoufik from Morocco
thank you for all these word list it was very very helpful for me.

Reply
Paul says:

March 13, 2013 at 11:08 pm

Hello. My name is Paul. Thanks for these wordlists! I will use the wordlist ‘en.zip’ for my research about english article readability. But can I ask how you build these wordlists? And I can build it myself if I need (because I may build a wordlist for the ESL/EFL learner ).

Reply
- Hermit Dave says:
  
  March 14, 2013 at 8:30 am
  
  Paul,
  
  Assuming you have tons of relevant text material, building the list involves going through it and maintaining word and occurrence details.
  
  The larger the data set, better are the results.
  I am going to,post a simple program later today that can do just that.
  
  Reply
jan says:

March 26, 2013 at 5:18 pm

thank you very much! it is very hard to find a wordlist of estonian vocabulary. I actually use yours to learn estonian.

Reply
Pingback: ASETNIOP BLOG » One-Dimensional Keyboard Hack
Monika says:

April 29, 2013 at 8:55 pm

Hello Hermit,
thank you very much for helping me. I have finally managed with your application 🙂 It works 🙂 Your an angel 🙂 Thank you. Tell me:
1) it scan only one text document in the folder or all .txt document in the folder? because it has scaned only one from all .txt files in the choosen folder
2) do you have any idea how to copy resaults from one column (numbers and words in the .txt document are, let say, in one “column”) and I would like to have them in 2 column in the Excel document, like numbers in A column and words in B column?
Regards,
Monika

Reply
Monika says:

May 24, 2013 at 7:19 am

Hermit, thank you for your help. I appreciate it very much. I wishe you all the best 🙂

Reply
Chuck says:

May 28, 2013 at 2:09 am

Thank you for these great lists – it is a fabulous resource!

Reply
Dino says:

June 5, 2013 at 11:59 pm

i don’t understand how this works 😦

Reply
- Dino says:
  
  June 6, 2013 at 7:47 pm
  
  ok now i know
  
  Reply
zeppo says:

July 14, 2013 at 2:42 am

Hi, Hermit Dave. I’m guessing your objective by using subtitles is to provide guidance to the study of spoken language. With this in mind, would it be possible to generate the frequency of combinations of words? For instance, in the previous sentence, if you searched for two word combinations, it would check the frequency of “would it”, “it be”, “be possible”, and so on. This could pick up on more frequently used phrases, idioms, etc that may have a meaning in their combination that isn’t revealed in a basic study of the individual words. It would really help in the study of idioms.

Reply
- Hermit Dave says:
  
  July 14, 2013 at 6:29 am
  
  This is perfectly possible. A small tweak to the generator should make this possible. Good idea. Let me muck around with that.
  
  Reply
  - Felipelipe says:
    
    November 12, 2014 at 11:36 pm
    
    Hi Hermit, did you manage to create this tweak to check the frequency of phrases?
  - Hermit Dave says:
    
    November 13, 2014 at 12:56 pm
    
    nope I never gotten around to it 😦
zeppo says:

July 18, 2013 at 1:35 am

Thanks. You could do it for two consecutive words, three consecutive words, four consecutive words, etc. It would be the ones with the most hits that would be relevant as worth studying, particularly with the larger word combos, and the lower hits could be ignored. The results, if possible to import into spreadsheet cells, could then program those cells to “highlight” (ie, background changes to yellow, for instance) if the contents also appear in second spreadsheet made up of common idioms (pulled from an idiom dictionary.) Then the student could go down the list of highlighted results and study those idioms in order of frequency.

Reply
Nadwah Onwi says:

July 24, 2013 at 7:16 pm

Hi Hermit Dave,
I just figure out your website while writing up my master thesis in validating the vocabulary of spoken Malay to be used by people with speech and language impairments. I always stumble to find the high frequency words in Malay since less rigorous study was done using SPOKEN language as their resource. Do you mind if I email you regarding the details on how do you generate the list? I will definitely cite your work in my thesis since the results of the word list that you have is very common if compare with my list. It is worth to discuss our findings and develop the knowledge together. Thank you for such an awesome job!

Reply
- Hermit Dave says:
  
  July 24, 2013 at 7:47 pm
  
  Of course. My email address is hermitd at hotmail dotcom. Look forward to hearing more about your work,
  
  Regards,
  
  Hermit
  
  Reply
John says:

July 26, 2013 at 10:01 pm

¡Que bueno! You’re the man. I’m using the Spanish list for a vocabulary to-do list. Muchas gracias. -John

Reply
bovinated says:

July 28, 2013 at 9:34 pm

Hello!

I used your word list in https://github.com/dw/cheatcodes/ , which is a simple function for mapping BitTorrent magnet URIs to spoken English. This was just a Sunday afternoon project, but I might try to improve later (e.g. minimize soundex/levenstein score of the chosen words).

Thanks for an excellent resource!

Reply
Pingback: Python Sweetness: CheatCodes – turning BitTorrent links into spoken English | The Black Velvet Room
Seascent says:

August 15, 2013 at 6:56 pm

Hi!

I stumbled onto your word list when searching for list of most commonly used German words. However, when I tried opening it, it’s gibberish.

I use Mac and it opens by default in Text Edit.

Could you please advise? Thank you!

Reply
Seascent says:

August 15, 2013 at 6:58 pm

Oh, also what do you mean by the choices ’50K’, ‘Full’ and ‘Log’? Thanks again!

Reply
- Hermit Dave says:
  
  August 15, 2013 at 7:48 pm
  
  you need to open the file as UTF8 encoded text file.
  On windows I use Notepad and it can open it without any issues.
  
  50K files are word list containing top 50000 or 50K entries.
  Full files are full wordlists
  Log files are logs of word list generation containing a few metrics like total word count, unique word count and total number of files processed.
  
  Reply
Brian ONeill (@boneill42) says:

August 25, 2013 at 3:06 pm

Thank you so much! I’m using this in a book i’m writing on distributed processing w/ Storm. I’ll be sure to cite you as the source!

Reply
carl says:

October 14, 2013 at 3:30 pm

love this file! can’t wait to play with it!

Reply
rensbey says:

October 15, 2013 at 11:57 am

Hi there. Fantastic initiative and very useful. Thanks very much! About the Korean lists: it seems the ko-2011 wordlist has inadvertently been run interspersed with Russian (according to Google Translate auto-detection)? Would it be possible at all to re-run this without Russian co-mingled? Also, this list (ko.txt) doesn’t display properly in Notepad by default as a result (I think the Russian confuses it). It can be opened in a web browser or MS Word just fine (but still has the Russian entries). Also, the ko-2012 list appears to be missing from Skydrive. There is a kk-2012 list (no idea which language this is though). This is the only subtitle-based Korean list I’ve found on the Internet so far so I’m super keen to get a working file! Thanks again 🙂

Reply
- Hermit Dave says:
  
  October 15, 2013 at 12:12 pm
  
  Ross,
  
  I have seen such wrong language entries while I was checking larger lists like Arabic and Hebrew – like Mandarin, those are earlier to spot. Sadly beyond hard coding language based character ranges etc its difficult to generate these cleaned.
  
  I did however clean a few when I was consuming those. I simply rename the files to CSV and open them in Excel and then sort by alphabets.. cut out what you don’t need and save the output again in whatever format I need.
  
  I will check my output repo at home for 2012’s Koream list. KK refer to Kazakh.
  
  http://www.loc.gov/standards/iso639-2/php/code_list.php
  
  Reply
  - rensbey says:
    
    October 15, 2013 at 12:32 pm
    
    OK – cool. I’ll simply employ the same technique for now and cleanse using Excel 🙂
    Ahhh so it was Kazakh! It even stumped Google, that one! Thanks again – you are doing the language learning communities of the world a massive favour.
Sean says:

November 2, 2013 at 4:30 am

Hey Dave, great work!

I want to “lemmatize” these lists but I’d like to get the 53GB (as of a few years ago) files from opensubtitles.org. How did you accomplish this? Did you email them and ask nicely or what? I know that subtitles are a new paradigm in word frequency studies and I’d like to create my own. I’ll share them with you to post when I get done!

Best,
Sean

Reply
- Hermit Dave says:
  
  November 4, 2013 at 3:11 pm
  
  Sean,
  
  I thought I had blogged about it and previously commented.. however here you go http://opus.lingfil.uu.se/ you will see OpenSubtitles 2011 in there.. download from there
  
  Reply
Thomas Pronk says:

November 13, 2013 at 3:26 pm

Hey! I’m a PhD student studying cultural differences in drug use, and I’ll be using them in an Implicit Association Task. Thnx!

Reply
User:Saltmarsh on Wiktionary says:

November 16, 2013 at 7:07 pm

Thanks for you work : I will find the Greek frequency list useful on Wiktionary

Reply
MP says:

November 20, 2013 at 9:06 pm

Thanks. I’m using this to supplement a Spanish word list I created using the EuroParl corpus. I just wanted to detect Spanish words (vs. English words) and unfortunately the EuroParl corpus is too formal for the type of data I’m using (it doesn’t common insults, for example). Subtitles are much more conversational, which is what I need.

Reply
Stefan Auer says:

January 6, 2014 at 3:28 pm

Hello!
Thank you for your word lists. It is very helpful for a project I am currently working on: it is a OCR-project for the University of Applied Sciences in Salzburg and I am using it to determine the quality of a OCR-framework. Currently I am using your German, English and Spanish word lists.
Greetings from Austria,
Stefan Auer, BSc

Reply
- Hermit Dave says:
  
  January 7, 2014 at 8:50 am
  
  That is absolutely awesome 🙂
  
  Reply
sneak a peek here says:

February 3, 2014 at 5:10 am

When I initially commented I clicked the “Notify me when new comments are added” checkbox and now each time a
comment is added I get four emails with the same
comment. Is there any way you can remove people from that service?
Many thanks!

Reply
- Hermit Dave says:
  
  February 12, 2014 at 9:50 am
  
  sorry whilst I have no idea check if there’s an unsubscribe option in the emails you get..
  
  Reply
Tim In Dublin says:

February 23, 2014 at 11:22 pm

Hey thanks this great information, and great place to start for creating my own.. I’m working in Scala and playing around with some custom word extractions this weekend. Question, how do you go about selecting subtitle files to download? Is there some kind of bulk option to grab a bunch of them, or did you find yourself trying to select specific movies? Thank you

Reply
- Hermit Dave says:
  
  March 2, 2014 at 10:39 pm
  
  did I reply to this post ? I used subtitles downloaded by someone else 🙂 search open subtitle corpa 🙂
  
  Reply
Cam Morris says:

April 2, 2014 at 2:52 pm

Thanks for these lists. I’m building them into my open-source password tool, OWASP Passfault. They are great. I’m combining 2011 and 2012 and throwing out the #1 and #2 occurrences as outliers (they seem to have the most typos). Thanks for making them available!

Reply
P, J, or J says:

April 11, 2014 at 10:57 am

Great work! Is there a nice easy way to point this at a wikipedia in a language and get a word frequency list from that? I’m mainly interested in the small to mid sized languages with under 50,000 articles. From there it might be possible to create some games to verify words and improve some of the dictionaries for a few of these.

Reply
- Hermit Dave says:
  
  April 14, 2014 at 9:48 pm
  
  while many have done so and it could be done, I needed spoken language and I found subtitle source. Wikipedia does allow its content to be used.. I am sure I came across corpus generated with Wikipedia source.
  
  Reply
Adam says:

May 1, 2014 at 11:32 am

I’m using your word lists so that I can decide which words are necessary for me to learn to have some fluency in a language, specifically Arabic; and its likely in the future French and Japanese. I’m always worried about spending time worrying over words that I won’t ever use

Reply
Pingback: dodolK Language pack(Russian)怎么样|dodolK Language pack(Russian)好用吗|dodolK Language pack(Russian)用户评论 - 就要问
James Yao says:

May 30, 2014 at 11:46 pm

Thanks for being awesome!

Reply
Pingback: Naive Language Detector | The Tokenizer
Pietro says:

July 23, 2014 at 9:49 am

Hello,

I just downloaded the application. Is it possible to scan documents in a directory instead of having to open them one by one?
It seems a solution has already been provided but can’t find it.

Thanks in advance for your help.
Pietro

Reply
- Hermit Dave says:
  
  July 23, 2014 at 10:01 am
  
  it only supports text files but yes you can put them in a directory and when you run the app, you select appropriate directory
  
  Reply
  - Pietro says:
    
    July 23, 2014 at 10:06 am
    
    My concern is that it asks me to select an input file. Selection of directory is not allowed.
  - Hermit Dave says:
    
    July 23, 2014 at 10:52 am
    
    When you run the app, and press the button “Build frequency list”, it asks you to
    1) Select directory (containing the text files)
    2) Shows you save as dialog to for you to type a name (and location) for generated frequency list
    
    Once you do this, it should run the program to churn out the results
Christina says:

July 24, 2014 at 7:42 pm

Hi Dave,

I’m using your German list in my master’s thesis in Applied Linguistics and it’s very helpful! I’m wondering if you could provide the German corpus size?

Many thanks!

Reply
- Hermit Dave says:
  
  July 29, 2014 at 3:10 pm
  
  Hi Christina,
  
  Sorry for the late reply. Along with the word list, I package 2 additional log files
  de-s.log contains a like of files users along with word count in the file
  de.log contains summary info like total words and unique word count etc. have a look at those
  
  Reply
Pingback: Learn a New Language and Become Fluent in One Month » Kermit Jones
garth says:

August 20, 2014 at 5:44 pm

thanks for posting these. ill be using them for language learning!

Reply
Olfert Rahbek says:

September 18, 2014 at 7:14 am

Hi there, great work, a couple of comments: It would be quite easy to clear other lists of the “pollution” from other character sets, e.g. Russian and mandarin in English.
In general the algorithm does not seem to find words that include a “-” -> some important, frequent words will be missing, e.g. in Danish.
Off course, composite words and idioms would be the next great achievement – do you know of any open sources that offer these, e.g. in English. Thanks!

Reply
- Hermit Dave says:
  
  September 18, 2014 at 10:41 am
  
  Hi,
  
  Yes it is easy to the lists.. when consuming I tend to do it in excel. It can be done programmatically as well as long as I define the correct range for each language.
  
  I do split works with – into two words when building the list. Languages like French use ‘ within words and its difficult to generate a list without fully understand the language intricacies behind it.
  
  Hermit
  
  Reply
Olfert Rahbek says:

September 21, 2014 at 1:37 pm

Thanks for your reply. I’d be happy to show you an example of the positive impact it would have if “-” combined words were kept as such. One other point: there seems to be a systematic quality issue with the input; in e.g. German and dutch but also other languages there are many words with “ii” where the correct spelling is one i or il or li. They don’t feel like normal typos but rather some interpretation error done by a computer – can you comment, tjsnks in advance Olfert

Reply
Tom says:

October 22, 2014 at 10:42 am

Hi Hermit,

Just to say thanks a lot for creating the Greek frequency list – I am currently using it to set up a ‘Memrise’ vocab course. Do you have a longer list to hand by any chance? Or is it simple to generate using the app?

Cheers,

Tom.

Reply
- Hermit Dave says:
  
  November 6, 2014 at 5:19 pm
  
  Hi Tom,
  
  I was sure I replied to you. I have a small app linked to the post which can use data in text files. You can use that to generate your own lists.
  
  Hermit
  
  Reply
program says:

December 23, 2014 at 12:30 pm

Hello, is there a command switched version available for FrequencyWordsHelper?
Like: c:\fq\FrequencyWordsHelper.exe -i c:\txtfiles -o c:\results\cur.txt

Reply
- Hermit Dave says:
  
  January 21, 2015 at 12:37 pm
  
  no I never created a command line version of it
  
  Reply
awordtester says:

January 17, 2015 at 4:25 am

I downloaded your Ukrainian frequency list but I don’t recognize the words. I see they are in the Roman (English) alphabet not Cyrillic (Ukrainian). But I don’t recognize them in either language. Can you clue me in here?? Thanks!

Reply
- Hermit Dave says:
  
  January 21, 2015 at 12:34 pm
  
  hey.. sorry for the late reply.. I just have a fellow Ukrainian look at the list and she said that most of it is fine. can you try opening it with notepad ?
  
  Reply
Simon says:

January 20, 2015 at 4:25 am

Hi Dave,
You have mentioned about the quotes from commercial source which costs about £500 per language for a cleaned wordlist. Could you please tell me which source can sale the reliable word frequency list? We are building a keyboard apps, and we do need a reliable word frequency lists with different languages, which we are happy to pay for them. Please assist !

Reply
- Hermit Dave says:
  
  January 21, 2015 at 12:37 pm
  
  Simon,
  
  It has been about 3.5 years and I am afraid those comms are buried in my email inbox somewhere. Do a search online and get in touch with those. individuals
  
  Reply
Div says:

January 25, 2015 at 5:57 pm

Is there one for Latin, perhaps? 🙂

Reply
- Hermit Dave says:
  
  February 2, 2015 at 10:59 pm
  
  🙂
  
  Reply
Russell says:

January 25, 2015 at 10:19 pm

Thanks so much–this is awesome. I’m using the Hebrew 2012 list for my dissertation on morphology.
I hate to bother you ,since you’ve already provided such an amazing service, but do you by any chance have a record of how many subtitles were included when you downloaded Hebrew 2012 from open subtitles.org, or maybe the date when you downloaded, since people add more there all the time? Thanks!

Reply
Pingback: An 80-20 System for Learning German Vocabulary That Really Works | FluentU German
Salomon says:

April 7, 2015 at 11:31 pm

i take word here and i launch revisions thanks to http://www.vocateacher.com , it is fucking efficient but sometimes I am waiting a long time before i can use a new word

Reply
Pingback: Italian Websites – Duolinguisto
Diana says:

April 14, 2015 at 2:17 pm

Thanks for posting these. We might use them to find words missing from our dictionaries. 🙂

Reply
Natawut Monaikul says:

April 14, 2015 at 8:03 pm

This is absolutely amazing. Thank you so much for compiling this! I’m going to use the Italian 2012 word list in my thesis. I’m looking to build a model of a bilingual mental lexicon (English-Italian) so I can analyze the structure of it (in search of a small-world network).

Reply
neri says:

April 15, 2015 at 8:52 pm

” however I was quoted about £500 per language for a nice / cleaned wordlist.”

holy shit who the fuck was quoting 500 per word list? where did you see it?

btw thanks for the wordlist, you are a god among men

Reply
Pingback: German Websites – Duolingistum
Pingback: DodolK Language pack(Russian) APK v1.0 Free Download
Kieran Maynard says:

May 27, 2015 at 3:56 am

Thanks for posting these. Really neat. The zh_cn-2012 list is really messed up, though. It’s useful only for the most common words, then it gets wonky and lists, for example, 物，哈 and 力 (actually rare as standalone words) as more frequent than 大家 (extremely common). Just be aware, for those using it.

Reply
Pingback: Notes on Learning a Language (Part 1) | Engineering Energy
Emre says:

July 18, 2015 at 1:18 am

thanks

Reply
Ahmet Akkök says:

July 28, 2015 at 3:26 pm

Hi Hermit,

Do you happen to have N-grams, like 2,3,4 words used in a row with freq?

Reply
- Hermit Dave says:
  
  August 21, 2015 at 7:24 am
  
  Sorry I only created simple list generator no n-grams
  
  Reply
Pingback: comment utiliser une liste de fréquence | apprendre une langue
danR says:

September 26, 2015 at 7:14 pm

Thanks. Still not seeing Korean 2012, though. This anomaly was mentioned by someone else already some years ago.

Reply
- Hermit Dave says:
  
  November 30, 2015 at 3:49 pm
  
  not sure what I can do for chrome to not block the zip archives.
  
  Reply
harish suvarna says:

November 5, 2015 at 7:15 pm

Hi Hermit,
Is it possible for you to license under creative commons 4.0 International or BSD or Apache 2.0?

-harish

Reply
Harish Suvarna says:

November 5, 2015 at 8:56 pm

Hi Hermit,
Is it possible for you to allow us to use the frequency lists using BSD or Apache license?
-harish

Reply
- Hermit Dave says:
  
  November 30, 2015 at 3:49 pm
  
  I used CC because the underlying data is based on CC. I am not well versed with licenses.
  
  Reply
  - Alex Gordon says:
    
    February 11, 2016 at 8:04 am
    
    These word lists are not copyrightable in the US, FWIW. Might be covered under EU database rights but it seems kind of borderline.
  - Hermit Dave says:
    
    April 8, 2016 at 7:20 pm
    
    I’m a layer abs I’m only going licensing of the content i consumed. Feel free to do whatever you think is right
sylvestertheinvestor says:

November 29, 2015 at 1:00 pm

I don’t think your program works anymore (Windows 10?) does it? It asks you to select a folder, and a file and then nothing happens.

Reply
- Hermit Dave says:
  
  November 30, 2015 at 3:50 pm
  
  if you look in the destination folder you should see the frequency list generated by the app. Having said that I haven’t tested it in recent months
  
  Reply
  - sylvestertheinvestor says:
    
    December 1, 2015 at 12:33 am
    
    Just tried it on a different PC with Windows 7 and it hangs after selecting the output location, no output generated. I’m trying to add a neglected language to Memrise, so I’d really like to use it.
  - sylvestertheinvestor says:
    
    December 3, 2015 at 12:30 am
    
    I may have figured it out. I think you need to run it as administrator.
  - Sylvester says:
    
    December 4, 2015 at 6:49 am
    
    No, it just doesn’t like SRT files. If I use subedit to remove the formatting, then it works.
  - Hermit Dave says:
    
    December 4, 2015 at 3:21 pm
    
    aaah I think that generator is pretty basic.. just expects text and nothing else 🙂
Klancy Kennedy says:

December 7, 2015 at 5:47 am

you said you wanted to know how your list is used. I am an English teacher in China, trying to find resources to identify high frequency words to teach students. Basic conversation is lacking here, so mastery never happens even with simple words. High frequency words will give them the best start.

Reply
Dushan Savich says:

December 14, 2015 at 12:26 pm

Thank you so much for this. This saved me so much time, exactly what I needed.

Cheers!

Reply
Pingback: Common words | HowToSpeakPolish.com
Pingback: Apprendre une langue étrangère à tout âge, c’est possible ! | Décor Emotif
Sunhee Kim says:

April 8, 2016 at 6:42 pm

Hello, how would we cite this? I will be using the English 2012 version to match the word frequencies of stimulus sets.

Reply
- Hermit Dave says:
  
  April 8, 2016 at 7:18 pm
  
  For any citation you can just include the blog url.
  
  Reply
Pingback: Trouver Des Mots Dans Un Livre Et Combien De Fois.
Marek Pelc says:

April 12, 2016 at 9:19 am

Hi,
Your dictionaries are probably the most extensive in the internet.
Great work!

I used EN and PL wordlists for my android keyboard app. (It will soon hit the play store 🙂 )
I did filtering to get rid of un-existing, faulty or bad words (swears etc). There was mass of them and I’m not sure if all of them are removed.

Anyway output of that is available here:
https://github.com/mkpelc/frequencywordslists

I mentioned you in licence.txt 🙂

thanks!

Reply
Wayne says:

April 16, 2016 at 1:28 am

Hermit Dave,

Thank you for your work. I will expand some of the Russian listing. For personal use in a dictionary. I need to memorize meaningful words first. And if I clean up the dictionary well enough, I will share with others.

Do you know how large the corpus was for the 2012 Russian listing? That would allow me to quantify the value of the words better.

AND, an actual word count would be better than using a word processing word count. IF you don’t have it handy, I will set up linux for mac, and re-freshen my command line commando skills.

Thanks,

Wayne

Reply
Marek says:

April 19, 2016 at 5:41 pm

Hi,
I used your en and pl lists.
I did a lot of filtering because there were many unexisting or swear words.
I put filtered wordlists here: https://github.com/mkpelc/frequencywordslists

I will use them in my android keyboard app.

thanks & cheers

Reply
Eos says:

April 25, 2016 at 12:11 am

Japanese Dictionary can be found here http://ftp.monash.edu.au/pub/nihongo/00INDEX.html
Look for JMDict or Edict.
I would love to see a lemma frequency list from the Japanese subtitles, and a comparison of native Japanese television series/movies vs foreign television series/movies. I expect there will be much more kanji in the native series.

Regards, Eos

Reply
Georgi 'Kaze' says:

May 1, 2016 at 7:25 pm

Hi Hermit,
thanks for sharing these, but my greediness goes far beyond your English list, the Google Books 1-grams, that I ripped, amount 7+ million unique ones:
https://onedrive.live.com/redir?resid=8439CC8C71159665!133&authkey=!ANkU5SM69yh-VNM&ithint=file%2ctxt

The text file is created by my n-gram ripper Leprechaun from *Google Books corpus All Nodes* featuring 3,473,595 English books.
The n-grams (or arcs) were downloaded from:
http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html

Hope, someone is going to share more refined-and-richer English wordlist, my greediness knows no limits.

Cheers

Reply
Antonios H says:

May 27, 2016 at 12:36 pm

Hi Hermit, do you know what is the encoding used for the Greek list of 50K most frequent words? I’m trying to import on excel 2011 on a mac and it shows up as gibberish… I see it normally on textEdit. Thanks very much for sharing your word lists!
A.

Reply
- Hermit Dave says:
  
  May 27, 2016 at 1:35 pm
  
  Hi Antonios,
  
  The files are saved as UTF-8 if I remember correctly.
  
  Hermit
  
  Reply
Jordan says:

June 3, 2016 at 7:55 am

Hi Hermit Dave,

Thanks for this awesome effort. … Apologies if you’ve said this elsewhere, but i was just curious how you went about compiling all the subtitle files in each language.. Did you write some sort of script to parse the .srt files stored on the opensubtitles server?

Cheers,
J

Reply
- Jordan says:
  
  June 3, 2016 at 8:21 am
  
  Also, i’ve taken a look at the Farsi list, and was just wondering why there’s such a large number of English words present in the list … am i reading/rendering the list incorrectly? … What were the source documents, if i may ask?
  
  Cheers,
  J
  
  Reply
  - Hermit Dave says:
    
    June 3, 2016 at 10:42 am
    
    Hi Jordan,
    
    I got a large dataset from Tehran university for Persian. I used that as the base for the wordlist as I did not have subtitle data.
    
    Hermit
- Hermit Dave says:
  
  June 3, 2016 at 10:43 am
  
  As I mentioned in the intro on the page, I used xml subtitle dataset generated for open corpus http://opus.lingfil.uu.se/
  
  Reply
Espen Klem says:

June 17, 2016 at 1:10 pm

Using it to create a Swedish language file for a stopword module. It’s used to filter out words that holds little or no meaning in text analysis, and was originally created for search-index.

Reply
Espen Klem says:

June 20, 2016 at 8:22 am

And a Danish stopword list

Reply
vinny says:

August 6, 2016 at 7:24 pm

God-sent , using it for language learning.

Reply
Sane Yagi says:

August 8, 2016 at 10:46 pm

Thanks, Hermit. The Arabic list is excellent. The Frequency Word Helper is outstanding too. I am using it to compile a specialized frequency list. You are wonderful.

Reply
- Hermit Dave says:
  
  August 10, 2016 at 9:52 pm
  
  Glad it is helpful.. Started on a newer version recently.. should be out soon
  
  Reply
  - bilalzaiter says:
    
    May 17, 2018 at 7:43 pm
    
    Hey Hermit, the work is great and definitely helpful at many levels. you started working a new Arabic version in 2016 as i read in one of your comments up, Any updates about this ?
    Thanks
  - Hermit Dave says:
    
    May 17, 2018 at 8:08 pm
    
    You can find that 2016 Arabic frequency list here
    https://github.com/hermitdave/FrequencyWords/tree/master/content/2016/ar
  - bilalzaiter says:
    
    May 17, 2018 at 10:55 pm
    
    Sorry just noticed your link and actually your git links leading to the original corpus are useful too. Here is a link from 2006 on “The use of film subtitles to estimate word frequencies” . I do not know if you saw it before. http://sites.univ-provence.fr/veronis/pdf/2007-AppliedPsy.pdf but it may be helpful for some readers here. again thanks for this work.
Michael King says:

August 28, 2016 at 11:05 pm

Hi Hermit Dave,
Great stuff! I’ve been using Zh for learning Chinese, PtBr for learning Portuguese, and Es for comparing a Portuguese word’s frequency with the frequency of it’s Spanish counterpart (I learnt Spanish a while back).

Question: I have the transcripts for the first 10 episodes of a Brazilian TV series. I would like to generate a frequency list using these transcripts; I would then like to compare the transcript frequency list with the overall language frequency list. Do you have any advice for how I can go about doing this?

Sincerely,
Michael King

Reply
- Hermit Dave says:
  
  August 29, 2016 at 8:13 am
  
  Hey Michael,
  
  I recently published a new version of the word lists along with the source code on Github.
  
  If you email (hermitd@hotmail.com) me the transcripts i can take a look at how to use those. I did publish a simple that can go through text files in a directory and generate wordlist from them. It should be available on this page. You can use either
  
  Reply
Marek P says:

August 30, 2016 at 3:27 pm

Hi
i left a comment here twice but it never appeard… i’m trying once more…
I used 2 of your dictionaries in my keyboard app – Lumberjack Keyboards you can find it on google play 😉
one comment from me is that there is massive noise of unexisting words on your dictionaries. Also many swears.
I tried to do extensive filtering but im not sure if i filtered out all bad words.
If you want i can send you a link to github with the output of my filtering

BR! And thanks for dictionaries 😉

Reply
- Hermit Dave says:
  
  August 30, 2016 at 4:39 pm
  
  Sure Marek, you can post the GitHub link here
  
  Reply
  - Marek P says:
    
    January 10, 2017 at 2:23 pm
    
    Here it is:
    https://github.com/mkpelc/frequencywordslists
jansegers says:

September 17, 2016 at 5:57 pm

Any change you could do the same work based on the simple.wikipedia.com ? I myself wouldn’t know where to start. It would provide a nice ELT ressource that is up-to-date.

Reply
- Hermit Dave says:
  
  January 11, 2017 at 3:06 pm
  
  I’d need to look at Markdown and find time but yes its certainly possible
  
  Reply
Joe Ramsey says:

April 21, 2017 at 11:41 am

I’ve wanted to create one of these for Japanese with romanji for a long time now. I’m not really sure what opensubtitles.org is or what you’re doing here.

Are you just downloading the subtitles in the target language of dozens and dozens of movies, then running all of it through a word frequency list to get the top 1000-2000 most used, then individually using an online dictionary to get the english translation?

Reply
- Hermit Dave says:
  
  April 21, 2017 at 12:38 pm
  
  Hi Joe,
  
  The latest version of the generated sets and the code to generate these can be found on GitHub
  https://github.com/hermitdave/FrequencyWords/
  
  On the repo and on this post, I mentioned in a few places that the subtitles were downloaded from another corpus effort
  http://opus.lingfil.uu.se/OpenSubtitles2016.php
  
  the one that I have in the repo I believe is most like going to be Kanji. Is there a one to one mapping of characters ? A while ago I did that with Serbian as it can be written in two scripts
  
  Reply
  - Yacouba says:
    
    February 11, 2019 at 12:58 pm
    
    Hi wonderful dave and others.
    I came accross your frequency list system and it’s awesome!
    thank you very much for that
    I would like to contribute firstly knowing how all that stuff works but in Linux (not windows) as I’m
    a bit addicted with Linux command line…so what to do if I want to create frequency list in any language?
    thank you very much again and excuse-me my english(I’ m french native speaker and by the way oh! i wish I could replace all I know in French by the wonderful and meaningful english language!but fate! fate!)
  - Hermit Dave says:
    
    February 11, 2019 at 1:37 pm
    
    Hi Yacoba, The last version of the frequency list builder can be found on GitHub https://github.com/hermitdave/FrequencyWords. Most logic is within Program.cs. Its written in C# and you should be able to compile it on Linux without significant issues (I haven’t tried) using .NET Core SDK.
    Have a look at it. Please let me know if you have any issues,
    
    Hermit
Pingback: Vārdu biežums – Expeditiones Linguarum