Frequency Word Lists

Go to skydrive download page

When I was writing Slydr, I found it difficult to get hold of decent word lists. A commercial company quoted me £500 odd per language for a 50k word list. So I decided to write my own. If you decide to use it, please let me know what you are using it for. Its yours to use.
Note: I used public / free subtitles to generate these and like most things, it will have errors. If you want me to create an updatable repository, i can put these in codeplex and you are welcome to update them.
I would like to thank opensubtitles.org as their subtitles form the basis of the word lists. I would also like to thank the Tehran University for Persian Language corpus which allowed me to build Persian / Farsi word list.
Creative Commons – Attribution / ShareAlike 3.0 license applies to the use of the word lists.
While the subtitles are free, donations do motivate further work. If you would like to donate, please click the Donate button to donate using Paypal.
If you like to create you own word lists, here’s something to get you started. Download FrequencyWordsHelper. When you run the app, it will ask for a directory to scan and then ask for output filename. once you provide both, it will scan the directory for all txt files and create a word list out of it. The app requires .NET framework 4.5


Format of the frequency lists:
word1 number1 (number1 represents occurance of word1 across all files)
word2 number2 (number2 represents occurance of word2 across all files)

Language 50K Full Log
Arabic – ar Download Download Download
Bulgarian – bg Download Download Download
Czech – cs Download Download Download
Danish – da Download Download Download
German – de Download Download Download
Greek – el Download Download Download
English – en Download Download Download
Spanish – es Download Download Download
Estonian – et Download Download Download
Farsi – fa Download Download Download
Finnish – fi Download Download Download
French – fr Download Download Download
Hebrew – he Download Download Download
Croatian – hr Download Download Download
Hungarian – hu Download Download Download
Indonesian – id Download Download Download
Icelandic – is Download Download Download
Italian – it Download Download Download
Korean – ko Download Download
Lithuanian – lt Download Download Download
Latvian – lv Download Download Download
Macedonian – mk Download Download Download
Malay – ms Download Download Download
Dutch – nl Download Download Download
Norwegian – no Download Download Download
Polish – pl Download Download Download
Portuguese – pt Download Download Download
Portuguese Brazilian – pt-br Download Download Download
Romanian – ro Download Download Download
Russian – ru Download Download Download
Slovak – sk Download Download Download
Slovenian – sl Download Download Download
Albanian – sq Download Download Download
Serbian Cyrillic – sr-Cyrl Download Download Download
Serbian Latin – sr-Latn Download Download Download
Swedish – sv Download Download Download
Turkish – tr Download Download Download
Ukrainian – uk Download Download Download
Simplified Chinese – zh-CN Download Download Download

192 thoughts on “Frequency Word Lists

    • you are welcome and thank you. i stumbled upon wiktionary when i was looking for word lists and i know its not easy to find many free / extensive sources. since i only frequented English Wiktionary, i only posted links on the english version of the page.

    • well my overall subtitle corpus for all languages was 53 GB compressed archive. I unfortunately deleted all except the original archive. Let me open it and i can give you an idea on the number of files at least. Based on my tests, frequency lists generated using decent amount of data should be comparable. I am assure you that there were lot more entries than 50k i used and provided for download

    • well i can do that but the word lists i consume are for a keyboard app and i need raw words to match user input. infact when i started i came across a few word lists and i could not use them for my requirement purely because i depending upon user input i would want to show walked in my app and lemmatised word lists would make loading a lot slower.

      maybe at some point once i am done with current work load, i will look into lemmatising the lists. thanks for writing

  1. When I download Hebrew (or Arabic, or Farsi) it’s just gibberish. How to go from gibberish to Hebrew?

    Excellent work, and I thank you once again if you can help with this issue.

  2. That’s a fantastic work indeed. I am using it for studying Russian now. Learning the most frequent words helps you to understand the spoken languages much easier and faster. And it’s really fabulous there’re people who share that. Thanks.

  3. Pingback: Most common words list for Korean « Acquiring Korean

      • Hi,

        Thanks for this great job.Please can you do the same for japanese. I really love manga and I would like to learn japanese.

      • I unfortunately do not have a large Japanese list. if you have a Japanese repo i’d be more than happy to churn them out for you. In the mean time, I will also check for some public repo

    • there.. i’ve added the links to Korean list i had.. it was small.. after initial clean up it was 14k or so words only. Not a very high quality frequency list i must admit – but at least something to start with.

    • the size of word lists depends on how many subtitles are used (and their size and the unique word count in them). i have a log somewhere of how many files / words per language
      Total files: 15
      Unique word count: 19215
      Total word count: 369216225
      See how low the file count was. higher the file / word count, better is the quality of word list

  4. Is there a reason I get a list of numbers? I get the words in the native language but next to them numbers, am opening in notepad

    • well the format i used for word list is sort of generic one i found around. it goes like this
      word1 wordfrequency1
      word2 wordfrequency2
      word3 wordfrequency3
      word comes first, and is followed by word frequency which is a number, with space in between.

      • The unique word count is unique occurance. subtitles has lot of errors for example words “i don’t” sometimes occurs as idon’t which would be a unique word. however its occurance would be lot further in the list than i and don’t as those two words are used lot often. Total word count is just a count of all words in all subtitles. Each word could be repeated 1000s of times.

  5. dave, i meant the total words figure, of course, that makes up your corpus. it appears that this figure of 70 billions words is just incorrect, because a simple calculation such as “Total word count/Total files” gives 6 and a half millions words in a single file, which, you must acknowlege, merely cant be true, especially applying to an average subtitle-file that is supposed to be quite small in terms of the words contained in it.

    6,5 mln. words per file would give a file of 10 Mb as minimum. so, again: this cant be a true figure — Total word count: 690,318,369,316.

    judging by words occurancy in your corpus, its size cant exceed that one of COCA.

    top10 from yours

    you — 21,953,223
    the 18609293
    to 13815857
    and 8134383
    it 7913344
    of 7131178
    that 6534300
    in 6010124
    is 5924671
    me 5619307

    top10 from coca’s

    the a 22,038,615
    be v 12545825
    and c 10741073
    of i 10343885
    a a 10144200
    in i 6996437
    to t 6332195
    have v 4303955
    to i 3856916
    it p 3872477

    so, it turns out the total words number of your corpus is slightely less than that of COCA which means the total number of word in yours must be around 400,000,000 words.

    can you recalculate the real figure?

    • Okay i will redo the counts (plus remember that) this is the first set of frequently lists i did. I have been tweaking a bit. I will do a count again and post it when i get to work.

    • you were correct. I re-ran the word list generator a couple of times and I found the mistake i made in computing the total count. The other details are correct however the total word count came up to 765703147 and not 690788712769

  6. Thanks for sharing this data with us! I’m getting 404 errors for Estonian and Ukrainian files. Any chance you can upload them again? And the Arabic line is repeated on the first and third lines.

  7. thank you, dave. this one indeed looks more realistic as it gives now the figure of about 7,000 words per file which is way closer to the truth.

    and a special thank for the great job you’ve done as i’m one of many who also may be intersted in such useful lists!!

    ATB

  8. i was about to think the question had been closed, but yet another thought has just occured to me, this time about the total amount of files figure.)

    this number of 107, 000 files, as it seems to me, might also be incorrect… why? because the total amount of subtitle files on http://www.opensubtitles.org is currently 1,593,684. but this figure seems not to refer to the number of UNIQUE files on the site, but rather to the number of all possible files available, while it is a well known fact that sometimes several or even dozen files can be attached to a single movie. for example, 62 english subtitles are currently attached to Avatar movie:
    http://www.opensubtitles.org/ru/search/sublanguageid-eng/idmovie-19984
    thus, all i want to say is, are you sure all the files making up your english corpus are really unique? the current figure of 107 thousand files seems to be dubious as it implies there are actual 107 thousand englishspeaking movies which is an enormous number.

    • that is true… based on my random scan i would say that most directories only had one file however there were a few directories with multiple files (repeating subtitles) and in at least one case i saw 10 entries. so yes there is a flaw however some subtitles are split and some are not.

  9. well, as long as there are recurring files in the corp it cant be authentic to the fullest. moreover, this fact also means the total words figure should be lessened even further, so i think you have to solve this problem somehow for sake of accuravy, although it might prove to be a not easy task this time, indeed…

    • true.. i do understand how corpus is created. i might have to write my own directory scan mechanism – its just a problem of logic and time.. its takes ages to generate word lists… more than a day as i couldn’t be asked to multi thread it (overall data is more than 50GB compressed). i’ll try to update the files over the next few days depending upon what i am doing.

      thanks for making me look into this.

  10. thank you.) i really like your idea of creating a corpus based on subtitles to movies.

    i think if you do recreate the corpus it could affect the words frequency lists as well, so it worth doing the job, especially since at the end of it you will get a more precise and accurate corpus!

    ATB.

    p.s. could you please notify people then by writing a message on here? it seems like i get an e-mail evry time someone post a message to comments which is useful ;)

    • my reworked frequency list builder works well.. managed to process all languages though i need to rework english to handle subtitle annoyances with don’t etc… its usually split into don ‘t as two words.. hopefully that will be done at some point this week.. right now both desktop and laptop busy downloading maps for another project

  11. this is great news, dave!.. have you thought of finding some ways as to lemmatize your corpras, at least that english one? it would be even greater!.. i’m not an expert, but it seems like i”ve seen a piece of software for that purpose somewhere in the internet…

    • I haven’t gotten around to reworking the english dictionary yet.. I have been asked about Lemmatizing the corpas but so far i haven’t gone that route for 2 reasons. 1) i consume straightforward word lists and thats why i build them in this manner. 2) i need to look into it and since i didn’t need for consumption it becomes low priority. Anyways been busy with christmas. I will try to get english list sorted tomorrow and then probably upload raw lists.

    • Uploaded the logs. the format is
      Total files: 12601
      Unique word count: 401001
      Total word count: 64362991
      Overall word count: 91273545

      Total word count is the total count used for frequency list.
      Overall word count was the actual word count. some words has junk character or at length of 1 which are ignored. Hence Total word count <= Overall word count.

  12. Just tried them again after restart, they download fine. Thanks a lot, I’m going to use them with dictionaries to help study French and Spanish on my Kindle.

  13. Dear Dave,

    Kindly explain to me the number next to the word, i assumed that it shows how common the specified word is used or how popular it is, no?

    p.s. this list is a gift from heaven merci, gracias for thess great lists

  14. Thank You Dave!

    very useful repository.
    However, I am analyzing the English corpus, first 10K words.
    I found there is not the “I” pronoun in the first 10K entries, which made me thing about some oddities.

    Can you tell me the amount of words and sources you retrieved the corpus from?
    I read is from opensubtitles, but how many movies have been processed? do you have some info about this sources?

    I am asking this to see if I can compare the corpus with authoritative sources, such as published dictionaries (oxford) where words are sorted by frequency too.
    thank you very much!!

    And happy 2012 and “fatherhood :)

    • Luigi,
      The word lists i have generated ignore 1 letter words like a and i. Its difficult to validate a single char word across multiple languages unless you know the language or can spend time tuning the rules per language. I know a bit about it as i have done something similar for accents across various latin based european languages. If you really want one, i can generate a one off and email it to you.

      The details of corpus is available and you should check the log file. Most languages have a log file entry in the table.

  15. Dave,
    I’ve got another question please.
    I am looking at the Russian dataset, but it is not encoded to handle cirillic.
    What should i do to see kirillic charset?
    thank you again!

  16. Thank you Dave,
    yes, if you don’t mind I would ask you a copy.
    You can contact me to the email address I wrote to comment your post..
    Let me understand: is it you that constructed the word lists across opensubtitles movies?
    If so, which movies did you pick up as genre? I mean, I’d like to understand if that list can be actually picked up to represent spoken EN on average.
    I’d like to use it to compeare twith type of lists alike this one:
    http://books.google.it/books?id=J69KTr60yt8C&printsec=frontcover&dq=english+russian+10000+words+dictionary&hl=it&sa=X&ei=goIIT46lC8P74QSSy9SNCA&ved=0CDYQ6AEwAA#v=onepage&q=english%20russian%2010000%20words%20dictionary&f=false

    Thank you!

    • Luigi,

      I found an extract someone had already done across various languages. I just consumed what i found – i think the resource was up to date with all movies across various genres. The concept of frequency lists dictates that it should be close if not representative of the actual usage. UK english usage is different from US English which is different from that in Canada, Australia, India etc etc.

      this word list is a general one that doesn’t represent the en-UK or en-US etc just english in general.

      sure you can compare the lists.

      Hermit

    • I have reworked the word lists and now they allow single character words.

      The files are availble as zipped text files. there’s a 50K word lists and then there’s full word list.

  17. Hi Dave,
    nevermind, i find the way to see it correctly.
    Good!
    Please, would you mind to let me know which other rules you adopted to construct the corpuses?
    - exclude words with one char
    - … ?

    Do you also have an idea where i could find a digital resource of a russian (and other languages) with definitions of words, which I can import easily (a list in txt, csv, xml are perfect…, while pdf is not …) ?

    thank you again for your work!
    Luigi

    • I unfortunately dont have them locally (they are on the hosting server). If i get around to generating them again, i will zip them up in a single archive.

      Having said that i will try to generate torrent files, one that references all the 50kzip and another one that references all full zips. Once i generate these, i will udpate this page with the torrent files.

      I however did not get what you mean by “by column too” !! each column currently offers a 50k zip and a full zip for that language. Providing all language download in same column is confusing – rather it should be a single entry at the very top or possibly on top of the table itself.

  18. Can you down them with FTP then pack them?

    For the columns, I mean to say all the 50k for every language, and all the full for every language, as two separate downloads like you described first, Not everything in one column! It is just an extra option for downloaders to choose…not urgent or important really :)

    You can host the files on a hosting site if bandwidth is a problem. Multi upload is a good option here.

    • :) i have download them off my smallbusiness live hosting account.. and packaged them up… took a lot of clicks.
      50k can be upload to my host, full one is about 80megs and is not allowed.. will have to upload to megaupload.. its been a while since i uploaded anything there.. will have to do it from home..

  19. Megaupload is down forever, FBI raid last week :) multi upload still works.

    I too just downloaded the files individually, so not to worry about it (sorry for the bandwidth use!). I’m surprised you have hosting limits on uploads. I suggest moving to a better provider.

    • oops… sorry know about megaupload… i’d be out of touch if i didn’t know that.. meant multiupload – i used to host Windows Mobile ROMs that i used to create there :) … for some reason it says at upload initializing..

      well i used to have an excellent package which would give me tons of bandwidth and allow me to host couple of gigs of data however i was not using it.. i dont even know if it still works (actually i will check in a bit).. eventually i moved my email hosting to microsoft live a while back and moved hosting there as well.. worse is wordpress.. they allow you to upload tons of things including movies but not zipped files..

  20. Pingback: 10 links « Pierre Rømër

  21. Hey Dave,

    Your wordlists are interesting. So they are completely composed of tokenizing movie subtitles? I am currently working on twitter research and am trying to set up lists to rate words in many of these languages. Our master-lists were generated by tokenizing google books but the tokenizer separates strings at apostrophes. This is a huge problem for our french word-list since words like c’est were appearing as two different words c’ and est which made the set unusable. Are the words on your french wordlist split by apostrophes?

    • Yes they are very interesting. Gives you something to think about. In case of subtitles, i used ‘ ‘ and ‘-’ to split the words. In case of subtitles, french subtitles were in good condition. English however had say don’t like don’ t and my code would assume don’ and t are two different words. So i changed logic for english to say that if last char of word is ‘ then join them and that worked perfectly. try something like that.

      No my french and english words have apostrophes

  22. Hello, mentioned above is the lists are being moved to a different host. Is this still in progress? I can’t find a date on any of these entries, so I have no idea if this is an abandoned project or not. I am particularly interested in a Korean list, but the others as well.

    Thank you,
    2/28/2012

  23. i am studying english and i translate the most common words in my language to english to check if i have missing word. that’s very useful really appreciate it

  24. Hi! I need to make my own frequency lists out of some documents I have in Chinese. Is there anyway you could share the lemmatizer or concordancer you used for Mandarin? 

    I really appreciate your help. I use your Chinese frequency list almost every day.

    • well i have a bit of c# code that churns through files. what format of files do you have ? are they utf-8 / unicode text files ? are they xml files. i have two sets routines, 1 deals with data in text files and another in specialised xml files

    • Did you really get the Chinese wordlist to work? I downloaded zh_50K.txt but no matter what options I choose open in Microsoft Word and Open Office (both on Mac) it just displays corrupted characters. Any way around this?

  25. Hi Dave,
    I’m creating a MS Word add-in to optimize the AutoCorrect. The add-in is useful to shorten typing, not to correct typographical errors. One part of its function is to shorten typing English words with suffix.

    For example: caref – careful, darkn – darkness, acceptg -accepting.

    So, I need a list of English words (probably around 4000 words) to create a database that will be used for the add-ins. I intend to share the add-in for the public, this is a freeware and an open source. Can I get your permission to use the list of English words from your Word Frequency List.
    Thank you.

    Akuini

  26. Pingback: Voyage au bout de la langue | Ressources pour apprendre le hongrois

  27. Thanks for the great work!
    I use your list as input for a password/passphrase generator.

    If you’re interested in another comprehensive wordlist (en_UK) with frequency classification, I also found this one:
    http://www.bckelk.ukfsn.org/words/wlist.zip

    While this list is not as long (57k words), things such as slang, typos and abbreviations are omitted.

  28. I’ve been using the German wordlist for some Psychology experiments. We needed emotionally neutral words for the task we designed, and the top few were a great starting point. Extremely helpful, thanks very, very much!

  29. Pingback: La recette pour apprendre une langue étrangère facilement

  30. Pingback: Towards the free and open way of learning Spanish « ma.juii.net

  31. Hi. Well done for a great site. Such a great resource. I am developing a word game and plan to use the lists for foreign language versions. It’s a crowded market so don’t expect to make any money but enjoying doing it. I plan to explain about the source of the word list but if you have developed a more “formal” list of any languages except for English, then I would be interested in obtaining (and paying) for them. If not then, your excellent list is a fantastic fall back position for me. Regards. Mitch.

    • Mitch,

      the ones available are the last version I built. I have created as smaller subset that is slightly cleaner :) I use 25- 30K lists for Slydr and now for a word game called Wordastic

      I’d like to emphasize on slighty clean… remove words with odd chars etc.. nothing massive.

      • Hi Dave. Thanks for your prompt reply. I think I am doing more clean up than that so I will continue working on your files (except English). This might seem a bit mad but hey ho, you never know … if you ever need a list of words in alphabetical order, where the frequency has been deleted, virtually all the words contain at least three consonants, all accents/acutes/etc have been either replaced with ordinary caps or deleted, no dupes then I’m your man! Did I say a BIT mad! It’s what I need for my app.
        Once again, thanks for a great resource. If I publish my game I’ll let you know!
        Regards.
        Mitch.

  32. Hi Hermit ,

    Thanks for the wonderful collection of word lists from various languages!

    I am using some of the English word lists as a marker against a dictionary word list in my word game to determine the difficulty level for individual words. It is an indirect use of your work. I would like to know how I can acknowledge/credit you?

    Cheers ,
    Prasad

  33. Hi Hermit,

    Thanks for your prompt reply. The game is called “Code-Z” and it is now available in English on Android (http://market.android.com/search?q=pname%3Acom.codez). I have added your name and web site in the Credits/Acknowledgment section.

    Pretty soon we will be rolling out a German version and I have used your wordlist as a difficulty marker for it too. I am planning for more languages. Will keep you updated.

    Cheers ,
    Prasad

  34. There is something strange ( file: en.zip from line 180394)

    makе 3
    іntο 3
    mergеr 3
    саse 3
    οwn 3
    сhіnese 3
    againѕt 3
    ѕех 3
    оther 3
    аlwаys 3
    firѕt 3
    сοuld 3
    tаlk 3
    αуе 3
    wе’ve 3
    іѕn’t 3
    mіght 3
    сοmе 3
    gоd 3
    gіvе 3

    • Andrea.. i am not sure i fully understand what you asked. I did document how many files were consumed and the unique / total word counts.. what other parameters are you looking for ?

  35. Hi Hermit, sorry if it took so long to respond. what I actually need is a document frequency list: a frequency list of the documents where that specific word appears… do you think this is doable for you to do?

    cheers!

    am

    • whoops, maybe I was not clear enough… in other words, a list where the frequency number corresponds to the number of documents where that word is to be found.

      tks++

      am

  36. Dave, thanks for your work, it can be put to so many uses. I have recently learned that the creator of a smartphone keyboard (which all use wordlists for prediction/validation) used your lists (as one input among others). I have now noted that there seem to be an unusually high number of words that are falsely spelt in lower case instead of capitals. In many languages this only affects proper nouns (which is bad enough), but for some languages which use capitals for regular nouns (like German), your list needs a lot of cleaning up before it can be relied on.
    So my question is: do you do any processing that can cause this effect, or are all these errors really in the subtitle files?
    Feel free to answer by e-mail if you like.

    Best regards from Nuremberg,
    Endre

    • the repo i used is an open source repo additonally i have little knowledge of how capital letters appear in non english language. so the two combined meant that i wouldn’t know if i can rely of each creator using the capitalisation as required and whether i can actually understand that part.

      for that reason, i force all words to lower case to build the frequency word list. i myself used it in Slydr (keyboard like app in Windows Phone) i created last year. i can look further but again without language specific input, i am helpless :(

  37. Thanks for the explanation, Dave. I guess it probably depends on what you actually want to do with the lists. If you want to use the frequency data alone (whatever they might be good for on their own) forcing lower case might be fine. However, if you want to re-use the actual words, I’m afraid that forcing all words to lower case might do more harm than good, particularly (but not only) in languages with extensive use of capitalisation like German.

    All other corpus-based word lists l’ve come across so far leave the data untouched¹. After all, anyone can convert a list to lower case (and recompute frequencies if desired) with minimal effort. However, restoring the original state from a lower-case list is impossible without external sources (and difficult even *with* external help, for instance dictionaries or spellcheckers). So here’s an emphatic vote to leave the data unchanged, even if this might mean double entries for many words – which again would also carry potentially useful information, eg. on the likelihood of occurrence of a particular lemma at the start of an utterance.

    ¹ I have seen one corpus-based list carry additional entries (with asterisks, e.g. That* or Man*) for upper-case occurrences of words whose dictionary form is lower-case, presumably where context indicated that the upper case was attributable to the position of the word (beginning of sentence or paragraph). Similarly, in your case one might argue that in occurrences where it’s reasonably likely that the capitalisation is due to the word’s position (rather than being a basic attribute of the word like in proper nouns), it makes sense to convert to lower case before processing (e.g. computing frequencies). This way, you might even provide added value to the users of the lists (who don’t have context information to make that distinction). In contrast, with indiscriminate lc conversion you do what any user can do if they want (so no real value added), but at the same time you corrupt the list for many uses.

    • Endre,

      You make a fine point. its easy to rework to compute frequencies in lower case and persist in case specific word. I however need a few days. Thanks for persisting and pushing your logic in clear manner.

      Hermit

  38. Hi Dave,
    I’m still trying to open the Hebrew file with Ms word on OSX, trying 20 different encodings incl. unicode usf8 and I still get gibberish, the same with nearly all the files, is there a quick fix? Also concerning the english files, did you come across a ressource with Pos and/or IPA translations? Kind regards, Nick

  39. Thanks a lot for the lists, Hermit! Now, three brief questions:
    1) What version of the opensubtitles corpora did you use? Did you use only this source for the lists?
    2) In the end, did you use all the available translations for each movie or just one?
    3) For some reason, the Hebrew list seem to have a high degree of dissimilarity with other equivalent lists from purely written language. Any idea?

    best,
    Michael

    • I found the difference – the Hebrew list does not have 1-character letters! Perhaps you did not actualize it? I remember that at some point you were not including 1-char words. Am I right?

      • that is correct. I however tend to rebuild them at the same time and I am sure I did add single character entries after some discussion here. let me check it tomorrow and if required rerun the code again

  40. Hi Dave,
    Thanks for the frequency lists! I am writing a little class term paper (totally nonbinding) and need to explain the source of the Bulgarian corpus (besides the info you put in the log text file). Do you know where you took the Bulgarian frequency lists from? Was it from the Bulgarian Natioanal Corpus (that is written) or did you also get info from a spoken corpus? How do I quote your frequency lists? Thanks for your help!

    Best,
    Svetlin

  41. Hi Hermit,

    I have a question about the license.
    I didn’t find your email-adress anywhere, so could you drop me a line?
    You can also reach me on Twitter: @lukaskawerau

    Would love to hear from you,
    Lukas

  42. Hi Dave. Awesome list. For each corpus did you only use subtitles for movies that were in their native language (i.e., only French film subtitles like Amélie for the French corpus)? Or did you also include subtitles that were translated from different languages?

  43. The corpora include translated material (in fact in most languages other than English, an overwhelming majority of the corpus will consist of subtitles translated from English). This does introduce a certain skew that is particularly noticeable with names – while non-English corpora typically contain English names in dozens of variants, many names from the respective language are missing or underrepresented. All in all, the picture that the corpora give you represent the language as it is used in blockbusters in your local cineplex, *not* a more general picture of the language at large. That’s an inevitable consequence of the sample used and not a deficiency per se, just something to keep in mind.

  44. Hi Hermit,

    How did you do these? Do you use any script for it? I ask you this question as I would like to find somewhere or to do by myself such list, but for a specific purposes. I mean for some narrow subjects like most frequent words for nurses, lawyers, construction workers and other such a groups. I would appreciate very much if you could help me in any way to establish such lists or give me some tools or advices how to do it relatively easy way, fast and cheap, as
    Hoping to hear from you soon, I wish you all the best.

    Monika

      • If I don’t find any ready lists, I will have to do it by hand, putting words and expressions into Excel document. There is no one source of it. I have found few books to teach English as a second language for specific purposes, for law, nursing or medicine for instance, then I will use adequats dictionaries, books for students of law, medicine, nurcery school, some websites where this kind of vocab is used, etc. It will takes me weeks of a hard work :( So I will get lists of few thousand words for some of disciples probably. And then I have to check frequency of every word or expression (maybe with google search) to choose the most frequent once. That is a very hard work, so I search for any possibility to do it faster or easier way. So far did not found any better idea. Do you have any, maybe?

      • The way I have done is using files. I can share my solution with you which can scan all files in a directory and generate word list out of it. All you’d need to do is create relevant files and run the app. Let me know if that’s good enough.

      • Hi Hermit,
        I’m so sorry to answer you after so long time, but I had so much work and other obligations… Yes, I thing your proposal is very kind and you application sufficient for my needs. I would appreciate very much if you share your application with me. I will then build a folder with documents I have already found and then run your application. I would save me a lot of time and hard work.
        Kind regards,
        Monika

      • Hello Hermit,
        thank you very much for helping me. I have finally managed with your application :) It works :) Your an angel :) Thank you. Tell me:
        1) it scan only one text document in the folder or all .txt document in the folder? because it has scaned only one from all .txt files in the choosen folder
        2) do you have any idea how to copy resaults from one column (numbers and words in the .txt document are, let say, in one “column”) and I would like to have them in 2 column in the Excel document, like numbers in A column and words in B column?
        Regards,
        Monika

      • Monika,

        I just ran the test with 3 English files and it processed them all – random work check in the list. If in doubt, email them to me hermitd at Hotmaildotcom

        To move columns etc, its easiest to open file in excel (delimited file option) and then cut / copy / paste as needed.

        Hermit

      • ok, I will try again with another folder and another files :) That was quite hard to me first time, I couldn’t at first understand what it is all about :) maybe second time will be easier :) Need to get used to this application. Since today I will not have any access to the computer and internet for 5 days, so I will let you know next week if I succeed this time :) Thank you Hermit for helping me. You’re very kind to me.

      • Hello Hermit,

        thank you very much for your app.
        I have finally succeeded to run it
        It worked this time
        All .txt files was scaned and I got a frequency list of all of them
        It will facilitate my teaching job a lot!

        Would it be possible in any way to use it for PDF documents
        as I have a lot of books in PDF format
        or
        to make a frequency list from some www sites?
        For instance I would like to prepare a frequency lists for my students
        from some journals online like Le Figaro ou Le Monde?

        Cheers,
        Monika

      • glad it worked. The problem with PDF is many fold, it can be text + image or image only etc. Its difficult to work out. The easier solution is to extract text from PDF and operate on the extracted data.

        PDF readers have an option of saving contents in text files.

        Websites are easier but they are a different fish as code will have to deal with markups etc. its not difficult – just annoying as its easy to break such a mechanism. Plus websites do not like screen scraping and move swiftly to block the IPs

        Hermit

  45. bonjour a tous je viens de telechargé un fichier Serbe Latin peut etre que je ne sais pas comment faire puisque dans ma liste il y a tout les mots ex: Mozda , Zasto… mais je n’est pas leurs signification -_- ca s’affiche comme cela :
    “samo 178651
    od 167786
    bi 163893 ”
    Comment faire pour avoir la signification de ses mots? j’ai pas trop confiance dans certain site qui donne plusieurs définitions différentes Pourrez-vous m’aider s’il vous plait merci

  46. Thanks a lot for this, it’s made it very simple for me to prioritise the most important words in the language learning site I’m working on.

    (PS: If you’re interested it’s here: http://readlang.com, and I’ll be adding an attributions page to my site soon, it’s all a bit rough at the moment :) )

  47. Hi Dave,

    Thanks for your work. I downloaded the Simplified Chinese list but fount there are a lot of Traditional characters in it. Maybe the resources you use are a mixture of Traditional Chinese and Simplified Chinese? Not sure. But it’s very useful anyway.

    • Jarvis,

      I have two sets of files, 1 was a common Chinese Dictionary which had words in both Simplified and Traditional Chinese side-by-side and other was the Subtitles – thought that was only in one language can’t remember which one. I used subtitles wordlist and then the other dictionary to then build a dictionary for both.

      Unfortunately I don’t know either to know about the mixing of characters. I apologize.

  48. Hello. My name is Paul. Thanks for these wordlists! I will use the wordlist ‘en.zip’ for my research about english article readability. But can I ask how you build these wordlists? And I can build it myself if I need (because I may build a wordlist for the ESL/EFL learner ).

    • Paul,

      Assuming you have tons of relevant text material, building the list involves going through it and maintaining word and occurrence details.

      The larger the data set, better are the results.
      I am going to,post a simple program later today that can do just that.

  49. thank you very much! it is very hard to find a wordlist of estonian vocabulary. I actually use yours to learn estonian.

  50. Pingback: ASETNIOP BLOG » One-Dimensional Keyboard Hack

  51. Hello Hermit,
    thank you very much for helping me. I have finally managed with your application :) It works :) Your an angel :) Thank you. Tell me:
    1) it scan only one text document in the folder or all .txt document in the folder? because it has scaned only one from all .txt files in the choosen folder
    2) do you have any idea how to copy resaults from one column (numbers and words in the .txt document are, let say, in one “column”) and I would like to have them in 2 column in the Excel document, like numbers in A column and words in B column?
    Regards,
    Monika

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s