My name is Hermit.. It really is!! Hermit Dave in all. I work on and am heavily opinionated around Microsoft stack. Federalist and a conservative in political inclination.. Atheist if you must know and i don’t really care about religious inclinations as long as you don’t bug me about it. I prefer dry humour – as dry as it gets.
I do almost everything (blogging as well) for the wrong reasons. What can i say people like me get their kicks in their own weird ways.. You have been warned!
did you see my update on the msdn fourm ?
yes i did… however i had my hands full plus we already knew that there were some limitations.
Hello Hermit Dave,
My name is Kriya, I am a student at Reed College currently working on a psycholinguistics research project. As part of our project, I need to find the frequency of use of a certain set of Spanish words. We would like to use your site as a source of information, and I wanted to know if that was alright. If so, how would you like us to cite you? Thank you for all of your work!
Kriya and team
I think Hermit Dave should be just fine. Go ahead. Let me know if you need any input from my side.
Excellent! Thank you so much. Also, we have a quick question: when creating your frequencies, what are they out of? That is, if the word cat appears 4000 times, does that mean 4000 times out of 50000 words, or perhaps 100000, etc.
okay each word has associated frequency if list had two words cat with value 400 and dog with 100. this woud imply that total words under consideration were 500 and the frequency of each gives their relative usage. if you sum frequency of each word you will get overall set / count. these were generated from unique subtitles
I have uploaded log files that contain general stats that might be helpful. There is a log file per language
I would like to use subtitles for a linguistic project.
Can you give me some hint on how to obtain them from opensubtitles.org?
I saw their API’s but it only allows 200 downloads a day. I would like to use subtitles from the same movie in many different languages.
I am not entirely sure.. I managed to find myself a 50GB repo of xml based subs.
checking OpenSubtitles right now, I moved on to http://opus.lingfil.uu.se/ which is the open parallel corpus. It has a link to download a 137 GB 2012 subtitle collection. Go crazy.
Hello Hermit David,
I’m Sana, I have just moved to the UK to work on a project on Arabic language @ Univ but to start I need to know the frequency of Arabic word. I was reading what Kyria was asking you and I have to make the same question plus is possible to have your program on a Mac? I feel like an alien never use it but should learn.
Thank you so much !
Whilst I don’t have anything at present, I do have something that will compile my code into Mac compatible. Will update when possible
Hello Hermit Dave,
I’m a PhD student in Linguistics and I’m working on words’ frequencies in Lithuanian and Latvian. I would like to use your site as a source of information. Do you agree? I would also like to learn more about how your lists were built. From which kind of corpora did you extract the words? Where those texts originally written, say, in Latvian, or just translations from another language?
Thanks for your great work.
Greetings from Italy!
I consumed opensubtitles – repository of which is available here http://opus.lingfil.uu.se/OpenSubtitles2012.php (including a link to earlier one).
The files are xml files that contain words relating to movie dialogs in native languages. I created a program to select a single set of files for a given movie and to pull words.
The words are then added to a list and a count of how many times each word occurs is maintained.
At the end of the process output file is create with word and its associated frequency.
Feel free to consume this as you wish.. yes you can specify me as a source for this.
Here’s to let you know I’m using your Hebrew list for an authorship analysis project. Cited of course.
Hi Hermit, I’ve been letting my students have a go at some of the western European word lists for practice with UNIX shell scripting and have done some curious and fun analyses of the English and Spanish word sets. Would be interested in sharing some of the results with you, but can’t find your email address. Cheers, David.
David, feel free to email me, my email is firstname.lastname@example.org