Why is nothing ever easy.
I managed to get Frequency List Builder up and running in no time. It can work pretty fast and gives me decent output.
What do i find ?? the input has errors.. of course that is bound to happen. So what do i do ? well i change my code to work around common bugs. ‘il is actually ‘ll and mr. is same as mr etc etc.
so what happens next.. i test it and its all beautiful. Except i have no idea the structure of words in non English languages. Damn!!!!!
The xml wiki dumps are just a pile of xml / html mess. I’ll probably have to write more code to strip the unwanted data before i start looking for real data.