Frequency Word List is driving me mad

Why is nothing ever easy.
I managed to get Frequency List Builder up and running in no time. It can work pretty fast and gives me decent output.

What do i find ?? the input has errors.. of course that is bound to happen. So what do i do ? well i change my code to work around common bugs. ‘il is actually ‘ll and mr. is same as mr etc etc.

so what happens next.. i test it and its all beautiful. Except i have no idea the structure of words in non English languages. Damn!!!!!

The xml wiki dumps are just a pile of xml / html mess. I’ll probably have to write more code to strip the unwanted data before i start looking for real data.


