11 May 2005

I broke my Word

MS Word that is.

Actually, I didn't break it, but I pushed it very very far, enough to put its word-count function off by 70,000. I've got this project that requires alphabetizing huge lists of words, and I challenged myself to write a Java program to do it for me (and it works).

Before I knew the actual word count of the test file, I opened it with Word, which told me there were 251,092 words in it. I ran the alphabetizer, which ran (hours) long enough for me to get frustrated enough to stop it. To be sure it wasn't simply seizing up, I added a feature that prints onscreen the number of words that have been alphabetized. Indeed, it took several hours before the number approached 250K. The rate of successful alphabetization slows down as the list of items to compare increases.

But then it kept going, far beyond 250K, and finally stopped at 319,604 words. I briefly entertained the notion that I had reached the upper limit of Word's word-counting capacity. Maybe, but Word correctly counted the number of pages - 5608, which it took several minutes to tally. At 57 lines per page and 1 word per line, this page count seems to be pretty accurate.

In fact, the counting tool is pretty precise. I added two words to the file, and the word count reflectd that. The issue is what counts as a word - a single apostrophe does, but anything in all caps does not. I therefore presume my file has around 70K such words.

Just out of curiosity, I decided to see how long it would take Word to alphabetize the lines in the same file. But Word says "The document is too big for word to handle".

0 Comments:

Post a Comment

<< Home