I have an array that contains 267,751 items. (It's the sowpods word list.) I populate the array by reading it from a text file. I need to search this array to see if a user-supplied permutation of letters is or is not a valid word.
I initially suspected Excel's own built-in functions would result in the fastest way to search it. And I came across this neat piece of code that's just one line long, using FILTER.
Very short and simple. And it works great. All search strings that are valid words are indicated as such, and all search strings that are not words return a value of 0.Please Login or Register to view this content.
However, it's terribly, terribly slow. Searching for just 5,000 possible words took almost 30 seconds.
I then tried the Excel VLOOKUP function. First question... does VLOOKUP work with internal arrays? I couldn't find a valid example and I couldn't get it to work myself. And so then I decided to simply list the contents of the array in Column A (267,751 rows) and use VLOOKUP with this column. (The normal way you might use VLOOKUP.)
The results also proved terribly slow.
MATCH wasn't much better. Using the MATCH function and using the list of words in Column A, MATCH was able to better the results of VLOOKUP by the tiniest of margins.
Because the array is already sorted, this gave me the idea to try a binary search. And so I wrote a little algorithm to do just that.
I remember a piece of code I wrote many years ago that sped up a binary search because you can pinpoint where to start looking.
When I read in the array initially, from the word list, I take a moment and index the first few letters. I'll give a quick example of what I mean. The word "spectacular" is at position 219,425 in the sowpods word list (and in my array). Normally you would just start at the middle of the array (position 133,876) and check to see if the word your searching is equal to, greater than, or less than the word at this position. If it's greater than, as in this case, you then split the difference and search again. And so my next search would be at index 200,813 (133,876 + 267,751) / 2 = 200,813.
And so on and so on. Eventually you either find the word (which doesn't take as many comparisons as you might think, but if you're familiar with binary searching you know this) or you've determined it doesn't exist in the word list at all.
By indexing the first few letters, I'm able to start my search much closer to where the word will be. (Again, if it's there at all.)
For example, the first "spe" word in the word list is the word "speak." I know from my index that the first "spe" word is at position 219,267.
The next indexed word is the word "sphacelate." (Because it's the first word after "speak that doesn't begin with "spe.") This is at position 219,831. If "spectacular" is going to be in this list, it must be in this range, and I begin looking for it at position 219,549. (219,267 + 219,831) / 2 = 219,549.
And I discovered that starting at position 219,549 and working from there, instead of starting at position 133,876 (the middle of the array) proved to be faster. (I don't recall how much faster. I do recall trying to improve the speed further and indexing the fourth letter of each word, instead of just the first three letters, but that gave me diminishing returns.)
Anyway, the difference with my own binary sort algorithm and Excel's FILTER function and Excel's VLOOKUP function and Excel's MATCH function was like night and day.
Night and day.
Using my binary search algorithm, I can search for 267,751 different items in less than 2 seconds. (I test the algorithm by looking for every word in the array, one at a time, just to prove the algorithm does find it.)
Compare this to the times mentioned above:
FILTER = 5,000 different words searched and found in 29 seconds. (Did not want to test 50,000 words.)
VLOOKUP = 50,000 different words search and found in 73 seconds.
MATCH = 50,000 different words searched and found in 69 seconds.
own binary sort = all 267,751 words searched and found in 1.77 seconds (1,770 milliseconds)
(The binary sort was fast enough I decided to time it in milliseconds)
Now, that's great but can I do better? Question #2: Is there an Excel trick or function that I'm not aware of that can give me even better results? It just seems to me Excel should have some kind of a built in function or trick that can search faster and out-perform a user algorithm.
(By the way, be careful when looking for the words "false" and "true" with a VLOOKUP function. )
Bookmarks