Some Bits in Few Pieces: Dictionary for WP - Approaches to String Lookup

Couple of months back I wrote a simple dictionary app for Windows Phone. The idea of the app was to make the list of words available offline so that we do not need a data connection to look up a definition.

There are few aspects that this app was designed around:
1) Words available locally
2) Show a list of matching words while typing
3) Show the definition instantly

Of these 1 and 3 was achieved easily using a hashtable of words, the value indexed to an offset in a huge file containing all the definitions. So as soon as we have a word, we look up the offset and seek to the definition in the large file. This happens almost instantly when the app is loaded on the phone.

The big problem was to solve 2. There was a list of over 87k words and to show a list of matching words, we have to search the list again after every keypress. Initially the app used a list of string for searching which worked well while running on the Phone Emulator but was extremely slow - up to 4 seconds for each keypress - while running on the phone. Below is the sequence of investigations that followed to improve the lookup performance.

1) Simple List - Lookup through direct sequential search
As mentioned the lookup is slow. But this has a benefit of having the least startup time as we can directly read all the strings into a list which was the fastest way to create the initial data structure.

The lookup took almost 3-4 seconds to get the list of matching words. As this was a sequential search, the matching words if present in the beginning of the list took comparativefly less time than the words towards end (so ab* loaded much faster than zy*).

Though this was the simplest way to store and had faster load times, the lookup responsiveness was simply unacceptable.

2) Trie - Lookup through DFS
Trie is a specialized tree data structure used specially for this kind of scenario with very quick string prefix matches, in constant time. List of all matching words with a given prefix can be generated by doing a depth first iteration over the node matching the prefix.

This was a great structure for prefix matches and perfect fit for the app scenario. The string searches were almost instant for words with 3 or more character but for less than 3 characters the number of words was so large that the depth first search took very very long, surprisingly even longer than the sequential searches.

The longer time spent for shorter words needed some improvement.

3) Trie & Sorted List - Search in Trie with Indexed Lookup in List
The original Trie was modified to store now two integers. One represented an index into the Sorted List with first matching substring, the second represented the count of words that actually matched that prefix. So to list all strings matching a prefix, we go to the location in the list using the index and iterate using the count.

This has great performance for lookup for all strings of any length and works amazingly well on the phone too. But this structure takes time to create, as we are doing a count of all matching substrings, the structure was created offline and exported to a file and read into the app.

The act of storing the structure in the file was the great drawback. The exported trie itself was almost 7MB in size (compared to rest of the app which was just 6MB) and the time taken to read and recreate the structure in the app was very long. The app startup took almost 12 seconds and in many cases it didn't start at all as WP throttles any app which takes near to 15 seconds to start.

I also did an iteration to load the structure in the background so that the ui thread doesn't get blocked and the app loads correctly. The structure was ready to use a few seconds after load and searches after that worked well without glitch. But the limited usability studies that I did (with my wife) clearly gave an indication that waiting for a few seconds after the app was loaded was not possible, as the primary intent of users is to quickly lookup a word instead of looking around the app then try to lookup.

I did couple more iteration by trying to store the structure differently - using BinaryWriter for serialized objects, XML output, custom string serialization - but each of them had the same result, loading the structure into the app took very long.

4) Character Indexed Array of Arrays
Though this sounds like a super specialized data structure, this was a collection of 26 arrays each representing a character from the alphabet. The arrays were created while loading the list of strings using first character of the string as index and the strings were stored in their respective array. For lookup the correct array was picked by the first character entered and a sequential search was done to generate the list of strings matching the prefixes.

The creation of the structure took almost same time as a single list. The look up was way faster than that of a single list. Though it was not almost instant as the case was with trie, it was near to half a second. Compared to a single list, this was a great improvement.

I came down to this structure taking the cue from the first single list, that the searches for strings early in the list were faster than those which were towards the end. So by reducing the number of strings to match, we decreased the surface area for search and thereby searches are now way faster than a single list.

For a full fledged computer system I'd have chosen option 3 but limited processing power and space for an app on a phone made it a non-ideal choice. Eventually the app was shipped with option 4, which given the circumstances, was the perfect solution that can work with acceptable User Experience.

Trie: http://en.wikipedia.org/wiki/Trie
Radix Tree: http://en.wikipedia.org/wiki/Radix_tree

The App: http://www.windowsphone.com/apps/8f4c22ee-a3bb-4236-9e12-3cb2c2637f42