[Data Structures] Dictionary
Suppose I have some two-letter words and their definitions. I need to design a dictionary which I can use to look up the definition of a word very quickly. Here we say the word is the key that addresses the definition.
There are 26 letters in English. So there are 26 * 26 = 676 possible two-letter words. To solve the problem, I can use an array of length 676. Each array index is a bucket that contains a definition. To insert a definition to the dictionary, I'll use a function called hashCode(), which maps each word to an integer from 0 to 675. Then I can use the hash code to index into the array and store the definition there. This enables us to look up the definition of a word in constant time.
Obviously English is not limited to two-letter words. In fact, the longest English word has length 45 (pneumonoultramicroscopicsilicovolcanoconiosis). To put such a word into our dictionary, we would have to create another array of length 26^45. This is obviously impossible. Even if we could create such a huge array, we would be wasting a lot of space because there are only around 200,000 English words in total. We need some better data structure.
Hash Tables
Hash tables are the most common way to implement dictionaries (not the only way - search trees).
Suppose we have n words. We are going to use a table of N buckets to store the definitions. A hash table can map a large set of keys into a small number of buckets by applying a compression function to each key's hash code. A straightforward compression function is this:
h(hashCode) = hashCode mod n;
With this compression function, several different keys could be hashed into one bucket. This problem is called collision. Collision is unavoidable if n > N.
To solve this problem, we use a technique called chaining. It's just a fancy word with a simple meaning. Instead of storing one definition per bucket, we make each bucket a reference to a linked list of definitions called a chain.
Another problem follows. If we store a list of definitions in each bucket, how do we know which definition maps to which key? The solution is simple, we can store both the key and the definition in the buckets. In other words, each list node stores a word and its definition together.
An example of a hash table looks like this:
A hash table has three standard operations:
- public Entry insert(key, value)
- Computes the key's hash code.
- Compress it to determine its bucket.
- Wraps key and value into an Entry object (an object that contains a reference to a word and a reference to a definition) and inserts it into the bucket.
- public Entry find(key)
- Hash the key and compress it to find bucket.
- Search the list for the Entry with the given key.
- If found, return the Entry. Otherwise, return null.
- public Entry remove(key)
- Same as find() except it removes the Entry and returns it.
What if a word has more than one definition?
This is reasonable because most English words have multiple definitions. What we can do is we can store multiple entries into the dictionary. So there will be multiple entries with the same key. But what if you call find() on this dictionary? In this case, a random entry will be returned. The dictionary will search the key and returns immediately when it finds a match. We could also add a findAll() method which returns all entries. The implementation shouldn't be hard.
Comments
Post a Comment