Raised This Month: $79 Target: $400
 19% 

[TUT]How to implement a hashmap


Post New Thread Reply   
 
Thread Tools Display Modes
Author Message
HamletEagle
AMX Mod X Plugin Approver
Join Date: Sep 2013
Location: Romania
Old 06-07-2020 , 11:06   [TUT]How to implement a hashmap
Reply With Quote #1

In this tutorial, we are going to implement our own hashmap data structure from scratch. We are going to learn:
  1. what it is
  2. one possible implementation
  3. why it is useful
  4. when it should be used

You used maps quite a lot already when working with tries. Tries are one implementation provided by AMXX for a map-like data structure.
By now you could be asking a very valid question: if we already have maps(tries) and we kinda know how to work with them why do we need this tutorial?
Answer:
1. Because it's useful to understand how it works(instead of just assuming "it works by magic") and the best way of really getting a feel about how something works is to build that thing yourself.
2. You probably heard people saying to use tries when you need to search for a specific entry, saying that tries are fast for retrieving data. You will understand "why" and make an informed decision about when to use one.
3. Bonus: this isn't AMXX or CS specific, this is a concept that's widely used in programming, it will help you in many many many scenarios outside AMXX scripting.

With that being said, let's see what it is: a map is an association between a key and a value(it maps a key to a value hence the name map).
Example 1: if you want to visit a person for the first time you need his address. Once you reach that address you know you reached that person's house. We can say the address is the key that tells you where to find the value(the house).
Example 2: you want to give players different weapon skins. You need a way to link the skins to the individual player, which is typically done by using their steamid. Here, the steamid is the key that identifies the skins for a particular player.

Considering our previous example, we will store in our map something like this:
Code:
steam1 -> models/new_weapons/new_ak47.mdl
steam2 -> models/new_weapons/new_m4a1.mdl
where x -> y means map key x to value y. Therefore, we must give to the player with the steamid "steam1" a new ak47 model, and to the player with steamid "steam2" a new m4a1 model.
When a player joins the server, we take his steamid and check the map. What value did we save for this steamid(key)? Then we take that value and give the model to the player.

Keys and values can be literally anything, they could be numbers, strings, or more complicated "data types"(like an enum). For now, let's consider our keys are numbers to get some intuition about how we could implement our hashmap in code.

Let's say we have the following key -> value pairs(they don't have to hold any meaning, we can map anything to anything and now we just feel like mapping 1 to 5 and 2 to 7):
Code:
1 -> 5
2 -> 7
3 -> 11
We need a way to store the fact that when we query the map with key = 1, it needs to respond with value = 5. Similarly, if we ask for key 3 it must return 11.
If we imagine these were not random numbers, but 1 2 3 are player indexes and 5 7 11 are their level for an XP System plugin the solution should be obvious. For ages, we used arrays like new PlayerData[33] to store information about a player.
We use the player "id" to index PlayerData array and find the corresponding value.
We are actually mapping from player ids to some value, which looks a lot like what a map should do.

We can write the following code:
PHP Code:
#include <amxmodx>

#define MAP_MAX_SIZE 33 

new numbersMap[MAP_MAX_SIZE]

public 
plugin_init()
{
    
//add some key -> value pairs in our map
    
addToMap(15)
    
addToMap(27)
    
addToMap(311)
    
    
//test to see if it works
    
server_print("key = 1 value = %d"getFromMap(1))
    
server_print("key = 2 value = %d"getFromMap(2))
    
server_print("key = 3 value = %d"getFromMap(3))
    
server_print("key = 25 value = %d"getFromMap(25))
}

addToMap(keyvalue)
{
    
numbersMap[key] = value
}

getFromMap(key)
{
    return 
numbersMap[key]

We are essentially doing:
PHP Code:
array[1] = 5
array[2] = 7
array[3] = 11 
but in a more general way, which will allow us to move to a true hashmap later on.

If we compile & run we get the following output:
Code:
key = 1 value = 5
key = 2 value = 7
key = 3 value = 11
key = 25 value = 0
The first 3 lines make perfect sense. We asked "what value is currently stored for key 1" and the map responded "the value is 5". Same for second and third line.
The last line is an attempt to ask about an invalid key(we never added the key 25 to our map). The answer was 0 because the map is initialized to 0 when created. That's fine, no need to worry about it for now.

We see it's really easy to implement a map for numbers. In fact, we do it in almost every plugin to store information about a player.
Unfortunately, when our keys are something else(not numbers) we can not do that anymore. This is because we can't use a string like "steamid1" to index into an array(array["steamid1"] is not valid).

But what if we could somehow convert from any "data type" like strings to integers? Take a string, do some magic and get a number from it. Then we could use our previous code for every kind of data. How cool is that?
The "thing" that we will use to convert from other "data types" to integers is called a hash function. A hash function takes something and outputs a number generated from that something.
Ideally, we want the output to be unique and you will soon see why.

Let's try to come up with a hash function that will allow us to use strings instead of number for keys. We will try to implement a map that allows us to have mappings like:
Code:
"first_key" -> 5
"second_key" -> 7
"third_key" -> 11
(notice how our keys are strings, not numbers)

We can't just do str_to_num because our strings do not have to represent valid numbers(like "13"), they can be anything so we need another hash function. There is no universal way of choosing a function, we need to look at our data and pick something.
Since we are working with strings, let's try to take the first character from the string('f', 's', 't' for our keys) and use that. Characters are actually ASCII numbers(http://www.asciitable.com/) and we like numbers because it's easy to index an array using them.
For example, 'f' has the ASCII code 102, 's' is 115, etc: we could use 102 and 115 to index our array.
The only problem is we only have 33 slots, 102/105 won't fit. A very simple solution is to use the modulo(%) operator. It will make sure our values are wrapped inside the interval [0-32](which are valid indexes because MAX_SIZE is 33).
If you are unsure why this works, check some elementary math resources, but the bottom line is: if x : y = c remainder r then r < y. Modulo gives us r so when we do index % 33 the remainder r is always smaller than 33.

Therefore, our hash function could look like
PHP Code:
stringHashFunction(string[])
{
    new 
firstChar string[0]  //take the first character from string
    
return firstChar MAP_MAX_SIZE //return the ascii value wrapped so it fits in the array

With this, we can change our code to:
PHP Code:
#include <amxmodx>

#define MAP_MAX_SIZE 33

new numbersMap[MAP_MAX_SIZE]

public 
plugin_init()
{
    
//add some key -> value pairs in our map
    
addToMap("first_key"5)
    
addToMap("second_key"7)
    
addToMap("third_key"11)
    
    
//test to see if it works
    
server_print("key = first_key value = %d"getFromMap("first_key"))
    
server_print("key = second_key value = %d"getFromMap("second_key"))
    
server_print("key = third_key value = %d"getFromMap("third_key"))
}

stringHashFunction(string[])
{
    new 
firstChar string[0]  //take the first character from string
    
return firstChar MAP_MAX_SIZE //return the ascii value wrapped so it fits in the array
}

addToMap(key[32], value)
{
    
//convert the string to a number(index)
    //keyHashCode will represent the slot where we add the data
    
new keyHashCode stringHashFunction(key)
    
numbersMap[keyHashCode] = value
}

getFromMap(key[32])
{
    
//convert the string to a number(index)
    //keyHashCode will represent the slot from where we retrieve the data
    
new keyHashCode stringHashFunction(key)
    return 
numbersMap[keyHashCode]

It's pretty much the same thing, except in addToMap and getFromMap we do an extra step. We call our hash function on the key to get a number. We use that number to index into our map.
For example, this is how it would work for adding the mapping "second_key" -> 7.
1.Take the first character from the key(which is 's')
2."Convert" it into a number and take the modulo to make sure the value is smaller than map size.
3.'s' = 115; 115 % 33 = 16. Then at index 16 in the array, we store our value(7), i.e numbersMap[16] = 7.

If you didn't quite understand what's the role of the hash function think about it this way::
You want to check-in at a hotel. You go to the front desk, tell them your name and the receptionist will give you a room where you will stay for the duration of the trip.
The receptionist is the hash function: a hash function gives us an index into the array where we can add our data and the receptionist gives you a room number where you can stay.

Great! Now we can use our simple map to store strings. Just like this, we could take any kind of data, apply a function we created over it, get an integer and then use it to index an array.
However, we still have a few issues we need to address:
  • we are not explicitly storing the key. If we wanted to print all (key, value) pairs we won't be able to. We only know the indexes and the values. In general, a hash function doesn't allow us to go back from index to the key. More precise:
    Code:
    index = hashFunction(key)
    key = hashFunctionInverse(index) does not generally hold(the hash function doesn't always have an inverse)
  • If we wanted to have strings as values(like in our second example where we map steam ids to weapon skins) we would need to change the definition of the map to something like new numbersMap[MAP_MAX_SIZE][MAX_VALUE_LEN]. Let's try to come up with something a bit more generic(as generic as pawn allows us to be anyway).

In order to fulfill both requirements, we will define a "pair" structure(with two members: key and value).
PHP Code:
enum _:pair 
{
    
key[128],
    
value[128]

The map will be an array of (key, value) pairs: we explicitly store the key and if we need to change the datatype we need to change the enum definition, not the map itself.
PHP Code:
new numbersMap[MAP_MAX_SIZE][pair
This also allows us to have more complex key/value pairs than just basic variables/strings. For example, if we had 2 keys for each value we could alter the pair enum to be:
PHP Code:
enum _:pair  //(key1, key2) -> value
{
    
key1[128],
    
key2[128],
    
value[128]

Our code will look like:
PHP Code:
#include <amxmodx>

#define MAP_MAX_SIZE 33

enum _:pair 
{
    
key[128],
    
value[128]
}

//an array of "pair" with size MAP_MAX_SIZE
new numbersMap[MAP_MAX_SIZE][pair]

public 
plugin_init()
{
    
//add some key -> value pairs in our map
    
addToMap("steam_id1""models/weapons/new_ak47.mdl")
    
addToMap("steam_id2""models/weapons/new_m4a1.mdl")
    
addToMap("steam_id3""models/weapons/new_deagle.mdl")
    
    
//test to see if it works
    
new model1[128]; getFromMap("steam_id1"model1);
    new 
model2[128]; getFromMap("steam_id2"model2);
    new 
model3[128]; getFromMap("steam_id3"model3);
    
    
server_print("key = steam_id1 value = %s"model1)
    
server_print("key = steam_id2 value = %s"model2)
    
server_print("key = steam_id3 value = %s"model3)
}

stringHashFunction(string[])
{
    new 
firstChar string[0
    return 
firstChar MAP_MAX_SIZE
}

addToMap(keyToAdd[128], valueToAdd[128])
{
    
//find the index where we need to add (keyToAdd, valueToAdd)
    
new keyHashCode stringHashFunction(keyToAdd)
    
    
//copy the key and value into the array 
    
copy(numbersMap[keyHashCode][key], charsmax(numbersMap[][key]), keyToAdd)
    
copy(numbersMap[keyHashCode][value], charsmax(numbersMap[][value]), valueToAdd)
}

getFromMap(keyToFind[128], outputValue[128])
{
    
//find the index where we inserted the key "keyToFind"
    
new keyHashCode stringHashFunction(keyToFind)
    
    
//copy the value from that index
    
copy(outputValuecharsmax(outputValue), numbersMap[keyHashCode][value])

You will notice now it is more similar with the trie natives: provide a key as a string and get the value back as a string.
This is also something you are likely to do in a plugin: use maps(trie) to save player customization. Let's try to run the code and see if it works:

Code:
key = steam_id1 value = models/weapons/new_deagle.mdl
key = steam_id2 value = models/weapons/new_deagle.mdl
key = steam_id3 value = models/weapons/new_deagle.mdl
Oops, this does not look right. steam_id1 should have a value of "models/weapons/new_ak47.mdl", steam_id2 a value of "models/weapons/new_m4a1.mdl". Instead, the hashmap returned models/weapons/new_deagle.mdl for every key. Can you figure out why?

To answer the question, manually execute the hash function on each key and check the output
PHP Code:
stringHashFunction("steam_id1") = 's' 33 16
stringHashFunction
("steam_id2") = 's' 33 16
stringHashFunction
("steam_id3") = 's' 33 16 
Our function generated the same index for all keys, each call to addToMap translated into copying the (key, value) pair at index 16. We lost the values for steam_id1, steam_id2 and only have the value for the last inserted key.
This is because the hash function only takes into account the first letter of the string, and all 3 keys start with 's'.

When a hash function generates the same index for different keys we say a "collision" happened. Two different keys ended up in the same array slot(this should not happen!).
Collisions are extremely bad because they cause us to lose data(we lost the values for the first 2 keys) and slow down our code.
Going back to our hotel analogy, you don't want the receptionist to give the same room to 2 different people, every person must have a different room.
Ideally, the hash function should generate different indexes for different keys, but most of the time this is not possible, so we need a way of dealing with the problem: we are going to look at a very simple and efficient solution called "linear probing".

Linear probing idea: start checking from hashFunc(key) until an empty index/an index with the same key is found. Write/update that index.
This is how it will work with our example keys: steam_id1, steam_id2, steam_id3.

Code:
Insert steam_id1: hashFunction(steam_id1) = 16. 
Is map[16] empty? 
Yes, insert steam_id1 key in map[16].

Insert steam_id2: hashFunction(steam_id2) = 16. 
Is map[16] empty? No, it contains steam_id1 key. 
Try the next index, 17. 
Is map[17] free?
Yes, insert steam_id2 key in map[17].

Insert steam_id3: hashFunction(steam_id3) = 16.
Is map[16] empty? No, it contains steam_id1 key. 
Try the next index, 17. 
Is map[17] empty? No, it contains steam_id2 key.
Try the next index, 18.
Is map[18] free?
Yes, insert steam_id3 key in map[18].
What happens if we want to insert steam_id1 again with a different value?
hashFunction(steam_id1) = 16.
Is map[16] empty? No, but it contains the key steam_id1 which is the same key we are trying to insert now. Since a map can not have duplicate keys(how would the map know which one you want to retrieve?) we will simply update the old value with the new value at index 16.

This is how this would work for the hotel analogy:
The receptionist wants to give you room 100. But firstly, he checks if it's free or someone is there already.
If the room is free, he gives it to you.
If someone else is already there, he will try to give you room 101 and repeat the process until a free room is found.

Just like that we fixed the collision problem and now we have all 3 pairs inside our map structure.
To implement this, we will have to modify addToMap and getFromMap and use a loop to find the first empty index/the first index that contains the same key.

PHP Code:
#include <amxmodx>

#define MAP_MAX_SIZE 33

enum _:pair 
{
    
key[128],
    
value[128]
}

new 
numbersMap[MAP_MAX_SIZE][pair]

public 
plugin_init()
{
    
//add some key -> value pairs in our map
    
addToMap("steam_id1""models/weapons/new_ak47.mdl")
    
addToMap("steam_id2""models/weapons/new_m4a1.mdl")
    
addToMap("steam_id3""models/weapons/new_deagle.mdl")
    
    
//test to see if it works
    
new model1[128]; getFromMap("steam_id1"model1);
    new 
model2[128]; getFromMap("steam_id2"model2);
    new 
model3[128]; getFromMap("steam_id3"model3);
    
    
server_print("key = steam_id1 value = %s"model1)
    
server_print("key = steam_id2 value = %s"model2)
    
server_print("key = steam_id3 value = %s"model3)
}

stringHashFunction(string[])
{
    new 
firstChar string[0
    return 
firstChar MAP_MAX_SIZE
}

addToMap(keyToAdd[128], valueToAdd[128])
{
    
//find the index where we need to add (keyToAdd, valueToAdd)
    
new keyHashCode stringHashFunction(keyToAdd)
    
    
//try to insert at most MAP_MAX_SIZE times 
    //if we fail to insert after MAP_MAX_SIZE tries then the map is full
    
for(new 1<= MAP_MAX_SIZEi++)
    {
        if(
numbersMap[keyHashCode][key][0] == EOS || //is this slot empty?
            
equal(numbersMap[keyHashCode][key], keyToAdd)) //does it contain the same key
        
{
            
//copy the key and value into the array 
            
copy(numbersMap[keyHashCode][key], charsmax(numbersMap[][key]), keyToAdd)
            
copy(numbersMap[keyHashCode][value], charsmax(numbersMap[][value]), valueToAdd)
            
            
//stop, we inserted the key and value
            
break
        }
        
        
//check next slot. We must do % to keep "keyHashCode" index < MAP_MAX_SIZE
        
keyHashCode = (keyHashCode 1) % MAP_MAX_SIZE
    
}
}

getFromMap(keyToFind[128], outputValue[128])
{
    
//find the index from where we start searching for "keyToFind"
    
new keyHashCode stringHashFunction(keyToFind)
    
    for(new 
1<= MAP_MAX_SIZEi++)
    {
        if(
equal(numbersMap[keyHashCode][key], keyToFind)) //our key is inserted at this slot
        
{
            
//copy the value from that index
            
copy(outputValuecharsmax(outputValue), numbersMap[keyHashCode][value])
            
            
//stop, we found the slot with our key
            
break
        }
        
        
//check next slot. We must do % to keep "keyHashCode" index < MAP_MAX_SIZE
        
keyHashCode = (keyHashCode 1) % MAP_MAX_SIZE
    
}

We just implemented the algorithm described above in code. We loop at most MAP_MAX_SIZE times because this guarantees we will check each slot(we have MAP_MAX_SIZE slots) in the worst case scenario.

Code:
key = steam_id1 value = models/weapons/new_ak47.mdl
key = steam_id2 value = models/weapons/new_m4a1.mdl
key = steam_id3 value = models/weapons/new_deagle.mdl
Finally, we have a working map implementation. The only issue left is what happens when the map is full. Right now, any insert will silently fail which is not good. You may want to return a value to signal when a failure occurs.
Something like this:
PHP Code:
bool:addToMap(keyToAdd[128], valueToAdd[128])
{
    
//find the index where we need to add (keyToAdd, valueToAdd)
    
new keyHashCode stringHashFunction(keyToAdd)
    new 
bool:succesfullyInserted false
    
    
//try to insert at most MAP_MAX_SIZE times 
    //if we fail to insert after MAP_MAX_SIZE tries then the map is full
    
for(new 1<= MAP_MAX_SIZEi++)
    {
        if(
numbersMap[keyHashCode][key][0] == EOS || //is this slot empty?
            
equal(numbersMap[keyHashCode][key], keyToAdd)) //does it contain the same key
        
{
            
//copy the key and value into the array 
            
copy(numbersMap[keyHashCode][key], charsmax(numbersMap[][key]), keyToAdd)
            
copy(numbersMap[keyHashCode][value], charsmax(numbersMap[][value]), valueToAdd)
            
            
//stop, we inserted the key and value
            
succesfullyInserted true
            
break
        }
        
        
//check next slot. We must do % to keep "keyHashCode" index < MAP_MAX_SIZE
        
keyHashCode = (keyHashCode 1) % MAP_MAX_SIZE
    
}
    
    return 
succesfullyInserted

Similarly, we can change getFromMap to return true/false if they key exists/doesn't exist.

PHP Code:
bool:getFromMap(keyToFind[128], outputValue[128])
{
    
//find the index from where we start searching for "keyToFind"
    
new keyHashCode stringHashFunction(keyToFind)
    new 
bool:succesfullyFound false
    
    
for(new 1<= MAP_MAX_SIZEi++)
    {
        if(
equal(numbersMap[keyHashCode][key], keyToFind)) //our key is inserted at this slot
        
{
            
//copy the value from that index
            
copy(outputValuecharsmax(outputValue), numbersMap[keyHashCode][value])
            
            
//stop, we found the slot with our key
            
succesfullyFound true
            
break
        }
        
        
//check next slot. We must do % to keep "keyHashCode" index < MAP_MAX_SIZE
        
keyHashCode = (keyHashCode 1) % MAP_MAX_SIZE
    
}
    
    return 
succesfullyFound

Another way to deal with a full hashmap is to use a dynamic array instead of a regular array to be able to resize it(when full or once the performance of the map degrades because of too many inserted elements).

Bechmarking our implementation:
For the end of this tutorial, let's do some benchmarking to find out how our implementation compares to the trie from amxx:
I'm going to use the following code which will generate 4096 random keys and values, insert them both in our map and in a trie and then test how fast we can retrieve them.
The testing code is not important, if you don't understand something do not worry.
(I stole the timing functions from somewhere, I don't remember who is the original author)

Spoiler


Output:
Code:
Insert for our map: 0 days, 0 hours, 0 minutes, 0 seconds and 526 milliseconds.
Get for our map: 0 days, 0 hours, 0 minutes, 0 seconds and 516 milliseconds.
Let's test the trie version too, before drawing any conclusions:
Spoiler


Code:
Insert for trie: 0 days, 0 hours, 0 minutes, 0 seconds and 4 milliseconds.
Get for trie: 0 days, 0 hours, 0 minutes, 0 seconds and 2 milliseconds.
Wow! That's really bad. Our version takes 526 milliseconds to insert vs 4 milliseconds for a trie and it takes 516 milliseconds to retrieve vs 2 milliseconds for a key.
But why? Is our implementation THAT bad? Did I lie when I said linear probing is fast and easy to implement? Well, no, and the explanation is actually simple.

Remember our example about collisions? It turns out the hash function is responsible for this poor performance. We used a hash function that takes the first character of the string and uses it as the slot for the key-value pair: every string that starts with the same letter will cause a collision.
All strings starting with 'a' will want to go into the same map slot.
All strings starting with 'b' will want to go into the same map slot.
....

We will have to run our collision resolution algorithm(the for loop) many times and this slows down our implementation. In order to get good performance we need a hash function that causes very few collisions.

Let's replace in the first code the hash function with this:
PHP Code:
stringHashFunction(string[])
{
    new 
hash 7
    
for (new 0strlen(string); i++) 
    {
        
hash hash 31 string[i]
    }

    return 
hash MAP_MAX_SIZE

If we test again we get:
Code:
Insert for our map: 0 days, 0 hours, 0 minutes, 0 seconds and 98 milliseconds.
Get for our map: 0 days, 0 hours, 0 minutes, 0 seconds and 95 milliseconds.
Just by changing the hash function, we got a code that 5 times faster. Sure, it's still not nearly as fast as a trie, but our goal here wasn't to beat tries. Our goal was to implement a decent hashmap.
There are many other techniques you can use to get even more speedup.

Then how do we choose a good hash function?
Firstly, it has to be fast, the function shouldn't go any crazy stuff that take a lot of time to run.
Secondly, it should have as few collisions as possible. There are very good articles around the web and this tutorial is already long, so if you want to read more please search.
If you can't be bothered to learn about good hash functions, then simply search "hash function for strings", "hash function for integers" and you'll get one.

One last thing:
You may have a very valid question: "I see you are using an array and doing some loops, so how is this any better than directly using an array and looping to find the data I need?".
If you do, please read the topic again and pay attention to the hash function, it's where all the magic happens.

Let's say you have an array of 4096 slots. What happens if you loop to find the data and it is in the last slot? You will have to check 4095 other items before finding the right one.
On the other hand, a map needs to do only one step in the ideal case: the needed data is in the slot returned by the hash function. With collisions, it may not be in that slot, but
you know it's in the next one or the next next one ...
With few collisions, you will loop very few times(let's say 5 or 10). 5 is much smaller than 4095, hence why maps are very good at finding data and why a good hash function is so important.

Conclusion:
Should you implement your own map in plugins? No, don't do that. Use tries, they will probably perform better than your implementation and they are tried and tested.
However, you should know how to implement a map if need be. Plus, knowing how it works on a deeper level helps you understand how/when you should use one.
__________________

Last edited by HamletEagle; 06-07-2020 at 11:15.
HamletEagle is offline
Napoleon_be
Veteran Member
Join Date: Jul 2011
Location: Belgium
Old 06-11-2020 , 07:11   Re: [TUT]How to implement a hashmap
Reply With Quote #2

Nice guide, will definitely be using this in the future.
__________________
Napoleon_be is offline
Send a message via Skype™ to Napoleon_be
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -4. The time now is 06:59.


Powered by vBulletin®
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Theme made by Freecode