PDA

View Full Version : nVault:


Anpheus
08-10-2005, 11:18
Please implement a changing hashtable size.

Given the method you currently employ, you create a large array full of objects when it would be far more efficient to start out with... oh, 1024 hash buckets as pointers? To check whether or not any value hashes to an existing once simply take the hash & 1023, check whether that pointer is null, etc.

Far less memory will be used because empty entries in the hashtable will not utilize any memory, the small amount of time required to dereference the pointer will be minimal, and it allows much faster growth.


How to do this? Well, we can go over the properties of hashtable growth and the probability of two keys having the same hash all day, but I think Google (http://goog-sparsehash.sourceforge.net/) has a solution.


Edit: If you would like to know how flawed your current 2048-index table is in terms of collision probability:

Hash size: 11 bits (2048 possible values)
Keys Probability that two keys have the same hash
7 0.010
15 0.050
21 0.100
26 0.150
31 0.200
35 0.250
39 0.300
46 0.400
54 0.500
76 0.750
111 0.950
123 0.975
138 0.990
169 0.999
195 0.9999
218 0.99999
238 0.999999
257 0.9999999
275 0.99999999

On the other hand, if you used a full 16 byte key (65536 words of memory, plus -possible- buckets at the end of valid pointers):

Hash size: 16 bits (65536 possible values)
Keys p
37 0.01
82 0.05
118 0.10
146 0.15
172 0.20
195 0.25
217 0.30
259 0.40
302 0.50
427 0.75
627 0.95
696 0.975
777 0.990
952 0.999
1099 0.9999
1229 0.99999
1346 0.999999
1454 0.9999999
1554 0.99999999

Ooh, that's probably not good.

But hey, you can always use a 23 or 32-bit hash and come up with a more efficient data structure:

Hash size: 32 bits (429462796) or 24 bits (16777216 possible values)
a = Keys in 32 bit hash, b = Keys in 24 bit hash
a b p
9292 581 0.01
20991 1312 0.05
30084 1881 0.10
37364 2336 0.15
43782 2737 0.20
49711 3107 0.25
55352 3460 0.30
66242 4141 0.40
77163 4823 0.50
109125 6821 0.75
160416 10026 0.95
178010 11126 0.975
198893 12431 0.990
243593 15225 0.999
281277 17580 0.9999
314477 19655 0.99999
344492 21531 0.999999
372094 23256 0.9999999
397785 24862 0.99999999


You really need to come up with a way to scale your hash structure, regardless.

As a caveat, every hashtable has this problem in that the probability of collisions grows very high (from 1/10, to 1/100, to 1/1000... etc) very quickly.

Though a 64 bit hash sure would be nice.

Twilight Suzuka
08-10-2005, 14:45
I can generate horrible worst case scenario's with random strings and very little effort.

*sigh* Do I have to make a NEW vault....again...

Yuri
08-10-2005, 16:29
*sigh* Do I have to make a NEW vault....again...

Affirmative.

Anpheus
08-11-2005, 01:33
Some more statistics gleaned from several other functions (I don't feel like going into sources right now)... a 2, 3, 4...-collision is a case where a hash 'bucket' contains 2, 3, 4... keys. A 0-collision means there is no valid key that has this hash, and a 1-collision means there is only one valid key that has this hash.

0 and 1-collisions are optimal. Above this some values can be considered 'acceptable' some of the times. More than this and it falls in the way of unacceptably large worst case scenarios.

Key size: 2048
Nr. Of Keys Inserted: 500
Nr. of 0-Collisions: 1604
Nr. of 1-Collisions: 392
Nr. of 2-Collisions: 48
Nr. of 3-Collisions: 4
Nr. Of Keys Inserted: 1000
Nr. of 0-Collisions: 1257
Nr. of 1-Collisions: 614
Nr. of 2-Collisions: 150
Nr. of 3-Collisions: 24
Nr. of 4-Collisions: 3
Nr. Of Keys Inserted: 2000
Nr. of 0-Collisions: 771
Nr. of 1-Collisions: 753
Nr. of 2-Collisions: 368
Nr. of 3-Collisions: 120
Nr. of 4-Collisions: 29
Nr. of 5-Collisions: 6
Nr. of 6-Collisions: 1
Nr. Of Keys Inserted: 5000
Nr. of 0-Collisions: 178
Nr. of 1-Collisions: 435
Nr. of 2-Collisions: 531
Nr. of 3-Collisions: 432
Nr. of 4-Collisions: 264
Nr. of 5-Collisions: 129
Nr. of 6-Collisions: 52
Nr. of 7-Collisions: 18
Nr. of 8-Collisions: 6
Nr. of 9-Collisions: 2
Nr. Of Keys Inserted: 10000
Nr. of 0-Collisions: 15
Nr. of 1-Collisions: 76
Nr. of 2-Collisions: 185
Nr. of 3-Collisions: 301
Nr. of 4-Collisions: 368
Nr. of 5-Collisions: 359
Nr. of 6-Collisions: 292
Nr. of 7-Collisions: 204
Nr. of 8-Collisions: 124
Nr. of 9-Collisions: 67
Nr. of 10+ Collisions: 57
Nr. Of Keys Inserted: 50000
Nr. of 0 - 9 Collisions: 0 ( 0.00%)
Nr. of 10 - 14 Collisions: 33 ( 0.02%)
Nr. of 15 - 19 Collisions: 293 ( 0.14%)
Nr. of 20 - 24 Collisions: 739 ( 0.36%)
Nr. of 25 - 29 Collisions: 672 ( 0.33%)
Nr. of 30 - 34 Collisions: 258 ( 0.13%)
Nr. of 35 - 39 Collisions: 46 ( 0.02%)
Nr. of 40 - 44 Collisions: 4 ( 0.00%)

This is what Bailopan's 50000 key test would look like if he used a key just FIVE BITS longer:
Nr. Of Keys Inserted: 50000
Nr. of 0 - 9 Collisions: 65461 (99.89%)
Nr. of 10 - 14 Collisions: 75 ( 0.11%)

BAILOPAN
08-11-2005, 03:04
/**
* This is a primitive, typical hash class.
* Design goals were: modular, easy to use, compact


Key word -- "primitive". I wrote this quickly so I could get the API of the vault system down in about a day. Yes, it needs to be changed, and I'll consider other options for the future.

One of the IT directors here told me he had trouble porting google's hash stuff to AMD64 and I don't have the time to look into it, which is why I did not consider using it.

Anpheus
08-11-2005, 17:04
I'd upgrade it.

Just look at the statistics.


Though you'll probably want to find a red-black tree implementation to use to make a map with 16-bit keys (but 24 or 32 would be much better)