Anpheus
08-10-2005, 11:18
Please implement a changing hashtable size.
Given the method you currently employ, you create a large array full of objects when it would be far more efficient to start out with... oh, 1024 hash buckets as pointers? To check whether or not any value hashes to an existing once simply take the hash & 1023, check whether that pointer is null, etc.
Far less memory will be used because empty entries in the hashtable will not utilize any memory, the small amount of time required to dereference the pointer will be minimal, and it allows much faster growth.
How to do this? Well, we can go over the properties of hashtable growth and the probability of two keys having the same hash all day, but I think Google (http://goog-sparsehash.sourceforge.net/) has a solution.
Edit: If you would like to know how flawed your current 2048-index table is in terms of collision probability:
Hash size: 11 bits (2048 possible values)
Keys Probability that two keys have the same hash
7 0.010
15 0.050
21 0.100
26 0.150
31 0.200
35 0.250
39 0.300
46 0.400
54 0.500
76 0.750
111 0.950
123 0.975
138 0.990
169 0.999
195 0.9999
218 0.99999
238 0.999999
257 0.9999999
275 0.99999999
On the other hand, if you used a full 16 byte key (65536 words of memory, plus -possible- buckets at the end of valid pointers):
Hash size: 16 bits (65536 possible values)
Keys p
37 0.01
82 0.05
118 0.10
146 0.15
172 0.20
195 0.25
217 0.30
259 0.40
302 0.50
427 0.75
627 0.95
696 0.975
777 0.990
952 0.999
1099 0.9999
1229 0.99999
1346 0.999999
1454 0.9999999
1554 0.99999999
Ooh, that's probably not good.
But hey, you can always use a 23 or 32-bit hash and come up with a more efficient data structure:
Hash size: 32 bits (429462796) or 24 bits (16777216 possible values)
a = Keys in 32 bit hash, b = Keys in 24 bit hash
a b p
9292 581 0.01
20991 1312 0.05
30084 1881 0.10
37364 2336 0.15
43782 2737 0.20
49711 3107 0.25
55352 3460 0.30
66242 4141 0.40
77163 4823 0.50
109125 6821 0.75
160416 10026 0.95
178010 11126 0.975
198893 12431 0.990
243593 15225 0.999
281277 17580 0.9999
314477 19655 0.99999
344492 21531 0.999999
372094 23256 0.9999999
397785 24862 0.99999999
You really need to come up with a way to scale your hash structure, regardless.
As a caveat, every hashtable has this problem in that the probability of collisions grows very high (from 1/10, to 1/100, to 1/1000... etc) very quickly.
Though a 64 bit hash sure would be nice.
Given the method you currently employ, you create a large array full of objects when it would be far more efficient to start out with... oh, 1024 hash buckets as pointers? To check whether or not any value hashes to an existing once simply take the hash & 1023, check whether that pointer is null, etc.
Far less memory will be used because empty entries in the hashtable will not utilize any memory, the small amount of time required to dereference the pointer will be minimal, and it allows much faster growth.
How to do this? Well, we can go over the properties of hashtable growth and the probability of two keys having the same hash all day, but I think Google (http://goog-sparsehash.sourceforge.net/) has a solution.
Edit: If you would like to know how flawed your current 2048-index table is in terms of collision probability:
Hash size: 11 bits (2048 possible values)
Keys Probability that two keys have the same hash
7 0.010
15 0.050
21 0.100
26 0.150
31 0.200
35 0.250
39 0.300
46 0.400
54 0.500
76 0.750
111 0.950
123 0.975
138 0.990
169 0.999
195 0.9999
218 0.99999
238 0.999999
257 0.9999999
275 0.99999999
On the other hand, if you used a full 16 byte key (65536 words of memory, plus -possible- buckets at the end of valid pointers):
Hash size: 16 bits (65536 possible values)
Keys p
37 0.01
82 0.05
118 0.10
146 0.15
172 0.20
195 0.25
217 0.30
259 0.40
302 0.50
427 0.75
627 0.95
696 0.975
777 0.990
952 0.999
1099 0.9999
1229 0.99999
1346 0.999999
1454 0.9999999
1554 0.99999999
Ooh, that's probably not good.
But hey, you can always use a 23 or 32-bit hash and come up with a more efficient data structure:
Hash size: 32 bits (429462796) or 24 bits (16777216 possible values)
a = Keys in 32 bit hash, b = Keys in 24 bit hash
a b p
9292 581 0.01
20991 1312 0.05
30084 1881 0.10
37364 2336 0.15
43782 2737 0.20
49711 3107 0.25
55352 3460 0.30
66242 4141 0.40
77163 4823 0.50
109125 6821 0.75
160416 10026 0.95
178010 11126 0.975
198893 12431 0.990
243593 15225 0.999
281277 17580 0.9999
314477 19655 0.99999
344492 21531 0.999999
372094 23256 0.9999999
397785 24862 0.99999999
You really need to come up with a way to scale your hash structure, regardless.
As a caveat, every hashtable has this problem in that the probability of collisions grows very high (from 1/10, to 1/100, to 1/1000... etc) very quickly.
Though a 64 bit hash sure would be nice.