Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

c - Size of the hash table

Let the size of the hash table to be static (I set it once). I want to set it according to the number of entries. Searching yielded that the size should be a prime number and equal to 2*N (the closest prime number I guess), where N is the number of entries.

For simplicity, assume that the hash table will not accept any new entries and won't delete any.

The number of entries will be 200, 2000, 20000 and 2000000.

However, setting the size to 2*N seems too much to me. It isn't? Why? If it is, which is the size I should pick?

I understand that we would like to avoid collisions. Also I understand that maybe there is no such thing as ideal size for the hash table, but I am looking for a starting point.

I using C and I want to build my own structure, for educating myself.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

the size should be a prime number and equal to 2*N (the closest prime number I guess), where N is the number of entries.

It certainly shouldn't. Probably this recommendation implies that load factor of 0.5 is good tradeoff, at least by default.

What comes to primality of size, it depends on collision resolution algorithm your choose. Some algorithms require prime table size (double hashing, quadratic hashing), others don't, and they could benefit from table size of power of 2, because it allows very cheap modulo operations. However, when closest "available table sizes" differ in 2 times, memory usage of hash table might be unreliable. So, even using linear hashing or separate chaining, you can choose non power of 2 size. In this case, in turn, it's worth to choose particulary prime size, because:

If you pick prime table size (either because algorithm requires this, or because you are not satisfied with memory usage unreliability implied by power-of-2 size), table slot computation (modulo by table size) could be combined with hashing. See this answer for more.

The point that table size of power of 2 is undesirable when hash function distribution is bad (from the answer by Neil Coffey) is impractical, because even if you have bad hash function, avalanching it and still using power-of-2 size would be faster that switching to prime table size, because a single integral division is still slower on modern CPUs that several of multimplications and shift operations, required by good avalanching functions, e. g. from MurmurHash3.

The entries will be 200, 2000, 20000 and 2000000.

I don't understand what did you mean by this.

However, setting the size to 2*N seems too much to me. It isn't? Why? If it is, which is the size I should pick?

The general rule is called space-time tradeoff: the more memory you allocate for hash table, the faster hash table operate. Here you can find some charts illustrating this. So, if you think that by assigning table size ~ 2 * N you would waste memory, you can freely choose smaller size, but be ready that operations on hash table will become slower on average.

I understand that we would like to avoid collisions. Also I understand that maybe there is no such thing as ideal size for the hash table, but I am looking for a starting point.

It's impossible to avoid collisions completely (remember birthday paradox? :) Certain ratio of collisions is an ordinary situation. This ratio affects only average operation speed, see the previous section.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...