Apologies in advance for the novel. Feel free to make suggestions or edit directly. --Chet
There is an overflow, but not an exception.
The danger doesn't come from losing accuracy, but losing range. Let's use a ridiculous example, where "prime" is a large power of 2, and 8-bit unsigned numbers for brevity. And assume that (hash1 ^ hash2)
is 255:
"prime": 1000 0000
(hash1 ^ hash2): 1111 1111
Showing the truncated digits in brackets, our result is:
product: [0111 1111] 1000 0000
But multiplying by 128 is the same as shifting left by 7 places. So we know that whatever the value of (hash1 ^ hash2)
, the least-significant places of the product will have seven zeros. So if (hash1 ^ hash2)
is odd (least significant bit = 1), then the result of multiplying by 128 will always be 128 (after truncating the higher digits). And if (hash1 ^ hash2)
is even (LSB is 0, then the product will always be zero.
This extends to larger bit sizes. The general point is that if the lower bits of "prime
" are zeros, you're doing a shift (or multiple shift + sum) operation that will give you zeros in the lower bits. And the range of the product of multiplication will suffer.
But let's try making "prime
" odd, so that the least significant bit will always be 1. Think about decomposing this into shift / add operations. The unshifted value of (hash1 ^ hash2)
will always be one of the summands. The least significant bits that were shifted into guaranteed uselessness by an even "prime
" multiplier will now be set based on, at minimum, the bits from the original (hash1 ^ hash2)
value.
Now, let's consider a value of prime
which is actually prime. If it's more than 2, then we know it's odd. So the lower bits haven't been shifted into uselessness. And by choosing a sufficiently large prime, you get better distribution across the range of output values than you'd get with a smaller prime.
Try some exercises with 16-bit multiplication using 8443 (0010 0000 1111 1011
) and 59 (0000 0000 0011 1011
). They're both prime, and the lower bits of 59 match the lower bits of 65531. For example, if hash1 and hash2 are both ASCII character values (0 .. 255), then all of the results of (hash1 ^ hash2) * 59 will be <= 15045. This means that roughly 1/4 of the range of hash values (0..65535) for a 16-bit number go unused.
But (hash1 ^ hash2) * 8443
is all over the map. It overflows if (hash1 ^ hash2)
is as low as 8. It uses all 16 bits even for very small input numbers. There's much less clustering of hash values across the overall range, even if the input numbers are in a relatively small range.
Assuming that the overflow is not a problem (JVM doing an automatic cast) is it better to do a bitshift instead of a cast?
Most likely not. The JVM should translate into an efficient implementation on the host processor anyway. Integer multiplication should be implemented in hardware. And if not, the JVM is responsible for translating the operation into something reasonable for the CPU. It's very likely that the case of integer multiplication is highly optimized already. If integer multiplication is done more quickly on a given CPU as shift-and-add, the JVM should implement it that way. But it's less likely that the folks writing the JVM would care to watch for cases where multiple shift-and-add operations could have been combined into a single integer multiply.
I imagine the performance of the hashcode function vary significantly based on the complexity of the hashcode. Does the size
of the prime multiplier not effect the performance?
No. The operations are the same when done in hardware regardless of the size, number of bits set, etc. It's probably a couple of clock cycles. It would vary depending on the specific CPU, but should be a constant-time operation regardless of the input values.
Is it better/smarter/faster to use multiple primes in a custom hashcode function instead of a single multiplier? If not, is there
some other advantage?
Only if it reduces the possibility of collisions, and this depends on the numbers you're using. If your hash code depends on A
and B
and they're in the same range, you might consider using different primes or shifting one of the input values to reduce overlap between the bits. Since you're depending on their individual hash codes, and not their values directly, it's reasonable to assume that their hash codes provide good distribution, etc.
One factor that comes to mind whether you want the hash code for (x, y)
to be different from (y, x)
. If your hash function treats A
and B
in the same way, then hash(x, y) = hash(y, x)
. If that's what you want, then by all means use the same multiplier. It not, using a different multiplier would make sense.
How about something like long lhash = prime * (hash1 ^ hash2);
then using (int)((lhash >> 32) ^ lhash)
? That's something I saw on another question here SO, but it wasn't really explained why it was a good idea to do it like that.
Interesting question. In Java, longs are 64-bit and ints are 32-bit. So this generates a hash using twice as many bits as desired, and then derives the result from the high and low bits combined.
If multiplying a number n
by a prime p
, and the lowermost k
bits of n
are all zeros, then the lowermost k
bits of the product n * p
will also be all zeros. This is fairly easy to see -- if you're multiplying, say, n = 0011 0000
and p = 0011 1011
, then the product can be expressed as the sum of two shift operations. Or,
00110000 * p = 00100000 * p + 00010000 * p
= p << 5 + p << 4
Taking p = 59
and using unsigned 8-bit ints and 16-bit longs, here are some examples.
64: 0011 1011 * 0100 0000 = [ 0000 1110 ] 1100 0000 (192)
128: 0011 1011 * 1000 0000 = [ 0001 1101 ] 1000 0000 (128)
192: 0011 1011 * 1100 0000 = [ 0010 1100 ] 0100 0000 (64)
By just dropping the high bits of the result, the range of the resulting hash value is limited when the low bits of the non-prime multiplicand are all zeros. Whether that's an issue in a specific context is, well, context-specific. But for a general hash function it's a good idea to avoid limiting the range of output values even when there are patterns in the input numbers. And in security applications, it's even more critical to avoid anything that would let someone make inferences about the original value based on patterns in the output. Just taking the low bits reveals the exact values of some of the original bits. If we make the assumption that the operation involved multiplying an input number with a large prime, then we know that the original number had as many zeros at the right as the hash output (because the prime's rightmost bit was 1).
By XORing the high bits with the low bits, there's less consistency in the output. And more importantly, it's much harder to make guesses about the input values based on this information. Based on how XOR works, it could mean the original low bit was 0 and the high bit was 1, or the original low bit was 1 and the high bit was 0.
64: 0011 1011 * 0100 0000 = 0000 1110 1100 0000 => 1100 1110 (206)
128: 0011 1011 * 1000 0000 = 0001 1101 1000 0000 => 1001 1101 (157)
192: 0011 1011 * 1100 0000 = 0010 1100 0100 0000 => 0110 1100 (204)