Edited to correct statement re: 80386, which (to my surprise) did have a barrel shifter.
Happy to hear the 286 described as "modern" :-)
The 8086 ran a SHL AX, CL
in 8 clocks + 4 clocks per bit shifted. So if CL
= 255 this is a seriously slow instruction !
So the 286 did everybody a favour and clamped the count by masking to 0..31. Limiting the instruction to at most 5 + 31 clocks. Which for 16 bit registers is an interesting compromise.
[I found "80186/80188 80C186/80C188 Hardware Reference Manual" (order no. 270788-001) which says that this innovation appears there first. SHL
et al ran 5+n clocks (for register operations), same like the 286. FWIW, the 186 also added PUSHA/POPA, PUSH immed., INS/OUTS, BOUND, ENTER/LEAVE, INUL immed. and SHL/ROL etc. immed. I do not know why the 186 appears to be a non-person.]
For the 386 they kept the same mask, but that applies also to 32-bit register shifts. I found a copy of the "80386 Programmer's Reference Manual" (order no. 230985-001), which gives a clock count of 3 for all register shifts. The "Intel 80386 Hardware Reference Manual" (order no. 231732-002), section 2.4 "Execution Unit" says that the Execution Unit includes:
? The Data Unit contains the ALU, a file of eight 32-bit general-purpose registers, and a 64-bit barrel shifter (which performs multiple bit shifts in one clock).
So, I do not know why they did not mask 32-bit shifts to 0..63. At this point I can only suggest the cock-up theory of history.
I agree it is a shame that there isn't a (GPR) shift which returns zero for any count >= argument size. That would require the hardware to check for any bit set beyond the bottom 6/5, and return zero. As a compromise, perhaps just the Bit6/Bit5.
[I haven't tried it, but I suspect that using PSLLQ
et al is hard work -- shuffling count and value to xmm
and shuffling the result back again -- compared to testing the shift count and masking the result of a shift in some branch-free fashion.]
Anyway... the reason for the behaviour appears to be history.