I'm going to pretend this wasn't a terrible trivial question, and actually discuss the interesting parts of doing this in assembly language (instead of letting a compiler optimize it for you).
In asm, you could do it like you would in any other language. However, if you're programming for a machine with vector instructions, you can and should use them. Compilers will usually do this for you, but in asm you have to do it yourself.
Since the main reason for writing code in asm is high performance, let's consider some of the issues:
Without vector instructions, it might or might not be a good idea to use a conditional-move to do the usual x=max(x, a[i])
. cmov
would introduce a loop-carried dependency, which might hurt performance more than the occasional branch mispredict. (google for more about this).
Branch mispredicts are probably not frequent when finding the max, unless your values are noisy but on average increasing. (e.g. a new max is seen every 1 to 10 elements would be near worst-case.) Otherwise, you probably tend to go for a long time either always seeing a new max or never seeing a new max.
x86 has vector min/max instructions that work like a cmp/cmov on a per-element basis.
So if your array is made up of 32bit signed integers, you could use start by loading the first 4 elements into a vector register (say xmm0
), then use add rsi, 16 / PMAXSD xmm0, [rsi]
inside a loop to do 4 packed x=max(x,src)
operations. PMAXSD
in English is: Packed(integer) Max of Signed DWord elements. See the links in the x86 wiki for reference guides. PMAXSD
is part of SSE4.1, so it's only supported on CPUs with that feature-bit.
If your array was made up of uint8_t
elements, you'd use PMINUB
(Packed(int) Min of Unsigned Byte elements). PMIN/MAXUB
and PMIN/MAXSW
are in SSE2, so they're baseline for x86-64 (and for x86-32 on OSes which require new enough hardware with SSE2 support).
After looping over the array (maybe using PALIGNR or PSRLDQ to handle the last non-multiple-of-16B bit of the array), you'll have 4 elements in your accumulator vector. Each one is the max of every 4th element, for four different offsets. To get the overal max, you need to horizontally find the max element in the vector. Do this by shuffling it (e.g. byte-shifting it right, so the high two elements move to the position of the low two elements), then and comparing it using PMAXSD
against the un-shuffled value. Then repeat the process, to get the max of the last two elements.
Now you can store that 32bit int to memory, or use movd
to transfer it directly from xmm0
to eax
as a function return value.
There's some room for improvement here, since even though pmaxsd
has a latency of one cycle (Intel Haswell for example), it has a throughput of 2 per cycle. So ideally we can sustain a throughput of two PMAX per clock, with memory operands. (Since Intel SnB and onward have two load ports, L1 cache can keep up with this.) We need to use multiple accumulators to allow parallel operation. (And then PMAX all the accumulators together at the end, before doing the horizontal operation).
;;; probably buggy, use at your own risk. edits welcome
global max_array
max_array:
; function args: int *rsi, uint64_t rdi
; requirements: src is aligned on a 16B boundary, size is a multiple of 32bytes (8 elements), and >=8 on entry
; TODO: support unaligned with some startup code, and a partial final iteration with some cleanup
lea rdx, [rsi + 4*rdi] ; end pointer
movdqa xmm0, [rsi] ; two accumulators
movdqa xmm1, [rsi + 16]
add rsi, 32
cmp rsi, rdx
jae .out ; early exit if we shouldn't run the loop even once. unsigned compare for addresses
.loop:
pmaxsd xmm0, [rsi]
pmaxsd xmm1, [rsi+16]
add rsi, 32
cmp rsi, rdx ;; loop is 4 uops on Intel, since this cmp/branch macro-fuses
jb .loop
.out:
;; TODO: cleanup code to handle any non-multiple-of-8 iterations.
pmaxsd xmm0, xmm1
movhlps xmm1, xmm0 ; xmm0 = { d, c, b, a}. xmm1 = { d, c, d, c }
pmaxsd xmm0, xmm1 ; xmm0 = { d, c, max(d,b), max(c, a) }
; if we were using AVX 3-operand instructions, we'd use PSRLDQ and another pmax because it's easy.
; do the final stage of horizontal MAX in integer registers, just for fun.
; pshufd/pmax to do the last level would be faster than this shld/cmp/cmov.
movq rax, xmm0 ; rax = { max(d,b), max(c,a) }
; two-reg shift to unpack rax into edx:eax (with garbage in the high half of both)
shld rdx, rax, 32 ; rax = unchanged (eax=max(c,a)), edx = max(d,b).
cmp edx, eax
cmovg eax, edx ; eax = max( max(c,a), max(d,b) )
ret
In theory, this runs at one iteration per clock on Intel SnB-family microarchitectures. 4 fused-domain uops per clock saturates the pipe, but unrolling more (and using more accumulators) just makes the cleanup code for a non-toy version of this a bigger headache.