The beginning of this answer is based on what Jester commented on. It just expands on that and explains the changes in some more detail. I did some additional changes too, two of which are also addressing mistakes in your sources.
First, this part:
An int
is 4 bytes but you use 8 all over your code. Use eax
instead of rax
.
These instructions in your example are accessing 8 bytes from the array each:
mov rax, [rdi]
cmp rax, [rdi]
mov rax, [rdi]
This is because rax
is a 64-bit register, so doing a full rax
load or compare with a memory operand accesses 8 bytes of memory. In NASM syntax, you are allowed to specify the memory operand's size explicitly, such as by writing the following:
mov rax, qword [rdi]
If you had done that you might have figured out earlier that you were accessing memory in 8-byte units (quadwords). Trying to access a doubleword explicitly while using rax
as the destination register would fail. The following line causes the error "mismatch in operand sizes" at assembly time:
mov rax, dword [rdi]
The following two lines are fine, and both load into rax
from a doubleword memory operand. The first one uses zero-extension (which is implicit in the AMD64 instruction set when writing to a 32-bit register part), the second one uses (explicit) sign extension:
mov eax, dword [rdi]
movsx rax, dword [rdi]
(A movzx
instruction from a dword memory operand to rax
does not exist, as it would be redundant with the mov
to eax.)
Later in your example, you're using rdi
as the address of a 4-byte wide type, advancing the array entry pointer by adding 4 to it:
add rdi, 4
This is correct for the int
type, but clashes with your use of quadwords as the memory operands' sizes.
Two more issues are given by Jester's comment:
Also do not use rbx
because that is a callee-saved register and it's pointless to copy from rsi
anyway. As before, you'd better use esi
because that's another int.
The rsi
problem is that the high 32 bits of the 64-bit rsi
might hold nonzero values depending on the ABI. If you aren't sure about whether a nonzero value is allowed, you should assume it is and that you should use the 32-bit value in esi
only.
The rbx
(or ebx
) problem is that rbx
needs to be preserved across function calls for the AMD64 psABI that is used by Linux, see Where is the x86-64 System V ABI documented? for the documentation of that ABI. In your simple test program changing rbx
may not cause any failure, but it easily would in a non-trivial context.
The next problem I found is with your initialisation of eax
. You wrote it this way:
; [rdi] is the first value in the array.
; We'll store the smallest value so far found
; in rax. The first value in the array is the
; smallest so far found, therefore we store it
; in rax.
mov rax, [rdi]
However, as evidenced by your loop flow-control logic, you allow for the caller to pass in a zero for the size argument. In that case, you should not access the array at all, because "the first value in the array" may not even exist or be initialised to anything at all. Logically, you should initialise the smallest-yet value with INT_MAX instead of the first array entry.
There is one more problem: You're using rsi
or esi
as an unsigned number, counting down to zero. However, in your declaration of the function you specified the type of the size
argument as int
, which is signed. I fixed that by changing the declaration to unsigned int
.
I did a few more optional changes to your program. I used NASM local labels for the "sub"-labels of your function, which is useful because you could re-use for example .loop
or .end
in other functions in the same source file if there were any to be added.
I also corrected one of the comments to note that we jump for eax being less than the array entry, and do not jump for eax being greater than or equal to the array entry. You could change this conditional jump to be jle
instead which would jump for equal comparisons too. Arguably one or the other may be preferred for clarity, or performance, but I don't have much of an answer as to which.
I also used dec esi
instead of sub esi, 1
which is not very much better but just sits better with me. In 32-bit mode, dec esi
is a single-byte instruction. Not so in 64-bit mode though; dec esi
is 2 bytes against sub esi, 1
being 3 bytes.
Further, I changed the initial check for esi
being zero from using cmp
to test
, which is slightly better, refer to Test whether a register is zero with CMP reg,0 vs OR reg,reg?
Finally, I changed the actual loop conditional to be at the end of the loop body, which means the loop uses one jump instruction less. The unconditional jump to the loop body's start is replaced by the conditional jump that checks for the while
condition. The test
at the beginning of the function is still needed to handle the zero-length-array possibility. Also, instead of using cmp
or test
again to check for zero in esi
, I just use the Zero Flag as already set up by the dec
instruction to check whether esi
was decremented to zero.
You could use ecx
or rcx
for the loop counter, but that would probably not be much of a win on modern CPUs. Would allow for slightly more compact code if you resorted to using the jrcxz
, jecxz
, or loop
instructions. They aren't recommended though, due to slower performance.
Instead of comparing to the dword [rdi]
and then, if less-or-equal than eax
, loading from the same memory dword, you could load the array entry's value into a register first and then use that as source for cmp
and mov
. This might be faster, but it does result in more opcode bytes.
A trick I otherwise use to advance the Destination Index (rdi
in 64-bit mode) by 4 is to use a single scasd
instruction, which modifies only the flags and the index register. This is a single-byte instruction instead of the 4-byte add rdi, 4
but is highly probably slower to run.
I uploaded a repo with your original source and my improvements to https://hg.ulukai.org/ecm/testsmal/file/2b8637ca416a/ (Taken under the CC BY-SA usage conditions of stackoverflow content.) I modified the C part and the test script too, but these are trivial and mostly irrelevant to your question. Here's the assembly source:
INT_MAX equ 7FFF_FFFFh
SECTION .text
global find_smallest_int
find_smallest_int:
; If the array is empty (size = 0) then we want to return
; without reading from the array at all. The value to return
; then logically should be the highest possible number for a
; 32-bit signed integer. This is called INT_MAX in the C
; header limits.h and for 32-bit int is equal to 7FFF_FFFFh.
;
; If the array is not empty, the first iteration will then
; always leave our result register equal to the value in
; the first array entry. This is either equal to INT_MAX
; again, or less than that.
mov eax, INT_MAX
; esi is the second argument to our function, which is
; declared as int find_smallest_int(int *, unsigned int).
; It represents the length of the array. We use this
; as a counter. rsi (and its part esi) need not be preserved
; across function calls for the AMD64 psABI that is used by
; Linux, see https://stackoverflow.com/a/40348010/738287
; Check for an initial zero value in esi. If this is found,
; skip the loop without any iteration (while x do y) and
; return eax as initialised to INT_MAX at the start.
test esi, esi
jz .end
.loop:
; If eax is smaller than dword [rdi], we'll jump down to the
; rest of the loop. Only if eax is bigger than or equal to
; the dword [rdi] will we reassign eax to that, to hold the
; new smallest-yet value.
cmp eax, dword [rdi]
jl .postassign
.assign:
; If we execute this code, it means eax was not less
; than dword [rdi]. Therefore, we can safely reassign
; eax to dword [rdi].
mov eax, dword [rdi]
.postassign:
; Set rdi to point to the next value in the array.
add rdi, 4
; Subtract one from our counter. This started as
; the number of elements in the array - when it
; gets to 0, we'll have looped through the entire thing.
dec esi
; Check to see if we've reached the end of the array.
; To do this, we use the Zero Flag as set by the prior
; dec instruction. If esi has reached zero yet (ZR) then
; we do not continue looping. In that case, we return the
; smallest value found yet (which is in eax at the moment).
;
; Else, we jump to the start of the loop to begin the next
; iteration.
jnz .loop
.end:
retn
Here's an alternative for the conditional jump within the loop body. The cmov
instruction seems to be supported by all AMD64 CPUs. It is a conditional move: If the condition is met it runs like a mov
-- it has no effect otherwise, with one exception: It may read the source operand (and thus might fault from the memory access). You can flip the condition you used for the branch around the mov
to get the condition for the cmov
instruction. (I came across this thread involving Linus Torvalds which indicates that the conditional jump solution may actually be better or no worse than cmov
. Make of that what you will.)
.loop:
; If eax is greater than or equal to dword [rdi], we'll
; reassign eax to that dword, the new smallest-yet value.
cmp eax, dword [rdi]
cmovge eax, dword [rdi]
; Set rdi to point to the next value in the array.
add rdi, 4