Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
494 views
in Technique[技术] by (71.8m points)

c - What is the actual relation between assembly, machine code, bytecode, and opcode?

What is the actual relation between assembly, machine code, bytecode, and opcode?

I have read most of the SO questions about assembly and machine code, such as this, but they are too high level and do not show examples of actual assembly code being transformed into machine code. As a result, I still don't understand how it works at a deeper level.

The ideal answer to this question would show a specific example of some assembly code, such as the snippet below, and how each assembly instruction gets mapped to machine code, bytecode, and/or opcode. An answer like this would be very helpful to future people learning assembly, because so far in the past few days of digging I haven't found any clear summary.

The main things I am looking for are:

  1. a snippet of assembly code
  2. a snippet of machine code
  3. a mapping between the snippet of assembly and machine code (how to do that mapping, or at least some general examples, and how do you know how to do this, where is all this information on the web)
  4. how to interpret the machine code (like are opcodes somehow related, and where is all the information on the web about what all those numbers mean)

Note: I don't have a computer science background, so I have just been slowly going lower level over the past several years and have now gotten to the point of wanting to understand assembly and machine code.

Relation Between Assembly and Machine Code

My current understanding is that an "assembler" (such as NASM) takes assembly code and creates machine code from it.

So when you compile some assembly such as this example.asm:

global main
section .text

main:
  call write

write:
  mov rax, 0x2000004
  mov rdi, 1
  mov rsi, message
  mov rdx, length
  syscall

section .data
message: db 'Hello, world!', 0xa
length: equ $ - message

(compile it with nasm -f macho64 -o example.o example.asm). It outputs this example.o object file:

cffa edfe 0700 0001 0300 0000 0100 0000
0200 0000 0001 0000 0000 0000 0000 0000
1900 0000 e800 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
2e00 0000 0000 0000 2001 0000 0000 0000
2e00 0000 0000 0000 0700 0000 0700 0000
0200 0000 0000 0000 5f5f 7465 7874 0000
0000 0000 0000 0000 5f5f 5445 5854 0000
0000 0000 0000 0000 0000 0000 0000 0000
2000 0000 0000 0000 2001 0000 0000 0000
5001 0000 0100 0000 0005 0080 0000 0000
0000 0000 0000 0000 5f5f 6461 7461 0000
0000 0000 0000 0000 5f5f 4441 5441 0000
0000 0000 0000 0000 2000 0000 0000 0000
0e00 0000 0000 0000 4001 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0200 0000 1800 0000
5801 0000 0400 0000 9801 0000 1c00 0000
e800 0000 00b8 0400 0002 bf01 0000 0048
be00 0000 0000 0000 00ba 0e00 0000 0f05
4865 6c6c 6f2c 2077 6f72 6c64 210a 0000
1100 0000 0100 000e 0700 0000 0e01 0000
0500 0000 0000 0000 0d00 0000 0e02 0000
2000 0000 0000 0000 1500 0000 0200 0000
0e00 0000 0000 0000 0100 0000 0f01 0000
0000 0000 0000 0000 0073 7461 7274 0077
7269 7465 006d 6573 7361 6765 006c 656e
6774 6800 

(that is the entire contents of example.o). When you then "link" that using ld -o example example.o, it gives you more machine code:

cffa edfe 0700 0001 0300 0080 0200 0000
0d00 0000 7803 0000 8500 0000 0000 0000
1900 0000 4800 0000 5f5f 5041 4745 5a45
524f 0000 0000 0000 0000 0000 0000 0000
0010 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 1900 0000 9800 0000
5f5f 5445 5854 0000 0000 0000 0000 0000
0010 0000 0000 0000 0010 0000 0000 0000
... 523 lines of this

But how did it go from assembly instructions, to those numbers? Is there some sort of standard reference that lists out all of those numbers, and what they mean, for whatever architecture you are on (I am using x86-64 through NASM on OSX), and how each set of numbers maps to each assembly instruction?

I understand that machine code is different for every machine, and there are dozens if not hundreds of different types of machines. So I am not currently looking for how assembly gets transformed to every one (that would be complicated). I just am interested in an example that illustrates how the transformation works, and any architecture can serve as the example. And from that point, I could go and research the specific architecture I am interested in and find the mapping.

Relation Between Assembly and Bytecode (or is it called "opcode"?)

So from my reading so far, assembly gets transformed into machine code as demonstrated above.

But now I get confused. I see people talk about bytecode, such as in this SO answer, showing stuff like this:

void myfunc(int a) {
  printf("%s", a);
}

The assembly for this function would look like this:

OP Params OpName     Description
13 82 6a  PushString 82 means string, 6a is the address of "%s"
                     So this function pushes a pointer to "%s" on the stack.
13 83 00  PushInt    83 means integer, 00 means the one on the top of the stack.
                     So this function gets the integer at the top of the stack,
                     And pushes it on the stack again
17 13 88 Call        1388 is printf, so this calls the printf function
03 02    Pop         This pops the two things we pushed back off the stack
02       Return      This returns to the calling code.

So then I get confused. Doing some digging, I can't tell if each of those 2-digit hex numbers like 13 82 6a are each, individually, called "opcodes", and the whole set of them is called "bytecode" as a catch-all term. In addition, I can't find a table that lists out all of these 2-digit hex numbers, and what their relation is to machine code, or assembly.

To summarize, I am very much looking forward to an example showing how assembly instructions map to machine code, and it's relation to bytecode and/or opcode. (I am not looking for how a compiler does this, just how the general mapping works). I think this would clarify it for not only myself but for many people down the road who are interested in learning more about the bare metal.

One other reason why this would be valuable to know is, so one can understand how the LLVM compiler generates machine code. Do they have some sort of "complete list" of 2-digit opcodes or machine code 4-digit sequences, and know exactly how that maps to any architecture-specific assembly? Where did they get that information from? An answer to this overall question would make it much clearer how LLVM implemented its code generation.

Update

Updating from @HansPassant's comment. I actually don't care what the actual distinctions are between the words, sorry if that wasn't clear. I just want to know this: how does assembly map to machine code (and where are places to begin looking for the references that hold that information on the web), and are opcodes or bytecode used anywhere in that process? And if so how?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Yes, each architecture has an instruction set reference that gives how instructions are encoded. For x86, it's the Intel? 64 and IA-32 Architectures Software Developer's Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z

Most assemblers, including nasm, can produce a listing file for you. Feeding your sample code to nasm -l, we get:

 1                                  global main
 2                                  section .text
 3
 4                                  main:
 5 00000000 E800000000                call write
 6
 7                                  write:
 8 00000005 B804000002                mov rax, 0x2000004
 9 0000000A BF01000000                mov rdi, 1
10 0000000F 48BE-                     mov rsi, message
11 00000011 [0000000000000000]
12 00000019 BA0E000000                mov rdx, length
13 0000001E 0F05                      syscall
14
15                                  section .data
16 00000000 48656C6C6F2C20776F-     message: db 'Hello, world!', 0xa
17 00000009 726C64210A
18                                  length: equ $ - message

You can see the generated machine code in the third column (first is line number, second is address).

Note that the output of the assembler is an object file, and the output of the linker is an executable. Both of those have a complex structure and contain more than just the machine code. This is why your hexdump differs from the above listing.

Opcode is generally considered to be the part of the machine code instruction that specifies the operation to perform. For example, in the above code you have B804000002 mov rax, 0x2000004. There B8 is the opcode, 04000002 is the immediate operand.

Bytecode is not typically used in the assembly context, it could be thought of as the machine code for a virtual machine.


For a walkthrough, x86 is a very complicated architecture. But your sample code happens to have a simple instruction, the syscall. So let's see how to turn that into machine code. Open the above mentioned reference pdf, and go to the section about syscall in chapter 4. You will immediately see it listed as opcode 0F 05. Since it doesn't take any operands, we are done, those 2 bytes are the machine code. How do we turn it back? Go to Appendix A: Opcode map. Section A.1 tells us: For 2-byte opcodes beginning with 0FH (Table A-3), skip any instruction prefixes, the 0FH byte (0FH may be preceded by 66H, F2H, or F3H) and use the upper and lower 4-bit values of the next opcode byte to index table rows and columns.. Okay so we skip the 0F and split the 05 into 0 and 5 and look that up in table A-3 in row #0, column #5. We find it is a syscall instruction.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...