EuOS Bytecode Format version 1.0 by Jeffrey Fielding (JJProg@cyberbury.net) -------------------------------------------------------------------------------- THE ADVANTAGES OF COMPILING EUPHORIA PROGRAMS In order for EuOS to run quickly, I believe that the programs need to be compiled. Since compiled programs won't reference variables and routines by name, EuOS will be able to find the variable or routine much faster, leading to much faster execution of the program. It would even be possible to write part of the interpreter in assembly since it would be relitavely simple. When the code is compiled, two files would be generated: a compiled version of the program and a translation file to translate indexes used in the bytecode version to reference parts of the program, allowing developers to distribute the compiled version of their program and still being able to debug it, without needing to distribute the source code. Another advantage to compiled versions is that they will be smaller. This could be useful in implementing an official EuOS programs archive on the internet, where users could quickly and easily download EuOS programs that were officially approved for stability and security. Finally, compiled programs can contain extra information that would not be so easily added to a source file, such as the version, name of the program, stability and security levels, a description, and the type of program (program, driver, library, shell, or daemon). THE BASIC STRUCTURE OF A COMPILED EUPHORIA PROGRAM Although Euphoria allows variables, routines, and code anywhere in the source file, the compiled version will have a strict format with data stored in blocks. This will allow programs to be loaded faster. All errors that can be determined when the program is compiled will be, so that EuOS will be able to continue the Euphoria idea of not allowing routines to be called before they are defined without using routine_id. The file will be divided into seven sections: the header, libraries, variables, exported variables, routines, exported routines, and code. The header contains data such as the name, version, and stability and security levels of the program. Libraries contains a list of libraries which should be loaded and accessible from the program. Variables includes all the variables in the program, including variables in routines. This allows routines to be called immediately without needing to setup variables. To avoid creating large numbers of variables, any variables that can be used more than once should be. For example: procedure a() integer i i = 5 ? i end procedure procedure b() integer j j = 6 ? j end procedure a() b() In the above case, i and j would both be referenced to one variable, so the variables section would only define one variable which would be used both times since i and j are never used together. Variables also contains temporary variables which are not defined by the programmer, but the compiler to do calculations in complex formulas and pass arguments that are never assigned to a variable. The routines section contains all the routines. There is no distinction between procedures, functions, or types. The code section is code that is automatically run when a new instance of the program is started. Finally, the two exported sections define what variables and routines are available to other programs. THE INSTRUCTION SET ARCHITECTURE Before I describe all the details of the compiled program format, I want do describe the instruction set. It will help to know how the program works before discussing the format of the program. You have probably noticed that the format is very basic - very much like an assembly program. That's because it is like an assembly program. It is designed to be very simple to allow the interpreter to run the program as fast as possible. This is probably very scary to anyone used to Euphoria's simplicity, but only compiler writers will have to deal with this stuff - EuOS Euphoria will be just like Euphoria, only it will be compiled (and better). I am also sure that compiler writers are rather scared by this, especially myself since I will probably be writing the first compiler, since I designed this in the first place. But hey - C compilers can do it, and they have to deal with a much lower-level instruction set. You'll see that the instruction set isn't all that low-level anyway. CALLING ROUTINES One very important thing to note is how routines are called, and how, if functions are not distinguished from procedures, does the interpreter check for errors with a function being called as a procedure or the other way around. Obviously, the compiler will find most of these errors before the program is compiled, but when using routine_ids this could generate some problems. Luckily, I've got it all figured out. Like on a processor, the interpreter has a stack. As with a processor, there are instructions for pushing data onto the stack and popping it off. In this instruction set, there are also instructions to check the size of the stack. This is what allows programs to check for the error described above. Simply check the stack before calling a routine, then check the stack after. If the stack is bigger by one, the routine returned something. If the stack is the same size, the routine was a procedure. Otherwise, there is something wrong with the program (and it's the compiler's fault unless Euphoria compilers start allowing inline "assembly"). If the compiler feels like it, it can decide not to use the stack for calling routines and instead use variables, but unless the compiler is really good, this could lead to really messy code that is difficult to debug. I suggest making this optional if it is supported at all. THE ADDRESS SPACE The address space is the index of variables and routines. Variables and routines have seperate address spaces, so routine 1 is seperate from variable 1. Since Euphoria sequences start at 1, so does the address space. The first routine defined in the routines section is 1 and so on. The first routine in the first library is n+1 where n is the number of routines in the program. The same goes for variables. LIBRARIES Another thing that can get messy is libraries. Libraries can be very messy, as anyone who has had problems with DLLs knows. I hope EuOS will change that, but unfortunately the fastest way to access libraries is rather messy, but don't worry. It's not that hard. There are two types of libraries, just as there are two types of every other program: single instance and multiple instance. Multiple instance libraries are the easiest. Your program includes a library and a new instance is started, and the library's routines and variables are added to your routines and variables. Since everything is indexed by numbers, it is now very simple. Say you have one routine, which would be indexed as routine 1. Now you include a libraray, and its first routine is indexed as routine 2 etc. If your program calls a routine that is not exported by the library, an error is generated. The compiler should get the indexes into a library from definition files generated by the compiler that compiled the library. Single instance libraries are a little more complicated for the interpreter. They appear the same to the program using them, but access is not as fast. The interpreter maps the library into the program's addess space, so everything is a reference, not an actual variable. One problem that may arise is that a library has some parts that would be best in a multiple instance library, and some parts that must be single instance. This can be solved by dividing the library into two parts, with the single instance library being included by the multiple instance library, which is included by the program. EUOS ASSEMBLY LANGUAGE I could just refer to everything as bytecodes, but I think everyone reading this, including myself, will be able to understand this better if I create a language that people can actually program in and understand, which can then be translated into bytecodes. Perhaps experts will be able to optomize what the compiler produces before assembling it. So here it goes... NOTE: Comments are like in Euphoria As you might expect, the assembly language has much the same format as the binary version, though it is easier to read. Again, the basic format is: .header .include .data .routines .code Notice that the exported lists are missing - this allows for much more readable code. Just precede a variable/routine definition by global to put it in the appropriate exported list. THE HEADER The header simply defines a bunch of constants that are read by EuOS to find out information about the program. At the moment, the following definitions are required in every program, though others may be added. name="Program name" -- The name of the program version=1.0 -- The version of the program (must be major.minor) stability=MEDIUM -- Possible values: UNSTABLE (Alpha-test), LOW (Beta-test), -- MEDIUM (unproven), HIGH (proven), ULTRA (crash-proof) type=PROGRAM -- PROGRAM, DAEMON, DRIVER, LIBRARY, or SHELL description="A program" -- Breif description id=3932583955 -- ID assigned by EuOS team to ensure security or -1 -- if no ID has been assigned THE .INCLUDE STATEMENT Format: .include file Includes a library. File is the definition file for the library. DATA Data simply contains a list of names of variables. Example: .data a b c d e f ROUTINES A section filled with routines like so: .routines ROUTINE A: -- Code here ROUTINE B: -- Code here CODE A section filled with inital code like so: .code -- Code here INSTRUCTION FORMAT Each instruction follows the following format: 0 arguments: instruction 1 argument: instruction arg1 2 arguments: instruction arg1, arg2 3 arguments: instruction arg1, arg2, arg3 There are several different types of arguments. Internally, they are represented by only two types - either a VAL or a VAR. A VAL is an object in the code (or one that is put in by the assembler). A VAR is the actual value of a variable referenced by the argument. TYPE EXAMPLE INTERNAL REPRESENTATION literal 5 VAL (5) variable var VAR (index of var) routine foo VAL (index of foo) label loop VAL (index of loop) variable index [var] VAL (index of var) routine index [foo] VAL (index of foo) label instruction [loop] VAL (index of loop) literal as a variable index [1] VAR (1) literal as a routine index [1] VAL (1) literal as a label (instruction) [1] VAL (1) Labels can be defined like so: label: Labels are only valid inside the current routine. You can jump to a label using jmp or cjmp. The internal value of a label is the instruction index of the next instruction. You could call jmp [1] to jump to the first instruction in the current routine. Note that instruction indexes are measured from the first instruction in the current routine. THE INSTRUCTIONS THE RESULT VARIABLE You can set which variable to store the result of some operations in by using setrv. setrv variable Result register = variable DATA MANIPULATION mov dest, source Puts data from source into dest movdi dest, source, index dest[index] = source movsi dest, source, index dest = source[index] add dest, value Adds value to dest sub dest, value Subtracts value from dest mul dest, value Multiplies dest by value div dest, value Divides dest by by value and dest, source dest = dest and source or dest, source dest = dest or source xor dest, source dest = dest xor source movlen dest, source dest = length(source) repeat dest, source dest = repeat(dest,source) append dest, source dest = append(dest,source) prepend dest, source dest = prepend(dest,source) concen dest, source dest &= source BUILT-IN TYPES isinteger val result variable = integer(val) isatom val result variable = atom(val) issequence val result variable = sequence(val) BITWISE LOGICAL INSTRUCTIONS andbits dest, source dest = and_bits(dest,source) orbits dest, source dest = or_bits(dest,source) xorbits dest, source dest = xor_bits(dest,source) notbits dest dest = not_bits(dest) MATH floor dest dest=floor(dest) sqrt desr dest=sqrt(dest) remainder dest, mod dest = remainder(dest,mod) rand dest, max dest = rand(max) sin dest dest = sin(dest) arcsin dest dest = arcsin(dest) cos dest dest = cos(dest) arccos dest dest = arccos(dest) tan dest dest = tan(dest) arctan dest dest = arctan(dest) log dest dest = log(dest) (base 10) power dest, power dest = power(dest,power) COMPARISON OF ATOMS (=,!=,<,>,<=,>=) ce a, b Result variable will be (a = b) cne a, b Result variable will be (a != b) cl a, b Result variable will be (a < b) cg a, b Result variable will be (a > b) cle a, b Result variable will be (a <= b) cge a, b Result variable will be (a >= b) cmp a, b Result variable will be compare(a,b) equ a, b Result variable will be equal(a,b) ROUTINES AND BRANCHES ret Returns from a routine call routine Calls a routine jmp label Jumps to label cjmp label, ifvar Jumps to label if ifvar != 0 GENERAL end Ends the program err errorid Ends the program with error id errorid. The position in the program, stack, and variables are saved and written to a file along with the error id, and sent to the programmer. THE STACK push val Pushes val onto the stack. pop var Pops the last value on the stack into var. ldss var Loads the stack size into var. OPERATING SYSTEM INTERFACE oscall call Calls operating system call # call. Arguments are poped off the stack, so you need to push them onto the stack from left to right. The result is pushed onto the stack.