7. Variable types

Choosing types for parameters and variables

In assembly language, data is generally untyped, except that it may be of a specific number of bits. For example, there is nothing in our sample file which indicates the type of the data held in D0, other than that it is a 32-bit quantity.

In contrast, in C all data must have a type. This can be one of the base types (such as char or long), or a pointer to a base type, or a more complex type such as a structure, or a pointer to a union. Relogix therefore has to choose types for the data held in registers or on the stack, and indeed for global data.

The analysis by which Relogix deduces variable types is very complex, because it needs to consider every usage of the variable, and how it relates to the usage of other variables, in order to determine the best choice. It starts by examining each place where a variable is referenced. For example, it might discover that:

Variable is used as a pointer to another variable, which is 32-bits wide;
Variable is used in a signed comparison;
Variable is used in a multiply instruction.

This usage information is used to create a set of 'type rules', which may contain links to rules for other variables (for example, variable A is a pointer to variable B, so choosing a type for variable B implies a type for variable A). In addition, the type information may have to be combined with information from other source files.

Having created the type rules, Relogix then tries to solve them recursively. In doing so, it looks at a number of different possible types for the variable, and chooses the most likely. As it solves the rules, it may encounter contradictions (for example, the variable may be used as a pointer in one instruction, but be used as a signed integer in another); if so, it makes a choice based on a weighting attached to the various input rules.

In our example, it has selected long for the variable corresponding to the D0 register (ECX in the x86 version), because it is a 32-bit value used as an arithmetic quantity. For D1 (EAX in the x86 version), it could have chosen either long or unsigned long; it has chosen unsigned long because the register is compared against the value 0x80000000 which indicates that an unsigned value is more likely. Once that choice is made, Relogix can deduce the types of the variables corresponding to A0 and A1 (ESI and EDI in the x86 version) inside the loop, because they are pointers to the data which is placed in D1.

Structures

The most interesting type shown in our example is that of A0 (or ESI) on entry to the function. Relogix sees that the code does two memory accesses using offsets from A0:

    MOVE.L    TOKCTR(A0),D0

and

    MOVE.L    TOKBUF(A0),A0

Such accesses are indicative of a possible structure in memory. In this case, the use of the names TOKCTR and TOKBUF provides further information, because Relogix can see the definition of these names in the source assembler file. They are defined as an incrementing series of equates, which is a possible indication of a structure definition. (Assembler OFFSET sections or MASM STRUCT directives are an alternative indication). Relogix has combined all of these clues together, and has deduced two separate pieces of information from them:

(a) That the assembler equates TOKCHAR, TOKALT, TOKBUF and TOKCTR effectively constitute a structure definition, and
(b) That on entry to the subroutine movetok, A0 (ESI in the x86 version) points to an instance of this structure.

Using the type information in the function prototype

All this information has been used to finalize the function prototype:



void movetok (struct tok *tok_ptr, unsigned long *ptr);


and also to create, in the header file, the definition of the structure type:



#ifndef __TYPE_TOK_DEFINED
#define __TYPE_TOK_DEFINED
struct tok {
    char tokchar;
    char tokalt;
    unsigned long *tokbuf;
    long tokctr;
};
#endif


(This definition is protected by #ifndef because translation of another file may also give rise to a definition of the same structure.)