Académique Documents
Professionnel Documents
Culture Documents
Memory (or Memory The value is in memory. The address of the value is encoded in the
Direct) instruction.
Base The address of the value is in a register (the "base" register). The
instruction specifies which register.
The MIPS instruction set is easier to program, use, and implement for several reasons
every instruction is a standard length - for our family, 32 bits.
there is a large number (32) of uniform registers. For 31 of these registers their use is
restricted only by convention. (But in reality only 26 registers are generally available)
there are very few different instruction encodings
there are no condition codes
it can function in either big- or little-endian mode
the procedure calling convention is lightweight, enabling simple leaf procedures to be
called with very little overhead
the assembly language is easily optimized and its regularity (and large number of
registers) makes "smart" compilers easier to write. (In fact, early MIPS chips relied on
the compiler to generate code in a special way to avoid the necessity of pipeline
interlocks, hence its acronym)
MIPS parameters
All MIPS
instructions are 32-bits. MIPS assembler allows pseudoinstructions. All operations are performed
in registers on 32-bit quantities. The only instructions that access memory are loads and stores,
and these are the only instructions that indicate data size. Loads and stores must be properly
aligned. Thus, a load of a four-byte quantity must be done from an address that is a multiple of
four. Loads of small sizes (halfword or byte) have signed and unsigned versions.
Register classes
As indicated, MIPS has a convention for register usage. Registers are divided into register
classes:
special-use. These registers are reserved for a special purpose. This includes $0, $at,
$v0/$v1 (function result), the stack, frame, and global pointer registers, the return address
register, and the two kernel registers (10 total). Except for the kernel registers, the stack
pointer, and $0, you may use these registers for other uses if you obey the rules.
$a0-$a3 are argument registers.
$s0-$s7 are "saved" registers. These are usually used to "hold" a value for a long period
of time. Again, later.
$t0-$t9 are "temporary" registers. Use these registers for calculations.
For example, if you want to load a 32-bit word from an address contained in register $t0,
placing the word in register $t1, the instruction would be
lw $t1,0($t0)
Thus, 0 is added to the contents of $t0 to get the address of the 32-bit quantity to load. If $t0 is
the base of an array, and the element of the array you want has a constant index like 1, you could
access the element (array[1]) by adjusting the constant. But the constant must be multiplied by
the size of the array element!
Thus, if the array is an integer array, and you want element #1 (array[1]), the instruction
would be
lw $t1,4($t0)
If the array is a character array and you want element #1, the instruction is
lw $t1,1($t0)
Suppose you need to access array[i], where the array is an integer array, the address of the
start of the array (the base of the array) is in $t0, and the value of i is in $t2. Here is what
you do:
lw $t1,0($t3)
Note that storing to memory takes the same format, and the memory address is still the right-
hand operand. Thus, the equivalent store operation to that above is
sw $t1,0($t3)
Loads and stores must be aligned. This means that you can only load an N-byte quantity
(N=1,2,4,...) from an address that is a multiple of N bytes. Thus, while it is legal to load a byte
from any memory address, it is only legal to load a word from an address that is a multiple of 4.
Loading the base address of the array into the $t0 register in the example above is a bit tricky. An
instruction only has room for a 16-bit constant, and addresses are 32-bits! This means we must
initialize $t0 in two steps:
the upper 16 bits of the register are set with the lui instruction (load upper immediate)
the lower 16 bits of the register are set using an ori instruction OR an add instruction.
Note that the add instruction has a 16-bit signed constant. If the address wanted was
0x01008008, we could not use an add instruction! This is what would happen:
There are two solutions to this. First, we could simply add one to the constant used in the lui
instruction
an alternative solution is to use an ori instruction instead of an add. (The constant in andi and
ori is unsigned)
This operation (loading a 32-bit address into a register) is so common that it has
pseudoinstruction in MIPS assembler: la (for load address). This means the instruction la can be
used in MIPS but it is translated to the appropriate lui/ori combination. In our case this would be
la $t0, 0x1008008
As you see, loads from 32-bit addresses ("static" data) are cumbersome. To streamline such
loads, MIPS programs are laid out in memory so that "static data" (as opposed to data on the
stack) is in blocks of 64kB. A register is then initialized with an address that is in the middle of
the current block of data. Later, that register can be used as the base register, and loads and stores
can reference memory using a signed offset from it. If there is more than one 64kB block of data
for a program, the global pointer can be initialized each time a function is entered. If a function's
data is larger than 64kB, the global pointer will have to be manipulated more often.
In our programs, data will be very small. We could simply use a global pointer that points to the
beginning of our memory area. That way offsets are always positive.
Consider a number 2^N where 31 > N > 0. This number is represented in binary on our machine
by a word with a single bit set, bit #N. Thus 2^5 has bit 5 set (counting from 0) or 100000 in
binary. In decimal, of course, this is 32.
If we move this bit to the left one position, it becomes 2^6, or 64. Moving left one bit position
multiplies the number by 2. This is a left-shift operation. A left-shift by P positions multiplies the
value by 2^P.
Similarly, moving the bit one position to the right, divides it by 2. Thus, 2^6 (decimal 64)
becomes 2^5 (decimal 32) when right-shifted one position.
Integer division - not exact, bits lost (shifted off the right side). B/c the binary number 011
(decimal 3), when right-shifted one position becomes 001, and the previous 2^0 bit is lost. Tl; dr
binary bullshit
Right-shifting has another problem: what to do when we right-shift a number that has its most-
significant bit set? If we shift zero bits in from the left, the sign bit is no longer set! This would
be correct if the original number was unsigned. Let's look at an example using a four-bit
numbers:
1100 in a four-bit unsigned number is decimal 12. If we right-shift this one position and
set the leftmost position to 0, the result is 0110, or 6 base 10. This is the correct answer.
However, if the original number was interpreted as signed, its original value would have
been -4. When we right-shift -4 by 1 position it shouldn't become 6! Instead we want to
set the [shifted-in] most-significant bit to 1. This would produce 1110, which, when
interpreted as a signed four-bit number is -2.
These two types of right shifts are called logical (when we treat the number as unsigned, and
shift 0 bits in) and arithmetic (when we treat the number as signed, and replicate the sign bit in
the bits shifted in). Since there is only one type of left-shift and it shifts 0 bits in, it is also called
logical.
...
Example:
Given two 16-bit unsigned non-zero numbers in $t0 and $t1, set $t2 so that $t2 has the number in
$t1 as its most significant 16 bits and the number in $t0 as its least significant 16 bits.
Which code sequence is correct:
(One of the answers is correct all of the time and one is correct some of the time.)
MIPS Instructions
All of and, or, xor and nor have R-type MIPS instructions where three registers are used:
All except nor also have immediate counterparts where the 16-bit immediate value is treated as
unsigned (not sign-extended) when the operation is performed. These are useful for creating 16-
bit ANDs, ORs and XORs.
the first 16-bits of the master data structure stored for every data object on Unix, called an inode,
comprise the object's mode. The mode includes both the filetype and the basic permissions.
Within these 16 bits, bit #8 (counting from 0) indicates whether the owner of the file can read it.
Thus, our data looks like
-------B--------
where B is the bit we want, and - indicates each bit that is 'in the way'.
Problem: if a file's mode is in $t0, set $t1 to 1 if the file's owner can read it, 0 otherwise.
There are several ways to solve this problem. Let's look at this graphically again. Here is the
operation we are interested in:
You may already see a simple solution to this problem, but we are going to take the long way
around and discuss the generally-useful idea of a mask.
A mask is a special bit pattern that is constructed so that an AND or OR operation can be applied
to a selected sequence of bits to isolate it. In our case, we want to construct a mask so that we
can extract our single bit, i.e., so that the operation
Once this operation is performed, we can simply shift B to the correct position:
In our case, where the file's mode is in $t0, the following sequence would be used
We indicated earlier that this may not be the easiest solution, though it is the most general. A
simpler solution would probably be
sll $t2,$t0,23
srl $t0,$t0,31
The concept of a mask is very useful and can be used to isolate and extract any data value. We
will see it again later. As one further example, suppose we wanted to isolate all the permissions
bits in the word in $t0, placing the isolated bits in $t2. Since the permissions take up a total of 12
bits, we would use
andi $t2,$t0,0x0fff
As discussed in an earlier section, R-type instructions must have room in the instruction
encoding for the following parts:
an opcode field
Since there are 32 instructions, the register fields must have 5 bits. Similarly, since the maximum
shift amount is 31, the shift amount field must have 5 bits. This leaves 12 bits for the opcode. To
make the encoding for different instruction types more compatible, the opcode field was broken
into two 6-bit fields, called opcode and function. For R-type instructions, the function (funct)
field indicates the instruction and the opcode (op) field (which is 0 or 1 for an R-type
instruction) indicates to look in the funct field for the operation code.
Encoding Instructions using Instructions
Using our example instruction of add $t0, $t1, $t2 we will use our bitwise instructions
to accomplish two tasks:
1. Create the instruction word from its parts (this is, after all, what MARS must do when it
assembles the instruction!)
2. Take an existing R-type instruction and modify the rd field, setting it to the value currently in $t4.
Leave the remainder of the fields intact.
Both of these tasks will allow us to practice with masks and our bitwise operators.
Problem: Encode the add $t0, $t1, $t2 instruction, placing the result in $t0
If you remember from our earlier discussion, we had values for these constants:
Let's assume we have the following register assignments already. (We are doing the general
solution here rather than optimizing for 0-valued fields.)
li $t1,0 # opcode
li $t2,9 # rs
li $t3,10 # rt
li $t4,8 # rd
li $t5,0 # shamt
li $t6,32 # funct
Problem: Take an existing R-type instruction and modify the rd field, setting it to $t4.
Leave the remainder of the fields intact.
This involves OR-ing in the new value of rd. But first, the current instruction's rd field must be
zeroed. This is a common use of masks. Here are the steps involved:
1. Create mask that has 1's everywhere EXCEPT the rd field, where there are 0's
2. AND the existing instruction with the mask. This zeroes the rd field, leaving the remainder of the
instruction intact.
Assuming our existing instruction is in $t0 and the value of the new rd field is in $t7, here is the
code
Note: We create the complement of the mask, then complement it to get our correct mask (this is
the nor instruction). We can use a load immediate instruction: If we use it with a hexadecimal
constant 0xF800, Mars will realize that it cannot use an addi, as this will sign-extend the result,
and, instead it will substitute an ori instruction. We will just insert the native ori instruction to
show what it looks like. $zero comes in very handy here.
Decisions
If statements and basic branches
Addressing mode
branch instructions are I-type. The label address is encoded in the 16-bit field of the I-type
instruction. 16-bits is insufficient for a full address. The field contains a relative address, which
is part of the branching instructions new addressing mode - pc-relative.
the pc holds the address of the next instruction to be executed. if executing an instruction at
address N, pc is set to the address N+4 (the next instruction is 4 bytes forward). The branch
instruction's 16-bit constant contains a number that indicates how to adjust the pc - but it is not
measured in bytes. To give the branch a longer range, the adjustment is measured in words. Thus,
in the sequence
the bne instruction would hold the constant -4. This number would be multiplied by 4 (read: left-
shifted by 2), then added to the pc. Originally the pc had the address of the instruction whatever.
After the adjustment by 16 bytes, it has the address of label.
This use of word-offsets in a branch instruction enables branches to have a range of 65536
instructions, centered on the current instruction. This is 256kB of code - which, in most cases, is
much larger than the size of a function.
In cases where the branch target is too far away, a jump instruction must be used instead. The
jump instruction is simple
j address
jump instruction uses J-type. A J-type instruction is simple: there are 6 bits of opcode followed
by a 26-bit constant. This 26-bit constant is again a 28-bit constant in disguise (it is a word
address and the low-order 2 zero-bits are suppressed), but the result is not PC-relative. Rather,
the 28-bit result replaces the low-order 28-bits of the current PC. This gives the jump instruction
the ability to jump anywhere within the current 256MB (2^28 byte) segment of code. Let's look
at an example. If the current PC is 0x20012C58:
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 9 8 7 6 5 4 3 2 1 0
1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 1 0 1 1 0 0 0
2 0 0 1 2 C 5 8
and we need to jump to the address 0x20053C58, we can't use a branch instruction because the
target PC is 41000 bytes (or 10400 words) away. Instead, we must alter the PC using a jump
instruction. The new PC must look like this
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 9 8 7 6 5 4 3 2 1 0
1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 0 1 0 1 1 0 0 0
2 0 0 5 3 C 5 8
where we have highlighted bits 2-27 of the target PC to indicate the bits encoded in the jump
instruction. The jump instruction would then be
j 0x0014F16
general jump instruction that can jump to anywhere -- jr reg instruction. In this case, the new PC
is placed in a register and the jr reg instruction indicates to jump to the address in the register.
Here, of course, there is no adjustment of the address - it is a real [byte] address. (Of course the
low-order 2 bits of the address must be zero to avoid an alignment fault.)
In the problems section of this module, Problem One compares these three techniques of
branching.
Translating if-statements
simple example:
if (a < b) {
temp = b;
b = a;
a = temp;
}
Loop Statements
A simple loop for practice
int A[N], i, element, *elptr;
i=0;
while (i<N) {
elptr = &A[i];
element = *elptr;
if (element < 0) *elptr = -element;
i++;
}
If iptr is a pointer that points to an integer, and we want it to point to an integer i (that is, iptr
holds the address of i), the declaration of iptr is
int *iptr, i;
this means that iptr is a pointer to an int. (Note that the * only applies to iptr, not to i. i is just an
int.) The * in a declaration means 'is a pointer to'.
You initialize iptr to point to an integer using the addressof operator (&) like this
iptr = &i;
To use the address in iptr and get the integer it points to you use * in an assignment statement
like this
int j = *iptr;
In an assignment statement, * means dereference, or go through the pointer and get what it
points to.
Let's go ahead now and change this to our ugly version of code:
i=0;
loop: if (i>=N) goto loopdone;
elptr=&A[i];
element=*elptr;
if (element >= 0) goto loopskip
*elptr = -element;
loopskip: i++;
goto loop;
loopdone:
MIPS code:
A moderately-complex loop
for (i=0,i<N,i++) {
for (j=i+1,j<N,j++) {
if (A[j] > A[i]) {
temp=A[i];
A[i] = A[j];
A[j] = temp;
}
}
A is an integer array of N elements.
i=0;
nextouter: if (i >= N) goto alldone;
j=i+1;
nextinner: if (j >= N) goto incrouter;
incrinner: j=j+1;
goto nextinner;
incrouter: i=i+1;
goto nextouter;
alldone:
MIPS: registers:
$t0 will be i (since i and j are 'throw away', we will not save them in memory)
$t1 will be j
$t8 will be N
.data
A: .word 5,3,-3,54,2,-56,7,9,11,11
N: .word 10
Now the .text region with the initialization of our variables (above)
.text
.globl main
main:
# $t8 will be N; $t9 will be &A[0]
la $t9, A
lw $t8, N
# j=i+1;
addi $t1,$t0,1
# temp=A[i];
move $t2,$t3
# A[i] = A[j];
sw $t4,0($t5)
# A[j] = temp;
sw $t2,0($t6)
# incrinner: j=j+1;
incrinner:
addi $t1,$t1,1
# goto nextinner;
b nextinner
# incrouter: i=i+1;
incrouter:
addi $t0,$t0,1
# goto nextouter;
b nextouter
alldone:
jr $ra
Switch Statements
Decisions - Problems
The .text and .data directives control whether you are defining code (.text) or data for the
program. instructions are placed in-memory at significantly different addresses than is data.
reasons for this separation:
want the addresses for data, and especially for text, to be contiguous and don't want to
'run out' of addresses as our program grows.
the protections set on these areas are different. Simply, we need to execute code and read
and write data. Conversely, we don't generally need (or want) to execute data, and we
want to limit the ability to read and write code.
To ensure that these regions do not overlap, and sufficient room is permitted for each, systems
adopt a convention for the initial addresses for each. These blocks of addresses are typically
called segments. On MARS the .text segment starts at 0x400000 and the .data segment starts at
0x10000000. This gives us a contiguous region of about 250MB of .text space.
The data region: starts at 0x10000000 and ends at 0x80000000 - an area of 1.75GB. But there are
several kinds of data. The first (and most important) division is between static data and stack
data.
Stack data is used for temporary data and arguments to support a chain of procedure calls. The
stack must be allowed to 'grow' as necessary to support longer call chains of possibly more
complex procedures. Data on the stack exists as long as the procedure that defined it is active.
Thus, the address of an item on the stack is determined when the procedure is called, and will
[probably] be different if the same procedure is called a second time.
Static data is used for data that resides at a fixed constant address and is independent of active
procedures. consists of global variables, string data (such as that used for messages), and
dynamic data such as allocated by malloc (read: new()). This last type of data is still termed
static because it is valid as long as the programmer wants, and is independent of procedures: in
other words, this data does not go out of scope. This data must be freed either explicitly or by
garbage collection (if all references to it are deleted). Just like stack data, static data must be
allowed to grow.
Support for two types of data that must grow using a single block of memory is implemented by
starting each of the types on opposite ends of the address range. Then, one type of data grows
from lower to higher addresses (the static data) and the other (the stack data) grows from higher
to lower addresses:
The static data area aka data area is subdivided into three smaller segments: initialized data,
uninitialized data, and heap data.
initialized data - this data is defined in program modules, space is allocated explicitly for it, and
it is often given an initial value. Examples of initialized data are an array that is initialized, and a
message that has predefined textual data. When using MARS it also contains variables whose
size is known but which are not initialized (i.e. using a .space directive). This latter category
would normally be part of uninitialized data. The size of the initialized data region is known
when the program is created (compiled and linked, or, in the case of MARS, just assembled).
uninitialized data follows initialized data and usually includes arrays that have a size but that are
not initialized (which are part of initialized data in MARS). This portion of uninitialized data is
called bss. Just like initialized data, the size of bss is known when the program is created. On
MARS, both initialized and uninitialized data are in the .data segment.
The end of the (initialized + uninitialized) data area is constant and is the highest legal absolute
address available to the programmer when the program starts. (Remember, stack addresses are
usually allocated outside of the programmer's control and never use absolute addresses). It is
illegal to reference addresses past the end of the known data area. The addresses that follow this
will be used to allocate our third type of static data - heap data. To make heap addresses legal,
the size of the program must be changed by extending the data area. This is done using a system
call named sbrk() (read: s-break).
In a normal program, the memory allocator package malloc() calls sbrk() to get a block of
memory addresses 'made legal' (i.e., mapped into the address space), then manages the block
internally, doling out chunks of it as requested. We will call sbrk() directly each time we need a
piece of heap memory. We give sbrk() a size, and it returns an address to a piece of memory of
that size, which is allocated by simply extending the data area size by that amount. We will do
briefly using the MARS sbrk syscall, then, later we will call our version of the function malloc,
which will call sbrk for us and ensure the memory it returns is not zeroed.
Suppose we need room for an array of N integers, named result. We would simply define result
as a pointer to an int, and the C code to allocate the array would be as follows:
Byte Order
hexadecimal memory - MARS displays the characters in strings strangely due to the byte order
of the underlying machine and to the fact that all MARS segment displays are shown using four-
byte quantities.
In short, there are two ways to number bytes in a four-byte quantity (a 32-bit word). In each of
the two drawings below, a word is shown as it would appear in a register. The byte farthest to the
left is the most-significant byte. The byte farthest to the right is the least-significant byte. In the
first drawing, we have numbered the bytes 0-3, where 0 is the most-significant byte.
0 1 2 3
If the bytes in our word were numbered like this, and the address of the word when it is placed in
memory was 0x1000, then the most-significant byte would appear at address 0x1000 and the
least-significant byte would appear at address 0x1003. This is how most people (at least most of
those of us who grew up in a left-to-right reading world) would think of byte-ordering, and, in
fact, when a 32-bit value in the header of an Internet packet is transferred across the Internet, the
bytes that make up the value must be transferred in this order - most-significant-byte first or
MSB-first for short. (The format is also called big-endian because the big part of the value (the
most-significant part) is transferred first.)
The other way to number bytes in our word is to assign 0 to the least-significant byte like this
3 2 1 0
Thus, if the address of this word when stored in memory was 0x1000, the byte stored at 0x1000
would be the least-significant byte. This format is called least-significant-byte first (LSB-first or
little-endian). It is the use of LSB-first byte order that causes the confusion in the memory
dumps.
ASCII strings, whose underlying type is one byte long, are stored as you would expect. The first
byte of the string is stored at address 0, the second at address 1, etc. We can see this if the string
is placed in a file and a character dump of that file is shown using Linux:
If we dump this file character-by-character we see that the characters' addresses simply increase
as we would expect:
$ od -A x -tc botest
000000 E n t e r t h e i n t e g e
000010 r t o c o n v e r t t o
000020 h e x : \n
Let's look at the data character by character where the characters are displayed in hexadecimal:
$ od -A x -tx1 botest
000000 45 6e 74 65 72 20 74 68 65 20 69 6e 74 65 67 65
000010 72 20 74 6f 20 63 6f 6e 76 65 72 74 20 74 6f 20
000020 68 65 78 3a 0a
If we output the file word-by-word, we will see that the words are read in LSB-first order. This
means that the first character ('E', whose value is 0x45) is assumed to be the least-significant
byte of the first word. When the words are redisplayed as 32-bit quantities, the characters appear
swapped:
$ od -A x -tx4 botest
000000 65746e45 68742072 6e692065 65676574
000010 6f742072 6e6f6320 74726576 206f7420
000020 3a786568 0000000a
Let's compare each of these so you can see how they line up:
000000 E n t e r t h e i n t e g e
000000 45 6e 74 65 72 20 74 68 65 20 69 6e 74 65 67 65
000000 65746e45 68742072 6e692065 65676574
The last line here is the way you will see the data displayed in the memory dump, although the
characters will appear in correct order if four bytes of this data are loaded as a word into a
register.
Characters
There are two common ways to represent a character string consisting of ASCII characters. The
one used by MARS is the same as the one used in C - a character string is a sequence of
characters followed by a null byte (a byte with value 0). The drawback - you cannot know the
length of a string without counting the characters. second encoding of a character string - the first
byte of the string indicates its length.
Besides the use of the null byte (which is also called a 'nul') to end the string, we also need to
encode the line terminator character.
Putting this all together, we can encode our string. Let's look up the values:
M o m \n
0x4d 0x6f 0x6d 0xa
look at the memory in the .data region. Assuming your label mom is word-aligned, you will find
a word with this value
0x0a6d6f4d
in the backwards order. this is due to the byte-order of the machine used. Since Intel machines
are LSB-first, the data, when output in 4-byte quantities, reverses the bytes of a character string
such as this. Whenever you are examining character data in Mars you must remember that the
bytes are reversed to avoid getting confused.
International Characters
Studying Unicode can be a bit confusing, since there is a difference between the characters
themselves and how they are transmitted. The problem is this - the computers of the world use a
byte stream to transmit character data. The base Unicode character set, which encodes all of the
most commonly-used languages of the world, is a 16-bit code. Inside a program, these 16-bit
unicode characters are referred to a wide characters. This base Unicode character set has the
ASCII characters as a subset. But how can this new 16-bit code be transmitted as a sequence of
bytes so that its users and the large number of ASCII users can easily coexist?
The solution is to encode non-ASCII Unicode characters as a sequence of one to three bytes.
This encoding is called UTF-8. When transmitting UTF-8, ASCII Unicode characters are
transmitted as single-byte ASCII values. When a non-ASCII character must be transmitted the
unused most-significant bit in the character is used to signal a change in encoding, and the
encoded character is sent. The number of leading 1-bits in the first byte indicates how many
bytes are used for the encoded character.
example - French. 'Mom' becomes 'M�re'. In this word, three of four characters are ASCII.
These do not require any special encoding - ASCII simply remains ASCII in UTF-8. The last
character, �, has the value 0xe8 in Unicode. Since this is greater than 0x7f it must be encoded as
a sequence of characters. The encoding for UTF-8 indicates that any value less than 0x7ff can be
encoded in a sequence of two bytes. The most-significant bits of the first byte would be 110 and
the most-significant bits of the second byte would be 10. The value of the character is encoded
using the remaining bits (as indicated by the underlined x's below):
first byte: 110xxxxx (the number of leading 1-bits indicates the number of bytes used to encode
the character)
Let's apply this to our special character �. Its value is 0xe8, which is 00011101000
expressed in 11 bits. Thus, encoded using UTF-8 the bit patterns are
try it in another language. In another popular language one character that can be used as an
expression of Mom has the value 0x5988. According to UTF-8, any 16-bit value can be
encoded using three bytes. The bits used to carry the value are the x's in
first byte: 1110xxxx (again, the number of leading 1-bits indicates the number of bytes used to
encode the character)
Our character has the 16-bit bit pattern 0101100110001000. This is encoded as
Unicode also has a 32-bit variant, which easily fits all the characters known in all languages in
the world. Even allowing for the private use area of the Unicode value range, all characters can
fit in 24 bits. It can also be encoded using UTF-8 and requires a maximum of four bytes.
If a program understands Unicode it will convert incoming UTF-8 to wide characters for internal
manipulation. When it must send the data elsewhere, it will be encoded as UTF-8 again for
transmission. Since UTF-8 is an eight-bit encoding, there is never an issue with byte-order.
If a program does not understand Unicode it will not be able to translate any encoded Unicode
characters in the input stream. These characters may appear as ? or some other funny character in
the output if the data is later displayed.
Since most languages can be represented in 128 or 256 characters, many languages (or language
groups) have their own eight-bit code, possibly with ASCII as a subset. There is a standard set of
these codes (character sets) named ISO8859-X where X indicates the specific encoding. Of
course, data written in one character encoding must be read using the same encoding. You cannot
write some data encoded in ISO8859-X and expect to read it as UTF-8.
A word on Strings
We have been using strings for a while but have not specifically talked about them. Let's take a
moment and go over the basics. We will use character strings as our example, since C strings are
easy to understand.
A C string is simply a sequence of characters with a zero byte at the end. That is what is
generated in the .asciiz directive. We are used to the construct
Here welcome is a label attached to our string. We also know that to output this string we must
get its address in the appropriate register for the syscall like this
la $a0,welcome
Mars keeps a record of where each label is. At the time the .asciiz directive was encountered,
Mars was filling up a data area (the .data section) with data, sequentially adding data as it was
encountered. At any time, a counter indicated where in the data area we currently were (the "end"
of the .data area). When the welcome: label was encountered, the value of the counter was
recorded and attached to the label welcome in an internal table called the symbol table.
Immediately following, bytes in the .data section were initialized with the contents of the
string and the counter was incremented by its length.
Later, when the la instruction was encountered, the symbol table was consulted, and the address
of welcome was substituted for the label.
For example, if the current end of the .data section was 0x10000040 when the .asciiz
directive was encountered, the address 0x10000040 would be entered in the symbol table for
welcome, 19 bytes (the string plus a null byte) would be initialized (0x10000040 -
0x10000052) and the new end of the .data section would be 0x10000053. Later, when the
la instruction was encountered, the address 0x10000040 would be substituted for the label
welcome. This would essentially become a load immediate instruction with a 32-bit constant
li $a0,0x10000040
lui $at,0x1000
addiu $a0,$at,0x40
Once the address of the string is in a register, you can perform the syscall OR write code to
manipulate the string character by character. To retrieve the first character of the string, for
example
lb $t0,0($a0)
addi $a0,$a0,1
etc.
Let's take one more step and examine the C statement (using a global variable welcome):
Although this looks very similar, it is somewhat different. Again, a null-terminated ASCII string
must be initialized and its address recorded. This time, however, the address is used to initialize a
[pointer] variable with type char *. Let's look at the code that would be generated for Mars by
a compiler, then explain it
Just like before, a label is generated (this time by the compiler) to attach to the string. The
[compiler-generated temporary] label (.Lwelcome) is entered into the symbol table with the
address of the string. Then, to keep track of the string in the high level language, a variable is
initialized with this address. The variable is a char *, or a pointer to char. When Mars
encounters the .word directive, it looks up the value of .Lwelcome and substitutes it. In our
example, the .word directive would become
lw $a0,welcome
Note the difference between the la and lw. In the earlier case welcome was the label
attached to the string. In this case it was a data word that had been initialized with the
address of the string.
Once you have the address of the string in a register you manipulate it the same way.
Note that the characters in C strings are constant characters. It is illegal to modify the
characters in a string initialized in this way. (The type of the pointer welcome above is actually
const char *, or pointer to constant char.) You may very well get a fault if you modify the
characters.
If you want to modify characters in a string you must use a character array and initialized it in
some way - by reading in a string or by copying a constant string to it.
Other pointers
The initialization of integer pointers is exactly the same. The only difference is how you use and
increment the pointer. At that time you must know the underlying size of the data. If the register
$a0 has been initialized with the correct address:
for a string
lb $t0,0($a0)
addi $a0,$a0,1
lw $t0,0($a0)
addi $a0,$a0,4
passing parameters
allocating local storage. This storage may be deallocated when the procedure returns
A very important aspect of this implementation is that both sides of a procedure call are blind.
Other than the published interface (of parameters and return value), the caller and callee do not
know anything about each other. Procedures must be able to be translated separately (in separate
modules) and work when the modules are joined.
All of these capabilities are provided by the platform's procedure calling convention and
implemented using a stack. A stack is a block of memory with a movable pointer (the stack
pointer - $sp on MIPS). At any time, memory on one side of the stack pointer is in-use and
memory on the other side of the stack pointer is not yet allocated. The stack pointer itself points
to the last memory address that is in-use. This is called the top-of-stack.
Stacks are usually word- or double-word aligned for simplicity. This means that the stack pointer
must be moved by multiples of the alignment size - in the case of words, by four-byte
increments. Any data placed on the stack must be padded to keep proper alignment. This means
all data placed on the stack must be a minimum of four bytes. This means scalars that are less
than four bytes are converted to their four-byte counterparts and placed on the stack.
Although stack memory and other (static and dynamic) memory could be implemented as
completely separate memory areas (addresses), they are usually implemented using one large
block of addresses, with each starting on opposite ends. In this case (e.g., MIPS) memory is
allocated using increasing addresses and stack space is allocated using decreasing addresses (i.e.,
the stack grows downwards):
static memory
...
dynamically
allocated
...
low address ---> pre-allocated
(globals)
(Note that on this implementation the top of stack is actually the lowest address in-use.)
Traditionally, placing an item on the stack is called pushing the item on the stack. In our
implementation, this involves the following steps
moving the stack pointer (by subtracting the size of the item)
storing the item on the stack (using a non-negative offset from the current stack pointer)
The inverse operation is called popping an item on the stack. The sequence is important:
moving the stack pointer (by adding the size of the item)
This inverse sequence is very important to avoid referencing memory that is not in the active part
of the stack. No such reference should ever be made.
Simple Procedures