Vous êtes sur la page 1sur 74

Software for Embedded Systems

CS424
KCS Murti, Chandra.kavuri@gmail.com

Embedded processor Architectures

Embedded Processor Architectures

Overview
Concepts associated with the processor within the SOC Reference to the Intel Atom processor 32-bit processor with a number of onboard peripherals Cost effective SoC is the objective for ES.

General Purpose Registers


Application Binary Interfaces(ABI) Specification that describes which registers are used for what purpose Primary modes of operation: Flat, Segmented, and Real 8086 modes Most environments have migrated to a linear 32-bit flat memory model 32 bit Extended Instruction Pointer (EIP) Flags Stack pointer Base pointer

Flags

Basic IA-32 program Execution registers

IA32 basic execution environment

Basic execution environment


Address space: linear address space of up to 4 Gbytes Physical address space of up to 64 GBytes (236 bytes) Basic program execution registers: eight general-purpose registers, six segment registers, EFLAGS register, and the EIP x87 FPU registers MMX registers: support execution of single instruction, multiple-data (SIMD) operations on 64-bit packed byte XMM registers: Support execution of SIMD operations on 128-bit packed single-precision and double precision floating-point values and on 128-bit packed data. Stack

IA32 memory models


Flat memory model: Memory appears to a program as a single, continuous address space (Linear address space) Segmented memory model: Memory appears to a program as a group of independent address spaces called segments. Gets mapped to linear address internally. (logical to linear) Real-address mode: 8086 memory model

Modes of Operation vs. Memory Model


Protected mode processor can use any of the memory models. memory model used depends on the design of the operating system or executive. When multitasking is implemented, individual tasks can use different memory models. Real-address mode processor only supports the real-address mode memory model System management mode processor switches to a separate address space, called the system management RAM (SMRAM). Memory access is similar to real address mode mode. 64-bit mode Segmentation is generally (but not completely) disabled. Creates flat 64-bit linear-address space.

Modes of operation
Real-address mode: implements the programming environment of the Intel 8086 processor with extensions Default mode after power-up or a reset. Protected mode: native state of the processor ability to directly execute real-address mode 8086 software in a protected, multi-tasking environment. System management mode (SMM): processor switches to a separate address space while saving the basic context of the currently running program or task. provides an OS with a transparent mechanism for implementing platforms pecific functions such as power management Virtual-8086 mode allows the processor execute 8086 software in a protected, multitasking environment

Special use of registers


EAX Accumulator for operands and results data EBX Pointer to data in the DS segment ECX Counter for string and loop operations EDX I/O pointer ESI Pointer to data in the segment pointed to by the DS register; source pointer for string operations EDI Pointer to data (or destination) in the segment pointed to by the ES register; destination pointer for string operations ESP Stack pointer (in the SS segment) EBP Pointer to data on the stack (in the SS segment)

Segment Registers

Flat memory model

Segment memory model

Privilege Levels
Provide a mechanism to allow portions of the software operate with differing levels of privilege. Current Privilege Level (CPL) is used by the system to control access to resources and execution of certain instructions. CPL stored in the lowest 2 bits of the code segment IOPL is stored in bits 12 and 13 of the FLAGS register Highest privilege level is number zero If the CPL is less than the current IOPL, then the privileged. operation is allowed

Floating-Point Units
Intel processors have two floating-point units. operate on floating-point, integer, and binary coded decimal (BCD) operands. supports 80-bit precision, double extended floating-point . Atom processor supports the Supplemental Streaming SIMD Extensions 3 (SSSE3) version of the SIMD instructions, which support integer, single, and double precession floating-point units.

x87 FPU Execution Environment

Example x87 FPU Dot Product Computation

Processor Specifics
Need some sort of version management in ES Need to establish exactly which features are supported on the particular version you are working with. Use control and information registers Use CPUID instruction to get information.

Application Binary Interface


For Es software is developed in HLL. But we should be aware how assembly and high-level languages interact The calling conventions define the following aspects of the code generated by compilation: Data representation Data alignment and packing Stack alignment Register usage Function calling conventions Relocation/relative addressing Name mangling Key aspect calling convention is the creation of stack.

ABI Register Calling Conventions32-Bit Linux/Windows


IA-32 Register EAX EBX ECX EDX ST0-ST7 ESI EDI EBP XMM0-XMM7 YMM0-YMM7 32-Bit Linux GNU/Windows Scratch and return value Callee save Scratch Scratch and return Scratch/ST0 return float Callee save Callee save Callee save Scratchregisters Scratch registers256 bit on AVX-capable processors only

Calling conventions
cdecl default calling convention requires the calling function to perform the stack cleanup supports functions with a variable number of arguments Stdcall supports a fixed number of arguments for a function stack cleanup is performed by the called function

ES use the C default cdecl convention

Processor Instructions

Operands
Immediate Operands data encoded in the instruction itself; Ex: MOV EAX, 00 Register Operands Source and destination operands can be any of the follow registers. 32-bit general purpose registers (EAX, EBC, ECX, EDX, ESI, EDI, ESP, or EBP) 16-bit general purpose registers (AX, BX, CX, DX, SI, SP, BP) 8-bit general-purpose registers (AH, BH, CH, DH, AL, BL, CL, DL) Segment registers EFLAGS register MMX Control (CR0 through CR4) System Table registers (such as the Interrupt Descriptor Table register) Debug registers Machine-specific registers Memory operands referenced by means of a segment selector and an offset.

Default Segment Selection Rules

Memory operands
Source and destination operands in memory are referenced by means of a segment selector and an offset. Ex: MOV [EBX], EAX; moves the value in EAX to the address pointed by EBX. Offset is added to base address: DisplacementAn 8-, 16-, or 32-bit immediate value. IndexA value in a general-purpose register. Scale factorA value of 2, 4, or 8 that is multiplied by the index value. Memory Operand = Segment Selector +Base Register+(Index Reg X Scale) Displacement Value

Offset computation

Data Types

Basic

Bit field

Pointer

Instruction group for Atom processors


General Purpose x87 FPU x87 FPU and SIMD State Management. MMX Technology SSE Extensions SSE2 Extensions SSE3 Extensions SSSE3 Extensions

General purpose
Same as x86. Some special instructions: CMOVE Conditional move BTS Bit test and set ENTER High-level procedure entry LEAVE High-level procedure exit

SIMD Execution model


Streaming SIMD Extensions (SSE) operate on sequential arrays of integers of 8, 16, or 32 bits single-precision 32-bit floating-point data double-precision 64-bit floatingpoint data packed byte, word, and double-word integers

Example
1 void add(float *a, float *b, float *c) 2{ 3 __asm { 4 mov eax, a 5 mov edx, b 6 mov ecx, c 7 movaps xmm0, XMMWORD PTR [eax] 8 addps xmm0, XMMWORD PTR [edx] 9 movaps XMMWORD PTR [ecx], xmm0 10 }

X87 FPU Instructions


Data Transfer Instructions
FLD Load floating-point value FST Store floating-point value

Basic Aritmetic Instructions


FADD Add floating-point FADDP Add floating-point and poph

Comparison Instructions
FCOM Compare floating-point FCOMP Compare floating-point and pop

Transcendental Instructions
FSIN Sine FCOS Cosine

Control Instructions
FINCSTP Increment FPU register stack pointer FDECSTP Decrement FPU register stack pointer

x87 FPU Data Type Formats

MMX Instructions
single-instruction multiple-data (SIMD) operations. MMX technology, SSE extensions, SSE2 extensions, and SSE3 extensions MMX instructions operate on packed byte, word, doubleword, or quadword integer Use 8 MMX registers (64 bit) , and general-purpose registers. Data Transfer Instructions
move doubleword and quadword operands between MMX registers and between MMX registers and memory.

Conversion Instructions
pack and unpack bytes, words, and doublewords

Arithmetic Instructions
perform packed integer arithmetic

Comparison Instructions
compare packed bytes, words, or doublewords

Logical Instructions Shift and Rotate Instructions State Management Instructions

SSE Instructions
Streaming SIMD Extensions (SSE) Enhance the performance of IA-32 processors for advanced 2-D and 3-D graphics, motion video, image processing, speech recognition, audio synthesis, telephony, and video conferencing Executed on Intel 64 and IA-32 processors that support SSE extensions Detect support using CPUID instruction Instruction groups SIMD single-precision floating-point instructions that operate on the XMM registers MXSCR state management instructions 64-bit SIMD integer instructions that operate on the MMX registers Cacheability control (temporal and non temporal) , prefetch, and instruction ordering instructions

SSE Execution environment

SSE2 Instructions
Operate on Packed double-precision floating-point operands packed byte, word, doubleword, and quadword operands located in the XMM registers. Instruction groups Packed and scalar double-precision floating-point instructions Packed single-precision floating-point conversion instructions 128-bit SIMD integer instructions Cacheability-control and instruction ordering instructions

SSE3 instructions
accelerate performance of Streaming, SSE2 and x87-FP math capabilities One x87FPU instruction used in integer conversion One SIMD integer instruction that addresses unaligned data loads Two SIMD floating-point packed ADD/SUB instructions Four SIMD floating-point horizontal ADD/SUB instructions Three SIMD floating-point LOAD/MOVE/DUPLICATE instructions Two thread synchronization instructions

Vertical data movement

Horizantal data movement

SSE4 instructions
Improve the performance of media, imaging, and 3D workloads. Dword Multiply Floating-Point Dot Product Streaming Load Hint Packed Blending Floating-Point Round Instructions with Selectable Rounding Insertion and Extractions from XMM Registers Packed Integer Format Conversions String and Text Processing

Advanced Vector Extensions (AVX)


Promotes legacy 128-bit SIMD instruction sets that operate on XMM register set to use a vector extension (VEX) prefix and operates on 256-bit vector registers (YMM). 256-bit Floating-Point Arithmetic Processing Enhancements 256-bit Non-Arithmetic Instruction Enhancements

Input/Output
Access IO through a separate I/O address space or memory-mapped I/O (64K) individually addressable 8-bit I/O ports. I/O devices that respond like memory components can be accessed through the processors physical-memory address space. When using memory-mapped I/O, caching of the address space mapped for I/O operations must be prevented using memory type range registers (MTRRs) I/O privilege level: EFLAGS register controls access to the I/O address space by restricting use of selected instructions. IN, Out instructions can be executed only if the current privilege level (CPL) of the program or task currently executing is numerically less than or equal to the IOPL

Processor Identification
When the CPUID instruction is executed, selected information is returned in the EAX, EBX, ECX, and EDX registers.

Privilege levels
segment-protection mechanism recognizes 4 privilege levels greater numbers mean lesser privileges Processor checks: Current privilege level (CPL):privilege level of the currently executing program or task Descriptor privilege level (DPL): privilege level of a segment or gate Requested privilege level (RPL) :override privilege level

Gate desriptors
Provide controlled access to code segments with different privilege levels Call gates Trap gates Interrupt gates Task gates

Call gate descriptor

Call gate mechanism

CALLs

Stack Structure

Calls to Other Privilege Levels


Access modules of higher privilege segments by tightly controlled and protected interface called a gate. segment selector provided in the CALL references data structure call gate descriptor call gate descriptor provides the following: access rights information the segment selector for the code segment of the called procedure an offset into the code segment processor switches to a new stack to execute the called procedure. segment selectors and stack pointers for the privilege level 2, 1, and 0 stacks are stored in a system segment called the task state segment (TSS)

CALL and RET Operation Between Privilege Levels


Performs an access rights check (privilege check). Temporarily saves (internally) SS, ESP, CS, and EIP registers. Loads the segment selector and stack pointer for the new stack from the TSS and switches to the new stack. Pushes the temporarily saved SS and ESP values for the calling procedures stack onto the new stack. Copies the parameters from the calling procedures stack to the new stack. A value in the call gate descriptor determines how many parameters to copy to the new stack. Pushes the temporarily saved CS and EIP values for the calling procedure to the new stack. Loads the segment selector for the new code segment and the new instruction pointer from the call gate into the CS and EIP registers, respectively. Begins execution of the called procedure at the new privilege level.

Interrupts & Exceptions


interrupt - asynchronous event triggered by an I/O device. exception - a synchronous event generated by processor. IA-32 defines 18 predefined interrupts and exceptions and 224 user defined interrupts Executes an implicit call to a handler procedure or to a handler task similar to a procedure call to another protection level Interrupt vector references to interrupt gate or a trap gate Interrupt and exception handler routines can also be executed in a separate task through task gates. Source External interrupts. Software-generated interrupts.

Procedure calls to block structured languages


ENTER (enter procedure) and LEAVE (leave procedure) instructions Simplify procedure entry and exit in compiler-generated code Create and release stack frames for called procedures Allow scope rules to be implemented Enter instruction: creates a stack frame compatible with the scope rules has two operands: number of bytes to be reserved on the stack for dynamic storage and lexical nesting level Ex: ENTER 2048,3 LEAVE Instruction: reverses the action of the previous ENTER instruction release all stack space allocated to the procedure

IA32 exceptions & interrupt sources

Exceptions & interrupt sources

Interrupt Descriptor Table (IDT)


Associates each exception or interrupt vector with a gate descriptor. Array of 8-byte descriptors IDT may reside anywhere in the linear address space Processor locates the IDT using the IDTR register

Interrupt procedure call

Interrupt latency

Task Management
Unit of work that a processor can dispatch, execute, and suspend Task execution space consists of a code segment, a stack segment, and one or more data segments Task-state segment (TSS) specifies the segments that make up the task execution space Provides a storage place for task state information Task register (TR) When a task is loaded into the processor for execution all attributes of TSS are loaded into TR.

Task State
Tasks current execution space general-purpose registers. EFLAGS register. EIP register. control register CR3. task register. LDTR register. The I/O map base address and I/O map (contained in the TSS). Stack pointers to the privilege 0, 1, and 2 stacks (contained in the TSS). Link to previously executed task (contained in the TSS).

Executing a Task
A task is dispatched by either An explicit call to a task with the CALL instruction. An explicit jump to a task with the JMP instruction. An implicit call (by the processor) to an interrupt-handler task. An implicit call to an exception-handler task. A return (initiated with an IRET instruction) when the NT flag in the EFLAGSregister is set. Task switch occurs between the currently running task and the dispatched task. Execution environment of the currently executing task is saved in its TSS Execution of the task is suspended. Tasks are not recursive.

Task management data structures


Task-state segment (TSS). Processor state information stored in a system segment Task-gate descriptor provides an indirect, protected reference to a task TSS descriptor TSS, like all other segments, is defined by a segment descriptor. Task register holds the 16-bit segment selector and the entire segment descriptor for the TSS of the current task NT flag in the EFLAGS register.

Task Linking
Return execution to the previous task Uses previous task link field of the TSS Uses NT flag in the EFLAGS register

IDT Gate Descriptors


IDT may contain Task-gate descriptor Interrupt-gate descriptor Trap-gate descriptor very similar to call gates

Interrupt Procedure Call

IDT

Interrupt tasks
Interrupt handler is accessed through a task gate . Advantages: entire context of the interrupted program or task is saved automatically. A new TSS permits the handler to use a new privilege level 0 stack . The handler can be further isolated from other tasks by giving it a separate address space Dis advantages Higher interrupt latency IA-32 architecture tasks are not re-entrant. So disable interrupts in handler.

System-Level Registers and Data Structures

Memory management
Provides Address translation: provides per process address translation of linear (virtual) address to physical addresses. Protection: provide privilege checking and read/write protection of memory. Cache control: Different memory regions requires different cacheability attributes. page directory entry contains page present indicator and the base address of a page table page table entry physical address of the 4-kB page .

Linear address translation

Segmentation and paging

Directory descriptor
bit
0 2 2 3 4

name
Present 1 R/W Read/write;

description
if the page descriptor is present and valid if 0, writes may not be allowed to the 4-MB region controlled by this entry User/supervisor; if 0, accesses with CPL 3 are not allowed to the 4-MB region controlled by this entry Page-level write-through; indirectly determines the memory type used to access the page table referenced by this entry Page-level cache disable; indirectly determines the memory type used to access the page table referenced by this entry Accessed; indicates whether this entry has been used for linear-address translation Ignored If CR4.PSE =1, must be zero Should be zero Physical address of 4-kB aligned page table referenced by this entry

U/S

PWT

PCD

5 6 7 811 1231

D PS Ignored Addr

Page table entry

MMU-Additional descriptors
Write protection general-protection fault if an attempt is made to a write to a protected page Privilege Set privilege levels to pages (kernel space to user space) Accessed accessed bit can be used to identify the age of a page table entry Dirty Set when it is written. Used by page swapping algo. Modes of MMU 32 bit Physical address extension 64 bit Nominal page sizes: 4 KB, 2MB and 4 MB

Translation Caching
Translation would be a very costly for every single memory transaction Cache the translation tables. Translation look-aside buffers (TLBs) TLB constructed as a highly associative cache. Virtual address is compared against all cache entries. TLB are used for the translation when hit occurs. TLB structures for Atom: Instruction for 4-kB page: 32 entries, fully associative. Instruction for large pages: 8 entries, four-way set associative. Data 4-kB pages: 16-entry-per-thread micro-TLB, fully associative; 64-entry DTLB, four-way set associative; 16-entry page directory entry cache, fully associative. Different translations for different processes! Let different processes live in the same linear address space to avoid flushout (at the cost of security!)

Cache in Atom
Hierarchy 32-K eight-way set associative L1 instruction cache. 24-K six-way set associative L1 data cache. 512-K eight-way set associative unified instruction and data L2 cache. cache line size is of 64 bytes Cache allocation policy read-only allocate allocation on a write transaction Cache coherency: MESI

Six-Way Set Associative 24-K Data Cache

cachebility
Strong Un-cacheable (UC) System memory locations are not cached. Un-cacheable (UC-) can be overridden by programming the MTRRs for the write combining memory type Write Combining (WC) System memory locations are not cached coherency is not enforced Writes may be delayed and combined in the write combining buffer Write-Through (WT) All writes are written to a cache line and through to system memory. Write-Back (WB) Delayed write Write Protected (WP) Writes are propagated to the system bus Cause corresponding cache lines on all processors on the bus to be invalidated.

Micro-architecture
Architecture : Contract between the platform and the software Micro-architecture : Specific implementation that complies with the architecture Tuned to fulfill specific optimizations such as core speed, power etc Necessary to know for tuning system performance

Atom Micro-architecture
in-order, superscalar pipeline two-wide superscalar MS=Micro sequencer TLB= Translate virtual to physical ILD: Instruction decoders AGU: Address generation unit BIU: bus Interface Unit

In nutshell
Power-efficient performance Single-micro-op instruction execution from decode to retirement Sixteen-stage, in-order pipeline Dual pipelines to enable decode, issue, execution and retirement of two instructions per cycle. Second level cache is 512 KB and 8-way associativity. Efficient hardware prefetchers to L1 and L2 (speculative loading) Two issue ports for dispatching SIMD instructions to execution units. Single-cycle throughput for most 128-bit integer SIMD instructions Up to six floating-point operations per cycle Up to two 128-bit SIMD integer operations per cycle

Atom pipeline

References
Modern embedded Computing- Chap 4 Overview of Bluetooth Technology, Hongfeng Wang, penn state More on USB can be found from www.usb.org

Vous aimerez peut-être aussi