Vous êtes sur la page 1sur 14

Christopher A. Wood caw4567@rit.

edu

Code architecture and design High-level source code changes Compiler settings Assembly tweaks

1.

Measure performance

Dynamic program analysis using a software profiler Portions of the code that consume the most CPU cycles and computation time I/O overhead, inefficient algorithm, poor design? Source code tweaks or design changes?

2.

Identify hotspots

3.
4.

Identify cause of hotspots


Change the program

-Donald Knuth

Design changes tend to have the biggest impact on code performance Analysis of the code architecture is the best starting point
Mathematical analysis Understanding technological considerations Parallelism

Change the scope of analysis (module- and global-

based)

Data bandwidth performance


Arithmetic operation performance
functions Think at the bit-level
Keep data in devices that can be accessed faster Know your order of operations and the performance of mathematical

Control flow

Software control flow structures (e.g. indirect

Memory usage

function calls, switch statements, branches) perform differently. Be conscious of processor pipeline predictions
Especially important with embedded devices

High performance, dual-issue, superscalar 32bit RISC CPU Seven stage, highly pipelined microarchitecture Dual instruction fetch, decode, and out-oforder issue Separate instruction and data cache arrays Memory Management Unit (MMU) with separate instruction and data shadow TLBs

Soft processor core designed specifically for Xilinx FPGAs Implemented using general-purpose memory and logic fabric of the FPGA Versatile interconnect system to support embedded applications connected to the PLB, its primary I/O bus User-configured memory aspects (cache size, pipeline depth, embedded peripherals, MMU, etc.) Capable of hosting operating systems that require hardware support (e.g. page tables and address space protection in Linux)

Is it an option on the target platform? Can portions of your algorithm be performed in parallel?
E.g. if your algorithm operates on bytes you may

be able to operate on 2, 4, or 8 of them simultaneously using word-based instructions provided by CPU

Can other hardware components perform computations in parallel with the processor?

Look at the software from both a source code and design perspective Analyze the flow of data in your algorithm High-level API usage Code size!

Improved hardware makes software optimization unimportant Using tables always beats recalculating Using C compilers makes it impossible to optimize code for performance Globals are faster than locals Using smaller data types is faster than larger ones

Powers of 2 Optimize loop overhead Loop manipulation (rolling/unrolling/jamming) Declare local functions as static Pass by value and pass by reference Unsigned vs. signed Leverage early termination of if statements Register usage (global variables arent placed there)

http://www.azillionmonkeys.com/qed/optimize. html http://www.cs.ucsb.edu/~nagy/docs/MAEMostafa.pdf http://www.codeproject.com/KB/cpp/C___Code _Optimization.aspx http://developer.amd.com/documentation/articl es/pages/6212004126.aspx https://www01.ibm.com/chips/techlib/techlib.nsf/techdocs/2 D417029AE3F3089872570F8006D4E99/$file/Pow erPC440x6_um_29Sept10_pub.pdf

Vous aimerez peut-être aussi