Efficient Coding For ARM Platforms

Page 1
Efficient coding for ARM platforms

Chris Shore, ARM - December 4, 2012
I am sure many of you will be familiar with Donald Knuth’s oft-quoted sentiment:
“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.”
How many of you have seen the second sentence?
“Yet we should not pass up our opportunities in that critical 3%.”
Both parts are important. We all realize that optimization is a valuable activity but it only reaches maximum payback for minimum effort when we
apply the available effort carefully in exactly the right place.
The business of optimization

When writing, modifying, testing, debugging and finally optimizing code, the coding standard rules in the vast majority of cases. However, any
sensible coding standard will contain sufficient loopholes to allow you to choose performance over, say, readability in critical cases. It is identifying
those critical cases where we must first spend time.
Never forget the 90/10 rule, which says that 90% of execution time is spent in 10% of the code. Before you start looking over the code with your
optimization spectacles on, you need to spend significant time identifying that 10%. Time spent here is truly well spent. Profilers and other tools are
invaluable here and can help pinpoint the trouble spots very quickly. But, as Rob Pike says:
“Bottlenecks occur in surprising places, so don’t try to second guess and put in a speed hack until you’ve proven that’s where the
bottleneck is.”
A many-dimensional problem
Coding is an activity which operates within an interlinked set of constraints.
The items around the outside (robustness, performance, security etc.) are the issues which we have to care about in our program, the items
surrounding the code are the constraints within which we must operate. We must work within a given language, on the platform we have been
provided with by our esteemed hardware- designing colleagues, using prescribed tools etc.
The rest of this paper concentrates on the four constraints in the diagram: Language, Hardware, Tools and Platform.
Language
Remember that a short program (in terms of lines of code) is not necessarily a faster one. Writing complex expressions on a single line is no faster
http://www.edn.com/Home/PrintView?contentItemId=4402645 1/7/2013 2:45:56 PM

Page 2
than a series of sub-expressions using temporary variables. In some cases it may be even slower. But, in almost all cases, it will be significantly less
readable and maintainable. Favor readability over conciseness always.
Ambiguity and Flexibility

When trying to be clever, remember that the language has limitations and ambiguities, some of which are deliberate. The behavior of expressions
involving variables of char type, for instance, depends on whether your tools (sometimes also the ABI you are using) specify that they are signed or
unsigned. Be careful and be aware of the rules which apply in your case.
A more subtle case is behavior when trying to shift a variable by a number of bits larger than its size. The result depends on a combination of the
compiler and the underlying hardware. Intel systems will generally shift by zero bits (leaving the value unchanged) while ARM systems will shift the
value right out of storage item (leaving a zero result). The C language does not define the behavior, your environment does. The compiler will often
warn about this but its ability to do so is limited when the shift distance is not a compile-time constant.
The C language is gloriously, wonderfully flexible. Let’s face it – this is the reason most of us love it so! But its very flexibility leads to some
Achilles heels. Most notable is the concept of “pointer aliasing” in which the compiler must assume that any pointer may address any data item
whose address is known. This restricts the freedom of the compiler to optimize effectively in many cases. You, as the programmer, can use your
meta- knowledge about what the program is trying to do to make the compiler’s job easier.
Writing this instead…
…may not look so nice but makes it explicit that each of the input values need be loaded only once during the sequence. Many compilers support the
C99 “restrict” keyword which allows the programmer to signal to the compiler that pointers do not reference other items. The use of temporary
variables in cases like this, though, is more portable and may, therefore, be preferable.
Title-1
Help the compiler

Make intelligent use of “const” as well to indicate to the compiler that certain items are non-volatile. In embedded systems, this is crucial as sections
containing “static const” items can be placed in non-volatile memory. Forgetting to declare such items as “const” in your code may confuse both the
compiler and future readers and maintainers of your program. Within functions, marking data which you do not intend to modify as “const” assists
the compiler in its optimization task as it does not have to work out what your assumptions are – if it can!
Hardware
Turing machines are wonderful beasts. They may not be particularly capable but they are of infinite extent – principally in that they have infinite
storage. We are not so lucky in the real world – in many cases we can treat memory as essentially infinite but we must acknowledge that only a
relatively small part of it will be fast.
Turing machines also carry out their limited range of operations at a constant pace, everything takes the same amount of time. In real life, not all
operations take the same length of time.

Page 3
Instruction set
ARM processors, unusually, support a range of instruction sets:
ARM – The original ARM instruction set in which all instructions are 32-bit.
Thumb – In earlier cores (ARMv4T and ARMv5) all Thumb instructions were 16-bit, providing improved code density at the expense of some
loss in performance. Later processors (ARMv6T2 and ARMv7 onwards) use Thumb-2 technology to add 32-bit instructions to it, making a
complete instruction set giving an excellent compromise between code density and performance.
NEON – NEON is a wide SIMD instruction set optionally support on ARMv7-A processors. It is an excellent target for DSP and multimedia
algorithms.
VFP – ARM’s Vector Floating Point instruction exists in several incarnations, sometimes integrated with NEON and sometimes on its own.
Make sure you are aware of the instruction sets available on the processor in your system and select which to target for selected parts of your
application. In modern ARM systems supporting Thumb-2, Thumb is the instruction set of choice in the vast majority of code. ARM is often chosen
for hand-crafted assembly code and when compiling high-performance code sections. NEON is chosen for particular algorithms which benefit from
its SIMD vector-processing capability.
In systems which do not support Thumb-2, use ARM for performance and Thumb for code density. In such systems, it is common to compile
significant parts of a program in different instruction sets and combine them using at link-time into a single body of code.
Microarchitecture
Those of you who have been using ARM for a while will be familiar with the evolution of the pipelines over the generations. We have moved from a
simple three-stage pipeline in the ARM7 to a much more complex variable-length pipeline in the more recent Cortex-A9. The processors in between
have had varying lengths and structures. Since all have been in-order execution units, it has historically been crucial to optimize for the pipeline
structure when aiming to maximize instruction throughput. In general, the compiler takes care of this, provided that it is configured correctly for the
target processor. When coding in assembler, though, it is the job of the programmer manually to order instructions appropriately. This has been one
area in which it has been possible for programmers to outdo the compiler.
In modern ARM cores (Cortex-A9 onwards) more advanced execution units make use of techniques such as out-of-order completion and register
renaming greatly to reduce the effects of the pipeline on throughput. This makes this kind of optimization much less important. On these processors,
optimizes C/C++ code is generally a much better choice than assembler.
Working with branch prediction

All ARM processors since ARM10 have made use of branch prediction techniques to improve performance. The precise techniques employed vary
from processor to processor and include static, statistical and dynamic prediction, sometimes backed with return stacks, branch target caches and
branch target buffers. Generally, branch prediction is one of those things which “just works” – you turn it on and your code runs faster. However,
there are some things which either don’t predict or don’t predict well.
In the case of successful prediction, the execution time of a branch instruction can be reduced to four cycles (static prediction), one cycle (dynamic
prediction) and sometimes to zero cycles (branch folding). The cost of a mis-predict is dependent on the precise pipeline structure but will be at least
7 cycles.
Branches which are not PC-relative are inherently difficult or impossible to predict. Since the target address is unknown until the instruction reaches
the execute stage of the pipeline, the processor has no time in which to start fetching ahead from the destination. Some cores (from ARM11 onwards)
incorporate a return stack which allows them to predict return instructions as a special case but, in general, are unable to predict this kind of branch.
Also, branches which execute immediately after another branch are not predicted and out of a pair of branches appearing in the same fetch slot in
memory, one will not be predicted.
Title-2
Branch prediction works well for “standard” loops e.g. “for” or “while” loops in which the conditional branch is at the bottom of the loop. In these
cases, prediction will default to predicting that such a backwards branch is taken and will successfully predict on every iteration except the last. “if”
statements, on the other hand, incorporate forward branches (the branch to the “else” clause) and these will be predicted as “not taken”. This leads to
the simple advice that the less commonly executed clause should be placed in the “else” part of the construct. This might involve reversing the sense
of the test in the “if” and is one case where the programmer might sacrifice readability for the sake of performance.
The fact that non-PC-relative branches cannot be predicted means that long branches, run-time-assigned function pointers and jump tables (including
shared library and DLL function tables) will, in general, not be predicted at all. Simply being aware of this helps the programmer make informed
decisions about using these techniques or avoiding them in critical sections.
Note that the definition of a “long” branch is dependent on the instruction set in use:

Page 4
ARM >32MB
Thumb >2KB
Thumb with Thumb-2 >16MB
When coding in the ARM instruction set, remember that almost all instructions can be conditionally executed and sequences of conditional
instructions can often eliminate the need for short, forward branches altogether. While the original Thumb instruction set does not include this
feature, Thumb-2 adds the “If-Then” instruction which allows for conditional blocks.
Division and modulo

ARM processors do not, generally have division hardware. This applies to all processors prior to ARMv7 and to ARMv7-A processors prior to
Cortex-A9. When coding for such processors, division (and the related modulo function) should simply be avoided. The compiler treats as special
cases division by a known compile-time constant, division by a power of 2 and division by 10. Otherwise, a run-time library routine is used with a
cost of between 20 and 140 cycles for 32-bit by 32-bit division.
As shown below, many uses of module can easily be replaced with a simple test-and-reset construct.
This is one very clear demonstration of the adage that “short code is not always fast code!”
Memory systems
One simple rule applies here:
“Keep it close and access it as little as possible”
As you can see from the diagram, access times increase rapidly the further away from the core you go to access an item in memory. At the extremes,
registers are available without penalty and external memory may take several hundred cycles to return the requested data.
When coding real-time systems, you should also take into account the fundamental indeterminacy involved with accessing caches and virtual
memory. Cache misses and page faults can add significant overhead.
So, cache what you can in registers. You can help the compiler do this by keeping the number and scope of local variables as small as possible. This
minimizes the need to spill variables to the stack during a function. Likewise, restrict the number of parameters passed to a function to four or fewer
(that many can be passed in registers with any excess requiring stack accesses).
Avoid, also, taking the address of a local variable (for instance, to pass it by reference to a function). This forces the compiler to put it in memory (on
the stack) and to keep its value updated, reducing the potential for caching the value in a register.
Data sizes
ARM processors are 32-bit beasts and handle 32-bit items well. While data items may be compressed for storage in memory (to save space) they
should be cast to word-sized variables for processing. Code which

Page 5
performs arithmetic or logical operations on sub-word data items will generally involve extra instructions (with cost both in execution time, code size
and reduced instruction cache utilization) continually to normalize results. Contrast this by noting that extending and truncating values in registers
when loading from memory or storing to memory is often a “free” part of the behavior of the relevant load and store instructions.
In general, small variables in storage make for better data cache utilization while long variables in code make for better instruction cache utilization.
Title-3
Writing for caches and TLBs

The total address space of your application can be huge, potentially up to 4GB. Clearly, this is orders of magnitude greater than what can held in the
cache at any one time and also then the total area of memory which can be address via the 32 address translation descriptors which can be cached in a
typical TLB.
The most common L1 cache configuration is 2 x 16KB. It makes sense to design data structures with this in mind. Obviously, small code and data
caches much better than large and data which is tightly packed in memory will cache better than sparse data.
Faced with the need to work with data sets which are frequently much larger than 16KB, we need to code in a way which respects cache access
patterns. Cache-friendly algorithms such as matrix tiling can be very effective when dealing with large arrays. Likewise, zero-copy algorithms should
be used wherever possible. The ARM instruction set also provides cache preload instructions which can be used carefully to ensure that data items
are loaded into the cache ahead of time, reducing latency.
A 32-entry TLB can cache translation information for between 128KB (32 x 4KB pages) and 32MB (32 x 1MB sections). Under a typical operating
system, reality will be somewhere between these two. Data which is spread thinly over a large region of memory can easily thrash the TLB, negating
much of its benefit. Repopulating a single TLB entry typically takes two accesses to page tables in external memory.
Coding in a TLB-friendly manner involves similar considerations to those mentioned above for caches. Data items which are used together should be
placed close together in memory so that, wherever possible, they are covered by a single TLB entry, or by a set of entries which persist for a
reasonable length of time before being evicted.
Constants
C code uses lots of constants! In many cases, the programmer has a high degree of choice of the precise values and ranges involved.
Remember that the ARM instruction set can typically encode only small constants without having to resort to literal pools. A standard ARM constant
is encoded as “8 bits rotated right by an even number of bits”. Apart from some instructions added in Thumb-2 which allow 16-bit constants and
certain special forms of repeated 8-bit values, this is what you have to work with. Anything beyond that will have to be loaded from a literal pool (a
pool of constants which the compiler embeds in the code stream) using a PC-relative load instruction. This involves extra load instructions in the
code, extra space in the code segment (which must then be placed in the data cache as well as the instruction cache) and potential cache misses.
It makes sense to choose values for constants (and we include enumerated types in this) which fit in the ARM immediate encoding.
Data structures
Many programs have a need to deal with packed, or unaligned, data in memory. Byte-oriented network protocols are a good example of this.
Accessing these items involves memory accesses which are potentially unaligned and, if there is no hardware support (as there is in modern ARM
processors), these accesses must be synthesized using multiple byte or word loads. Even when the hardware does support unaligned addressing, there
is still a performance penalty due to the need to make multiple accesses at bus level when accesses cross the bus transaction boundary. Such accesses
may also cross cache line and memory page boundaries, involving much larger overhead.
Declaring and using packed or unaligned data is relatively easy as most modern compilers support some variant of the “packed” attribute which
allows the programmer to flag all access to a particular item (or through a pointer) as potentially unaligned. Indiscriminate use of this will severely
affect performance, though. If a large amount of processing is required on unaligned data, you should consider “unpacking” it into an alternative
structure representation in which all elements are aligned, carrying out the processing and then packing it up again for storage or onward
transmission.
Linked lists, and similar data structures, are well-known for caching poorly as they are often distributed across wide regions of memory. Consider
techniques for reducing this such as: storing lists in “slabs” of multiple items, storing an end pointer as well as a start pointer, maintaining a count of
entries, pre-pending instead of appending, storing the link and index information separately from the payload etc.
Title-4
Control structures
Cascaded “if…else if” statements can be very bad for cache utilization. Starting from the top, each one whose condition fails must nevertheless be
loaded into the instruction cache and executed. For each test, an entire cache line may have to be loaded just to execute the two or three instructions
which make up the test itself. Go down three or four levels and cache utilization quickly falls to 20% or worse.

Page 6
As mentioned with branches, it makes sense to put the common cases near the top of the chain. Even though branch prediction will be less effective,
a switch statement may also be a better alternative if the values to be tested lend themselves to this.
Tools
Developers using ARM-powered systems use a huge variety of tools! The vast majority, however, support in one way or another features which can
be very useful in increasing performance.
Functions which do not use global data, have no side-effects and whose return value depends only on their parameters should be declared as “pure”.
This allows the compiler to deduce that two calls to such a function with identical input parameters will always return the same value and second or
subsequent calls can be eliminated.
The “restrict” keyword allows the programmer to indicate the pointers do not overlap with areas or items addressed by other pointers. The
possibilities for optimization, particularly in loops which process arrays, are much increased by this.
Almost all compilers support some variation on the “packed” keyword mentioned earlier for indicating items which may not be naturally aligned, or
for declaring data structures without the padding necessary to align individual items. The usage varies greatly from one tool to another, though. Note
that gcc does not support pointers to unaligned memory at all, so portable code targeting Linux should not use unaligned pointers.
While on the subject of differences between compilers, it is worth reminding programmers to know and take account of differences in the sign of
“char” types. MSVC++ treats the “char” type as signed by default, RVCT as unsigned. Any code which relies on or makes assumptions about the
sign of single byte variables and the range of values which they can store is potentially dangerous and should be examined very carefully.
Platform
Moving from the processor and the tools into the realm of the “platform” broadens the scope of the discussion considerably and moves well beyond
the scope of a paper like this to say much of any great use. I will touch briefly on just three areas.
Power-saving
Power efficiency is a defining feature of much of the embedded space. Especially for devices such as mobile phones, tablets and the like, battery life
is a key attribute. While circuit and hardware design, as well as battery technology, has a large bearing on battery life, the software is hugely
important too.
In general, code which is efficient in terms of execution performance will also be power-efficient. Reducing the number of instructions executed
reduces power consumption. More important is reducing memory accesses. Cache accesses are not hugely expensive in terms of power but accesses
to external memory are! In many cases they will be an order of magnitude more expensive. So, maximizing cache utilization and reducing external
memory use will have a large effect on power consumption. Code which processes data, held either in cache or registers, uses significantly less
power than code which moves data from one place to another in memory.
In terms of the power consumption of the chip and the wider platform, everything becomes implementation- specific very quickly. Remember that the
power-saving mechanisms used are not part of the ARM architecture and individual chip designers are free to use proprietary techniques. There is no
substitute for being aware of the capabilities of the chip and the platform and ensuring that they are used to maximum effect. In most cases, all of this
functionality will be access via the operating system and you also need to be familiar with how this works. Be very conscientious about signaling to
the operating system any opportunities for reducing execution speed or shutting down processing elements using whatever facilities it provides.
Title-5
Multicore
Partition tasks carefully and sensibly when working with either asymmetric or symmetric multi-processing systems.
Beware of the phenomenon of “false sharing” in shared caches. ARM multi-core clusters incorporate a Snoop Control Unit which automatically
maintains coherency between the individual L1 data caches. Among other things, this moves dirty lines between caches to avoid having to clean and
refill them when common data is accessed by different cores with thin the cluster. Although transparent to the programmer, it is not without penalty
in terms of bandwidth, time and energy. If items which are not actually shared are placed in memory in such a way that they lie in the same cache line,
the cache line will bounce back and forth between the caches as the items are updated by different cores. No explicit sharing is involved but the line
will still be moved. This effect can be avoided by ensuring that items which are not explicitly shared are not located in the same cache line as items
which are. Also, that items which are actively shared are not placed in the same cache line.
Inter-process communication is greatly affected by the partition of the overall application into separate tasks. In general, the partitioning scheme used
should minimize the need to communicate between cores as there will always be latency involved.
As ever, the operating system is key here and it important to be aware of the facilities provided by it to maximize efficiency of task switches and to
identify time periods when processing requirement is reduced so that some or all cores can be placed in low power states.
Coprocessors
Many systems incorporate optional extensions to the architecture. The most common are NEON and floating point. When present, always take
maximum advantage of the possibilities these offer to improve data throughput. NEON, in particular, greatly improves processing of vectored data.

Page 7
NEON can be accessed either via libraries (many standard libraries targeting NEON are available), by enabling automatic vectorization in the
compiler, or by coding directly in NEON assembler. Writing NEON assembler should, in general, be avoided unless the performance gains outweigh
the downsides of reduced portability and maintainability.
In some ARM processors (you need to read documentation carefully for your particular device), there is latency of several cycles to transfer values
from NEON/VFP registers to and from core registers. This means that mixing float and integer operations can sometimes be quite inefficient.
Additionally, there is often considerable latency involved in setting the condition codes based on values generated by NEON or VFP operations.
Conclusion
To go back to the beginning, your code should, of course, be guided by such considerations as readability, portability, maintainability and all that
good stuff in your coding standard. There will be occasions, though, when you will be tempted to break those rules in the name of increased
performance (FSVO performance). Remember an even older adage:
“Be good! If you can’t be good, be careful!”
About the author

Chris Shore is passionate about ARM technology and, as well as teaching ARM’s customers, regularly presents papers and workshops at engineering
conferences. Starting out as a software engineer in 1986, his career has included software project management, consultancy, engineering management
and marketing. Chris holds an MA in Computer Science from Cambridge University.
This paper was originally presented at Design East 2012.
If you liked this and would like to see a weekly collection of related products and features delivered directly to your inbox, click here to sign up for
the EDN on Systems Design newsletter.

Efficient Coding For ARM Platforms

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Efficient Coding For ARM Platforms

Transféré par

Droits d'auteur :

Formats disponibles

Page 1

Efficient coding for ARM platforms

How many of you have seen the second sentence?

“Yet we should not pass up our opportunities in that critical 3%.”

The business of optimization

http://www.edn.com/Home/PrintView?contentItemId=4402645 1/7/2013 2:45:56 PM

Ambiguity and Flexibility

Writing this instead…

Help the compiler

http://www.edn.com/Home/PrintView?contentItemId=4402645 1/7/2013 2:45:56 PM

Working with branch prediction

http://www.edn.com/Home/PrintView?contentItemId=4402645 1/7/2013 2:45:56 PM

Division and modulo

“Keep it close and access it as little as possible”

http://www.edn.com/Home/PrintView?contentItemId=4402645 1/7/2013 2:45:56 PM

Writing for caches and TLBs

http://www.edn.com/Home/PrintView?contentItemId=4402645 1/7/2013 2:45:56 PM

http://www.edn.com/Home/PrintView?contentItemId=4402645 1/7/2013 2:45:56 PM

“Be good! If you can’t be good, be careful!”

About the author

This paper was originally presented at Design East 2012.

http://www.edn.com/Home/PrintView?contentItemId=4402645 1/7/2013 2:45:56 PM

Vous aimerez peut-être aussi