Apple A9

Apple A9
Computer Architecture Assignment

BSc in MIS (Special) Batch 04
GROUP MEMBERS
Savithri Nandadasa BSC-UGC-MIS-16-18 -018

Thakshila De Silva BSC-UGC-MIS-16-18 -004
Apple A9
INTRODUCTION
The Apple A9 is a 64-bit system-on-chip (SoC), designed by Apple Inc. Manufactured for Apple
by both Samsung and TSMC, it first appeared in the iPhone 6S and 6S Plus which were
introduced on September 9, 2015. Apple states that it has 70% more CPU performance and
90% more graphics performance compared to its predecessor, the Apple A8, and is currently
one of the most powerful and energy-efficient mobile chips on the market today along with the
Samsung Exynos 8890 and Qualcomm Snapdragon 820.
With the Apple-designed A9 chip in your iPhone 6s or iPhone 6s plus, you are getting the most
advanced smartphone chip in the world. Every chip we ship meets Apple's highest standards for
providing incredible performance and deliver great battery life, regardless of iPhone 6s capacity,
color, or model.
APPLE A9 SOC VERSIONS

That Apple more strongly invests in SoC development for the iPhone-S processors
When Apple released the A7 SoC alongside the iPhone 5s in 2013, they pulled off something that
rocked the SoC industry. The Cyclone CPU core all but came out of nowhere, beating previous
estimates for the first ARMv8 64-bit phone SoCs (by any vendor) by roughly a year. As a result the
64-bit transition became a lot more important a lot sooner than anyone was expecting, and to this
date some of Apples SoC competitors are still trying to recover from the shock of having to scramble
to go 64-bit sooner than they planned.
When the iPhone 6 launched, Apple finally reached the point where they were building SoCs on a
leading edge manufacturing process. That process at the time being TSMCs 20nm planar process.
The fact that Apple was building on a leading edge process was important for two reasons:
1) It was a strong indicator of how serious they were about SoC production and how much
they were willing to spend in order to achieve the best possible performance,
2) it meant that Apple had finally completely climbed the ladder (so to speak) and wouldnt be
able to exceed the curve just by catching up on manufacturing technology. Post-A8, Apple
can only improve their performance by improving their architecture, building bigger chips,
and finally, jumping to newer manufacturing processes as they become available.
FinFET transistors are necessary because as transistors get smaller their leakage (wasted power)
goes up, and without FinFETs leakage would spiral out of control. In fact thats exactly what
happened on the 20nm nodes from Samsung and TSMC; both companies thought the leakage of
planar transistors could be adequately controlled at 20nm, only for leakage to be a bigger problem
than they expected. Due in large part to this reason, the 20nm SoCs released over the last 18
months have more often than not struggled with power consumption and heat, especially at higher
clock speeds. Apple is something of the exception here, with the 20nm A9 proving to be a solid SoC,
thanks in part to their wide CPU design allowing them to achieve good performance without using
high clock speeds that would exacerbate the problem.
ARM BIG.LITTLE ARCHITECTURE
ARM big. LITTLE processing is a power-optimization technology where high-performance ARM
CPU cores are combined with the most efficient ARM CPU cores to deliver peak-performance
capacity, higher sustained performance, and increased parallel processing performance, at
significantly lower average power. The latest big. LITTLE software and platforms can save 75%
of CPU energy in low to moderate performance scenarios, and can increase performance by
40% in highly threaded workloads. The underlying big. LITTLE software, big. LITTLE MP,
automatically and seamlessly moves workloads to the appropriate CPU core based on
performance needs. ARM big. LITTLE technology enables mobile SoCs to be designed for new
levels of peak performance, in the same all-day battery life users expect.
A9 processor is a popular general purpose choice for low-power or thermally constrained, cost-
sensitive 64-bit apple devices.
The A9 implements the widely supported ARMv8-A and AArch64 architecture with an efficient
microarchitecture:
High-efficiency, dual-issue superscalar, out-of-order, dynamic length pipeline (8 11
stages)
Highly configurable L1 caches, and optional NEON and Floating-point extensions
Available as a Single processor configuration, or a scalable multi-core configuration with
up to 4 coherent cores
Global Task Scheduling (GTS) gives the OS awareness of the big and LITTLE processors, and
the ability to schedule individual threads of execution on the appropriate CPU core based on
dynamic run-time behavior. ARM has developed a kernel space patch set based on GTS called
big. LITTLE MP that keeps track of load history as each thread runs, and uses the history to
anticipate the performance needs of the thread next time it runs.
CPU ARCHITECTURE
Microarchitecture
A9 microarchitecture is similar to second generation Cyclone (used in A8 chip) microarchitecture.
Some of the micro architectural features are as follows:
Pipeline depth (stages) 16

Issue width 6 micro-ops
Re Order Buffer 192 micro-ops
Load latency 4 cycles
Number of integer pipes 4
Number of shifter ALUs 4
Load/Store Units 2
Integer pipe buffer size 48
About half of the performance boost over A8 comes from the 1.85 GHz freq. About a quarter comes
from the better memory subsystem (3x bigger caches). The remaining quarter comes from the
microarchitectural tuning and smaller technology node.
A9 is Twister, the latest generation ARMv8 AArch64 CPU core out of Apple. With Cyclone Apple
made a clear leap to the front of the ARM CPU development pack, and since then they havent
looked back. Still, in the next year they will be facing ARMs own Cortex-A72 design along with
Qualcomms own Kryo. As a result Apple needs to progress on the CPU performance front if only to
maintain their lead over other ARM vendors.
For the launch of the Apple A8 last year, Apple put together the Typhoon CPU core. Even though
Typhoon was for a non-S iPhone, Apple still managed to integrate some basic architectural
optimizations that put it ahead of Cyclone. This was important because Typhoon would only reach
1.4GHz in phones likely a trade-off imposed by the temperamental 20nm process and as a result
Apple needed their CPU architecture to carry the day.
FinFET processer
However with the iPhone 6s, all of the stars are coming into alignment for Apple. On the one hand as
this is an iPhone S release, even more is expected of them on the architectural side of matters. On
the other hand between the power benefits of the FinFET processes and Twisters place in Apples
seeming 2-year cycle, Apple will get to run up the score twice: once with clock speed and once with
a more substantial architecture improvement.
In fact on the clock speed front this is the biggest jump in CPU frequencies since Swift in the A6,
where Apple went from an 800MHz ARM Cortex-A9 to the aforementioned custom Swift design at
1.3GHz. As a result Apple immediately gets to capitalize on a 450MHz (32.1%) clockspeed bump for
Twister in the A9 versus the Typhoon-powered A8. That large of a clock speed bump alone would be
enough to give Apple a sizable performance boost, especially as competing designs are already at
2GHz+ and are unlikely to shoot much higher due to power concerns.
Apple has always played it conservative with clock speeds in their CPU designs favoring wide
CPUs that dont need to (or dont like to) clock higher so an increase like this is a notable event
given the power costs that traditionally come with higher clock speeds. Based on the underlying
manufacturing technology this looks like Apple is cashing in their FinFET dividend, taking advantage
of the reduction in operating voltages in order to ratchet up the CPU frequency. This makes a great
deal of sense for Apple (architectural improvements only get harder),
As for Twisters architecture, theres a story here as well. Relative to the Cyclone-to-Typhoon
transition, Typhoon-to-Twister is a larger architectural upgrade for Apple as well see. At the same
time however its not on the level of Swift-to-Cyclone, nor would we expect it to be. Apples
architecture, for lack of a better word, should be stable for the moment, which means Apple has
plenty of room to optimize their designs without flipping the table and starting over.
So with that out of the way, lets start with a low-level look at Twister, and some of the attributes of
the CPU design.
In terms of execution width and reorder depth, we havent found anything to indicate that Twister is
wider or deeper than Typoon, so the issue-width appears to still be 6 micro-ops while the out-of-
order-execution reorder buffer remains at 192 micro-ops. A 6-wide design was and remains
atypically large for a 64-bit ARMv8 design, and this is one of those stable aspects that is likely not
to change anytime soon. As for the OoO reorder depth, contemporary experience is that deeper
OoO reorder windows eat more power, in which case this is something that Apple may want to hold
off on until they cant pick up performance gains elsewhere.
Whats far more interesting is the branch prediction latency. While we dont have Apples official
numbers that being where 16 and the 14-to-19 range originate from for Cyclone our testing
indicates that branch misprediction penalties are way down. The average misprediction penalty is
just 9 cycles, significantly lower than the official or average misprediction penalties for
Cyclone/Typhoon. Without more architectural information I dont want to read into this too much
shorter penalties could imply a shorter pipeline however at a minimum this means that Apples
performance just got a lot better whenever they do miss a branch.
Meanwhile the number of FP/NEON units, Integer units, and Load/Store units is unchanged from
Typhoon, but the performance of those ALUs has shifted, both for Integer and FP workloads. Twister
still retires up to 3 FP32 additions per cycle, but the latency has dropped from 4 cycles to 3 cycles,
which is all the more remarkable with Twisters clock speed boost (this brings the real-time latency
from ~2.9ns to ~1.6ns). In fact FP32 multiplication latency is down as well, from 5 cycles to 4 cycles.
Coupled with this, FP32 multiplication throughput on Twister is increased, indicating that it is now
capable of retiring 3 FP32 malts per cycle, as opposed to 2 under Twister. As a result Twister should
show some rather significant improvements in floating-point heavy workloads.
On the Integer side of matters on the other hand, things haven't changed nearly as much. Integer
throughput and latency remain unchanged for addition and multiplication. However the shifters,
which we rarely talk about, have been improved. All 4 integer pipelines can now also do shifts, up
from 2 on Typhoon. Shifters are an important type of ALU resource, however unlike basic arithmetic
operations it's a bit less obvious when it's in use, so while there will be performance benefits from
this change it's not as easy to predict where we'll see them.
Finally, looking at Twisters caches, while the L1 cache sizes remain untouched from Typhoon, Apple
has managed to pack in larger caches for both the L2 and L3. The size of the L2 cache in particular
has really ballooned, going from 1MB on Typhoon to 3MB on Twister. The benefit of growing this
cache is that Apple now can store much more in the way of data and instructions closer to the
Twister cores before going to L3, but the tradeoff is that cache access times typically go up a bit as it
takes longer to find something in the cache.
Cache organization
The A9 features an Apple-designed 64-bit 1.85 GHz ARMv8-A dual-core CPU called Twister. The A9
in the iPhone 6S has 2GB of LPDDR4 RAM included in the package. The A9 has a per-core L1
cache of 64 KB for data and 64 KB for instructions, anL2 cache of 3 MB shared by both CPU cores,
and a 4 MB L3 cache that services the entire SoC and acts as a victim cache.
Apple is using an inclusive style cache here where all cache data is replicated at the lower
levels to allow for quick eviction at the upper levels then Apple would have needed to increase
the L3 cache size by 2MB in the first place just to offset the larger L2 cache. So the effective
increase in the L3 cache size wont be quite as great. Otherwise Im a bit surprised that Apple
has been able to pack in what amounts to 6MB more of SRAM on to A9 and A8 despite the lack
of a full manufacturing nodes increase in transistor density.
The shift from an inclusive cache to a victim cache allows the 4MB cache on A9 to still be
useful, despite the fact that its now only slightly larger than the CPUs L2 cache. Of course
there are tradeoffs here if you actually need something in the L3, its more work to manage
moving data between L2 and L3 but at the same time this allows Apple to retain many of the
benefits of a cache without dedicating more space to an overall larger L3 cache.
The A9 includes a new image processor, a feature originally introduced in the A5 and last updated in
the A7, with better temporal and spatial noise reduction as well as improved local tone mapping. The
A9 directly integrates an embedded M9 motion coprocessor, a feature originally introduced with the
A7 as a separate chip. In addition to servicing the accelerometer, gyroscope, compass, and
barometer, the M9 coprocessor can recognize Siri voice commands.
Memory management
The A9 features a custom storage solution, which uses an Apple-designed NVMe-based controller
that communicates over a PCIe connection. The iPhone 6s' NAND design is more akin to a PC-class
SSD than embedded flash memory common on mobile devices. This gives the phone a significant
storage performance advantage over competitors which often use eMMC or UFS to connect to their
flash memory.
Instruction sets Used in a9 as follows,

Has 31 general-purpose 64-bit registers.
Has dedicated SP or zero register.
The program counter (PC) is no longer directly accessible as a register.
Instructions are still 32 bits long and mostly the same as A32 (with LDM/STM instructions
and most conditional execution dropped).
o Has paired loads/stores (in place of LDM/STM).
o No predication for most instructions (except branches).
Most instructions can take 32-bit or 64-bit arguments.
Addresses assumed to be 64-bit.
Advanced SIMD (NEON) enhanced
o Has 32 128-bit registers (up from 16), also accessible via VFPv4.
oSupports double-precision floating point.
oFully IEEE 754 compliant.
oAES encrypt/decrypt and SHA-1/SHA-2 hashing instructions also use these
registers.
A new exception system
o Fewer banked registers and modes.
Memory translation from 48-bit virtual addresses based on the existing Large Physical
Address Extension (LPAE), which was designed to be easily extended to 64-bit
The other truly impressive aspect of the iPhone 6s this generation is the storage solution. The
iPhones storage solution here is ahead of everything else in the industry for three clear
reasons. The first is the use of more advanced NAND organization. Although TLC NAND alone
is going to be clearly worse for performance than SLC or MLC NAND, the iPhone 6s use SLC
caching in conjunction with TLC NAND to improve storage performance in the situations that
matter. The second is the use of PCI-Express to enable much higher bandwidths, which means
that the SLC cache can really stretch its legs to reach the high levels of bandwidth that its
capable of. The third is the use of a custom storage controller with NVM Express, which helps to
realize the full benefits of PCI-Express. Overall, all of these things come together to make
noticeable differences in user experience. Probably the most obvious example here would be
iCloud backup and restore, along with app installs and updates. Burst photography and camera
speed are also improved as a result of better storage.
Performance
The A9 includes a new image processor, a feature originally introduced in the A5 and last
updated in the A7, with better temporal and spatial noise reduction as well as improved local
tone mapping. The A9 directly integrates an embedded M9 motion coprocessor, a feature
originally introduced with the A7 as a separate chip. In addition to servicing the accelerometer,
gyroscope, compass, and barometer, the M9 coprocessor can recognize Siri voice commands.
References
http://www.anandtech.com/show/9686/the-apple-iphone-6s-and-iphone-6s-plus-review/4

Apple A9

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Apple A9

Transféré par

Droits d'auteur :

Formats disponibles

Apple A9

Computer Architecture Assignment

Savithri Nandadasa BSC-UGC-MIS-16-18 -018

APPLE A9 SOC VERSIONS

Pipeline depth (stages) 16

Instruction sets Used in a9 as follows,

Vous aimerez peut-être aussi