Vous êtes sur la page 1sur 23

Lucian Codrescu

Sr. Director, Technology


Qualcomm Technologies, Inc.

Qualcomm Hexagon
DSP: An architecture
optimized for mobile
multimedia and
communications

1
Qualcomm Technologies, Inc. All Rights Reserved
Hexagon DSP processors in Snapdragon products

aDSP: Real-time
media & sensor
Snapdragon processing
800
Audio
Camera Krait Krait
Adreno Sensors
CPU CPU
Display GPU
JPEG Krait Krait
Hexagon Misc.
CPU CPU
Video aDSP Connectivity

Other
2MB L2

Multimedia Fabric System Fabric

Hexagon
Fabric & Memory Controller Modem
mDSP

mDSP: Dedicated
LPDDR3 LPDDR3
modem processing
2
Qualcomm Technologies, Inc. All Rights Reserved
Expansion of Hexagon DSP use cases beyond audio

Image Enhancement
Camera, Still, Video
HexagonV4 based products
HexagonV2/V3
Computer Vision &
Augmented Reality
HexagonV4 based products

Video
HexagonV5 based products

Voice Audio

Sensors
HexagonV5 based products

Hexagon DSP is evolving for use beyond voice and audio to


computer vision, video and imaging features 3
Qualcomm Technologies, Inc. All Rights Reserved
The Hexagon DSP evolution
Generational improvements in performance and power efficiency driven by
both architecture and implementation

V3M V4M V5A


45nm 28nm 28nm
June 2009 Dec 2010 Dec 2012

V4L
V2 V3L 28nm
V1 45nm Nov Apr 2011
65nm 65nm
Dec 2007 2009
Oct 2006

V3C V4C V5H


45nm Aug 28nm 28nm
2009 Dec 2010 Dec 2012

Time
4
Qualcomm Technologies, Inc. All Rights Reserved
Key characteristics of
modem & multimedia applications

Requirements Characteristics
Require fixed real-time Mix of signal processing
performance level & control code
(fps, Mbit/sec, etc.) For modem, Qualcomm does not
Extremely aggressive use a split CPU/DSP architecture.
power & area targets All processing is done on Hexagon
DSP
Multimedia apps have significant
control in the RTOS & frameworks
Heavy L2$ misses
Multimedia is data intensive
Modem is code intensive

5
Qualcomm Technologies, Inc. All Rights Reserved
Hexagon DSP blends features targeted to modem &
multimedia
VLIW
Need multi-issue to
meet performance
Low complexity for
Area & Power

Multi-Threading Innovate in ISA to


To reduce L2$ miss maximize IPC
Hexagon More work/VLIW packet
penalty without the need
for a large L2 DSP reduces energy/instruction
Increases Keep the pipelines full for
instructions/VLIW packet MIPS/mm2
because compiler doesnt Target both Signal
need to schedule latency Processing & Control code

6
Qualcomm Technologies, Inc. All Rights Reserved
VLIW: Area & power efficient multi-issue
Dual 64-bit execution units
Variable sized Standard 8/16/32/64bit data
instruction packets Instruction types
(1 to 4 instructions Cache SIMD vectorized MPY / ALU
per Packet) / SHIFT, Permute, BitOps
Instruction Unit Up to 8 16b MAC/cycle
2 SP FMA/cycle

Device L2
DDR Cache
Memory
/ TCM

Dual 64-bit Data Unit Data Unit Execution Execution


load/store (Load/ (Load/ Unit Unit
units Store/ Store/ (64-bit (64-bit
Also 32-bit ALU) ALU) Vector) Vector)
ALU Data Cache Unified 32x32bit
General Register
File is best for
compiler.
No separate Address
Register File/Thread or Accum Regs
Register File
Register File Per-Thread

7
Qualcomm Technologies, Inc. All Rights Reserved
Maximizing the signal processing code work/packet
Example from inner loop of FFT: Executing 29 simple RISC ops in 1 cycle

64-bit Load and


64-bit Store with
post-update
addressing
{ R17:16 = MEMD(R0++M1)
MEMD(R6++M1) = R25:24 Complex multiply with
R20 = CMPY(R20, R8):<<1:rnd:sat round and saturation
R11:10 = VADDH(R11:10, R13:12)
}:endloop0

Zero-overhead loops Vector 4x16-bit Add


Dec count
Com pare
Jum p top

8
Qualcomm Technologies, Inc. All Rights Reserved
Maximizing the control code work/packet
Hexagon DSP ISA improves control code efficiency Example C code
over traditional VLIW void example(int *ptr, int val) {
if (ptr!=0) {
*ptr = *ptr + val + 2;
}}

Tradional VLIW Hexagon DSP: Hexagon DSP: Hexagon DSP:


Assembly Code Dot-New Predication Compound ALU New-Value Store
{ { {

p0 = cmp.eq(r0,#0) p0 = cmp.eq (r0,#0) p0 = cmp.eq(r0,#0)



p0 = cmp.eq(r0,#0)
{ if (!p0.new) r2=memw(r0) if (!p0.new) r2=memw(r0) if (!p0.new) r2=memw(r0)
if (!p0) r2=memw(r0) if (p0.new) jumpr:nt r31 if (p0.new) jumpr:nt r31 if (p0.new) jumpr:nt r31
if (p0) jumpr:nt r31 } } }
} r2 = add(r2,#2) r1 = add(r1,add(r2,#2)) {
r2 = add(r2,#2) r1 = add(r1,r2) { r1 = add(r1,add(r2,#2))

r1 = add(r1,r2) {

memw(r0) = r1 memw(r0) = r1.new
{ memw(r0) = r1 jumpr r31 jumpr r31
memw(r0) = r1
jumpr r31 } }
jumpr r31 }
}

Instr/Packet = Instr/Packet =
7 instr/5 packets = 1.4
7 instr/2packets = 3.5

9
Qualcomm Technologies, Inc. All Rights Reserved
High avg. instructions/packet for targeted use cases
Compound instructions count as 2

5
Average Instructions/VLIW Packet

4.5
4
3.5
3
2.5
2
1.5
1
0.5
0

Computer
Vision Video Imaging Control Audio

10
Source: Qualcomm internal measurements Qualcomm Technologies, Inc. All Rights Reserved
Programmers view of Hexagon DSP HW
multi-threading
Hexagon V5 includes three hardware threads
Architected to look like a multi-core with communication
through shared memory

Shared Instruction Cache

Thread 0 Thread 1 Thread 2

D D X X D D X X D D X X L2
U U U U U U U U U U U U Cache /
TCM
Register File Register File Register File

Shared Data Cache

11
Qualcomm Technologies, Inc. All Rights Reserved
Hexagon DSP V1-V4: Interleaved multi-threading
Simple round-robin thread scheduling
Number of threads match execution pipe depth
(three threads three execute stages)
All instructions complete before next packet dispatch
Compiler schedules for zero-latency which helps to increase
instructions/VLIW packet

Thread 0 Dispatch Thread 1 Dispatch Thread 2 Dispatch

T0: { Ld Ld Add Cmp } T1: { St Ld Mpy Add } T2: { Ld Add Jump }

T0: { Ld Ld Add Cmp } T1: { St Ld Mpy Add }

T0: { Ld Ld Add Cmp }

12
Qualcomm Technologies, Inc. All Rights Reserved
Hexagon DSP V5: Dynamic HW multi-threading
Recover some performance when threads idle or stalled
Remove a thread from IMT rotation
On L2 cache misses
Coremarks/ Dhrystone
When in wait-for-interrupt or off MHz DMIPS/MHz
mode 8 5
Additional forwarding to support 7 4.5

2-cycle packets 6
4
3.5
5 3

VLIW packets with dependencies 4 2.5

between long latency instructions 3 2


1.5
will stall 2
1
But many VLIW packets with 1 0.5

simple instructions can 0 0

complete in 2 processor clocks

IMT DMT IMT DMT

13
Source: Qualcomm internal measurements Qualcomm Technologies, Inc. All Rights Reserved
Hexagon DSP instructions per cycle

5 Multi-Threaded Apps
4.5
Average Instructions / Cycle

4
3.5
3
2.5
Single-Threaded Apps
2 IPC_DMT
1.5 IPC_IMT

1
0.5
0

14
Source: Qualcomm internal measurements Qualcomm Technologies, Inc. All Rights Reserved
Qualcomm Hexagon DSP architecture
Highly efficient mobile application processordesigned for more
performance per MHz
20
18
DSP Performance per MHz

16
BDTImark2000/MHz

14
12
10
8
6
4
2
0
Mobile Competitor Qualcomm HXGN V4 (1 thread) Qualcomm HXGN V4 (3 threads)

Clock Rate (MHz) 430-520 100-233 300-700

DSP Performance (BDTImark2000) 4730-5720 1810-4220 5440-12660*

* - Projected best case score for 3-threads


15
Source: BDTI - For more detailed information see www.BDTI.com. All scores 2013 BDTI Qualcomm Technologies, Inc. All Rights Reserved
Hexagon DSP Power Benefits

16
Qualcomm Technologies, Inc. All Rights Reserved
MP3 playback power for competitive smartphones
Power
Lower is better

Competitor A Qualcomm / Competitor B Competitor C Competitor D Competitor E Competitor F Competitor G


Hexagon-based

Power measured at the battery for various phones


Includes everything: DSP, CPU, memory, analog components, etc
17
Source: Qualcomm internal measurements Qualcomm Technologies, Inc. All Rights Reserved
Computer vision offload ARM/neon to Hexagon DSP
Augmented Reality Java App finding objects in
image using FastCV Feature Detect
Comparison of Feature Detect run on:
App CPU (ARM/Neon)
App DSP (Hexagon)

CPU Utilization (%) Detection Time (%) Total Device Power (%)

52% Less CPU 7% Less Time 32% Less Power*

18
Source: Qualcomm internal measurements. * Power measured at the device battery Qualcomm Technologies, Inc. All Rights Reserved
Hexagon DSP power for different thread utilizations
Excellent near-linear power scalability
(as threads go idle, power used by the thread is nearly eliminated)
Achieved through optimized clock tree design & clock gating
Dhrystone Power, FIR Power,
IMT Mode IMT Mode
100% 100%
90% 90%
80% 80%
70% 70%
60% 60%
50% 50%
40% 40%
Actual Actual
30% Ideal 30% Ideal
20% 20%
10% 10%
0% 0%

19
Source: Qualcomm internal measurements Qualcomm Technologies, Inc. All Rights Reserved
Hexagon DSP Software Development

20
Qualcomm Technologies, Inc. All Rights Reserved
Independent Algorithm Developers on Hexagon DSP

21
Qualcomm Technologies, Inc. All Rights Reserved
Announcing the Hexagon DSP SDK
See the Hexagon DSP SDK in action at Uplinq2013 (www.uplinq.com)

Visit http://developer.qualcomm.com for more information. 22


Qualcomm Technologies, Inc. All Rights Reserved
Thank you
Follow us on:

For more information on Qualcomm, visit us at:


www.qualcomm.com & www.qualcomm.com/blog
2013 Qualcomm Technologies, Inc.
Qualcomm and Hexagon are trademarks of QUALCOMM Incorporated, registered in the United States
and other countries. All QUALCOMM Incorporated trademarks are used with permission. Other
product and brand names may be trademarks or registered trademarks of their respective owners.
Hexagon is a product of Qualcomm Technologies, Inc.

23
Qualcomm Technologies, Inc. All Rights Reserved

Vous aimerez peut-être aussi