Embedded Memory Systems

Advanced Computer Architecture
Part II: Embedded Computing

Embedded Memory
y Systems
y
Paolo.Ienne@epfl.ch
EPFL I&C LAP
(Largely based on slides by P. R. Panda, IITD
and P. Marwedel, University of Dortmund)
Motivation
Memories are the limiting performance factor
System-on-Chip memories and SRAMs embedded in
FPGAs are fast (1-2 cycles access) but:
On-chip memory might not be enough
eDRAM or eFLASH mayy be coming
g into the picture
p
Memories are a key energy consumer on SoC

Memory systems in embedded systems can be
customised
Large design space to exploit for optimisation
SoC and FPGA technologies support to a good extent
irregular memory systems
irregular
2
AdvCompArch Embedded Memory Systems
Ienne 2003-08
Importance of Memory in SoCs

Some rough rule
rule-of-thumb
of thumb figures:
Area
50-70% of chip area may be memory
Performance
10-90% of system
y
performance mayy be memoryy
p
related
Power
o
25-40% of system power may be memory related
Ienne 2003-08
Sub-banking
Energy
gy
Access times
Applications are
getting larger and
larger
The energy cost of
keeping access
times low is very
high
g
Ienne 2003-08
Sourrce: Marwede
el, 2007
Things Are Only Getting Worse
Some More Recent Estimations
Processor Energy
Cacheless
monoprocessor
Main Mem.
E
71%
Proc. Energy
IC h E
I-Cache
Energy
51,9%
28,1%
Multiprocessor with
I and
nd D caches
he
D-Cache Energy
Main Mem.
gy
Energy
14,8%
5,2%
Average of over 200 benchmarks

5
Ienne 2003-08
So
ource: Verma
a and Marwed
del, Springer 2007
29%
Outline
Memory data layout
Scratchpad memory
Custom memory architectures
Ienne 2003-08
Memory Data Layout

Problem statement
Optimise the placement of data in memory to
maximise cache effectiveness with minimal resources
Normally compilers place data following

language conventions and program order
Can they do any better if they see the complete
embedded application?
Similarity
This reminds
Thi
i d off the
th study
t d off placement
l
t for
f DSP
variables (but problem and strategy was different)
Ienne 2003-08
Array Layout and Data Cache

a
a[i]
int
int
int
...
for
a[1024];
b[1024];
c[1024];
[
];
b[i]
(i = 0; i < N; i++)
c[i] = a[i] + b[i];
c
c[i]
Data Cache
Memory
(Direct-mapped,
512 words)
d )
Problem: Every access leads to cache miss!

8
Ienne 2003-08
Aliasing Example
Cache size C,
C line size M,
M array size N
Addresses and cache position:
a[i]: i
b[i]: i + N
c[i]: i + 2N
(i mod C) / M
((i + N) mod C) / M
((i + 2N) mod C) / M
If N = kC all cache positions identical!

C is normally a power of 2
N is often a power of 2, too
Solutions?
Set-associative cache
Make C larger than N
?!
9
Costly!
Ienne 2003-08
Source: B
Banakar et al., IEEE 200
02
Energy Cost of Associativity
10
Ienne 2003-08
A Solution: Array Padding

a
M words
a[i]
int
int
int
...
for
DUMMY
a[1024];
b[1024];
c[1024];
[
];
(i = 0; i < N; i++)
c[i] = a[i] + b[i];
b[i]
DUMMY
c
c[i]
Data Cache
(Direct-mapped,
512
2 words)
d )
Memory
Data alignment avoids cache conflicts
11
Ienne 2003-08
Classic Transformation
Loop Blocking
Modify loop exploration space in blocks (or tiles)
tiles ) so
that all elements accessed at once fit the cache
Original Code
for i = 1 to N
for k = 1 to N
r = X [i,k]
for j = 1 to N
Z[i,j] = r * Y[k,j]
Blocked Code
for kk = 1 to N step B
for jj = 1 to N step B
for i = 1 to N
for k = kk to min (kk+B-1, N)
r = X [i,k]
for j = jj to min (jj+B-1,
(jj+B-1 N)
Z[i,j] = r * Y[k,j]
B
N
12
Ienne 2003-08
Idea:
Split the array in blocks or
tiles
tiles and group tiles of
each array which are
accessed at once
If the tiles are small
enough, the set of tiles
accessed at once will fit
into the cache
Since theyy are adjacent
j
in
data memory, they will not
conflict in the cache
13
Ienne 2003-08
Sou
urce: Pand
da et al., IEEE 20
001
Array Tiling Reduces Aliasing Too
Real-World Example: FFT

double sigreal[2048]
g
[
]
le = le / 2
f
for
(i = j
j; i < 2048
2048; i +
+= 2*l
2*le)
)
{
= sigreal[i]
= sigreal[i + le]
sigreal[i]
g
[ ] =
sigreal[i + le] =
0
}
1st Outer Loop Iteration

0
1024
Array
sigreal
g
14
511
Cache
Ienne 2003-08
Padded FFT
double sigreal[2048
g
[
+ 16]
]
le = le / 2; le = le + le / 128
f
for
(i = j
j; i < 2048
2048; i +
+= 2*l
2*le)
) {
i = i + i / 128
1st Outer Loop Iteration
= sigreal[i]
= sigreal[i + le]
Pads (~1 cache line, every cache size)
sigreal[i]
sigreal[i
] =
0
sigreal[i + le] =
1032
0
}
Array
sigreal
511
Cache
Padding 15% Speed-up on a Sparc5

15
Ienne 2003-08
Algorithms to Decide Data Layout

Need to make the following decisions:
Tile Size Computation
Largest possible tiles such that working set fits the
cache
Pad Size Computation

Minimum pad size which eliminates aliasing
Interleaving of Tiled Arrays

Arrangement
g
of multiple
p arrays
y so that there is no
aliasing among arrays and all working sets fit the
cache
16
Ienne 2003-08
Algorithms to Decide Data Layout

Matrix Multiplication (Array Sizes 35
35-350)
350)
TSS
ESS
LRW
DAT
DAT uses fixed

fi d tile
til dimensions
di
i
Others use widely varying sizes
17
Ienne 2003-08
Outline
Memory data layout
Scratchpad memory
18
Ienne 2003-08
Scratchpad Memory Idea

Scratchpad
1 cycle
On-chip
Memory
0
P-1
P
CPU
Data
Cache
((on-chip)
p)
Off-chip
Memory
Addressable
Memory
1 cycle
N-1
N
1
10-20 cycles
19
Ienne 2003-08
Scratchpad Memory Advantages

Architecturally
Architecturally visible static software-managed
software managed
cache
Avoid aliasing problems
Decide explicitly which data will be reused
Avoid evicting useful data from the cache
Increase determinism
Data are always in

in the cache
cache when needed
Save power
20
Avoid energy cost of high associativity

Avoid caching irrelevant data
Avoid using 2 levels of the memory system for temporaries
Ienne 2003-08
Energy Cost of Hardware Caches
8
7
Energy perr access [n J]
6
Scratch pad
5
Cache, 2way, 4GB space
Cache, 2way, 16 MB space

Cache, 2way, 1 MB space
3
2
1
0
256
512
1024
2048
4096
8192
16384
me mory size
Energy consumption in tags, comparators, and muxes is significant

21
Ienne 2003-08
Source: B
Banakar et al., IEEE 200
02
Timing Predictability
Most memory hierarchies (i.e.,

(i e caches) for PC-like
PC like systems designed for
good average case, not for good worst case behavior
Worst case execution time

(WCET) larger than
without cache
G.721 using unified

cache
h on a ARM7TDMI
22
Ienne 2003-08
Sourrce: Marwede
el, 2007
Many embedded systems are real-time systems: computations must be

finished in a given amount of time
Scratchpad Memory
Embedded processor
processor-based
based system
Processor core
Embedded memory
Instruction and Data Cache

Embedded SRAM
Embedded DRAM
Scratch Pad Memory
Design problems
1. How much on-chip memory?
2 Partitioning of on-chip
2.
on chip memory in cache and scratchpad?
3. Which variables/arrays in the scratchpad?
Goals
Improve performance
Save power
23
Ienne 2003-08
Architecture Exploration
Explore exhaustively the design space
Requires an algorithm to perform partitioning between
on- and off-chip
on
off chip memory
Algorithm Memory Explore

for On-chip Memory Size T (in powers of 2)
for Cache Size C (in powers of 2, < T)
SRAM Size S = T - C
Data Partition (S)
for line size L (in powers of 2
2, < C,
C < MaxLine)
Estimate Memory Performance (T, C, S)
p
goals
g
Select ((T,, C,, S,, L)) which maximize optimisation
24
Ienne 2003-08
[Example: Histtogram]
Variation of On-chip Memory

Allocation
Effect of different ratios scratchpad/cache sizes

T t l on-chip
Total
hi memory size
i = 2 KB
25
Ienne 2003-08
[Example: Histtogram]
Variation of Total On-chip Memory
Effect of on-chip
on chip memory size
26
Ienne 2003-08
Data Partitioning (I)

procedure Histogram_Evaluation
char
int
BrightnessLevel[512][512];
g
Hist [256];
Regular Access
Off-chip + Cache
for (i = 0; i < 512; i++)

for (j = 0; j < 512; j++) {
/* for each pixel (i,j) in image */
level = BrightnessLevel[i][j];
Hist[level] += 1;
}
Irregular Access
Scratchpad
27
Ienne 2003-08
Data Partitioning (II)

procedure Convolution
int source[128][128], dest[128][128];
int mask[4][4];
for (all points x,y of source)
new = 0;
for (i scanning the mask horizontally)
for (j scanning the mask vertically)
new += source[x+i][y+j] * mask[i,j];
dest[x][y] = new / norm;
Iteration (0,0)
mask
Small
Scratchpad
Iteration (0,1)
source + dest
Large and Regular
Off-chip + Cache
28
Ienne 2003-08
Data Partitioning
Pre Partitioning Scratchpad/Off
Pre-Partitioning
Scratchpad/Off-chip
chip
Scalar variables and constants to scratchpad
Large arrays to off
off-chip
chip memory
Detailed Partitioning
Identify critical data for scratchpad
Criteria:
Life-times of arrays
Access frequency of arrays
Loop conflicts
Similar reasoning can be applied to code

29
Ienne 2003-08
Global Placement Optimization

for j ...{ }
while
while...
main
memoryy
Scratchpad
memory,
capacity SSP
Processor
30
Which object
j
(array,
(
y loop,
p etc.)) to be
stored in a scratchpad?
Non-overlaying
l i
allocation
ll
i
repeat...
Gain gk + size sk for each object k
function
function...
Maximise gain G = gk, respecting the

scratchpad size SSP sk
Array...
y
Array
Array...
Int...
Solution: knapsack
p
algorithm
g
Overlaying allocation
Moving objects back and forth
between hierarchy levels
Solution: more complex
complex...
Ienne 2003-08
Source: S
Steinke et al., IEEE 200
02
for i ...{ }
Symbols:
S (vark ) = size of variable k
n (vark ) = number of accesses to variable k
(vark ) = energy saved
e
d per variable
i bl access, if vark is
i migrated
i t d
E (vark ) = energy saved if variable vark is migrated (= e (vark ) n (vark ))

x (vark ) = decision variable
=1 if variable k is migrated to scratchpad, =0 otherwise
K = set of variables
Similar for functions I
Integer programming formulation:
Maximize kK x (vark ) E (vark ) + iI x (Fi ) E (Fi )
Subject to the constraint
i I S (Fi ) x (Fi ) + k K S (vark ) x (vark ) SSP
31
Ienne 2003-08
Source: S
02
Integer Linear Programming
Cycless [x100]
Energy [J]
Source: S
02
Reduction in Energy and Runtime
multi_sort
benchmark (mix
of sorting
algorithms)
Numbers will change with technology but algorithms will remain unchanged
32
Ienne 2003-08
Outline
Memory data layout
Scratchpad memory
33
Ienne 2003-08
Array-to-Memory Assignment and

Memory Banking
Bank
#1
Exploit the possibility of designing

a memory system not for general
use (e.g., completely uniform) and
not of standard components (e
(e.g.,
g
off-the-shelf DRAMs)
Bank
#2
Small
bitwidth
Clustering of arrays in memories

Exploit features of eDRAMs
Trade offs power/energy/area
Addresss Space
Ad-hoc bit widths

E.g., 32-bit word-addressed
architecture with specific 6-bit
arrays
Bank
#3
Multiple accesses per cycle

E.g., Allow concurrent accesses by
coprocessors
Multiple CPU busese.g., DSPs
34
Accessible
A
ibl
at once
Ienne 2003-08
Memory Banking Motivation

(e)DRAMs
for (i = 0, i < 1000; i++)
{ A[i] += B[i] * C[2*i]; }
A
Row Address
Addr[15:8]
Page
B
C
Column Address
Addr[7:0]
Address
35
Page Buffer
Data
Ienne 2003-08
Memory Banking Motivation

(e)DRAMs
for (i = 0, i < 1000; i++)
{
A[i] += B[i] * C[2*i];
}
Row
A[I]
Row
B[I]
Row
C[2I]
Col
Col
Page Buffer
Add
Addr
36
T Datapath
To
D t
th
Col
Page Buffer
Add
Addr
T Datapath
To
D t
th
Page Buffer
Add
Addr
T Datapath
To
D t
th
Ienne 2003-08
Typical DRAM Tradeoffs Also in

High-End Servers
DRAMs are complex objects:
Multiple interleaved DRAM banks in a system
Large premium for burst accesses
Tradeoff between leaving a page open and having
neighbouring accesses faster and closing a page and not
requiring precharge time for far accesses
In servers, optimizations are usually of dynamic nature
performed byy the memoryy controller subsystem
p
y
and
controlled by the BIOS/OS
In embedded computers, similar optimizations can be
done statically and application-specific
37
Ienne 2003-08
DFG
Can be extended to model multiple

simultaneous accesses to the same
array ( multi-ported memories)
Conflict Graph
Schedule
Minimal Allocation
38
Ienne 2003-08
Modifiied from P
Panda et a
al., ACM
M 2001
Minimal Number of Banks to

Remove Access Conflicts
Memory Allocation Exploration
Sou
urce: Pand
da et al., ACM 20
001
Useful Exploration
p
Space
p
39
Ienne 2003-08
Summary
In SoCs and FPGAs situation is different from general
purpose computers
Different design space (fast memories almost as fast as logic)
Less constraints to use standard components: any size possible,
more types of memory available (e.g., dual port), etc.
More bandwidth exploitable
p
(no
( pins)
p )
Hence different world where many more things are

possible (whereas in classic computing or normal chip
based embedded systems there is not much freedom)
Companion situation to the customisation of processors:
Optimizations tailored to Data Cache
Memory Data Layout
Memory architecture customized to a given application

Scratchpad Memory
Memory Banking
40
Ienne 2003-08
References
P.
P R
R. Panda et al
al., Data and Memory Optimization
Techniques for Embedded Systems, ACM Transactions
on Design Automation of Electronic Systems, 6(2):149
206 April
206,
A il 2001
M. Verma and P. Marwedel, Advanced Memory
Optimization Techniques for Low Power Embedded

Processors, Springer, 2007
P. R. Panda (ed.), Memory Issues in Embedded Systemson-Chip, Kluwer Academic, 1999
IEEE Design & Test of Computers, Special issue on Large
Embedded Memories,
Memories May-June
May June 2001
41
Ienne 2003-08

Embedded Memory Systems

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Embedded Memory Systems

Transféré par

Droits d'auteur :

Formats disponibles

Advanced Computer Architecture

Part II: Embedded Computing

Memories are a key energy consumer on SoC

AdvCompArch Embedded Memory Systems

Importance of Memory in SoCs

AdvCompArch Embedded Memory Systems

AdvCompArch Embedded Memory Systems

Things Are Only Getting Worse

Some More Recent Estimations

Average of over 200 benchmarks

AdvCompArch Embedded Memory Systems

AdvCompArch Embedded Memory Systems

Memory Data Layout

Normally compilers place data following

AdvCompArch Embedded Memory Systems

Array Layout and Data Cache

Problem: Every access leads to cache miss!

AdvCompArch Embedded Memory Systems

If N = kC all cache positions identical!

AdvCompArch Embedded Memory Systems

Energy Cost of Associativity

AdvCompArch Embedded Memory Systems

A Solution: Array Padding

AdvCompArch Embedded Memory Systems

AdvCompArch Embedded Memory Systems

AdvCompArch Embedded Memory Systems

Array Tiling Reduces Aliasing Too

Real-World Example: FFT

1st Outer Loop Iteration

AdvCompArch Embedded Memory Systems

Padding 15% Speed-up on a Sparc5

AdvCompArch Embedded Memory Systems

Algorithms to Decide Data Layout

Pad Size Computation

Interleaving of Tiled Arrays

AdvCompArch Embedded Memory Systems

Algorithms to Decide Data Layout

DAT uses fixed

AdvCompArch Embedded Memory Systems

AdvCompArch Embedded Memory Systems

Scratchpad Memory Idea

AdvCompArch Embedded Memory Systems

Scratchpad Memory Advantages

Avoid evicting useful data from the cache

Data are always in

Avoid energy cost of high associativity

AdvCompArch Embedded Memory Systems

Energy Cost of Hardware Caches

Energy perr access [n J]

Cache, 2way, 4GB space

Cache, 2way, 16 MB space

Energy consumption in tags, comparators, and muxes is significant

AdvCompArch Embedded Memory Systems

Most memory hierarchies (i.e.,

Worst case execution time

G.721 using unified

AdvCompArch Embedded Memory Systems

Many embedded systems are real-time systems: computations must be

Instruction and Data Cache

Scratch Pad Memory

AdvCompArch Embedded Memory Systems

Algorithm Memory Explore

AdvCompArch Embedded Memory Systems

Variation of On-chip Memory