Vous êtes sur la page 1sur 4

Embedded memory hierarchy exploration based on

Magnetic RAM
Lus Vitrio Cargnini, Lionel Torres, Raphael Martins Brum, Sophiane Senni, Gilles Sassatelli
LIRMM - UMR CNRS 5506 - University of Montpellier 2
161 Rue Ada, Montpellier, 34095, France
E-mail(s): {Torres,cargnini,brum,senni,sassatelli}@lirmm.fr

AbstractSRAM, DRAM and FLASH are the three main


employed technologies in design of on-chip processor memories.
However, manufacturing constraints for this technologies in the
most advanced nodes compromises further evolution. MRAM
(Magnetic memory) presents itself as an attractive alternative
for these technologies, as it has reasonable timing and power
characteristics. Last results in the state of the art demonstrate
that MRAM access time is can be less than 5ns and read/write
energy per bit in same order of magnitude as SRAM, also it can
evolve with the manufacturing process. One important feature of
MRAM is the non-volatility, allowing to define new instant on/off
policies and mainly optimizing leakage current. In this paper we
demonstrate how MRAM can be used into memory hierarchy of
embedded systems. The main objective is to demonstrate the
interest to use MRAM for Level-1 & 2 cache and to better
understand the architectural choice in order to minimize the
impact of the higher write latency of MRAMs.
Index TermsSemiconductors, VLSI, SoC, Memory, NVM,
MRAM, Embedded Systems, Memory Hierarchy.

I. I NTRODUCTION
SRAM currently is the de-facto technology to design cache
memories at Levels 1&2 of processors memory hierarchy. It is
a fast, yet power-hungry kind of memory. DRAM comes next
in the hierarchy, serving as a larger but not so fast volatile
memory, another drawback the process to build DRAMs and
kept the 30 of capacitance are highly complex at sub-micronic
nodes. Finally, in embedded systems secondary storage is
usually made with solid-state devices based on Flash memory.
Many obstacles threaten continued scaling of these three
technologies. From increasing leakage power to lithography
issues, it has been estimated that, by 2018, SRAM, DRAM
and Flash technologies will likely be replaced if Moores law
is to hold [1]. This landscape motivated the appearance of
a number of non-volatile memory (NVMs) technologies in
the past years. Spin-Transfer Torque Magnetic RAM (STTMRAM), Phase-Change RAM (PCM) and Resistive RAM
(RRAM), among others, are considered by ITRS as the most
promising candidates to take over the mainstream market. In
Table I, a quick comparison of those technologies is provided.
MRAM density (depending of the MRAM technology style)
is around four to seven times higher than the SRAMs, but
its access time is between three and ten times higher. But,
to be optimistic, last results from Toshiba [4] concerning
perpendicular STT, shows access time approximately of 4ns

978-1-4673-6104-0/13/$31.00 2013 IEEE

Table I: Comparison of NVM technologies [1][3]


Technology
SRAM
STT-MRAM
pSTT-MTRAM
TAS-MRAM
NAND
NOR
FeRAM
RRAM
PCMM

Min. cell
size(F)
150
20

4
10
22
30
4

Endurance
(cycles)

1016

1012
104
105
1012
105
1012

Read
latency (ns)
2
5
3
30
100E3
15
40
100
12

Write
latency (ns)
2
530
3
30
1E6
1E3
65
100
100

and bit energy read/write almost equivalent to SRAM. But


currently this result is essentially obtained at the device level.
We remind that for the Magnetic Tunnel Junction (MTJ),
the information is stored as the magnetization direction in one
of the two ferromagnetic layers separated by a thin tunnel
barrier, which is called as free layer. The other layer, called as
reference or pinned layer, is designed in such a way that it is
hard to reverse its magnetization. Figure 1 shows a schematic
view of a typical 1T-MTJ cell, as memory cell architecture in
MRAM with STT writing (STT- MRAM).

Figure 1: Perpendicular STT-MRAM Principle [5]


In this paper is proposed an embedded processor evaluation
flow based on memory hierarchy using STT-MRAM. We will
use this evaluation flow to demonstrate that STT-RAM in
certain conditions is quite compatible with SRAM in terms

of application performances, with ability to optimize energy


consumption.
II. M ETHODOLOGY EVALUATION FLOW
In order to evaluate the impact of STT-MRAM applied into
memory hierarchy, also based on previous work of [6], we
propose a methodology flow as depicted in Figure 2.

Concerning STT-MRAM, the memory simulator is based on


a modified environment of CACTI : NVSIM simulator [15].
NVSim, is a circuit-level model simulator dedicated to NVM
memory performance analysis, energy, and area estimation,
which supports various NVM technologies, including STTRAM, PCRAM, ReRAM, and legacy NAND Flash. NVSim is
successfully validated against industrial NVM prototypes, and
it is expected to boost architecture-level NVM-related studies.
At the end we are able to provide for a given application
a clear comparison performances between STT-MRAM &
SRAM used into embedded processor architecture.
III. E XPERIMENTAL S ETUP

Figure 2: Overall evaluation flow


The methodology is mainly based on a processor architecture simulator GEM5 [7]. The gem5 simulator currently
supports a variety of ISAs like Alpha, ARM, MIPS, Power,
SPARC, and x86. The simulators modularity allows these
different ISAs to plug into the generic CPU models and the
memory system without having to specialize one for the other.
In our particular case we adopted the ARM ISA v7 available
in Gem5. Specifically we assume ABI compatible with the
Cortex-A9, our compiler generates binaries specifically for that
target, regarding ASM and SIMD instructions.
The main interest for our approach is to determine the
overall processor system architecture to use, including the
memory hierarchy specifications and features: cache size and
latencies L1 & L2 main memory. We are able in this way
to extract all the memory transactions: number of L1 and
L2 read/write accesses , cache Hit and Miss, among other
parameters. The use of GEM5 (quasi-cycle accurate simulator)
allows us to evaluate different memory sizing strategies, cache
policies, and accurate performances analyses.
Our objective is to compare the use of SRAM cache vs.
STT-MRAM cache into the embedded processor memory
hierarchy. For this reason for SRAM or STT-MRAM it is
necessary to obtain the electrical features of these memories
(latency read / write access time, power consumption and so
on) to calibrate the GEM5 simulator.
For SRAM performances we used the memory simulator
CACTI [8][12]. CACTI is an integrated simulator, which is
able based on technology node characterization to provide
accurate information about cache and memory access time,
cycle time, area, leakage, and dynamic power model.

To evaluate this methodology flow, we propose herein a


case study application based on video X.264 encoder. We
considered a 32-bit RISC processor, dual-issue superscalar,
out-of-order, speculating dynamic length pipeline (8 - 11
stages), which is at the final quite similar to the ARMv7
architecture. The clock Frequency is fixed at 1.5GHz. We also
have a complete Linux Operating System running on top of
it. The video to encode is a 30 frames at 720p (resolution
1280720).
For instance, and to better understand the case study developed in the next section, we evaluated the impact of STTMRAM for cache L2 and we will compare its characteristics
with a similar SRAM. Table II gives the initial architecture
parameters scenario for this case study.
Table II: Details regarding the architecture and the levels of
memory hierarchy.
Parameter
Processor
L1 Caches
L2 Caches

Features
32 bits RISC Processor - 8-11 stages pipeline - 2 instructions
per cycle
64 Kbyte SRAM - 4-way set associative, 2 ns access latency
- 32 byte per cache line
2 Mbyte SRAM - 8-way set associative, 20 ns access latency
- 32 byte per cache line

Performances comparison between SRAM and MRAM at


node 45nm on L2 cache are described into Table III.
IV. S CENARIO A : L2 CACHE EXPLORATION FOR H IGH
PERFORMANCE SYSTEM

Considering X264, the experimental results described into


Table III are obtained based on the execution of the benchmark
comprising the OS on top of the Gem5, and calibrated with
the memory banks latency for each technology.
Observing Table IV, the total CPU time increases from 16.2
second to 17.1 second which is not necessarily critical for such
application. The fact to change the L2 memory bank based on
SRAM by the MRAM counterpart, with higher latency, x2.7
times higher for hit, cause an increase of only few percents on
total CPU time. The latency increases of cache L2 has only
slight impact on CPU time.
It is clear that the major benefit for using STT-MRAM is
essentially based on the leakage consumption. The gain is
around x50 times using the MRAM technology, we remind

0.213 nJ
0.213 nJ
0.22nJ
26.5mW
24.3mW
2.2mW
70.8
75.1

Table IV: L2-cache dynamic energy estimation.


CPU time (s)
Write Back Total (Write):
Overall Access (Read):
Write Energy per cache line:
Read Energy per cache line:
Total Write Energy (J):
Total Read Energy (J):

SRAM
16.2
5879046
21113987
22.8 nJ
957.7 pJ
0.13
0.020

STT-MRAM
17.11
5944740
22309621
170.6 nJ
150.4 pJ
1.01
0.0033

Table V: Static power consumption


Execution Time (s):
Static Power:
Total Energy (J):

SRAM
16.2
1326.7mW
21.49

MRAM
17.1
26.5mW
0.45315

here that CMOS will be only used for data decoding, whole
memory-array is no more leaking (data are stored into the
magnetic tunnel junction).
For the current state of the technology MRAM consumes
more dynamic energy than a SRAM, for dynamic energy
operation as noticed in Table IV (at least for our particular
case a X264 encoder, presently available in all embedded
devices on the market). Indeed, if we consider the total amount
of energy as the sum of dynamic plus leakage, the MRAM
has the advantage, as notices into Table V. For Write access
we observe that the MRAM takes x7.5 times more dynamic
energy than the SRAM for write operations, while the read
operations on SRAM takes x6 times more energy, in overall
the MRAM took a x1.25 times more dynamic energy than the
SRAM for overall operation to this specific application.
In [6], for example, a 2 MB L2 SRAM Cache was replaced
with an 8 MB L2 MRAM Cache, using roughly the same
silicon fingerprint. In their particular case, the increase on the
cache size was not enough to compensate the penalty due
to the cache access delay. By employing write buffers and a
novel cache access policy, they managed to achieve similar
performance while reducing the power consumption on the

PERFORMANCE SYSTEM

With the same idea, we are evaluating the use of MRAM in


Level-1 caches of microprocessors targeted for the embedded
system domain. The target is low performance system, where
constraints are different. Our goal was to determine whether
replacing L1 SRAM caches by L1 MRAM caches, while
keeping the same silicon fingerprint, is worthwhile.
The baseline configuration is quite simple. It consists of
a single processor having a single cache level and a large
external memory assumption that can be considered for many
systems. Differently from our previous work in [16], we
assumed that the MRAM density is four times the SRAMs
[17]. We are then comparing, for instance, a 4 KB SRAMbased cache with a 16 KB MRAM-based cache. For this set
of experiments, we assumed a latency of 3 clock cycles during
each cache access. It means that the processor will stall upon
each cache request, waiting for the data to become available.
We also assumed a latency of 1000 cycles for the external
memory to make the first word available, and 10 cycles for
each subsequent word while doing burst reading [18]. In the
same manner, as shown in Figure 3, where a 128 KB SRAM
cache is compared with its 512 KB MRAM counterpart, the
latter shows comparable performance to the smaller, yet faster
SRAM.
Cycles Per Instruction - CPI (less is better)

200
150
100
50
0

1 KB SRAM

4 KB MRAM

unepic

1.07 nJ
1.07 nJ
0.03 nJ
1326.7mW
1180.6mW
146.1mW
18.8
10.1

V. S CENARIO B : L1 CACHE EXPLORATION FOR LOW

mpeg2enc

70.1ns
66.0 ns
75.1ns

mpeg2dec

18.8ns
2.9ns
10.1 ns

texgen

2.2mm2
1.8mm2
0.39mm2

osdemo

5.6mm2
5mm2
0.63mm2

overall application (comprising all the memory hierarchy) by


almost 74%.
They also present a hybrid MRAM/SRAM cache organization, having 31 sets implemented in MRAM and 1 set
implemented in SRAM. The write-intensive data is kept in
the SRAM part, in order to mitigate the higher write delay. A
method for determining which data is suitable for being placed
in the SRAM set is also discussed.

mipmap

MRAM
45nm
2MB
8

epic

SRAM
45nm
2MB
8

djpeg

field
Technology
Size
Associativity
Area
Total Area
Data Array Area
Tag Array Area
Timing
Cache Hit Latency
Cache Miss Latency
Cache Write Latency
Power
Hit Dynamic Energy2
Miss Dynamic Energy3
Write Dynamic Energy4
Total Leakage Power
Data Array Leakage Power
Tag Array Leakage Power
hit(ns)
response (ns)

cjpeg

Table III: Memory banks characteristics

Figure 3: Same silicon area MRAM versus SRAM L1 Cache


execution time comparison - 1KB SRAM versus 4KB
In Figure 4, we set a MRAM L1 cache of 512KB and
we compare to L1 128KB SRAM cache size.It is shown
that, for most benchmarks, they are comparable in terms of
performance.
In order to generalize this conclusion, let us then define the
CPI penalty as the increase in the CPI caused by replacing an
SRAM cache with an MRAM cache using the same silicon

In fact, we believe that many other architecture elements in


digital systems could benefit of the recent advances in NVM
technologies.

Cycles Per Instruction - CPI (less is better)

8.72

3
2

R EFERENCES

128 KB SRAM

512 KB MRAM

unepic

mpeg2enc

texgen

osdemo

mipmap

epic

djpeg

cjpeg

mpeg2dec

Figure 4: Same MRAM silicon area versus SRAM L1 Cache


execution time comparison - 128KB SRAM versus 512KB
MRAM

area, as follows:
CPI penalty  1 

CPIMRAM
CPISRAM

(1)

.
Based on the CPI penalty , in Figure 5, the best-case, the
worst-case and the average performance over the benchmark
set are shown as a function of the cache capacity. Given our
assumptions are valid, MRAM does present a CPI gain rather
than a CPI penalty for most cases. Once the cache capacity
is large enough to contain the whole benchmark data, the CPI
gain turns into a penalty which can no longer be compensated
if no specific technique is employed.
CPI Penalty (less means 'better than the reference')
20

CPI Penalty (%)

0
-20
-40
-60
-80
-100
1:4

2:8

4:16

8:32

16:64

32:128

64:256

128:512

Cache Capacity (SRAM:MRAM), in KB


Best case

Worst case

Average

Figure 5: Overview of CPI Penalty: best-case, worst-case and


average of the Mediabench benchmarks performance.

VI. C ONCLUSION
We presented in this paper our working methodology for
memory hierarchy evaluation, and results we can obtain to
corroborate our assertions. Also, we investigated possible
applications of new memory technologies that can evolve together with the advanced nodes for embedded processors. The
use of MRAM for Level-1 or Level-2 caches is being explored
by several research groups, including ourselves. Current results
indicate that it could be an attractive solution to address the
rising power consumption verified in CMOS circuits. The
use of eNVMs opens a new paradigm on the implementation
of power-saving mechanisms, as the non-volatility could be
explored to power-off the devices whenever they are idle.

[1] ITRS, Emerging research devices, International Technology Roadmap


for Semiconductors, Tech. Rep., Feb. 2012. [Online]. Available:
http://www.itrs.net/Links/2011ITRS/2011Chapters/2011ERD.pdf
[2] W. Kim, S. I. Park, Z. Zhang, and Y. Yang-Liauw, Forming-free
nitrogen-doped alo x rram with sub-a programming current, . . .
2011 Symposium on, 2011. [Online]. Available: http://ieeexplore.ieee.
org/xpls/abs_all.jsp?arnumber=5984614
[3] H. Yoda, S. Fujita, N. Shimomura, E. Kitagawa, K. Abe, K. Nomura,
H. Noguchi, and J. Ito, Progress of stt-mram technology and the
effect on normally-off computing systems, Electron Devices Meeting
(IEDM), 2012 IEEE International, p. 11, 2012. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6479023
[4] E. Kitagawa and S. Fujita, STT-MRAM cuts power use
by
80eetimes.com.
[Online].
Available:
https://eetimes.com/
design/memory-design/4412080/STT-MRAM-cuts-power-use-by-80-?
pageNumber=2&Ecosystem=industrial-control
[5] P.
Singer.
(2012,
Dec.)
IEDM:
Nanoelectronics
provide
a
path
beyond
CMOS
ElectroIQ.
[Online].
Available:
http://www.electroiq.com/articles/sst/2012/12/
iedm-nanoelectronics-provide-a-path-beyond-cmos.html
[6] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen, A novel architecture of the
3D stacked MRAM L2 cache for CMPs, pp. 239249, 2009. [Online].
Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4798259
[7] N. Binkert, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish,
M. D. Hill, D. A. Wood, B. Beckmann, G. Black, S. K. Reinhardt,
A. Saidi, A. Basu, J. Hestness, D. R. Hower, and T. Krishna,
The gem5 simulator, ACM SIGARCH Computer Architecture
News, vol. 39, no. 2, pp. 17, Aug. 2011. [Online]. Available:
http://portal.acm.org/citation.cfm?id=2024716.2024718&coll=
DL&dl=ACM&CFID=219507219&CFTOKEN=22791001
[8] S. J. Wilton and N. P. Jouppi, An enhanced access and cycle
time model for on-chip caches, 1993. [Online]. Available: http:
//citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.2142
[9] N. Muralimanohar, J. H. Ahn, and N. P. Jouppi, Memory Modeling with
CACTI, Processor and System-on-Chip . . . , 2010. [Online]. Available:
http://link.springer.com/chapter/10.1007/978-1-4419-6175-4_14
[10] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, Cacti 6.0:
A tool to model large caches, Published in International Symposium
on Microarchitecture, Chicago, Dec 2007, Tech. Rep., Apr. 2009.
[11] G. Reinman and N. P. Jouppi, CACTI 2.0: An Integrated Cache Timing,
Power, and Area Model, 250 University Avenue Palo Alto, California
94301 USA, Tech. Rep., Feb. 2000.
[12] P. Shivakumar and N. P. Jouppi, CACTI 3.0: An Integrated Cache
Timing, Power, and Area Model, 250 University Avenue Palo Alto,
California 94301 USA, Tech. Rep., Aug. 2001.
[13] S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi, CACTI
5.1, Hewlett-Packard Development Company, L.P., Tech. Rep., Apr.
2008.
[14] Tarjan, David, Thoziyoor, Shyamkumar, Jouppi, and N. P, CACTI 4.0,
Hewlett-Packard Development Company, L.P., Tech. Rep., Jun. 2006.
[15] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, NVSim: A
Circuit-Level Performance, Energy, and Area Model for Emerging
Nonvolatile Memory, IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, vol. 31, no. 7, pp. 9941007,
Jul. 2012. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/
wrapper.htm?arnumber=6218223
[16] W. Zhao, Y. Zhang, Y. Lakys, J.-O. Klein, D. Etiemble, D. Revelosona,
C. Chappert, L. Torres, L. Cargnini, R. Brum, Y. Guillemenet, and
G. Sassatelli, Embedded mram for high-speed computing, in VLSI
and System-on-Chip (VLSI-SoC), 2011 IEEE/IFIP 19th International
Conference on, oct. 2011, pp. 37 42.
[17] K. Mackay, Tas, tas+stt-mram and magnetic logic unit, Gardanne,
Provence-Alpes-Cte dAzur, France, nov. 2011, property of Crocus
Technology. Non authorized publication.
[18] JC-42.3, Double data rate (ddr) sdram standard, JEDEC,
Standard, 2008, http://www.jedec.org/standards-documents/docs/jesd79f. [Online]. Available: http://www.jedec.org/standards-documents/
docs/jesd-79f

Vous aimerez peut-être aussi