Académique Documents
Professionnel Documents
Culture Documents
Jian-Hong Lin
Yuan-Hao Chang
Jen-Wei Hsieh
Tei-Wei Kuo
Cheng-Chih Yang
3
addr(b1)
addr(b2)
addr(b3)
,
, 4
, (
, ;
3
3,
3, 4
3 4 5 ( 7 ; 9 0
(q|,; z[ |;z,,
v
,
;
_
,
|
;
z
j
(
q
|
,
;
,([ ||[ [;,,
Figure 6. Increment of average branch number
4.2 Read Performance and Cache Miss Rate
Figure 7 shows the read performance of each game with differ-
ent cache sizes where the prediction graph was derived based
on ten traces of each game. Our approach achieved the best
performance for Raiden, due to its regular access patterns, but
the worst performance was observed for AOE II. When the
cache size was 2KB, the average read performance of AOE
II, TTD, and Raiden with the prefetch mechanism was 27.74
MB/s, 68.68 MB/s, and 94.98 MB/s, respectively. We must
point out that all of them were better than the read performance
of NOR (23.84 MB/s). Note that the lower the cache miss rate
in prefetching was, the higher read performance. To resolve
a cache miss, data accesses to NOR had to be redirected to
NAND so that missed data could be loaded from NAND to the
cache. It was also shown that a 4KB cache was sufcient for
the games under considerations because the read performance
became saturated when the cache size was no less than 4KB.
Figure 8 shows the read performance of the proposed ap-
proach for the three games with respect to different numbers of
traces, where the cache size was 4KB. The read performance
of each game was better than that of NOR even when only
two traces were used to generate a prediction graph. For exam-
ple, the improvement ratios over AOE II, TTD, and Raiden
were 24%, 216%, and 298%, respectively, when the num-
ber of traces for each game was 10, and the size of cache
was 4KB. When there were more than two traces, the read
performance of Raiden had almost no improvement because
the cache miss rate was almost zero. For AOE II, the read
7, 74
9, 57 30, 05 30, 0( 30, ; 30, 49
(;, (;
75, 4
77, 77, 34 77, 7 77, (
94, 9; 94, 99 94, 99 94, 99 94, 99 94, 99
3, ;4
0
0
0
30
40
50
(0
70
;0
90
00
4 ; ( 3 (4
zzj, ,;,, ([|)
;
,
,
p
,
;
[
z
;
q
z
,
(
|
/
,
)
,([ ||[ [;,, S[, ([
Figure 7. The read performance with different cache sizes (10
traces)
performance was improved slowly when the number of col-
lected traces increased because the access pattern of AOE II
was highly random. The increasing in the number of collected
traces for the prediction graph could not reduce the cache miss
rate signicantly. For TTD, good improvement was observed
with the inclusion of two more traces. It was because the last
two traces were, in fact, collected during the advance of players
in the game by clearing more stages. Furthermore, we summa-
rize the read performance of the proposed scheme and other ex-
isting products in Table 3. It shows that the read performance
of some specic applications with regular access patterns is
even better than that of OneNAND. On the other hand, without
our prediction mechanism, i.e., the worst case = 100% miss
rate, request data has to be read from NAND ash memory
on each read request. Thus, it is impractical to use NAND to
replace NOR without any prediction mechanism because the
read performance gap between the emulated NOR and NOR is
too large.
7, 5
;, 44 9, ( 9, 3( 9, 57
40,
45, 0(
5(, 5 5(, ;
75, 4
;, 0(
94, 99 94, 99 94, 99 94, 99
3, ;4
0
0
0
30
40
50
(0
70
;0
90
00
4 ( ; 0
(q|,; z[ |;z,,
;
,
,
p
,
;
[
z
;
q
z
,
(
|
/
,
)
,([ ||[ [;,, S[, ([
Figure 8. The read performance with different numbers of
traces (4KB cache)
Figure 9 shows the cache miss rate. The miss rate was
lower when more number of traces were used to construct the
prediction graph. In the gure, when ten traces were used to
6
AOE II TTD Raiden Worst case NOR OneNAND[25]
Read
29.57 75.24 94.44 8.76 23.84 68
(MB/s)
Table 3. Comparison of the read performance (10 traces and
4KB cache in our approach)
generate the prediction garph, the cache miss rate of Raiden
was almost zero and that of TTD was lower than 5%, but
that of AOE II could not be reduced effectively because of its
unpredictable access patterns. Comparing to read performance
shown in Figure 8, the read performance of a game was higher
when the cache miss rate was lower.
39, ;3
37, 3
35, ; 35, (4
35, 7
, ;(
7, 73
0, 95 0, ;
4,
, ;4
0, 0( 0, 0( 0, 0( 0, 0(
0
5
0
5
0
5
30
35
40
45
4 ( ; 0
(q|,; z[ |;z,,
q
;
,
,
;
|
,
(
%
)
,([ ||[ [;,,
Figure 9. Cache miss rate with different number of traces (with
4KB cache)
4.3 Main-memory Requirements
The major memory overhead of the prediction mechanism was
to maintain the branch table. The more the traces were used
to create the prediction graph, the bigger the branch table was.
That was because more access patterns were learned. As shown
in Figure 10, the table sizes of AOE II, TTD, and Raiden were
only 39.83KB, 35.14KB, and 0.43KB, respectively, when there
were ten traces used to construct each game. In most embedded
systems, the branch table of each game was still small enough
to be stored in RAM. However, in this experiment, branch ta-
bles were stored in NAND ash memory and loaded to SRAM
on demand. Figure 10 shows that the table size of Raiden kept
low when the number of traces increased, but the table sizes of
AOE II and TTD kept growing because ten traces still couldnt
cover all the access patterns of AOE II and TTD. However, as
shown in Figure 9, the cache miss rate of TTD was very low
and didnt need to involve new traces to improve the cache hit
ratio, and the cache miss rate of AOE II still could not be low-
ered even if more traces were collected.
4.4 Cache Pollution Rate
Cache pollution Rate is the rate of data that are prefetched but
not referenced during the program execution. The prefetching
of unnecessary data represented overheads and might even de-
creased the read performance because the prefetching activities
, 4
7, ;7
, 4
(, 9
35, 4
, (
;, 4
30, 4
33, 45
39, ;3
0, 04 0, 7 0, 3( 0, 4 0, 43
0
5
0
5
0
5
30
35
40
45
4 ( ; 0
(q|,; z[ |;z,,
|
;
z
j
|
|
}
,
,
;
,
,
(
[
|
)
,([ ||[ [;,,
Figure 10. The size of branch table
of unnecessary data might delay the prefetching of useful data.
In addition, unnecessary data transfer leads to extra power con-
sumption, which is critical to designs of embedded systems.
Let N
SRAM2host
be the amount of data accessed by the host, and
N
f lash2SRAM
the amount of data transferred from NAND ash
memory to SRAM. The cache pollution rate was dened as
follows:
Cache pollution rate = 1
N
SRAM2host
N
f lash2SRAM
As shown in Figure 11, the cache pollution rate increased
as the number of traces for each game increased. That was
because more traces led to a larger number of branches per
branch node, and only one of the LBA links that follow a
given branch node was actually referenced by the program.
In summary, there was a trade-off between the prefetching
accuracy and the prefetching overhead, even though the cache
pollution rates were still lower than 10% in most cases.
7, 3;
;, 3
;, 9
9, 43
0, (3
, 57
4, (4
4, ;;
5, 09
(, 09
0, 0 0, 0 0, 03 0, 03 0, 03
0
3
4
5
(
7
;
9
0
4 ( ; 0
(q|,; z[ |;z,,
z
z
j
,
p
z
}
}
(
|
;
z
|
,
(
%
)
,([ ||[ [;,,
Figure 11. The cache pollution rate (4KB cache)
5. Conclusions
This paper addresses the issue of the replacement of NOR with
NAND motivated by a strong market demand. Different from
7
on-demand cache mechanisms proposed in previous work,
we propose an efcient prediction mechanism with limited
memory-space requirement and an efcient implementation to
improve the performance of programs stored in NAND. Binary
code of programs is prefetched from NAND to SRAM cache
precisely and efciently according to the prediction graph that
is constructed by the collected access patterns of program ex-
ecution. A series of experiments is conducted based on re-
alistic traces collected from three different types of popular
games AOE II, TTD, and Raiden. We show that the
average read performance of NAND with the proposed pre-
diction mechanism could be better than that of NOR in most
cases, the cache miss rate was 35.27%, 4.21%, and 0.06% for
AOE II, TTD, and Raiden, respectively, and the percentage of
redundant prefetched data was lower than 10% in most cases.
Fur future research, we shall further extend the proposed
mechanism to adjust the prediction graph on-line to make the
prediction mechanism adaptive to any special and temporal
changes of program executions. We shall also explore the pred-
icability of data prefetching for programs that have high ran-
domness in terms of access patterns.
References
[1] Flash Cache Memory Puts Robson in the Middle. Intel.
[2] Flash File System. US Patent 540,448. In Intel Corporation.
[3] FTL Logger Exchanging Data with FTL Systems. Technical
report, Intel Corporation.
[4] Software Concerns of Implementing a Resident Flash Disk. Intel
Corporation.
[5] Flash-memory Translation Layer for NAND ash (NFTL). M-
Systems, 1998.
[6] Understanding the Flash Translation Layer (FTL) Specication,
http://developer.intel.com/. Technical report, Intel Corporation,
Dec 1998.
[7] Windows ReadyDrive and Hybrid Hard Disk Drives,
http:// www.microsoft.com/whdc/device/storage/hybrid.mspx.
Technical report, Microsoft, May 2006.
[8] L.-P. Chang and T.-W. Kuo. An Adaptive Striping Architecture
for Flash Memory Storage Systems of Embedded Systems. In
IEEE Real-Time and Embedded Technology and Applications
Symposium, pages 187196, 2002.
[9] L.-P. Chang and T.-W. Kuo. An Efcient Management Scheme for
Large-Scale Flash-Memory Storage Systems. In ACM Symposium
on Applied Computing (SAC), pages 862868, Mar 2004.
[10] P. J. Denning. The Working Set Model for Program Behavior.
Communications of the ACM, 11(5):323333, 1968.
[11] P. J. Denning and S. C. Schwartz. Properties of the Working-Set
Model. Communications of the ACM, 15(3):191198, 1972.
[12] F. Douglis, R. Caceres, F. Kaashoek, K. Li, B. Marsh, and
J. Tauber. Storage Alternatives for Mobile Computers. In
Proceedings of the USENIX Operating System Design and
Implementation, pages 2537, 1994.
[13] F. Douglis, P. Krishnan, and B. Marsh. Thwarting the power-
hungry disk. In Proceedings of the 1994 Winter USENIX
Conference, pages 292306, 1994.
[14] DRAMeXchange. NAND Flash Contract Price,
http://www.dramexchange.com/, 03 2007.
[15] Y. Joo, Y. Choi, C. Park, S. W. Chung, E.-Y. Chung, and
N. Chang. Demand Paging for OneNAND
TM
Flash eXecute-
In-Place. CODES+ISSS, October 2006.
[16] A. Kawaguchi, S. Nishioka, and H. Motoda. A Flash-Memory
Based File System. In Proceedings of the 1995 USENIX Technical
Conference, pages 155164, Jan 1995.
[17] J.-H. Lee, G.-H. Park, and S.-D. Kim. A new NAND-type
ash memory package with smart buffer system for spatial and
temporal localities. JOURNAL OF SYSTEMS ARCHITECTURE,
51:111123, 2004.
[18] B. Marsh, F. Douglis, and P. Krishnan. Flash Memory File
Caching for Mobile Computers. In Proceedings of the Twenty-
Seventh Annual Hawaii International Conference on System
Sciences, pages 451460, 1994.
[19] C. Park, J.-U. Kang, S.-Y. Park, and J.-S. Kim. Energy-aware
demand paging on nand ash-based embedded storages. ISLPED,
August 2004.
[20] C. Park, J. Lim, K. Kwon, J. Lee, and S. L. Min. Compiler-
assisted demand paging for embedded systems with ash memory.
EMSOFT, September 2004.
[21] C. Park, J. Seo, D. Seo, S. Kim, and B. Kim. Cost-efcient
memory architecture design of nand ash memory embedded
systems. ICCD, 2003.
[22] Z. Paz. Alternatives to Using NAND Flash White Paper.
Technical report, M-Systems, August 2003.
[23] R. A. Quinnell. Meet Different Needs with NAND and NOR.
Technical report, TOSHIBA, September 2005.
[24] Samsung Electronics. K9F1G08Q0M 128M x 8bit NAND Flash
Memory Data Sheet, 2003.
[25] Samsung Electronics. OneNAND Features and Performance, 11
2005.
[26] Samsung Electronics. KFW8G16Q2M-DEBx 512M x 16bit
OneNAND Flash Memory Data Sheet, 09 2006.
[27] M. Santarini. NAND versus NOR. Technical report, EDN,
October 2005.
[28] Silicon Storage Technology (SST). SST39LF040 4K x 8bit SST
Flash Memory Data Sheet, 2005.
[29] STMicroelectronics. NAND08Gx3C2A 8Gbit Multi-level NAND
Flash Memory, 2005.
[30] A. Tal. Two Technologies Compared: NOR vs. NAND White
Paper. Technical report, M-Systems, July 2003.
[31] C.-H. Wu and T.-W. Kuo. An Adaptive Two-Level Management
for the Flash Translation Layer in Embedded Systems. In
IEEE/ACM 2006 International Conference on Computer-Aided
Design (ICCAD), November 2006.
[32] M. Wu and W. Zwaenepoel. eNVy: A Non-Volatile Main
Memory Storage System. In Proceedings of the Sixth International
Conference on Architectural Support for Programming Languages
and Operating Systems, pages 8697, 1994.
[33] Q. Xin, E. L. Miller, T. Schwarz, D. D. Long, S. A. Brandt,
and W. Litwin. Reliability Mechanisms for Very Large Storage
Systems. In Proceedings of the 20th IEEE / 11th NASA Goddard
Conference on Mass Storage Systems and Technologies (MSS03),
pages 146156, Apr 2003.
[34] K. S. Yim, H. Bahn, and K. Koh. A Flash Compression Layer
for SmartMedia Card Systems. IEEE Transactions on Consumer
Electronics, 50(1):192197, Feburary 2004.
8