Vous êtes sur la page 1sur 15

Authors Accepted Manuscript

Design of a Memristor-Based Look-Up Table


(LUT) for Low-Energy Operation of FPGAs
T. Nandha Kumar, Haider A.F. Almurib, Fabrizio
Lombardi
www.elsevier.com/locate/vlsi

PII:
DOI:
Reference:

S0167-9260(16)00028-6
http://dx.doi.org/10.1016/j.vlsi.2016.02.005
VLSI1190

To appear in: Integration, the VLSI Journal


Received date: 17 November 2015
Revised date: 23 February 2016
Accepted date: 23 February 2016
Cite this article as: T. Nandha Kumar, Haider A.F. Almurib and Fabrizio
Lombardi, Design of a Memristor-Based Look-Up Table (LUT) for Low-Energy
Operation
of
FPGAs, Integration,
the
VLSI
Journal,
http://dx.doi.org/10.1016/j.vlsi.2016.02.005
This is a PDF file of an unedited manuscript that has been accepted for
publication. As a service to our customers we are providing this early version of
the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting galley proof before it is published in its final citable form.
Please note that during the production process errors may be discovered which
could affect the content, and all legal disclaimers that apply to the journal pertain.

Design of a Memristor-Based
Look-Up Table (LUT) for
Low-Energy Operation of FPGAs
T. Nandha Kumar, Haider A.F.Almurib and Fabrizio Lombardi

`
Abstract This paper presents a scheme for designing a memristor-based look-up table (LUT) in which the memristors are connected in
rows and columns. As the columns are isolated, the states of the unselected memristors in the proposed scheme are not affected by the
WRITE/READ operations; therefore, the prevalent problems associated with nano crossbars (such as the write half-select and the sneak
path currents) are not encountered. Extensive simulation results of the proposed scheme are presented with respect to the WRITE and
READ operations; its performance is compared with previous LUT schemes using memristors as well as SRAMs. It is shown that the
proposed scheme is significantly better in terms of WRITE time and energy dissipation for both memory operations (i.e. WRITE and
READ); moreover it is shown that the READ delay is nearly independent of the LUT dimension. Simulation using benchmark circuits for
FPGA implementation show that the proposed LUT offers significant improvements also at this level.
Index Terms Memristor, non-volatile memory, Sneak path, FPGA, LUT.
I.

INTRODUCTION

ield Programmable Gate Arrays (FPGA) have been widely utilized because they allow the fast hardware design and realization of
digital systems at relatively low development cost and good performance [1]. The general scheme of an FPGA consists of
configurable resources; among them, Look Up Tables (LUT) are used to implement combinational logic circuits [2]. All
configurable resources (inclusive of the LUTs) are controlled through the configuration bits stored in Static Random Access Memories
(SRAMs) [2]. SRAM-based FPGAs are volatile, so unable to retain the configuration bits when a power loss occurs. Non- Volatile
(NV) flash memories are utilized as alternative to store the configuration bits [3]; this technology requires a larger area, higher costs
[4], high delay for retrieving the configuration bits and a substantial increase of leakage current in stand-by mode [5] (so causing
additional power dissipation). Hence to overcome the above mentioned issues, NV memories (for LUT operation) made of so-called
resistive elements (such as the memristor) have been proposed as storage elements. Memristor-based memories using nano crossbars
have been extensively analyzed in the technical literature [8][9][11]-[17].
T. Nandha Kumar & Haider A.F.Almurib
are with Faculty of Engineering, The University of Nottingham, Malaysia.
haider.abbas@nottingham.edu.my).
Fabrizio Lombardi is with the Department of ECE, Northeastern University, Boston, MA 02115, USA (e-mail: lombardi@ece.neu.edu).

(e-mail: nandhakumaar.t;

This manuscript is an extended version of [19] by the same authors

These memories have been advocated as a potential replacement for conventional NV Flash Memories (NVFMs) as LUTs in a
FPGA due to the higher density and lower power consumption [8]. However, these memories usually use nano crossbars, whose
operation is affected by sneak path currents and the write half-select [13], i.e. following few WRITE/ READ operations, the
memristances of the unselected memristors change, causing errors in stored data.
To address these issues, this paper proposes a novel scheme in which the memristors are connected in rows and columns, but the
columns are isolated. It is then possible to prevent the sneak path current and write half-select problems, because the memristances of
the unselected memristors are unaffected. Also the proposed scheme retains the advantages of [16], such as no power dissipation in
stand-by mode; no refresh pulse and no V/2 bias are required (so also lowering the number of power rails). The LUT scheme
presented in this paper does not use a nano crossbar while still retaining the same advantages of previous approaches, for example the
utilization of WRITE and READ schemes of [16] for improved performance. Extensive simulation results and a detailed comparative
analysis (inclusive of circuit modeling) with previous works [11][12][16] show that the proposed scheme is significantly better in
terms of WRITE time and energy dissipation for both the WRITE and READ operations. Simulation is also extended from circuitlevel to FPGA-level; a FPGA implementation using the proposed LUT shows significant improvement in performance (delay and
energy) compared to both SRAM-based and existing memristor-based (non-volatile) schemes, thus confirming its viability.
II. REVIEW AND PRELIMINARIES
a.

Memristor
The four circuit variables that define circuit theory, are charge, flux, current and voltage; these four variables account for the six
possible (two-variable) combinations, out of which only five of them (as relations) are known and well understood. Chua theorized
Page 1 of 14

this fourth fundamental element in 1971 [6] and named it memristor. The memristor is characterized by its memristance function given
by the ratio of the change of flux to the change of charge. HP Lab has achieved the first physical realization of a memristor using a
titanium dioxide film sandwiched between two platinum electrodes [7]. In the HP implementation, there are two layers of film; one
has a slight depletion of oxygen atoms while the other layer is non-depleted. The oxygen vacancies act as charge carriers, i.e. the
depleted layer has a significantly lower resistance than the non-depleted layer. When an electric field is applied, the oxygen vacancies
drift in the direction of the field, changing the boundary between the high-resistance and low-resistance layers. The application of a
positive voltage at one end moves the oxygen ions to the other end of the film, thereby shifting the boundary between the doped and
undoped film regions; the application of a voltage with opposite polarity reverses this phenomenon.
Let w(t) denote the length of the doped region (as function of time) and D the total length of the titanium dioxide layer (memristor).
Then w(t)/D is referred to as the Normalized State Parameter (NSP) [17]. When w(t) = D, then NSP=1 and the memristor is at the
least resistance value (RON), i.e. the oxygen ions spread over the entire film. If w(t) = 0 then, NSP=0 and the memristor is at the
highest resistance value (ROFF), i.e. no doped region is present. The value over which the resistance varies from RON to ROFF is defined
as the so-called range of the memristor. The threshold value of the NSP is the value that allows to correctly distinguish the binary
values (0,1) of the stored data; this is assumed to be 0.5. An updated version of the model in [18] is utilized in this paper because it has
shown close resemblance to the HP Labs implementation [7]. Moreover, throughout this manuscript unless specified, the default
values of the parameters are given as follows: Memristor: RON = 100 Ohms; ROFF = 19k Ohms, D =10nm; Transistors: Low Power
(LP) model of the Predictive Technology Model (PTM) [20], gate length = 32nm and aspect ratio of 2; the on-state and off state
resistances of transistors T1 and T2 are 5.552k and 348.22G respectively; accuracy of NSP: 0.1%; Simulator: LTSPICE IV.
b. Non-Volatile Memory Schemes
The memristor is a potential candidate for replacing SRAM and NVFM [13] for storing the configuration bits of an FPGA. In
recent years, new schemes have been proposed for FPGAs by using memristor based memories [8]-[17]. A novel interconnect design
for FPGAs has been proposed using only memristors and metal wires [8][9]; an hybrid design using SRAMs and memristors to
implement the Non-Volatile LUT (NVLUT) of a FPGA has been reported in [10]. In addition to the NV feature, this NVLUT reduces
power dissipation in the stand-by mode. However it suffers from an increase in dynamic power dissipation, area and delay. [13] has
presented a MOS accessed memristor array, but this scheme incurs also in a substantial increases in area, current density and
fabrication complexity.
To address these issues, [11]-[17] have reported NVLUT-based schemes using only memristors and nano crosswires. Although
these schemes provide high device density and address some of the issues raised by [10], they still suffer from the so-called write halfselect and sneak path current problems [13]. Following few WRITE/ READ operations, the memristances (and the NSPs) of the
unselected memristors changes, thus resulting in an erroneous stored data. The so-called V/2 bias with different two-step WRITE
schemes has been proposed in [11]-[15] to address the write half-select problem. In [16], only a two-step WRITE scheme is utilized,
so different from [11]-[15] in which the V/2 bias is utilized. The sneak path current problem has been mostly addressed in the
technical literature by biasing all unselected rows to the same voltage of the selected columns or by using a complex circuit-level
solution with double sampling [13]; however, these schemes only partially alleviate the sneak path current problem by imposing
limitations on the array size. Sneak path currents have been substantially reduced in [16], because the V/2 bias is not utilized.
III. PROPOSED LUT DESIGN
The proposed LUT design utilizes a memristor-based scheme, so not a nano crossbar [19]; its features are as shown a fast WRITE
time, a significantly reduced READ power dissipation (compared with previous schemes such as [11]-[17]) and no power dissipation
in the stand-by mode. Moreover, the effects of write half-select and sneak path current [11]-[15] are not encountered in the proposed
scheme while attaining better controllability of the NSPs for the unselected memristors.
Fig. 1 shows the proposed memory block for a two-input LUT [19]; it consists of columns for the Bit Lines (BLs). The
independent horizontal lines (i.e. the Word Lines or WLs) connect the memristors (every memristor has a terminal connected to a BL
and the another to the controller), so the memristors on a row are not connected, but rather they are isolated on a row-wise basis, i.e.
the proposed scheme does not utilize a nano crossbar structure. The dimension of the LUT is given by the number of memristors
connected to each BL; every BL is connected to ground through a MOS transistor.
The controller (Fig. 2) handles the data to be driven on WL (as per the input data); moreover, it switches on and off the transistors
by controlling the gate signals (G1 and G2) and selects (Select) the appropriate BL value to the output (Out). A and B denote the
inputs of the LUT; Out is the output of the LUT. The logic values of the four different address lines AB (00, 01, 10 and 11) are stored
in the memristors M11, M12, M21 and M22 respectively. The voltage requirements for the signals of the WRITE and READ
operations of a selected memristor (in this case M11, without loss of generality and/or correctness) are shown in Table 1 [19].

Page 2 of 14

+Vdd

C
O
N
T
R
O
L
L
E
R

A
B

R En
WEn

Reset

-Vdd

BL1

BL2

M11

M12

M21

M22

WL11
WL12
WL21
WL22

Out
Out1

G1
G2

T1

Out2
T2

Select

Fig. 1 Proposed new scheme for a two-input LUT memory block implemented using memristors.
Table 1
Voltage requirements for the WRITE operation of M11& M21 and READ operations on M11 using the proposed LUT.
WL11 WL12 WL21 WL22 BL1 Voltage BL2 Voltage
GND
Floating
Write 1 Vdd Floating Vdd Floating
GND
Floating
Write 0 -Vdd Floating -Vdd Floating
GND
Floating
Read Vdd Floating Floating Floating

The circuit diagram for the proposed controller is shown in the Fig. 2 while Table 2 shows the truth table of the functions of the
controller block. The execution of the WRITE or READ operation is initiated by the controller using the inputs WEn and REn.
The input C is used to choose a particular BL, such that the memristors connected to it execute the WRITE operation. Unlike
previous schemes, the WRITE operation is performed in parallel for all memristors connected to a single BL.
The READ operation is similar to [16] in which depending on the values of A and B, the corresponding memristor is READ and
the result appears at OUT.
The WRITE and READ operations are treated in more detail next.
a) WRITE operation
The WRITE operation is asserted by WEN; this signal enables the simultaneous write to all memristors connected to a specific BL.
As illustrated in Table 2 and Fig. 2, when C is 0 (1) the memristors connected to BL1 (BL2) i.e. M11 and M21 (M12 and M22) are
operative with respect to the WRITE operation by turning on T1 (T2). Thus depending on the WRITE input (Table 1), data is written
in the memristors. Each BL is connected to ground through the MOS transistors (Fig. 1); thus during the WRITE operation for the
memristors connected to BL1, T1 is turned on while the memristors connected to BL2 are unaffected (T2 is turned off). Therefore, the
memristors connected to BL1 operate independently of the memristors connected to BL2; moreover unlike previous schemes, the
proposed scheme does not require a V/2 bias to unselect a memristor, i.e. the proposed scheme does not suffer from the so-called write
half-select problem. Additionally, the proposed scheme does not require a two-step writing scheme; instead, the WRITE operation is
performed simultaneously across all memristors connected to a BL, incurring in a reduced WRITE delay.
+Vdd

-Vdd

WL11

+Vdd
TG1

D0
-Vdd

WL12
TG2
WL22

+Vdd
TG4

D1
-Vdd
A
B

WL21
1

TG3

3
WEn
REn

2
4

T1

Reset

T2
Sel

Fig. 2 Circuit diagram of the controller for a two-input LUT.


Table 2
WRITE and READ operations using proposed scheme
Page 3 of 14

Operation
Write

Read

REn

WEn
C
A
B
T1
0

0
0

0
1

1
0

1
1

- Low; - High; - dont care; z- floating (high resistance)

T2

M11(1) (00)
D0
z
D0
z
z
z

M12(2) (01)
z
D0
z
D0
z
z

M21(3) (10)
D1
z
z
z
D1
z

M22(4) (11)
z
D1
z
z
z
D1

Note:
b) READ operation
The READ operation on a memristor is dependent on the values of A and B (Table 2) provided REN is asserted. For example
(Table 2) when AB is 00, then the READ operation is executed for M11. The remaining memristors connected with BL1 and BL2
are driven to a high resistance (floating) as the corresponding pass transistors (TG2, TG3 and TG4) are in the off state. The READ
voltage (Table 1) is applied to WL1; then by applying the appropriate sel(0) signal, the voltage across T1 is propagated to OUT, i.e.
the value stored for the input 00 is read out. This completes a READ operation on M11. Furthermore and explained next, the output
voltage difference between the READ 0 (NSP =0) and READ 1 (NSP =1) of the proposed scheme is significantly greater than for
[16].
Figs 3(a) and 3(b) illustrate the circuit diagrams for the READ operation of M11 in a two-input LUT using [16] and the proposed
scheme respectively. In the proposed scheme, the unselected WLs are connected to a high resistance (in the order of Giga Ohms),
because the corresponding pass transistors are turned off; for [16], the unselected WLs are grounded. Figs. 4(a) and 4(b) show the
equivalent circuit diagrams of [16] and the proposed scheme respectively.
M11

M12

Vin
M22

TG4

Out1

Out2

T1 ON

T2 OFF

M12

M21

M22

TG2

Vin
M21

M11

Out1

Out2

T1 ON

T2 OFF

TG3

Pass Transistors
All OFF

(a) Previous architecture

(b) Proposed new architecture

Fig. 3 Example of reading the LUT (a) [16]; (b) proposed scheme.
iWL1

iR11

iR12

R11
V1

R12

V1

V2
R22+ RTG4

RT2-OFF

(a) Previous architecture

VWL11

R21+ RTG3

R22

RTG2
RT1-ON

RT2-OFF

R21

iR11

R11

R12
V2

RT1-ON

VWL1

iWL11

(b) Proposed new architecture

Fig. 4 Equivalent circuit of the example of reading the LUT (a) [16]; (b) proposed scheme.

In the scheme of [16], the output voltages V1 and V2 of Fig. 4(a) are given by
(

(1)

(2)

where 1/RP1=1/RT1-ON+1/R21 and 1/RP2=1/RT2-OFF+1/R22 are the parallel resistances between the unselected memristors and the
transistors. The READ current iWL1 branches to the unselected WLs; this causes a decrease in magnitude for the output voltage V1
across the load (RT1-ON), also reducing the output voltage difference between the READ 0 and 1 operations. However when the LUT
dimension is increased, the number of unselected memristors (that are connected in parallel to the output) increases, so decreasing the
effective load resistance and V1. The worst case scenario occurs when the NSP of the unselected memristor is 1, i.e. at the smallest
resistance value (RON).
Fig. 5 shows the differences of output voltages between the READ 1 and 0 operations (V =V1(1) V1(0)) for different values of
RON and LUT sizes using [16] and the proposed scheme; as expected, with an increase in memory size, the difference in output voltage
decreases due to the decrease in load resistance. Moreover when RON increases (under a constant ROFF/ RON ratio), the output voltage
difference decreases; this is caused by the decrease in the amount of current flowing in the circuit.
Page 4 of 14

Difference in output voltage [V]

0.8

Previous architecture
R = 100 R
= 19k
ON

0.6

R
0.4

ON
ON

OFF

= 1k R

OFF

= 10k R

= 1.9M

Proposed new architecture


R = 100 R
= 19k
ON

0.2

R
0

= 190k

OFF

2x2

3x3

4x4

5x5

6x6

7x7

ON
ON

OFF

= 1k R

OFF

= 10k R

= 190k

OFF

= 1.9M

8x8

LUT memory size

Fig. 5 Difference between READ 0 and 1 operations at different RON and LUT memory of sizes under the worst case for [16] and the proposed scheme.

In the proposed scheme, RTG3 (resistance of TG3) has a high resistance (in the order of few giga ohms), so the effective resistance
across RT1-ON remains constant; also in the proposed scheme, the READ current iWL1 does not branch and hence, the voltage across the
load is given by
(

(3)

where V2 = 0. Using the proposed scheme, the magnitude of V1 increases by nearly 150% more than in [16] for a better output voltage
difference between the READ 0 and 1 operations. An increase in LUT dimension does not significantly affect the load resistance and
hence the output voltage difference remains nearly constant (Fig. 5).
The proposed approach has very distinctive features; they are analyzed as follows.
a) Sneak current path
As shown in Fig.4(b), iWL1 does not branch to the unselected memristors (R12 & R22) that are connected to the BL of the selected
memristor; therefore the NSPs of the unselected memristors (R12 & R22) are not affected during the READ operation on R11. This is
applicable to a LUT irrespective of its size.
Consider the unselected memristor (R21) that is connected to the same BL as the selected memristor (R11). R21 is unselected by
turning off the pass transistors (TG3) that provides a very high resistance to its path. Therefore iWL1 is infinitesimally small; so the NSP
of the unselected memistor is not affected. The proposed scheme does not incur in the so-called sneak path current problem. As the
memristors in the rows are not connected, V/2 biasing is not required and therefore, the proposed scheme does not incur in the write
half select problem. It should be noted that in the proposed controller, the simultaneous enabling of WEN and REN is not possible.
b) Complexity analysis
Consider an nxn LUT; the proposed scheme requires n2 WLs while [16] requires only n WLs. However, WLs are nano wires
connected to the controller and therefore, an increase in the number of WLs does not cause a significant overhead. More importantly
compared with the number of inputs driving the controller, the proposed scheme needs one input (C) more than [16], i.e. n+1 input
lines. Therefore in terms of complexity of the input signals the proposed scheme is not substantially different from [16]. The
complexity of the controller design of the proposed scheme (Fig.2) is nearly the same as [16], except that the number of pass
transistors is now n2 (versus n). The worst case delay (in terms of number of gates) of the control signals in both controllers is the
same, however the worst case delay of the data path in the proposed controller is less than in [16], because the proposed controller
does not use multiplexers in the data path.
A comparative analysis of the layout area (only for the memory cells) for the proposed method with previous methods is pursued
next. Let the length of a transistor be 2 with an aspect ratio of 2; the layouts of a 2x2 LUT design for the proposed scheme and [16]
are shown in Figs 6(a) and 6(b) respectively. These layouts use two metal layers and the memristors are placed between metal 1 and
metal 2. The layout area required for both the proposed method and [16] is 30 x 35 .
Next, a comparison of the proposed layout with those of previous methods [11] [12] is also pursued; [11] requires a transistor per
WL and a transistor per BL, for a 2x2 LUT, an additional area of at least 162 is required. [12] requires one transistor per memristor as
well as a transistor per bit line; for a 2x2 LUT an additional area of at least 322 is required. Therefore, the proposed method requires
less layout area than [11] and [12].

Fig. 6(a) Layout of 2x2 LUT design of proposed scheme


Page 5 of 14

Fig. 6(b) Layout of 2x2 LUT design scheme of [16]

c) NSP Variation analysis


The changes to the NSPs of the unselected memristors of a LUT for a WRITE operation are assessed by simulation at different sizes
of the LUT. The simulation results are shown in Table 3. As expected, the NSPs of the unselected memristors in the proposed method
do not change during both the WRITE 1 and WRITE 0 operations; this condition has been verified for different sizes of LUTs (up to
dimension 8). The NSPs of some of the unselected memristors in the designs of [11] [12] [16] change and the magnitude of this
change is such that the threshold value is often reached. Thus, the proposed method in addition to preventing sneak path currents,
preserves data integrity too.
Table 3
Comparison of NSPs during WRITE operation

0.179
0.242
0.261
0.269

0.868
0.838
0.827
0.812

0
0
0
0

2
6
10
14

2
6
10
14

1
1
1
1

[11]&[12]

[11]&[12]

0
0
0
0

[16]

[16]

1
3
5
7

Proposed
method

Proposed
method

2
6
10
14

[11]&[12]

[11]&[12]

0
0
0
0

[16]

[16]

2X2
4X4
6X6
8X8

Proposed
method

LUT size

WRITE 0 operation
No. of affected (unselected)
Worst case NSP among unselected
memristors
memristors with initial value of 1
Proposed
method

WRITE 1 operation
Worst case NSP among
unselected memristors with initial
value of 0

No. of affected (unselected)


memristors

0.175
0.173
0.127
0.173

0.016
0.048
0.093
0.140

IV. SIMULATION RESULTS


The proposed LUT has been designed at dimension of 2, 4, 6 and 8 and simulated using LTSPICE. The following scenarios are
simulated [16]: Scenario 1: WRITE 1 to all memristors. Scenario 2: WRITE 0 to all memristors. Scenario 3: WRITE 0 to a memristor
while the NSPs of all memristors are initially 1. Scenario 4: WRITE 1 to a memristor while the NSPs of all memristors are 0. Scenario
5: READ 0 when the NSPs of all other memristors are 1.Scenario 6: READ 0 when the NSPs of all memristors are 0. Scenario 7:
READ 1when only the NSP of one memristor is 1 (i.e. the NSP of all other memristors are 0). Scenario 8: READ 1 when the NSPs of
all memristors are 1. The results are compared with the works of [11][12][16].
a) WRITE operation
LUTs of different dimensions of the proposed and previous schemes [11][12][16] are simulated for the WRITE operation by
considering the four relevant scenarios (1 to 4). For each of these scenarios, the write delay, the energy dissipation and the Energy
Delay Product (EDP) are found by simulation. Then, the average and worst values for the WRITE operation at different array
dimension are calculated; the average and worst case delays, energy and EDP values are shown in Figs 7 and 8.
The average and worst case WRITE delays for the proposed scheme are significantly less than for [11][12][16]. With an increase
in the dimension of the LUT, the proposed scheme incurs in a significantly smaller WRITE delay than other schemes. This occurs
because in the proposed scheme, the WRITE operation is performed in parallel for the memristors connected to a BL; in addition, the
average and worst case EDPs of the proposed scheme are significantly less than for [11][12][16]. Furthermore the simulation results
confirm that for the proposed scheme, the NSPs of the unselected memristors are unchanged and the two-phase WRITE operation of
[16] is not required.

Page 6 of 14

Proposed scheme

Scheme of [16]

1000

2x2

4x4

6x6

600
400
200
0

8x8

EDP [ns.pJ]

Energy [pJ]

Delay [ns]

800

2000

Scheme of [11] & [12]


x 10

3000

3
2
1
0

2x2

4x4

6x6

8x8

2x2

4x4

LUT size
(b)

(a)

6x6

8x8

(c)

Fig. 7 Average WRITE performance vs LUT size; (a) Delay, (b) Energy, and (c) EDP.
Proposed scheme

Scheme of [16]

Scheme of [11] & [12]

2000

4000

2000

10

1000
500
0

0
2x2

4x4

6x6

8x8

x 10

1500

EDP [ns.pJ]

Energy [pJ]

Delay [ns]

6000

6
4
2
0

2x2

4x4

6x6

LUT size
(b)

(a)

8x8

2x2

4x4

6x6

8x8

(c)

Fig. 8 Worst-Case WRITE performance vs LUT size; (a) Delay, (b) Energy, and (c) EDP.

Next, the simulation is performed to compare the performance of the WRITE operation of a volatile SRAM-based LUT with the
proposed non-volatile scheme. The simulation results are shown in Table 4. As expected, the average delay and EDP at different sizes
of the SRAM based LUTs are significantly less than the proposed scheme; however the volatile nature of a SRAM-based LUT
restricts its application.
Table 4
Comparison of average WRITE delay, energy and EDP
LUT size

Aver. Delay (ns) Aver. Energy (pJ) Aver. EDP (pJns)


Proposed SRAM Proposed SRAM Proposed SRAM

2x2
4x4
6x6
8x8

171.88
214.1
256.03
297.8

0.1031
0.2062
0.3093
0.4124

26.78
91.08
198.23
348.32

0.0016 4912.2 0.000173


0.0067 24437.94 0.001386
0.0151 70132.36 0.004679
0.0268 152715.5 0.011091

b) READ operation
Similar to the performance analysis of the WRITE operation, LUTs of different dimensions for the proposed and previous schemes
[11][12][16] are simulated for the READ operation by considering the four relevant scenarios (5 to 8). For each scenario, the read
delay, the energy dissipation and the Energy Delay Product (EDP) are found; then, the average and worst values at different array
dimensions are calculated. Figs. 9 and 10 show the plots for the average and worst case values of the delay, energy and EDP
respectively.
Proposed scheme

Scheme of [16]

400

Scheme of [11] & [12]

10

4000

3000

EDP [fs.fJ]

600

Energy [fJ]

Delay [fs]

800

6
4

200

2000
1000

2
0

0
2x2

4x4

6x6

(a)

8x8

2x2

4x4

6x6

LUT size
(b)

8x8

2x2

4x4

6x6

8x8

(c)

Fig. 9 Average READ performance vs LUT size; (a) Delay, (b) Energy, and (c) EDP.

The average and worst case READ delays remain constant and nearly independent of LUT dimension in the proposed scheme. As
explained previously, this is caused by the nearly constant value of the load resistance; in addition, the average and worst case READ
delays of the proposed scheme are significantly decreased compared with [11][12][16], thus confirming that the proposed scheme is
capable of delivering a significantly faster READ operation as a very important feature of a FPGA. Also, the average and worst case
EDPs of the proposed scheme are significantly less than in [11][12][16].

Page 7 of 14

Proposed scheme

Scheme of [16]

Scheme of [11] & [12]

10
12000
2000

10000

1500
1000

EDP [fs.fJ]

Energy [fJ]

Delay [fs]

8
6

500

8000
6000
4000
2000

2x2

4x4

6x6

8x8

0
2x2

4x4

6x6

LUT size
(b)

(a)

8x8

2x2

4x4

6x6

8x8

(c)

Fig. 10 Worst-Case READ performance vs LUT size; (a) Delay, (b) Energy, and (c) EDP.

The simulation results of the READ operation for different sizes of SRAM-based LUTs at 32nm are compared; as shown in Table
5, the proposed scheme requires significantly smaller READ delay and EDP compared with a SRAM-based LUT. This feature shows
that the proposed design is suitable for FPGAs, because more READ operations are normally performed than WRITE operations [1].
Table 5
Comparison of average READ delay, energy and EDP
Aver. Delay (fs)
LUT
size Proposed
SRAM

2x2
4x4
6x6
8x8

35.310
35.310
35.310
35.310

320255
1094964
3364084
14922750

Aver. Energy (fJ)


Aver. EDP (fJfs)
Proposed
SRAM Proposed
SRAM

2.405
4.744
7.083
9.422

0.466
1.718
9.3831
692.839

84.92
167.52
250.12
332.71

149313
1881392
3154677
10339063187

c) Evaluation on benchmark circuits


The proposed LUT design is evaluated also for FPGA implementation and compared with the volatile LUT design of [16] using the
ISCAS89 sequential benchmark circuits. [16] has already shown that its non-volatile LUT design outperforms a SRAM-based LUT
design for FPGA implementation. As in [16], mapping is performed using the Xilinx Virtex4 FPGA (XC4VLX100) and the Virtex5
FPGA (XC5VLX220) and replacing each LUT with the non-volatile LUT version using the proposed memristor-based scheme. The
interconnect routing of the benchmark circuits is kept the same as in the Xilinx FPGAs. The average delay and energy required by the
LUT designs (proposed, [16] and SRAM) during the WRITE and READ operations are found for different benchmark circuits. The
results are presented in Tables 6 and 7; the average delay and average EDP required by the proposed LUT design during a WRITE and
READ operations are significantly less than for the scheme proposed in [16]. This is accomplished due the nature of memristors
connection in the proposed scheme that performs the WRITE operation on all the memristors connected to BL simultaneously. In the
READ operation, the proposed LUT scheme requires substantially less delay and energy when compared with LUT scheme using
[16]. This is a vital feature for FPGA, where more READ operation is performed once FPGA is configured. Therefore the proposed
LUT scheme can significantly improve the performance of the FPGA.
Next, the same performance metrics that have been used in [16] to compare the non-volatile LUT design with a SRAM-based LUT
design are also used in this manuscript. The average performance of the proposed method is found by estimating the number of
consecutive READ operations required immediately following a WRITE, such that total delay time incurred by the proposed scheme
is equal to the delay incurred by the LUT scheme of [16]. Let TO,I,B denote the average delay for operation O (either READ, R or
WRITE, W) using scheme I (I=SRAM or I=MEM) for benchmark B; in a similar manner, it is possible to define E O,I,B for the average
EDP under the same cases. Let
TO,I,B= tW,I,B + KT tR,I,B

(4)

EO,I,B= eW,I,B + KE eR,I,B

(5)

(4) and (5) denote the average performance metrics (delay time and EDP) using a given scheme for a benchmark under a single
WRITE followed by KT or KE consecutive READ operations. KT and KE are the least integers such that T O,MEM,,B is equal or greater
than TO,SRAM,B and EO,MEM,,B is equal or greater than EO,SRAM,B respectively for each benchmark circuit. The results for K T in different
benchmark circuits when mapped to the Virtex4 and Virtex5 FPGAs are plotted in Fig.11 (the data for the SRAM-based FPGAs is
from [16]).

Page 8 of 14

350
300

KT

250
200
150
Virtex 4

100

Virtex 5

50
0

Benchmark Circuits

Time Taken TKT


(ns)

Fig. 11 KT for different benchmark circuits


14
12
10
8
6
4
2
0

Virtex 4
Virtex 5

Benchmark Circuits
Fig. 12 TKT for different benchmark circuits

The average values of KT are 281 and 131 for Virtex4 and Virtex5 FPGAs; the average values of K T in [16] are 656 and 490 for the
Virtex4 and Virtex5. The total time for a WRITE followed by K T READ operations in each benchmark circuit (denoted by T KT) is
found and plotted in Fig.12; TKT is a few nano seconds for each benchmark circuit.
2.00E+07

KE

1.50E+07
1.00E+07
Virtex 4

5.00E+06

Virtex 5
0.00E+00

Benchmark Circuits
Fig. 13 KE for different benchmark circuits

The results for KE for different benchmark circuits when mapped to the Virtex4 and Virtex5 FPGAs are shown in Fig.13. The
average values of KE for the Virtex4 and Virtex5 FPGAs are 1.52x107 and 0.679x107; by comparison, the average values of KE for the
Virtex4 and Virtex5 in [16] are given by 4.18x107 and 2.88x107. The total time for a WRITE followed by KE READ operations in
each benchmark circuit (denoted by T KE) is plotted in Fig.14; it has a value less than a millisecond.

Page 9 of 14

Time Taken TKE


(ns)

7.00E+05
6.00E+05
5.00E+05
4.00E+05
3.00E+05
2.00E+05
1.00E+05
0.00E+00

Virtex 4
Virtex 5

Benchmark Circuits
Fig. 14 TKE for different benchmark circuits

Table 6
Evaluation of proposed LUT design and LUT design of [16] by mapping ISCAS89 circuits on Virtex4 FPGA A

Benchmark
Circuit

2 Input LUTs

3 Input LUTs

4 Input LUTs

Virtex 4 FPGA (XC4VLX160)


WRITE OPERATION
Average DEALY (ns)
Average EDP (pJ.ns)

298
400
510
820
953
1238
1488
5378
15850
35932

4
11
13
10
25
27
26
65
140
1192

6
14
17
26
33
41
48
96
270
489

11
19
55
72
123
155
183
206
376
1307

READ OPERATION
Average DEALY (ns)
Average EDP(pJ.ns)

Proposed

[16]

SRAM

Proposed

[16]

SRAM

Proposed

5105.18
10771.22
19853.86
26071.76
41975.38
51920.42
60149.66
88277.76
197380
652808.3

11443.12
22960.73
47936.33
62885.94
103235.7
128498.1
149829.4
202309.5
428570.5
1435265

3.917
7.938
16.186
21.238
34.744
43.198
50.312
68.973
147.639
493.230

7.30E+6
2.98E+7
1.24E+8
2.14E+8
5.72E+8
8.84E+8
1.19E+9
2.26E+9
1.03E+10
1.15E+11

17862481
69873405
3.26E+08
5.6E+08
1.52E+09
2.37E+09
3.23E+09
5.65E+09
2.46E+10
2.78E+11

0.38903
1.50597
7.18203
12.352
33.6919
52.3959
71.5095
123.559
532.564
6031.66

9.53E-4
20.4E-4
36.01E-4
47.3E-4
75.5E-4
93.2E-4
107.6E-4
163.4E-4
372.8E-4
1227.7E-4

[16]

SRAM

0.1137 17.168
0.2313 33.293
0.4672 75.274
0.6131 98.692
1.0016 163.822
1.2448 204.626
1.4491 239.448
1.9989 307.866
4.296 629.474
14.34 2126.055

Proposed

[16]

SRAM

8.64E-5 0.452
3.76E-4 1.691
1.34E-3 8.761
2.32E-3 15.05
6.06E-3 41.56
9.29E-3 64.88
1.25E-2 88.89
2.60E-2 145.82
1.27E-1 606.08
1.4019 6923.80

0.452
1.691
8.761
15.05
41.56
64.88
88.89
145.82
606.08
6923.80

Table 7
Evaluation of proposed LUT design and LUT design of [16] by mapping ISCAS89 circuits on Virtex5 FPGA
Virtex 5 FPGA (XC5VLX220)
4 Input LUTs

5 Input LUTs

1
3
2
5
0
2
4
3
13 9
9
9
11 10
32 41
89 94
1152 326

3
6
5
7
10
19
24
49
96
380

3
9
11
25
31
39
24
59
197
129

6 Input LUTs

3 Input LUTs

298
400
510
820
953
1238
1488
5378
15850
35932

2 Input LUTs

Benchmark
Circuit

WRITE
Average Delay (ns)

Proposed

4 4321.26
4 8392.16
17 11530.82
47 27919.1
71 41887.28
60 43276.46
76 43376.28
80 79173.02
163 201061.16
542 608074.92

[16]

READ
Average EDP (pJ.ns)

SRAM

Proposed

14937.09
4.22 1.03E+7
25812.24
7.83 3.27E+7
49140.09
12.99 9.78E+7
124419.93 32.167 6.09E+8
183517.56 47.32 1.34E+9
178134.89 47.52 1.31E+9
192626.61 49.38 1.46E+9
283224.22 79.18 3.65E+9
684171.86 196.81 2.18E+10
1897444.42 541.06 1.79E+11

[16]

SRAM

Average Delay (ns)

Proposed

38974005 0.7553 0.0007062


1.09E+08
2.25
0.0014124
4.52E+08
8.35
0.00169488
2.95E+09 53.31 0.00402534
6.41E+09 115.22 0.00614394
5.89E+09 109.36 0.00649704
7.09E+09 126.85 0.00632049
1.42E+10 272.80 0.01274691
8.08E+10 1593.16 0.0328383
6.13E+11 11733.93 0.10536504

[16]

0.1198
0.223
0.3640
0.899
1.325
1.334
1.382
2.240
5.580
15.451

SRAM

Average EDP (pJ.ns)

Proposed

[16]

SRAM

29.61
6.865E-5 0.0703 1.3686
47.64
25.4803E-5 0.2109 3.5403
105.30 50.4874E-5 0.7648 17.382
271.46 296.785E-5 4.87 115.528
399.73 666.669E-5 10.54 250.254
377.64 708.452E-5 10.04 223.397
421.63 715.602E-5 11.60 278.452
569.75 2320.16E-5 25.31 506.961
1339.15 14864.35E-5 148.31 2800.221
3650.18 131411.44E-5 1105.20 20665.076

The results of Tables 6 and 7 have been plotted for K T and KE thus comparing the proposed approach with [16]; Figs.15, 16, 17 and
18 show the results for Virtex4, while Figs. 19,20,21 and 22 show the results for Virtex5 (a logarithmic scale is utilized in the plots for
Page 10 of 14

11
both TKT and TKE). These plots shows that the proposed approach improves substantially over [16] (that has already shown an
improvement over a SRAM-based FPGAs); the improvements are substantial when the timing figures of merit (T KT and TKE) are
evaluated, so further reducing the number of consecutive READ operations following a WRITE for reducing the delay and power
consumption.

Fig. 15 KT for different benchmark circuits (Virtex4)

Fig. 16 Log plot of TKT for different benchmark circuits (Virtex4)

Fig. 17 KE for different benchmark circuits (Virtex4)

Page 11 of 14

12

Fig. 18 Log plot of TKE for different benchmark circuits (Virtex4)

Fig. 19 KT for different benchmark circuits (Virtex5)

Fig. 20 Log plot of TKT for different benchmark circuits (Virtex5)

Fig. 21 KE for different benchmark circuits (Virtex5)

Page 12 of 14

13

Fig. 22 Log plot of TKE for different benchmark circuits (Virtex5)

V. CONCLUSION
This paper has proposed a new LUT scheme that utilizes memristors as non-volatile storage elements. This LUT is amenable to FPGA
implementation and unlike previous works [11][12][16], it does not utilize a nano crossbar as memory scheme. The proposed scheme
uses an independent selection circuit such that the columns are isolated; this arrangement does not incur in a sneak path current
generation, thus avoiding changing the state of unselected memristors during a memory operation. One of the advantages of the
proposed scheme is that it permits the simultenous WRITE operation to all memristors connected to a BL; therefore the WRITE time
decreases considerably. However, its significant advantage is that its READ delay is significanlty less than previous schemes for both
the average and worst case scenarios. In terms of hardware unlike [11][12], there is no write half select problem and hence, it requires
a smaller number of power rails for its operation (similar to [16]). The proposed scheme is viable for LUT designs in commercially
available FPGAs (with at most six inputs). However for larger memory designs, the proposed scheme requires more IO lines; this may
limits its applicability due to the more pronounced parasitic effects and a larger area overhead.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]

[17]
[18]
[19]
[20]

S. Kilts, Advanced FPGA Design: Architecture, Implementation, and Optimization, Wiley-IEEE Press, 2007.
Xilinx Spartan Data sheet, http://www.xilinx.com.
Xilinx SpartanTM-3AN FPGAs, http://www.xilinx.com
ITRS, International Technology Roadmap For Semiconductors 2011 Edition Executive Summary, 2011.
B. S. Deepaksubramanyan and A. Nu, Analysis of Sub threshold Leakage Reduction in CMOS Digital Circuits, in Proc. 50th Midwest Symposium on Circuits
and Systems, 2007, no. 1, pp. 1-8.
Chua, L. O. Memristor - the missing circuit element, in IEEE Transactions on Circuit Theory, vol. ct-18 no.5 pp.507-519, Sep. 1971.
J. J. Yang, M. D. Pickett, X. Li, D. A. A. Ohlberg, D. R. Stewart and R. S. Williams, Memristive switching mechanism for metal/oxide/metal nanodevices,
Nature Nanotechnology, vol 3, pp 429433, 2008.
J. Cong and B. Xiao, mrFPGA: A Novel FPGA Scheme with Memristor-Based Reconfiguration, in Proc. IEEE/ACM International Symposium on Nanoscale
Architectures 2011, pp. 1-8.
S. Tanachutiwat, M. Liu, and W. Wang, FPGA Based on Integration of CMOS and RRAM, in IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 19, no. 11, pp. 2023-2032, Nov. 2011.
O.Turkyilmaz, S.Onkaraiah, M.Reyboz, F.Clermidy, Hraziia, C. Anghel, J.M. Portal, M.Bocquet, RRAM-based FPGA for Normally Off, Instantly On
Applications, in Proc. IEEE/ACM International Symposium on Nanoscale Architectures, 2012, pp101-108.
Y. Ho, Garng M. Huang, and P. Li, Dynamical Properties and Design Analysis for Nonvolatile Memristor Memories, in IEEE Transactions on circuits and
systems-I, vol. 58, no. 4, April 2011.
C. Xu, X. Dong, N. P. Jouppi, and Y. Xie Design Implications of Memristor-Based RRAM Cross-Point Structures, in Proc of Design, Automation and Test in
Europe, 2011, pp. 16.
A. Chen, Accessibility of Nano-Crossbar arrays of resistive switching devices, in Proc. IEEE International Conference on Nanotechnology, 2011, pp. 17671771.
I. E. Ebong and P. Mazumder, Self-Controlled Writing and Erasing in a Memristor Crossbar Memory, in IEEE Transactions on Nanotechnology, vol. 10, no. 6,
Nov. 2011.
Jiale Liang and H.-S. P. Wong, Cross-Point Memory Array Without Cell SelectorsDevice Characteristics and Data Storage Pattern Dependencies, in IEEE
Transactions on Electron Devices, vol. 57, no. 10, Oct. 2010.
H. A.F. Almurib, T. N, Kumar and F. Lombardi, A Memristor-Based LUT For FPGAs in Proc. of 9th IEEE International Conference on Nano/Micro
Engineered and Molecular System IEEE-NEMS 2014, pp.448-453 (extended manuscript to appear as Design and Evaluation of a Memristor-Based LUT for
Non-Volatile FPGAs," in Proc. IET Circuits, Devices & Systems, accepted October 2015).
T.N.Kumar, H.A.F. Almurib and F. Lombardi,On the Operational Features and Performance of a Memristor-Based Cell for a LUT of an FPGA in Proc. of 13th
IEEE International Conference on Nanotechnology, 2013, pp. 71-76.
Z. Biolek, D. Biolek and V. Biolova, SPICE Model of Memristor with Nonlinear Dopant Drift, Radioengineering, vol. 18, no. 2, pp.210-214, 2009.
T. N, Kumar, H. A.F. Almurib and F. Lombardi A Novel Design of a Memristor-Based Look-Up Table (LUT) For FPGA IEEE Asia Pacific Conference on
Circuits & Systems, 2014, pp.703-706 .
Predictive Technology Model (PTM) website, http://ptm.asu.edu/.
Page 13 of 14

14

Highlights

A New LUT scheme that utilizes memristors as non-volatile storage elements and does not utilize a nano crossbar as
memory scheme.
This LUT scheme does not incur in a sneak path current generation, thus avoiding changing the state of unselected
memristors during a memory operation.
Permits the simultenous WRITE operation to all memristors connected to a BL; therefore the WRITE time decreases
considerably.
READ delay is significanlty less than previous schemes for both the average and worst case scenarios. Hence suitable for
FPGA.
No write half select problem and hence, it requires a smaller number of power rails for its operation.

Page 14 of 14

Vous aimerez peut-être aussi