Vous êtes sur la page 1sur 11

1910 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 63, NO.

11, NOVEMBER 2016

A 9-T 833-MHz 1.72-fJ/Bit/Search Quasi-Static


Ternary Fully Associative Cache Tag With Selective
Matchline Evaluation for Wire Speed Applications
Sandeep Mishra, Member, IEEE, Telajala Venkata Mahendra, and Anup Dandapat, Senior Member, IEEE

Abstract—Hardware search engine (HSE) plays a major role to


speed up the search operation in wireless applications. Ternary
content addressable memory (TCAM) is such an engine which
performs the search in a single clock cycle but the use of separate
content and mask storage, various wordlines for read/mask/write,
and decoupled data/search lines require substantial design area
and consume relatively high power. This article proposes imple-
mentation of a state of the art energy-efficient quasi-static ternary
fully associative cache tag for wire speed memory access. A 4-T
static content and dynamic mask storage have been used with
coupled data and search line for reducing the energy dissipation
during search. The proposed 128 × 32-bit TCAM tag with se-
lective matchline evaluation scheme has been implemented with
predictive 45-nm CMOS process and simulated in SPECTRE
at the supply voltage of 1.0 V. The design dissipates an energy
of 1.72-fJ/bit/search with a reduction of 32% in the cell area
compared to the traditional TCAM.
Index Terms—Associative cache tag, content addressable mem-
ory (CAM), high density CAM, low-power design, search engines,
ternary CAM.

I. I NTRODUCTION

D ATA communication network routes packets of data by


maintaining a lookup table (LUT), where information re-
garding the packets destination is maintained. A cache memory
Fig. 1. Memory organization of a CPU considering CAM as cache tag.

is used in this regard for storage and faster access (wire-speed of contention for cache locations can be solved, but it results
access) [1]–[4]. In a conventional direct mapped cache, there in requirement of entire cache tag search (serial). The software
is a probability of high cache miss rate due to the continuous match routine is slower in spite of using faster matching algo-
refresh in cache manager. Register and level 1 (L1) cache rithms. So, a content addressable memory (CAM) is often used
are fastest and the performance degrades for the level 2 (L2) in place of software cache tag presented in Fig. 1 often called as
cache and the main memory. The moderate sized L2 cache is hardware cache tag which performs the search in a single clock
often used for accessing the frequently searched information. cycle but at the cost of additional storage area.
In a conventional searching, the cache controller provides the Unlike a random access memory (RAM), a CAM renders an
address of frequently searched data to the cache rather than the accelerated data search medium by comparing the search data
main memory for faster data access. with prestored contents in a single clock cycle. In addition to
A fully associative cache must be used so that any location the basic CAM, a ternary CAM (TCAM) also called threefold
in main memory can be associated with the cache and issue memory uses a supplementary don’t care (or “X”) state. During
the search operation, input is prefetched to the match index
and a simultaneous comparison is carried out with previously
loaded data. TCAM is an efficient search engine which makes
Manuscript received January 19, 2016; revised April 26, 2016; accepted
July 11, 2016. Date of current version October 25, 2016. This work was sup- it suitable in asynchronous transfer mode switching and fast
ported in part by the Ministry of Human Resource Development, Government lookup of network routing [5]–[9]. Besides the fast searching,
of India. This paper was recommended by Associate Editor V. Erraguntla. large number of storage cells and interconnections occupy
The authors are with the Department of Electronics and Communication
Engineering, National Institute of Technology Meghalaya, Shillong 793003, substantial design area and make TCAM more power hungry.
India (e-mail: ssandeep.mmishra@nitm.ac.in; telajalamahendra@nitm.ac.in; Thus, efficient low-power techniques and high density storage
anup.dandapat@nitm.ac.in). approach must be employed in designing a TCAM.
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. Algorithmic approaches have been implemented to reduce
Digital Object Identifier 10.1109/TCSI.2016.2592182 the TCAM lookup [10], [11]. These techniques help in reducing
1549-8328 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
MISHRA et al.: A 9-T 833-MHz 1.72-fJ/BIT/SEARCH QUASI-STATIC TERNARY FULLY ASSOCIATIVE CACHE TAG 1911

Fig. 2. TCAM architectures. (a) 18-T conventional swapped-XOR TCAM. (b) 12-T compact TCAM comprising two 4-T static storage. (c) Proposed 9-T quasi-
static TCAM comprising 4-T static storage and dynamic mask storage.

the power consumption but at the cost of performance degrada- The non-segmented architectures face the challenge of high-
tion. In [12], a unique choking current method has been pro- leakage power consumption [28], [29]. The segmented archi-
posed to reduce the power consumption with speed boosting. tecture resolve this issue but the cell count remains the same.
Dynamic designs have been presented for high density storage CAM cells are arranged in an alternate fashion in a dual bit
with low leakage requirements but a proper synchronization content addressable memory (DBCAM) to store logic 0 and 1
between data retention and refresh cycle is too complex and separately [30], [31]. The storage cell requirement is reduced
increases the energy dissipation [13]–[16]. Tsai et al. have used by half in comparison with the conventional TCAM presented
reflex charge equating scheme to minimize the power consumed in [32], yet both these suffer from the complex matchline
by matchline (ML) [17]. control and lead to smaller hit rate.
In [18], a NAND-type circuit has been partitioned into two The proposed architecture provides a high-density fully asso-
segments with different capacities that operate consecutively, ciative cache tag that reduces lookup load by using a hardware
resulting in lesser power consumption. But the matching prob- cache tag prior to the L2 cache storage. The use of cross-
ability and design of the pre-computation circuitry decide the coupled inverters for data storage in the conventional TCAM
power consumption in the above technique. Low-power designs leads to additional leakage. Therefore, a faster quasi-static
have been proposed based on power reduction in the high TCAM approach has been employed in designing the cache tag
capacitive matchlines [19]–[23]. Pie-sigma ML scheme has that also helps in reducing the design area. The rest of this paper
been used in [19] where NAND and NOR cells have been realized is structured as follows: Section II describes the background
by pie and sigma segment respectively. An interfacing logic on high-density TCAMs. Next, we introduce the quasi-static
has been used between these segments to avoid the short- TCAM followed by the selective matchline evaluation scheme
circuit current. The voltage detector current has been recycled in Section IV. Analysis on the measured results has been
to charge-up the ML for reducing energy [21]. presented in Section V and we conclude the paper in Section VI.
An efficient high-density cache tag must be designed that
performs the matchline charge sense (match) in near zero time. II. BACKGROUND : H IGH -D ENSITY T ERNARY
A dynamic CAM (DCAM) is the best quick fix but has a C ONTENT A DDRESSABLE M EMORIES
low data retention time that requires a proper synchronization Set associative storage cells have been popular among
between the refresh and search cycle to function [13]–[16]. designers for high-density memory architectures. Asymmetric
High-density static CAMs (SCAMs) do not suffer from this static storage requires less design area with similar performance
issue but the power consumption is significant [6], [24]. In [24], results as of conventional static TCAMs (SCAMs). This ap-
two latches have been used to store three logic values similar proach consumes considerable power while the complementary
to the design presented in [6]. These designs take lesser number data search provides a faster matchline charge-up or charge-
of transistors among available TCAMs. Two metal rails VDDML down. Separate storage cells for content and mask have been
and VDD have been introduced to power up the data and mask used in conventional swapped-XOR TCAM as shown in Fig. 2(a).
storage cell [25]. The mask value (M ) and evaluation result (E) go through a
Unique arrangements in the conventional TCAMs have been NAND based ML sensing. A conventional TCAM writes and
carried out to reduce the energy dissipation [26], [27]. The reads through three wordlines [data wordline (DWL), mask
mask bits with only logic 1 value have been separated from wordline (MWL), and read wordline (RWL)]. This approach (de-
those having only logic 0 value [26]. All other cells except coupled data and search line with separate wordlines) increases
the boundary cells in different segments have been self gated. the interconnection length and size of the TCAM macro.
1912 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 63, NO. 11, NOVEMBER 2016

TABLE I
S TATE TABLE C OMPARISON OF TCAM A RCHITECTURES (Q: S TORAGE N ODE ; SL: S EARCHLINE ; E: E VALUATION S TATE ;
ML: M ATCHLINE ; XM: M ASK I NPUT; BL: B ITLINE /S EARCHLINE OF P ROPOSED D ESIGN )

A more compact design is presented in Fig. 2(b) that uses two


asymmetric static storage cells. The presented four transistor
(4-T) static storage has been used by many designers, but
for a manifest understanding the design depicted in Fig. 2(b)
has been used in the comparison. Local masking is imple-
mented by storing “0” in nodes H1 and H2 , while global mask-
ing is performed by providing “0” to SL0 and SL1 . The 4-T
static storage pair are strong to store logic 1, but storing “0” at
nodes S1 and S2 requires a synchronized bias voltage gener-
ation. Furthermore, the use of two storage cells increase the
overall energy dissipation and the decoupled data and search-
line increases the cell layout.

III. Q UASI -S TATIC T ERNARY C ONTENT


A DDRESSABLE M EMORY Fig. 3. Small-signal model of the evaluation circuit of proposed QSTCAM.

A state of the art quasi-static TCAM (QSTCAM) has been TCAM have weak logic values at various nets, which results in
presented in Fig. 2(c). The storage cell provides two com- higher power consumption.
plementary soft nodes (Q and Q) similar to the conventional Searchline capacitance (CSL ), matchline capacitance (CML )
TCAM (hard) storage. The additional “Q” does not contribute at each mismatch state and the matchline swing primarily
to the evaluation of search data but helps in maintaining affects the overall power consumption of a TCAM. The search-
logic 0 and 1 values at node “Q.” The static storage exhibits line power for S number of searchlines is
a dc characteristics with static-noise-margin of 752 mV and
1
126 mV with a trivial variation of 21.9% in the threshold volt- Psearch = S CSL VDD 2 f. (1)
age over process corner variation (presented in the Appendix). 2
A coupled data and searchline (BL/ BL) has been provided to Considering a TCAM array of m-words × n-bits, (1) can be
the sources of transistors T5 and T6 . The data wordline (DWL) written as
has been used to write data values (BL and BL) and mask 1
wordline (MWL) has been provided to write the mask (XM) Psearch = m nC SL VDD 2 f. (2)
2
through transistor T8 . The dynamic masking approach has been
Similarly
employed where the mask storage net (N1 ) is destructive but
the separated MWL can be activated at any period without PML = mnC ML VDD 2 f. (3)
changing the storage node (Q and Q) values. MWL is activated
during each precharge phase to ensure a valid mask value With unequal matchline swing
at N1 during search which is controlled through the mask- PML = mnC ML VDD VMLswing f. (4)
line driver. The conventional swapped- XOR TCAM [23] uses
18 transistors with 7 I/O ports (18-T–7-I/O), while the compact The presented design does not provide control over the
TCAM [24] takes minimum 12-T–6-I/O. The proposed quasi- matchline swing for maintaining lower interconnects. There-
static approach (static content and dynamic mask storage) fore, parameters (CSL , CML , and S) become the deciding
requires only 9 transistors with 6 ports (9-T–6-I/O) making it factors in the power reduction. The use of single static storage
suitable in high-density design requirements. cell reduces the number of searchline pairs to half, the coupled
State table comparison of these designs has been summarized data and searchline increases the power but at one end. The use
in Table I. The masking in proposed QSTCAM is very similar of single matchline discharge transistor also helps in reducing
to the conventional design with the exception of complementary the ML power consumption.
mask values. A low value at net N2 [“0” or small charge (L)] The performance speed of the TCAM depends on the time it
keeps the ML transistor (T9 ) at cut-off state, while a high value takes to change the matchline charge. The approximated small-
[“1” or near VDD (H)] discharges the ML to ground. Dissimilar signal model of the proposed TCAM is presented in Fig. 3 for
logic values at SL and H result in a match in compact TCAM. the calculation of discharging time constant for a mismatch
From the state table it is clear that the compact and proposed case. The model is valid for the search phase where the nets,
MISHRA et al.: A 9-T 833-MHz 1.72-fJ/BIT/SEARCH QUASI-STATIC TERNARY FULLY ASSOCIATIVE CACHE TAG 1913

Fig. 4. (a). TCAM array mask distribution (global and local masking). (b) Net and port charge variation in the proposed QSTCAM.

ports, and nodes with constant voltage level during the phase are IV. S ELECTIVE M ATCHLINE E VALUATION S CHEME
grounded. For a simplified analysis the following assumptions The matchline charge-up or charge-down depends upon the
have been considered: matchline control transistors state (T17 and T18 in conventional
1) Neglecting the effects of conductances (GBD and GBS ) TCAM; T5 and T6 or T11 and T12 in compact TCAM). In
and resistances (RDB and RSB ). the proposed TCAM, transistor T9 state decides the matchline
2) Neglecting the effects of dependent current sources value. In the proposed design depicted in Fig. 2(c), a voltage
(GMBS VBS , INRD , ID , and INRS ). level at N2 sufficiently below the threshold of T9 ensures the
3) Equating the charge effects of DWL, PRE, XM, BL, BL, match state. Circuit behavior at various masking for pattern
Q, Q, MWL, and VDD to zero. matching has been discussed first followed by explanation at
normal match and mismatch states.
A two-port model has been designed from the evaluation cir-
cuit as the “Q” charge is constant throughout the precharge and A. Pattern Matching Implementation
search phase. The Thevenin’s equivalent resistance (RTh ) or
the voltage gain can be used for the calculation of discharging The additional don’t care state (X) of TCAM as summarized
time constant. The voltage gain in the model can be calculated in the sequence 5 and 6 in Table I is significant in longest
by using the ML port as output and node E as input. The gate prefix matching, pattern matching, and fast lookup of network
of T8 (MWL) and the source (XM) are at same voltage level routing [5]–[9]. Local matching provides a strong “0” and
thereby isolating its effect from the decision path. The ML global masking supply a weak but low logic N2 to the gate of
delay depends upon the discharging time of matchline where ML control transistor T9 shown in Fig. 2(c). At this N2 signal
evaluation result decides the discharge of ML voltage. level, the matchline remain charged at VDD . The TCAM array
The drain-gate capacitances of T9 and T7 (CDG9 and CDG7 ) mask distribution shown in Fig. 4(a) considers a 128-word ×
and effective drain resistances (RDN9 and RDP7 ) mainly con- 32-bit macro with global masking at bit 4 (BL4 = BL4 = 0)
tribute to the discharging time. The output voltage at port ML and local masking (X) at shown positions.
(VOUT ) can be written in the function of voltage VN2 as A 3-search timing diagram of an 1-bit TCAM cell is pre-
sented in Fig. 4(b). DWL is kept high to write the data and
VN2 RDN9 MWL is kept high for writing the mask value (XM) to the
VOUT = 1 . (5)
CDG9 S + RDN9 net N1 . The masking bits are stored in the mask-bit register
and driven through the mask-bit driver. The writing strategy
The voltage at net N2 can be expressed as of the mask bits into the TCAM cell is very much similar
  to the conventional mechanisms expect the timing. The two
1
1
VIN CDG7 VOUT CDG9 + RDN9
S S storage nodes (Q and Q) are kept at the same voltage level
VN2 = 1 = . (6)
C S + RDP7 R DN9 irrespective of the mask value. This property of separating
DG7
data storage (static) and mask storage (dynamic) reduces the
Therefore, the transfer function for the evaluation circuit can be dynamic power consumption due to alternate storage of data
written as and mask values at same storage cell. The reason of separating
VOUT CDG9 RDG9 S wordlines for data and mask storage is to achieve parallel data
= . (7) and mask write. This increases the frequency of operation by
VIN (1 + CDG9 RDG9 S)(1 + CDG7 RDG7 S)
simultaneous search after precharge. During global masking,
The discharging time can be calculated from (7) or RTh of the the sources of both evaluation transistors (T5 and T6 ) have been
approximate model shown in Fig. 3. provided with logic 0 that matches with all “Q” values.
1914 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 63, NO. 11, NOVEMBER 2016

Fig. 6. Proposed charge-up matchline sense amplifier.

T3 and T5 . ML has been precharged simultaneously through


T13 to provide a low input at T7 and T10 . Transistors T11 and
Fig. 5. Interconnection model of the evaluation circuit of the proposed QST-
CAM cell.
T12 have been used to reduce the sudden discharge of MLSO
during phase change from precharge to search. During search
when a mismatch occur, MLSO discharges to ground through
B. Regular Matching Implementation the path T2 −T7 −T9 . When a match occur, T10 provides an
extra charge path to the output thereby blocking its discharge
Asymmetric static storage cells are weak to store “0” and due to weak N1 . The rate of discharge of MLSO during the
the reason for using two storage cells to contain alternate phase change from precharge to search highly affects the
logic values. In the proposed TCAM on the other hand, two dynamic power consumption due to the sudden discharging
storage nodes (Q and Q) have been used with cross-coupled and charging current.
regenerative feedback similar to one-sided asymmetric design. Transistors T2 and T3 build a strong “0” at the output (MLSO)
The design function can be visualized from Fig. 4(b) for writing during mismatch state while strong T10 and weak T1 through
“0” and searching both “0” and “1.” The approximate intercon- net N1 keep the MLSO charge during a match state. T8 is
nection model of the evaluation circuit is depicted in Fig. 5. remained at cut-off all the time but is important for providing
The extra transistor TD is used for demonstration purpose only isolation between transistors T9 and T12 . The isolation is neces-
which is not physically present in the final model. The focus sary for holding the precharge value at MLSO during precharge
is put on the charge distribution among various capacitances state as well as phase change between precharge to search.
during a match case. The input Q is stable due to the cross The proposed sensing scheme is very useful in high-density
coupled feedback paths in the storage cell. The storage node TCAMs where the matchline control transistors are driven by
charge Q drives T5 and T6 which passes either a strong BL or weak signals.
weak BL to net E that charges or discharges CDS5 or CDS6 .
During the data write, evaluation net E does not charge as Q
virtually drives itself. Therefore, net N1 and N2 are coupling V. R ESULTS AND P ERFORMANCE A NALYSIS
free from E during this phase. The proposed 128 × 32-bit high-density QSTCAM depicted
During a match case, available charge on CDS5 or CDS6 in Fig. 7 has been implemented using the generic process
discharges thereby blocking the charge path of CGD9 making design kit (GPDK) 45-nm CMOS process and simulated in
the net N2 independent of gate charge of T7 (N1 ). A strong SPECTRE. A comprehensive comparison with the conventional
discharge path to N2 may be provided through transistor TD swapped-XOR TCAM [23] and compact TCAM [24] shown in
to avoid the ML coupling during masking but it increases the Fig. 2(a) and (b) respectively has been carried out for testing the
cell area and makes net N2 vulnerable during mismatch case. efficiency of our proposed design. For a legitimate analogy, the
When a mismatch occur even a weak charge on CGD9 drives the compared designs have been scaled to 45-nm CMOS process
matchline transistor T9 to discharge the ML voltage. During the and analyzed in the same environment (PVT). These designs
write and precharge phase, nets E and N2 carry low signal to have been tested under all environment variations like temper-
ensure the cutoff state of T9 . The net N2 is charged only during ature, supply voltage, process corner, frequency of operation,
the sequence 3 and 4 of Table I else set to low. and macro size.
The impact of device scaling has also been analyzed to
test the sustainability of the proposed design. The nMOS and
C. Matchline Sense Amplifier
pMOS cells with thresholds at typical corner (0.36 and −0.4 V,
A monoport charge-up sense amplifier shown in Fig. 6 has respectively) have been considered for the analysis, except the
been implemented for sensing the proposed and compared process corner variation in Section V-B. The predictive models
TCAMs. In conventional sensing, ML is charged to VDD of the GPDK at various channel lengths and corners have
initially and a full row match discharges the matchline to been given in the Appendix. A 32 × 64-bit macro size has
ground. In the proposed scheme, sense output (MLSO) has been considered for the analysis presented in Section V-A–F,
been charged to VDD during precharge phase through transistor variable macro size in Section V-B and 128 × 32-bit size
MISHRA et al.: A 9-T 833-MHz 1.72-fJ/BIT/SEARCH QUASI-STATIC TERNARY FULLY ASSOCIATIVE CACHE TAG 1915

Fig. 7. Measurement results of the extracted layout netlist of a 128 × 32-bit QSTCAM (α: Considering full matchline charge; β: Considering acceptable
matchline charge.)

32 × 64-bit macros during transition between precharge to


search is shown in Fig. 8. The presence of dual static storage
cells in compact TCAM spirals the power consumption at
search transitions. Conventional TCAM settles down faster but
the larger peak and leaky static mask storage affects the average
power consumption.
The compared designs [23], [24] are analyzed by varying
temperature from −20 ◦ C to 100 ◦ C as shown in Fig. 9(a)
and (b). Peak power variation over the temperature affects the
design functionality at subjugated conditions. A trivial variation
of 1.7% in case of the proposed QSTCAM as shown in Fig. 9(a)
suits the design limit decision making in this regard. The
proposed design consumes the least average power over the
given temperature range.
Average power variation for varying temperature and supply
voltage have been presented in Fig. 9(b) and (c), respectively.
Fig. 8. Switching power variation of compared TCAMs during the phase Small charge variation at low supply voltage makes the compact
change from precharge to search at 6 ns. TCAM [24] a better design at 0.7 V among compared designs,
but 1.8 to 0.8 V is considered to be the nominal supply voltage
at Section V-F. Write/search drivers and decoders have not range by many designers [11], [12], [18], [19], [24], [31]. The
been accounted in the comparisons presented from Figs. 8–12, NAND based conventional TCAM [23] does not function below
14 and Tables II–IV, VI for manifest impact of the TCAMs on 0.7 V while the proposed as well as compact design work up
various performance analysis. to a supply voltage scaling of 0.6 V. The compared design
performances have been presented for a range of 1.2 to 0.7 V
due to the unacceptable matchline delay metrics below this
A. Power Performance Comparison range as discussed in Section V-D.
High-density TCAM implementations are achieved by shar-
ing various internal net charges. This comes at the cost of
B. Energy Dissipation and Matchline Delay Analysis
higher power consumption compared to conventional TCAMs
particularly at low supply voltages. We have put effort on Evaluation and mask value at same node in both pro-
including only one static storage for the content storage. The posed and compact TCAM (Proposed QSTCAM-N2 ; Compact
dynamic mask storage consumes less power and the overall de- TCAM-H1 and H2 ) renders faster search result than conven-
sign functions with a little extra power consumption compared tional TCAM. The discharging time constant resulting between
to conventional TCAM. Switching power variation of compared the matchline and evaluation result node decides the matchline
1916 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 63, NO. 11, NOVEMBER 2016

Fig. 9. Power analysis of compared TCAMs. (a) and (b) Peak and average power comparison respectively at varying temperature from −20 ◦ C to 100 ◦ C
and VDD of 1 V. (c) Average power comparison for supply voltage scaling from 1.2 to 0.7 V.

The proposed design clearly settles down faster than the com-
pared designs. The comparable energy dissipation and small
ML delay provides the best energy delay product (EDP) among
the referred TCAMs.
Energy dissipation at various TCAM macros of 32-bits are
compared in Fig. 11(a). A similar energy dissipation metric
in all macro sizes secures the proposed QSTCAM cascad-
ability for forming larger TCAM arrays. Power consumption
distribution at various operational phases has been presented
in Fig. 11(b). The low-power consumption of compact TCAM
during the write and precharge phase provides the design
advantage at low operating frequency, but the overwhelming
value during the search limits its use. The proposed QSTCAM
provides a trade-off between static and matchline energy dis-
sipation. The two storage nodes (Q and Q) increase the static
energy dissipation but well below the conventional design.
Fig. 10. Matchline charge variation of compared TCAMs at mismatch state.
An equal rise and fall time of 10 ps has been considered.
C. Device Scaling and Process Corner Variation
The stability of compared designs have been tested at various
delay as discussed in Section III. Transistors functioning in this process corners—Threshold voltages at various corners are
decision path (Conventional TCAM-TG7 , TG8 , T17 , and T18 ; presented in the Appendix—. The proposed design dissipates
Compact TCAM-T5 and T6 or T11 and T12 ; Proposed less at slow corners (FS and SS) while the evaluation result
QSTCAM-T5 , T7 , and T9 or T6 , T7 , and T9 ) contribute to pass transistor (T7 ) contributes to the smaller ML delay at fast
the search time. In an approximation, higher transistors in the corner (FF). The compact TCAM advances at FS corner but the
decision path leads to higher matchline delay. higher energy dissipation provides worst EDP at fast corner.
The ML delay variation over the given temperature range Matchline delay of the conventional TCAM is more sensitive
follows the peak power variation. The higher the peak a design to the process corner variation where as an average variation
clinches, the higher the time it takes to settle down completely of 15% in the EDP makes the proposed design best as shown
(Discharge). A lower average rate of change of 10.65% is found in Fig. 11(c).
in the ML delay of conventional TCAM but the proposed design The effect of device scaling at typical corner (TT) is sum-
searches faster in the temperature range from −20 ◦ C to 100 ◦ C. marized in Table II. CMOS circuits are more pronounced to
Performance of the proposed design degrades at a lower rate leakage as the technology scales down [considering relative
compared to the compact TCAM as the supply voltage is threshold voltage (VTH ) reduction]. The threshold voltages
scaled down. however are not scaled down in that proportion. The leaky
A ML delay increment of 89% and 90.7% for the proposed static storage cells in conventional and compact designs are
and compact TCAM, respectively, have been found at VDD of the reason behind high average power consumption at lower
0.6 V, and hence not considered in the comparison presented. device sizes. The proposed design therefore performs better in
The matchline charge variation of compared designs for a all respects as the technology is scaled down and can further
mismatch case at 27 ◦ C and VDD of 1 V is shown in Fig. 10. work (below 45-nm) by specifying a proper VDD /VTH ratio.
MISHRA et al.: A 9-T 833-MHz 1.72-fJ/BIT/SEARCH QUASI-STATIC TERNARY FULLY ASSOCIATIVE CACHE TAG 1917

Fig. 11. (a) Energy dissipation comparison at various TCAM macros. (b) Average power distribution among various operation phases. (c) Design sensitivity to
energy-delay at various process corners.

TABLE III
I MPACT OF P ROCESS C ORNER VARIATION ON P ROPOSED QSTCAM
P ERFORMANCE AT VARIOUS S UPPLY V OLTAGES (FF: FAST C ORNER ;
FS: FAST nMOS, S LOW pMOS; SF: S LOW nMOS, FAST pMOS;
SS: S LOW C ORNER ; TT: T YPICAL C ORNER )

TABLE IV
TCAM S USTAINABILITY, E NERGY D ISSIPATION AND
M ATCHLINE L OW L OGIC V OLTAGE C OMPARISON
AT VARIOUS F REQUENCY OF O PERATION
Fig. 12. (a) Matchline delay performance on voltage scaling. (b) EDP compar-
ison for supply voltage scaling from 1.2 to 0.7 V.

TABLE II
I MPACT OF D EVICE S CALING ON VARIOUS TCAM P ERFORMANCE
AT T YPICAL C ORNER —C YCLE T IME OF 90 ns H AS B EEN
C ONSIDERED IN THIS C OMPARISON —

the full matchline charge-up or charge-down (charge-down


in the presented comparison) rather than the ML delay. The
Impact of process voltage (PV) variation on the performance compared designs have been tested at an increasing operating
of proposed QSTCAM is summarized in Table III. The design frequency from 55 to 833 MHz. The longer settling duration
functions well at all corners for VDD of 1.0 and 0.8 V but the of conventional TCAM shown in Fig. 10 restricts the match
near threshold effect at slow corners for 0.6 V sets the low operation at higher search frequency.
voltage limit to 0.7 V. An EDP of 2555.47 fJ × ps can be noted The average ML delay variation is minimal in higher fre-
at 0.6 V which is 93% more than the value at 0.8 V. quency of operation (above 10 MHz) [22] but the matchline
voltage swing restricts the frequent search operation in a
TCAM. The matchline which has not discharged fully (below
D. Frequency Variation
mV range) is not considered in the frequency assessment. The
The TCAM sustainability depends upon the frequency of proposed design can perform at 833 MHz with a higher average
operation for handling high lookup table access rate. The reduction rate of 11.4% in the energy dissipation as summarized
maximum frequency of operation has been calculated from in Table IV.
1918 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 63, NO. 11, NOVEMBER 2016

Fig. 13. Measurement waveforms and performance results of the extracted layout netlist of the proposed QSTCAM.

E. Low Voltage Analysis


The small average change of 12.7% in the energy dissipation
flaunts the stability of proposed design. The rate of dissipation
reduces at a higher rate for compact TCAM as the supply volt-
age is scaled down, but the corresponding larger ML delay pre-
sented in Fig. 12(a) nullifies this effect to provide higher EDP
at all voltage range. The normalized EDP of compared designs
are presented in Fig. 12(b) over VDD variation of 1.2 to 0.7 V.
The power improvement in compact TCAM at low supply
voltages as discussed in Section V-A, provides a visible re-
duction but the faster hit (match) duration makes the proposed
QSTCAM a better design.
The near threshold effect at evaluation transistors (T5 and T6 )
depicted in Fig. 2(c) increases the power consumption at low
supply voltages (0.7 V or below). The nonlinearity between
Fig. 14. Effect of device scaling on the energy dissipation and matchline delay
1.2 to 0.8 V and 0.7 V in the power consumption graphs of compared TCAMs.—The device sizes of MOS have been presented in the
[Figs. 9(c) and 12(b)] are visible due to this effect. Searchline Appendix.—
voltage scaling techniques can be applied to minimize this
effect but increases the macro area.
mented in [23] and [27] that provides a better EDP metric, but
F. Performance Comparison Summary the masking feature of TCAM is essential in many network
routing applications [4], [6], [7], [10], [11].
The measured waveforms and performance results of the
High-density TCAM implementation is remaining as an
QSTCAM are presented in Fig. 13. The proposed design has
emerging challenge as CMOS is facing scalability issues.
been implemented in a core area of 0.046 mm2 which performs
TCAM with static storage requires considerable layout area
the search in 430 ps with an moderate energy dissipation of
and leaky while fully dynamic CAMs are possible alterna-
1.72 fJ/bit/search. The energy-delay comparison with referred
tives though the refreshing and synchronization limit the use
designs at various channel lengths is shown in Fig. 14. The
[13], [15], [16]. Performance analysis with compared designs
conventional TCAM with NAND-type scheme exhibits good en-
[23], [24] has been concluded in Table VI. A comparable design
ergy dissipation metric at higher device sizes but the proposed
area and power consumption with better ML delay performance
design provides better average EDP over all device sizes among
metric of the proposed QSTCAM provide the best EDP among
compared TCAMs.
the compared designs.
The performance comparison of energy dissipation, delay
and cell density with recently proposed relevant designs has
been summarized in Table V. The matchline delay and EDP
VI. C ONCLUSION
are better in the proposed QSTCAM compared to most referred
designs. In [4], reduction in the energy dissipation has been A state of the art quasi-static ternary content addressable
achieved by using a complex hierarchy searchline scheme to memory has been presented in this article. The common evalua-
limit the switching activity. Binary CAMs have been imple- tion and mask storage nodes with selective matchline evaluation
MISHRA et al.: A 9-T 833-MHz 1.72-fJ/BIT/SEARCH QUASI-STATIC TERNARY FULLY ASSOCIATIVE CACHE TAG 1919

TABLE V
P ERFORMANCE C OMPARISON S UMMARY OF R EFERRED D ESIGNS

TABLE VI R EFERENCES
P ERFORMANCE C OMPARISON S UMMARY OF C OMPARED
32 × 64-B IT M ACROS AT 27 ◦ C AND VDD OF 1.0 V [1] K. Zheng, C. Hu, H. Lu, and B. Liu, “A TCAM-based distributed parallel
IP lookup scheme and performance analysis,” IEEE/ACM Trans. Netw.,
vol. 14, no. 4, pp. 863–875, Aug. 2006.
[2] Y. D. Kim, H. S. Ahn, S. Kim, and D. K. Jeong, “A high-speed range-
matching TCAM for storage-efficient packet classification,” IEEE Trans.
Circuits Syst. I, Reg. Papers, vol. 56, no. 6, pp. 1221–1230, Jun. 2009.
[3] I. Arsovski, T. Hebig, D. Dobson, and R. Wistort, “A 32 nm 0.58-fJ/bit/
search 1-GHz ternary content addressable memory compiler using silicon-
aware early-predict late-correct sensing with embedded deep-trench
capacitor noise mitigation,” IEEE J. Solid-State Circuits, vol. 48, no. 4,
pp. 932–939, Apr. 2013.
[4] P.-T. Huang and W. Hwang, “A 65 nm 0.165 fJ/bit/search 256 × 144
TCAM macro design for IPv6 lookup tables,” IEEE J. Solid-State Circuits,
vol. 46, no. 2, pp. 507–519, Feb. 2011.
[5] S. K. Maurya and L. T. Clark, “A dynamic longest prefix matching content
addressable memory for IP routing,” IEEE Trans. Very Large Scale Integr.
TABLE VII (VLSI) Syst., vol. 19, no. 6, pp. 963–972, Jun. 2011.
P REDICTIVE M ODELS OF THE G ENERIC P ROCESS D ESIGN [6] I. Arsovski, T. Chandler, and A. Sheikholeslami, “A ternary content-
K IT (GPDK) R EPRESENTING THE T HRESHOLD addressable memory (TCAM) based on 4T static storage and including
V OLTAGES [V] AT VARIOUS C ORNERS a current-race sensing scheme,” IEEE J. Solid-State Circuits, vol. 38,
no. 1, pp. 155–158, Jan. 2003.
[7] H. Che, Z. Wang, and K. Zheng, “DRES: Dynamic range encoding
scheme for TCAM coprocessors,” IEEE Trans. Comput., vol. 57, no. 7,
pp. 902–915, Jul. 2008.
[8] P. Maffezzoni, B. Bahr, Z. Zhang, and L. Daniel, “Oscillator array models
for associative memory and pattern recognition,” IEEE Trans. Circuits
TABLE VIII Syst. I, Reg. Papers, vol. 62, no. 6, pp. 1591–1598, Jun. 2015.
P REDICTIVE M ODELS OF THE G ENERIC P ROCESS D ESIGN [9] Y. Sun and M. S. Kim, “A hybrid approach to CAM-based longest pre-
K IT (GPDK) R EPRESENTING THE D EVICE S IZES [nm] fix matching for IP route lookup,” in Proc. IEEE GLOBECOM, 2010,
pp. 1–5.
[10] L. Kosmidis, J. Abella, E. Quiñones, and F. J. Cazorla, “Efficient cache
designs for probabilistically analysable real-time systems,” IEEE Trans.
Comput., vol. 63, no. 12, pp. 2998–3011, Dec. 2014.
[11] I. Hayashi, T. Amano, N. Watanabe, Y. Yano, Y. Kuroda, M. Shirata,
K. Dosaka, K. Nii, H. Noda, and H. Kawai, “A 250-MHz 18-Mb full ternary
scheme perform faster search at low average energy dissipation CAM with low-voltage matchline sensing scheme in 65-nm CMOS,”
IEEE J. Solid-State Circuits, vol. 48, no. 11, pp. 2671–2680, Nov. 2013.
change of 12.7% over supply voltage scaling. The design [12] C. Wang, C. Hsu, C. Huang, and J. Wu, “A self-disabled sensing technique
nudges a larger peak during the phase change from precharge to for content-addressable memories,” IEEE Trans. Circuits Syst. II, Express
search compared to the conventional TCAM due to the use of Briefs, vol. 57, no. 1, pp. 31–35, Jan. 2010.
[13] V. Lines, A. Ahmed, P. Ma, and S. Ma, “66 MHz 2.3 M ternary dy-
two soft storage nodes but the higher matchline discharge rate namic content addressable memory,” in Proc. Record IEEE Int. Workshop
renders low settling time. The proposed 9-T–6-I/O QSTCAM Memory Technol., Design Testing, 2000, pp. 101–105.
can be used in applications with low-power and high-density [14] Y. H. Gong and S. Chung, “Exploiting refresh effect of DRAM read oper-
ations: A practical approach to low-power refresh,” IEEE Trans. Comput.,
storage requirements. The design dissipates 1.72-fJ/bit/search vol. 65, no. 5, pp. 1507–1517, May 2016.
at 1 V and can perform at a cycle time of 1.2 ns. Results [15] M. Chae, J. W. Lee, and S. H. Hong, “Decoupled 4T dynamic CAM
conclude that a 33% reduction in the matchline delay with an suitable for high density storage,” Electron. Lett., vol. 47, no. 7,
pp. 434–436, Mar. 2011.
average improvement of 62% in the energy delay product have [16] V. Vinogradov, J. Ha, C. Lee, A. Molnar, and S. H. Hong, “Dynamic
been achieved over the compared architectures. ternary CAM for hardware search engine,” Electron. Lett., vol. 50, no. 4,
pp. 256–258, Feb. 2014.
[17] K. L. Tsai, Y. J. Chang, and Y. C. Cheng, “Automatic charge balancing
A PPENDIX content addressable memory with self-control mechanism,” IEEE Trans.
Circuits Syst. I, Reg. Papers, vol. 61, no. 10, pp. 2834–2841, Oct. 2014.
Predictive models of the generic process design kit (GPDK) [18] N. Onizawa, S. Matsunaga, V. C. Gaudet, W. J. Gross, and T. Hanyu,
“High-throughput low-energy self-timed CAM based on reordered over-
representing the threshold voltages and device sizes have been lapped search mechanism,” IEEE Trans. Circuits Syst. I, Reg. Papers,
shown in Tables VII and VIII respectively. vol. 61, no. 3, pp. 865–876, Mar. 2014.
1920 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 63, NO. 11, NOVEMBER 2016

[19] S. H. Yang, Y. J. Huang, and J. F. Li, “A low-power ternary content Sandeep Mishra (M’14) received the B.Tech and
addressable memory with pai-sigma matchlines,” IEEE Trans. Very Large M.Tech degrees in electronics and communication
Scale Integr. (VLSI) Syst., vol. 20, no. 10, pp. 1909–1913, Oct. 2012. engineering from the Biju Patnaik University of
[20] N. Mohan, W. Fung, D. Wright, and M. Sachdev, “A low-power ternary Technology, Rourkela, India, in 2011 and 2013, re-
CAM with positive-feedback match-line sense amplifiers,” IEEE Trans. spectively. He is presently pursuing the Ph.D. degree
Circuits Syst. I, Reg. Papers, vol. 56, no. 3, pp. 566–573, Mar. 2009. with the Department of Electronics and Communi-
[21] J. W. Zhang, Y. Z. Ye, and B. D. Liu, “A current-recycling technique cation Engineering, National Institute of Technology
for shadow-match-line sensing in content-addressable memories,” IEEE Meghalaya at Shillong, India.
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 6, pp. 677–682, His research area of interest covers low-power
Jun. 2008. memory design, high-speed sense amplifier, and in-
[22] B. D. Yang, Y. K. Lee, S. W. Sung, J. J. Min, J. M. Oh, and H. J. Kang, telligent transportation system.
“A low-power content addressable memory using low swing search lines,”
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 58, no. 12, pp. 2849–2858,
Dec. 2011.
[23] A. Agarwal, S. Hsu, S. Mathew, M. Anders, H. Kaul, F. Sheikh, and
R. Krishnamurthy, “A 128 × 128b high-speed wide-AND match-line
content addressable memory in 32 nm CMOS,” in Proc. 2011 ESSCIRC,
2011, pp. 83–86.
[24] C. C. Wang, J. S. Wang, and C. Yeh, “High-speed and low-power design Telajala Venkata Mahendra received the B.Tech
techniques for TCAM macros,” IEEE J. Solid-State Circuits, vol. 43, degree in electronics and communication engineering
no. 2, pp. 530–540, Feb. 2008. from the Jawaharlal Nehru Technological University,
[25] A. T. Do, S. Chen, Z. H. Kong, and K. S. Yeo, “A high speed low-power Kakinada, India, in 2013, and the M.Tech degree in
CAM with a parity bit and power-gated ML sensing,” IEEE Trans. Very VLSI Design in 2016 from the National Institute of
Large Scale Integr. (VLSI) Syst., vol. 21, no. 1, pp. 151–156, Jul. 2013. Technology Meghalaya, Shillong, India.
[26] Y. J. Chang, K. L. Tsai, and H. J. Tsai, “Low leakage TCAM for IP lookup He is currently a Junior Research Fellow at
using two-side self-gating,” IEEE Trans. Circuits Syst. I, Reg. Papers, the National Institute of Technology Meghalaya.
vol. 60, no. 6, pp. 1478–1486, Jun. 2013. His research interests include design of low-power
[27] N. Onizawa, S. Matsunaga, V. C. Gaudet, and T. Hanyu, “High-throughput VLSI circuits, content-addressable memories, volatile
low-energy content-addressable memory based on self-timed overlapped memories, and FPGA-based implementations.
search mechanism,” in Proc. 18th IEEE Int. Symp. ASYNC, 2012,
pp. 41–48.
[28] A. Wiltgen, K. Escobar, A. Reis, and R. Ribas, “Power consumption
analysis in static cmos gates,” in Proc. 26th SBCCI, 2013, pp. 1–6.
[29] N. S. Kim, T. Austin, D. Blaauw, T. Mudge, K. Flautner, J. S. Hu,
M. J. Irwin, M. Kandemir, and V. Narayanan, “Leakage current: Moore’s
law meets static power,” IEEE Comput., vol. 36, no. 12, pp. 68–75, Anup Dandapat (M’10–SM’15) received the Ph.D.
Dec. 2003. degree in digital VLSI design from Jadavpur Univer-
[30] D. Kayal, A. Dandapat, and C. Sarkar, “Design of a high performance sity, Kolkata, India, in 2008.
memory using a novel architecture of double bit CAM and SRAM,” Int. He is presently an Associate Professor with
J. Electron., vol. 99, no. 12, pp. 1691–1702, Jun. 2012. the Department of Electronics and Communica-
[31] S. Mishra and A. Dandapat, “EMDBAM: A low-power dual bit associa- tion Engineering, National Institute of Technology
tive memory with match error and mask control,” IEEE Trans. Very Large Meghalaya at Shillong, India. He has authored over
Scale Integr. (VLSI) Syst., vol. 24, no. 6, pp. 2142–2151, Jun. 2016. 50 national and international journal papers. His
[32] K. Pagiamtzis and A. Sheikholeslami, “Content-addressable memory current research interests include low-power VLSI
(CAM) circuits and architectures: A tutorial and survey,” IEEE J. Solid- design, low-power memory design, and low-power
State Circuits, vol. 41, no. 3, pp. 712–727, Mar. 2006. digital design.

Vous aimerez peut-être aussi