Académique Documents
Professionnel Documents
Culture Documents
I. INTRODUCTION
WANG et al.: HIGH-SPEED AND LOW-POWER DESIGN TECHNIQUES FOR TCAM MACROS
531
Fig. 2. (a) The original cascaded AND-type match-line circuit. (b) Logic transformation.
Fig. 3. (a) Parallel, (b) 3-level tree, and (c) 2-level tree AND-type match lines.
The design in Fig. 3(a) uses two short parallel match lines in
each half plane and merges the outputs from both planes into
a 4-input AND gate to generate the final matching result. On
the other hand, the design in Fig. 3(b) and (c) adopt a 3-level
and 2-level tree match-line circuit, respectively, in each half
plane, and use an 8-input and 4-input AND gate, respectively,
to generate the final matching results. The electrical behaviors,
including delay time and power consumption, are used to determine the final choice. The evaluation results are described as
follows.
Let us take the design of a 0.18 m 128b TCAM match line
as an example. Post-layout evaluation results of different implementations are listed in Table I. All the designs use the same
TCAM cell for a fair evaluation, and the cell layout is shown in
Fig. 4. The impacts of the TCAM cell design and the cell layout
design will be described in Section IV.
The following are the observations from the extracted features and parameters.
1) Both designs 1 and 2 have the deepest logic depth, but
design 1 performs a more complex function in the critical
path than design 2. So, design 1 has the longest search
delay.
532
TABLE I
PERFORMANCE COMPARISONS BETWEEN DIFFERENT MATCH LINES
In Internet Protocol version 6 (IPv6), the length of an IP address extends to 128 bits. In a routing table, the prefix region
stores either 0 or 1, and the rest stores . The statistic prefix
length distribution observed at a specific router [17] is shown in
Fig. 6(a). We find that more than 90% of IP addresses are shorter
than 64 bits. Therefore, when the routing table is constructed
with a TCAM array, a large portion of the array contains the
mask bits (i.e., the bits), as shown in Fig. 6(b).
B. The Proposed Segmented Search-Line Circuitry
Since the cells in Fig. 6(b) do nothing but pass matching
signals, they do not have to be involved with the search operation. This property, when combined with the progressive layout
pattern, indicates that search lines behind the cells can be
turned off to save energy. The idea then leads to the segmented
search-line design as shown in Fig. 7. Many segmentation entries (SEs) are inserted into the cell array. A segmentation entry
contains a row of segmentation cells (SCs), and SCs are used to
control signal propagation in the search lines.
The circuit containing an SC and two TCAM cells is shown in
Fig. 8. The SC is composed of a dummy cell and a path-control
switch. The word line (WL) for the upper TCAM cell is also
applied to the dummy cell. When writing an into the upper
TCAM cell, both WBLP and WBLN lines are raised to high. In
that case, the output of the dummy cell receives a low to cut
off the signal propagation, and the upper segment of the search
WANG et al.: HIGH-SPEED AND LOW-POWER DESIGN TECHNIQUES FOR TCAM MACROS
533
Fig. 6. (a) The prefix length distribution of IP addresses, and (b) the corresponding TCAM array.
Fig. 8. The circuit showing the relationship between the SC and neighboring
TCAM cells.
Fig. 7. Concept of the segmented search-line scheme.
The number and locations of segmentation entries can be decided by the statistic features of the routing table. Once the
TCAM array has been designed, segmentation entries can not be
changed for a specific embedded application. If an entry needs
to be added to the look-up table, the table should be resorted at
the system level first, and then write operations are performed
534
TABLE II
PERFORMANCE COMPARISON BETWEEN DIFFERENT INTERCONNECTION MANNERS
voltage at node out and the channel length (L) of the feedback
at typical (TT) and worst (SF) process corners are
pMOS
shown in Fig. 10(b). The results indicate that if cell_out is con, the adjustable range of for maximal
not
nected to
exceeding 0.4 V is very limited. Moreover, for the same ripple
can
voltage, say 0.15 V, the design with
use a longer (0.5 m) and obtain a shorter gate delay (188 ps),
should use a shorter
while the design with
(0.21 m) and get a larger gate delay (379 ps).
B. Interconnections Among TCAM Cells
Fig. 9. The proposed TCAM cell.
All the gates in the proposed 2-level tree match line [Fig. 3(c)]
should be arranged in one row in the memory array. Therefore,
it is necessary to make a long interconnection to link two
branches of the tree. The way an interconnection is made influences the amount of parasitic capacitance and in turn influences
both search speed and power consumption. We have studied
two interconnection methods for performance evaluation.
Fig. 11(a) and (b) show the conceptual and layout diagrams
of straightforward and leap-frog interconnection methods,
respectively. Post-layout simulation results are summarized in
Table II.
The simulation data in Table I are based on the leap-frog interconnection. The data in Table II reveal that if a straightforward interconnection is adopted, not only will the search be delayed but also the power consumption will increase. This effect
is mainly because the long interconnection in the straightforward manner lies in the critical path and results in a larger RC
product.
C. TCAM Cell Layout
Both evaluation results in Tables I and II are based on the
TCAM cell shown in Fig. 4. In the following we show performance evaluations based on different cell layouts. Fig. 4 is
a TCAM cell with an aspect ratio of 1.17. We designed two
other cell layouts with a small aspect ratio, as shown in Fig. 12.
Table III summarizes the post-layout evaluation results for a
128b 2-level tree match line (ML).
The data in Table III show that the smaller the aspect ratio
of the TCAM cell, the longer the search delay, the larger the
power consumption of the match-line but the smaller the capacitance on the search lines (SL). A good tradeoff is to use the design of Fig. 12(a) because it only sacrifices 1.97% search delay
but obtains 33% SL capacitance reduction. The overall power
reduction from 33% SL capacitance reduction will more than
compensate for the 4% ML power increase.
WANG et al.: HIGH-SPEED AND LOW-POWER DESIGN TECHNIQUES FOR TCAM MACROS
535
Fig. 10. (a) Simulation model for observing the CSE and (b) simulation results.
536
Fig. 12. (a) The second style, and (b) the third style TCAM cell layouts.
line for a larger power saving, but use the above sizing while
avoid further speed loss with the aid of floorplan design. Fig. 14
shows the final floorplan of the 256 128b TCAM macro with
two SEs. One SE is located at the quarter and the other at the
half of the search line. The write and search buffers are located
at the center of the array so that they can drive search lines in
the upper and the lower half arrays simultaneously. With this
design, each search signal will pass only one SC although one
256b search line has two SCs.
In Fig. 14 we also show the diode used to generate VDDC.
The large diode is realized by many distributed small diodes
located at top and bottom of the cell array.
V. EXPERIMENTAL RESULTS
We implemented a 1.8 V 0.18 m TCAM test chip for verifying the proposed design techniques. The critical-path circuit
of the TCAM macro is shown in Fig. 15(a). Before we can use
the TCAM for searching purposes, the TCAM array should be
filled with data using the write operation. The timing waveforms
for the write mode are shown in Fig. 15(b). When writing a
will be set
dont care, the corresponding mask bit
as 0, and the corresponding bitline enable signal
and write bitlines (
,
) will be pulled low
and high, respectively. So, both storage nodes (QP and QN) of a
TCAM cell will be written a 0 and the inner node cell_out
will be pulled up to the voltage level of VDDC as described eargoes high and
lier. When writing 1 or 0, the signal
one of the write bitlines will be pulled low according to the input
datum.
The timing waveforms for the search operation are shown
in Fig. 15(c). The signal is the internal clock signal for the
match circuit, and the complementary signal of the external
clock signal clk. When goes low, the match-line circuit enters
the pre-charge phase and the external datum and its complement
are fetched by the up-going clk into the search lines through the
search line buffers. In this phase, the datum on the search lines
begins to compare with all the data previously written and stored
WANG et al.: HIGH-SPEED AND LOW-POWER DESIGN TECHNIQUES FOR TCAM MACROS
537
Fig. 15. (a) Schematic of the critical-path circuit, (b) waveforms of the write operation, and (c) waveforms of the search operation.
into the TCAM array, and the voltage at node cell_out of each
memory cell goes toward its final value. When goes high, the
match-line circuit enters the evaluation phase. Please refer to
[12] for the detailed operation of the PF-CDPD match-line circuit. All match lines are evaluated at the same time, and each
.
will go high if
match line generates an output
the search data matches with the stored data.
The block diagram of the test chip is shown in Fig. 16(a). The
TCAM macro contains two segmentation entries with the prefix
length of each being equal to 64 bits and 32 bits, respectively.
A voltage controlled oscillator (VCO) and a divide-by-two circuit are used to generate the clock signals with a 50% duty
cycle. The clock frequency range can be adjusted from 200 MHz
to 600 MHz. A dummy clock buffer synchronizes the rising
(falling) edge of the clock clkt for the peripheral circuits, and
for the TCAM core.
the falling (rising) edge of the clock
The pre-stored data and the search data are generated by four
32b linear feedback shift registers (LFSRs), and the seed for
the LFSRs can be controlled for varying the data sequence. The
mask-bit control circuit is used to help generate the progressive data pattern. The 8b counter is used for generating the address for the write operation. In the beginning of measurement,
the 4 32b LFSRs will generate a random pattern, which is
538
Fig. 16. (a) The block diagram and (b) the timing diagrams of the test chip.
Fig. 17. (a) Photograph, (b) measurement waveforms, and (c) shmoo chart of the test chip.
TABLE IV
FEATURES SUMMARY OF THE TEST CHIP
are 0.021 ns and 0.012 fJ/bit/search, respectively. This result implies that the proposed design techniques are robust to process
variations.
WANG et al.: HIGH-SPEED AND LOW-POWER DESIGN TECHNIQUES FOR TCAM MACROS
539
TABLE V
FEATURES SUMMARY AND PERFORMANCE COMPARISON
TABLE VI
OTHER PERFORMANCE COMPARISONS
the TCAM design [13], the proposed design still shows a 25%
improvement in the energy index.
In order to see how the speed and power are affected by the
bit width and CMOS technology, we have also implemented a
0.18 m 1.8 V 256 144b TCAM macro and a 0.13 m 1.2 V
256 128b TCAM macro. Table VI summarizes the design features. When realizing a 144b match line, a four-input PF-CDPD
AND gate is added at the end of each branch of the 2-level tree
AND-type match line (refer to Fig. 3(c)). As compared to the
128b-wide TCAM macro, the search delay and the energy index
of the 144b-wide TCAM match line increase 12% and 23%, respectively. On the other hand, comparing the 0.13 m 1.2 V
256 128b TCAM design to the 0.18 m 1.8 V 256 128b
TCAM design, the search delay and the energy index improve
29% and 75%, respectively. The results indicate the benefits
from the technology scaling.
VI. CONCLUSION
In this work, the tree AND-type match-line scheme is proposed for its high search speed, and the segmented search line
scheme for its high energy efficiency in the TCAM-based application of IP address lookup. The design of the TCAM cell, interconnections among TCAM cells, TCAM cell layout, and segmentation entries are also described. The realized 1.8 V 0.18 m
256 128b TCAM macro achieves a search time of 1.56 ns with
1.42 fJ/bit/search energy.
ACKNOWLEDGMENT
The authors thank the Chip Implementation Center for supporting the chip fabrication.
REFERENCES
[1] K.-J. Lin and C.-W. Wu, A low-power CAM design for LZ data compression, IEEE Trans. Comput., vol. 49, pp. 11391145, Oct. 2000.
[2] F. Yu, R. H. Katz, and T. V. Lakshman, Gigabit rate packet patternmatching using TCAM, in Proc. IEEE ICNP, 2004, pp. 174183.
[3] T. Ikenaga and T. Ogura, A fully parallel 1-Mb CAM LSI for real-time
pixel-parallel image processing, IEEE J. Solid-State Circuits, vol. 35,
no. 4, pp. 536544, Apr. 2000.
[4] R. Sangireddy and A. K. Somani, High-speed IP routing with binary
decision diagrams based hardware address lookup engine, IEEE J. Sel.
Areas Commun., vol. 21, no. 5, pp. 513521, May 2003.
540