Review

Name :Marri.
Nitin Laxma Reddy

Date Assigned: Wednesday January 29, 2014
Date Due: Wednesday March 26, 2014.
Title: Darshan D. Thaker et al., Recursive TMR: Scaling Fault Tolerance in the Nanoscale Era.

This paper describes the challenges faced by the designer as the process technologies decreases in feature sizes. By
decreasing the sizes of technologies it may increase many faults like noise related faults, natural radiation, and
increases in the uses of transistor. There will be increase in noise related faults where size less than 0.25um that
comes from the electrical disturbances in the logic values held in circuits and wires and which may leads to logical
fault in the circuit. Another cause for transient faults is natural radiation such as neutrons and alpha particles where
randomness is introduced for protecting against transient errors. As the size decreases there will be use of hundreds
of transistors and they mainly use these transistors for dynamic scheduling and caches which are the simple
structures.
Designers must handle every reliability concerns to make good use of cost and speed advantages of new
technologies and as the complexity increases the verification techniques will also increase equally which may cause
many undetected errors. But the raw error rate per latch or SRAM bit will remain constant if the technologies size
decrease. Many of the reliable circuits are constructed from the error prone components where the overhead did not
suggest the use of fault tolerant circuits.
The feature sizes which are pushing towards unreliable components raises many questions where scalable approach
would employ fault tolerant logic rather than all devices will remain error free and we must compare using large
more reliable devices. As the fact that smaller size is better and reliability can be decreased by taking fault tolerant
circuit techniques but its not true for many CMOS devices using fault tolerant circuits like recursive triple modular
redundancy (RTMR) circuits. This article mainly classifies the source of noise which can be corrected by
RTMR,single event upsets caused by energetic particles, flicker noises in the device and s micro architectural design
options for mixing large and small devices to trade off reliability, speed, and area.

Modular Redundant Design is proposed by Neumann when there is absence of reliable electronic components which
motivated the use of fault tolerant designs but due to high overhead which were beyond the capacities the Neumann
techniques were not used. But in future there was progress in logic gate reliability so in 1963 Wino grad and Cowan
improved Neumann work and proposed the theorem A circuit containing N error free gates can be simulated with
probability of failure E using g O(N poly(log(N/))) error-prone gates which fail with probably p provided p<pth.
This papers describes about Error Correcting Codes which are used in circuits where they represent single logical bit
and a gate applied on bits of word and codes like triple-redundancy code which is used to reduce error propagation
such that an error in any single gate causes no more than one error in each encoded output block.

For using the Fault tolerant procedures the computation should always be performed on encoded data to eliminate
the possibility of errors corrupting data in an uuencoded state. Where in triple redundancy technique it has three
gates where output of three gates are fed into three input majority vote gate where it violates the guideline because
the three majority gate decode the data. Where we have a design that does not decode data using d identical NAND
gates and d identical d-majority gates. A NAND gate never exists independently so noisy inputs are considered as its
reliability. A nand gate takes two inputs issued by similar gate where t = [d/2] failures lead to incorrect output which
may be caused by one of the following reason There were failures in the d-majority gate. The number of such
possibilities is (dt). There were k failures in the NAND gates and t k deviations in the inputs for k=1, , d/2.
The number of possible events in this case. There were t deviations in the input. Each deviating input fed a different
NAND gate, and these t deviations affects both input code words
This paper describes how the recursive multiplexing is helpful in decreasing transient errors where we build reliable
architecture from unreliable circuits. Recursive voting leads to double exponential decrease in the circuits failure
probability but this technique is considered faulty because single error in the last d-majority gate can cause an
incorrect result. The solution is to combine recursive majority voting and multiplexing. To construct a circuit that
fails with less probability there is approach is to enlarge small devices until they consume area and we divide
smaller device failure probabilities into separate classes.
Single event upsets are caused when neutrons generated by cosmic rays and alpha particles generated by packaging
materials when they passes through the semiconductor devices. Logic Errors which are caused by bit flips from
single-event upsets are called soft errors. For estimating error rates designers use Failures in time (FIT). Where
Failures in time provides a measure of degree of reliability a system requires to maintain error free operation. FIT
values changes according to the effects of alpha particles of dynamic circuits with scaling in CMOS technology. For
measuring the failures which are caused by soft errors they set all cells to specified logic levels, exposed the chip for
fixed time intervals under specific voltage conditions and cheeked for corrupted logic states. The result shown as
pfail(A) A2.5
For studying the effects of noise on a CMOS circuit reliability we use Sarpeshkar, Delbrck, and Meads noise
resource equation which relates a transistor noise power to its area and power it dissipates.
V
2
n
= K
w
(P)/ I
p
( F
h
-F
l
) + K
t
/A ln( F
h
/F
l)

Designers really cannot rely on error correcting circuits to build processors immune to soft errors. In this paper they
clearly explained about redundant threads for increase the performance. Redundant threads are used by some
process to instruction and compare results to ensure reliability. It mainly improve the performance of redundant
hardware with one thread prefetching caches misses and computing branch outcomes from other threads. IT is a
superscalar pipeline to reduce issue queue bandwidth constraints. The main goal of redundant thread is to provide
fault tolerant with minimum changes to existing designs without sacrificing performance.
Second approach for building fault tolerant microprocessor is Dynamic Implementation Verification Architecture
(DIVA) which combines superscalar pipeline with a simpler checker pipeline. Here the core processor commits the
completed instructions in the main out-of-order pipeline, the instructions proceed in program order to e the core
processor commits the completed instructions in the main out-of-order pipeline, the instructions proceed in program
order to the checker pipeline the checker pipeline.
DIVA Checker is a simple design, simplifying verifications and improving reliability for microprocessor design
which have high complex where in small size devices the errors are mainly caused due to noise and energetic
particles. DIVA does not provide adequate reliability because the checker is less resistant to failure. DIVA idea is
used as a safeguard against all the transient.
The workflow of DIVA is followed by constructing a pipe line in a normal manner and building DIVA checker with
RTMR design because the original checker consumed only 6% of the core processors area, using level-1 recursion
for the checker still keeps its area well below that of the core. Mainly the DIVA checker must be slower than the
core pipe line but checker with four wide pipeline is faster than the core pipeline which is proved by Weaver and
Austin.
This paper describes about the Micro architectural Vulnerabilities where they are the most venerable to soft errors.
A new approach called architecture vulnerability factor (AVF) was defined by mukherjee that probability of a fault
leads to a error. Another approach presented here is RTL simulation of an Alpha processor where they injected
faults how they propagate and become errors at micro architecture level. The main goal is to keep the overhead very
low and decreases the errors.
In future as this transient error increases there must be many approaches to overcome this errors like architecture
using light weight construction to protect vulnerable components might not experience performance hits during but
might suffer an increase in errors as single events upsets. Another one is to increase in the variety of nano scale
components like carbon nano tubes, silicon nano wires, and single electron transistors. Where study of the trade-offs
between area, reliability, and performance when different nano scale devices are used to build circuits.
This paper presented how the decreasing devices size has been used for faster, cheaper and denser computation.
System reliability is one of the important concept in the nano scale era whrer RTMR effects like noise and single
events caused by energetic particles and it also presented how the noise sources associated with devices scale too
quickly with device size and how RTMR circuits become to large and slow. They also explained how micro
architectural research is critical to the effective use of future nano scale technologies and many other to decrease the
transient errors.

Review

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Review

Transféré par

Droits d'auteur :

Formats disponibles

Name :Marri.

Nitin Laxma Reddy

Vous aimerez peut-être aussi