Académique Documents
Professionnel Documents
Culture Documents
Peter Fuhrmann
Philips Research Europe
Aachen (Germany)
peter.fuhrmann@philips.com
Abstract
In this paper, an overview is given on the main architectures used in the automotive to
implement fail-safe microcontrollers. The concept of a new HW-centric, distributed and
optimized architecture is also presented. In light of the IEC 61508 norm for safety related
electronic systems, a comparisons between these different architectures is done based on a
reference design. The paper concludes discussing how the presented architectures can be
extended to become fail-functional.
1. Introduction
Driven by many different requirements, such as AUTOSAR [1], the complexity of
Microcontroller Units (MCU) used in todays vehicles is continuously increasing: powerful
32/64 bits CPUs are used with a corresponding increase of embedded memory size and this
brings to the progressive use of very deep submicron technologies. At the same time, MCUs
are heavily used in many safety-critical or safety-related applications (like Throttle Control,
Transmission, Suspension, Steering, Braking, and Airbags) and therefore faults and failure
modes have to be carefully considered. In fact on one side, the uncertainty of models,
functional verification holes and specification misunderstanding is increasing the probability
of systematic faults causing intermittent faults, very difficult to be detected during design and
production tests [2]. On the other side, electromagnetic interferences, crosstalk, unforeseen
interactions, misuse, soft-errors and malicious accesses are increasing the random failure rate
[3]. Due to that and due to the increasing popularity of norms ruling the safety-critical
applications, such IEC 61508 [4], there is an increasing pressure to move the safety integrity
level of automotive electronic components and systems to an unprecedented level,
challenging the current available architectures.
This paper starts with an introduction of IEC 61508 norm. Chapter 3 gives an overview on
the architectures currently used in automotive to implement fail-safe microcontrollers. The
concept of a new HW-centric, distributed and optimized architecture is presented in Chapter 4
and compared in Chapter 5 with the other presented architectures. The paper concludes in
Chapter 6 discussing how the presented architectures can be extended to become failfunctional.
123
which dont have the potential to put the safety-related system in a hazardous or fail-tofunction state) and detected dangerous failures over the sum of all the possible failures (safe
plus dangerous). Annexes of IEC61508-2 deliver guidelines in terms of faults and failure
modes to be considered for each component. For instance for a CPU, IEC 61508 requires to
cover permanent and transient faults for register banks, internal RAMs, instruction coding and
execution, flags, address calculation, program pointer, stack pointer, interrupts logic. IEC
61508 includes as well the recommended diagnostic techniques, graded according to their
effectiveness with respect to the target SIL.
An important role for System-On-Chip (SoC) is played by the Beta Factor. The
definition of the Beta Factor has been introduced in order to quantify the probability of
common cause failures (CCF) when multiple, functionally-equal channels are implemented in
the same system to provide the required safety integrity. For an MCU or more in general for
any ASIC, IEC 61508 specifies that the CCF must be quantified using a ASIC that has to be
lower than 25%. Examples of CCF affecting redundant channels in MCUs are: sleeping
faults, clock faults, power faults, temperature faults, hot spots, timing faults and checker
faults. Another important concept is the hardware fault tolerance (HFT). A system with a
HFT of N means that N+1 faults could cause a loss of the safety function. In light of IEC
61508, a unit is said to operate in a fail-safe manner if a defined set of faults are detected and
they will lead to a safe condition in an adequate short time. If despite these faults the unit can
still guarantee the mission functionality, it is said to operate in a fail-operational or failfunctional manner. In general, units with HFT=0 cannot be fail-functional but can be failsafe. In summary, the following milestones have to be considered for an MCU to be certified
SIL3 in accordance with IEC 61508, for HFT=0: SFF has to be equal or greater than 99% and
ASIC has to be equal or lower than 25%.
124
CPU1
(master)
SW
Compare Unit
1oo1D
CPU2
(checker)
SW
SW 1
CPU
S1
CPU1
SW 2
SW 3
S2
SW 3
A
watchdog
SW 2
SW 3
1oo1D
CPU2
1oo1D
SW 2
SW 1
125
At the design level, the use of hardened processing units - such [11] can increase the
reliability of some sub-systems of the MCU. However, they cannot solve the overall safety
problem since they are addressing only specific fault models. Furthermore, the IEC 61508
standard discourages this approach due to the general difficulty to separate the mission logic
from the safety integrity circuitry in a proper way. This increases the ASIC factor beyond the
acceptable limit. For the same reasons, the use of well-known triple-modular-redundant flipflops to harden a given circuit is very critical in light of the IEC 61508 norm. This solution
element is specific to a particular fault model and becomes more and more inefficient due to
the increasing rate of multi-cell single event upsets which represent a CCF for such structure.
4. faultRobust Technology
As described in the previous section, state-of-the-art architectures for fail-safe MCUs
mainly rely on HW replication (the dual-CPU lock-step), SW redundancy (the challengeresponse architecture), and/or processing units hardening. As it will be better quantified in
the following section, SW redundancy introduces overhead, performance degradation, strong
application dependency and a quite negative impact on time-to-certification and on its
associated cost. HW replication introduces a significant overhead and high complexity related
to the need of protection against common-cause failures. All of the issues highlighted for SW
redundancy and HW replication become even more severe for MCUs integrating multiple
CPUs, multiple memories and complex peripherals. On the other side, technology hardening
(i.e. process, layout or circuit techniques to harden the processing units against faults) has a
strong impact on cost due to the changes generally required on the production chain.
Moreover as previously stated, this type of hardening can address only specific types of
faults. Therefore it is not going to reach the target in terms of SIL, unless it is used in
conjunction with SW redundancy and/or HW replication.
The basic concepts of a platform-based HW-centric optimized approach targeting IEC
61508 SIL3 fail-safe MCUs, called faultRobust, have been recently introduced [12]. In
summary, faultRobust is composed by a set of tools and methodologies (identified as
fRMethodology [13]) to perform Failure Modes and Effect Analysis (FMEA) of an integrated
circuit in adherence to IEC61508, and by a library of Intellectual Properties (called fault
supervisors or faultRobust IPs [14]) to address fault detection and fault tolerance of MCU
sub-systems. A very high level architecture of a MCU integrating faultRobust IPs (fRIPs) for
implementing the safety integrity strategy is shown in figure 4. A summary description of
fRIPs is given in the following.
The fRMEM supervisor is a configurable and re-usable IP for protection of memory subsystems [15]. It is basically an ECC-based memory protection supervisor but extended with
measures to fulfill the requirements of IEC 61508. Optionally it provides optimizations to
improve access time (fast-track) and protection over time (scrubbing of developed
errors). Due to the ability to store the ECC codes separately from the data, memory overhead
can be reduced by providing different protection levels for different memory pages. For multi
bus master architectures also a distributed MPU can be integrated. The fRMEM supervisor
has been certified by TV SD in accordance with IEC 61508.
Concerning the fRCPU, it is designed taking full benefit of the fRMethodology, i.e.
starting from a detailed FMEA which select the sensitive zones of the target CPU to be
protected in order to get a SFF 99% with the minimum cost. So the fRCPU supervisor does
not come as a replication of the CPU, but is architecturally and functionally diverse and thus
126
strongly reduces the ASIC as required by the IEC 61508. Also, as fRCPU implements a
hardware-centric approach, most diagnostic is implemented in hardware and not in software
as in approaches and thus CPU performance is not impacted. Main blocks included in the
fRCPU are: the Sniffer Unit that is collecting, comparing and coding signals from the
CPU boundary; the Shadow Processing Unit executing the same instruction flow of the
CPU, including a register bank to store a shadow value of CPU main registers; a
Management Unit to generate data/addresses; a Coverage Monitor Unit providing runtime information on the current coverage (SFF) with respect to the assumptions used
during fRCPU design optimization. And finally, fRCPU compares its results with the
ones read from the protected CPU by means of a set of independent checkers
supervising the different CPU ports. The fRCPU has been assessed by TV SD: a
concept report has been delivered stating that the proposed measures can be able to fulfill the
safety integrity level SIL3 in accordance with IEC 61508.
The other remote supervisors are fRBUS, fRPERI and fRNET. Bus supervisors (fRBUS)
consist of a set of blocks (decoders, arbiters, checkers) monitoring sources and sinks of the
bus interconnect and providing the information needed to control data integrity and routing.
Peripheral supervisors (fRPERI) implement a hardware verification component, i.e. a block
where a subset of the protocol checks and assertions are used to verify that a given interface
and protocol are still implemented by the hardware. fRCPU, fRMEM, fRBUS and fRPERI
communicate with system-level supervisors through a dedicated interconnect (fRNET) based
on an on-chip robust protocol that guarantees that information is transferred without errors
between the different diagnostic units as required by IEC 61508.
fRBUS
MCU
fRIP
sub-systems
CPU
fRMEM
Memory
fRCPU
fRPERI
Peripheral
Bus
.....
.....
Peripheral
fRPERI
CPUs
fRCPU
memories
fRMEM
busses
fRBUS
peripherals
fRPERI
diagnostic
infos
fRNET
fRNET
5. Comparisons
The first comparison described in this paper is between the dual-CPU lock-step
architecture and faultRobust approach, considering the following parameters: Safe
Failure Fraction and ASIC, area overhead, SW overhead (referring to the overhead of
Flash/ROM and RAM memory needed to host the additional software, both code and
data), power overhead, detection latency (referring to the relative detection latency, i.e.
the time between the appearance of an error at the boundaries of the CPU core respect
when the error is detected by the diagnostic) and finally performance overhead
127
128
As shown in these tables, the use of faultRobust provides very significant savings
compared to dual-CPU lock-step costs: from table 1, it brings around one sixth of the
delta costs of a SIL3 compliant dual-core lock-step architecture. A possible objection is
how much the diluting effect of large memory amounts affects the faultRobust
advantages. For a typical medium-high end automotive MCU with 512Kb flash, 48Kb
SRAM, 32Kb shared TCM and 0.25Mgates of extra logic, the saving of fRCPU/fRMEM
respect to dual-CPU lock-step is still around 25%. It is important to note that this
benchmark does not include the optimized supervision of the buses and peripherals for
failures. With a simple simulation, for the typical medium-high end automotive MCU
described before and considering 70% of the extra logic to be protected with HW safety
integrity measures, the saving of fRCPU/fRMEM plus fRBUS/fRPERI with respect to
dual-CPU lock-step plus bus/peripheral redundancy rapidly growths to 35%.Concerning
power overhead, the same considerations done for the area overhead apply. For all the
other benchmarked parameters (such SW, performance overhead and detection latency),
the fRCPU optimized and diverse architecture always ends up in a positive balance
respect the dual-core lock-step, resulting in an overall very significant benefit.
Table 1. Comparison without memories
Area overhead
SW overhead
Power overhead
Detection latency
Perf. overhead
Dual CPU
Lock Step
asic>25%
1
1
1
1
1
SIL3
Dual CPU
fault
Lock Step
Robust
asic25%
1,97
0,31
2,22
0,45
1,94
0,31
3
<1
2,22
0,08
Power
Temp
Sensor
CPU_checker
(ARM968ES)
fRCPU
Temp
Sensor
CPU_master
CPU
(ARM968ES)
(ARM968ES)
ECC
Temp
Sensor
I-TCM
ECC
Power
D-TCM
D-TCM
I-TCM
CPU_checker
(ARM968ES)
Temp
Sensor
CPU_master
(ARM968ES)
ECC
Area overhead
SW overhead
Power overhead
Detection latency
Perf. overhead
Dual CPU
Lock Step
asic>25%
1
1
1
1
1
SIL3
Dual CPU
fault
Lock Step
Robust
asic25%
1,22
0,19
2,22
0,45
1,23
0,24
3
<1
2,22
0,08
ECC
D-TCM
D-TCM
129
S
CPU1 CPU 1
(master)
A
fRCPU
Shadow Proc
Unit
Shadow Proc
Unit
Check
ers
CPU 2
130
7. References
[1] AUTOSAR consortium, http://www.autosar.org
[2] Cristian Constantinescu, "Trends and Challenges in VLSI Circuit Reliability," IEEE Micro, vol. 23, no. 4, pp.
14-19, Jul/Aug, 2003
[3] R. Mariani, Soft errors on digital components chapter in "Fault Injection Techniques and Tools for VLSI
Reliability Evaluation", Kluwer Academic Publisher
[4] CEI International Standard IEC 61508, 1998-2000.
[5] T. Fruehling, Delphi Secured Microcontroller Architecture, SAE Technical Paper, 2000-01-1052
[6] US Patent n.5880568 by Robert BOSCH Gmbh
[7] DE Patent n.19933086 by Robert BOSCH Gmbh
[8] S.Brewerton, R.Schneider, D.Eberhard, Implementation of a Basic Single-Microcontroller Monitoring
Concept for Safety Critical Systems on a Dual Core Microcontroller, SAE 2007 World Congress &
Exhibition, 2007-01-1486
[9] H. Komaki et al., Development of Low-Cost Standard Air Bag ECU, FUJITSU TEN Tech. F. No.18, 2002
[10] PROGRAM-CONTROLLED UNIT WITH MONITORING DEVICE, Patent WO/2003/029979
[11] Todd Austin , David Blaauw , Trevor Mudge , Krisztin Flautner, Making Typical Silicon Matter with
Razor, Computer, v.37 n.3, p.57-65, March 2004
[12] R. Mariani, P. Fuhrmann, B. Vittorelli, Cost-effective Approach to Error Detection for an Embedded
Automotive Platform, 2006-01-0837, SAE 2006 World Congress & Exhibition, April 2006, Detroit, MI, USA
[13] R. Mariani, G. Boschi and F. Colucci, Using An Innovative Soc-Level Fmea Methodology To Design In
Compliance With IEC 61508, DATE 20007 Conference, April 2007, Nice, France
[14] R. Mariani, A Platform-based Technology For Fault-robust SoC Design, IP/SOC 2006 Conference,
December 2006, Grenoble, France
[15] R. Mariani, P. Fuhrmann, F. Colucci, Safety Integrity of Memory Sub-systems in Automotive
Microcontrollers, 2007-01-1494, SAE 2007 World Congress & Exhibition, April 2007, Detroit MI, USA
[16] M. Peri, S. Pezzini, A. Ferrari, A. Sangiovanni-Vincentelli, M. Baleani, Fault Tolerant Platforms for
Automotive Safety Critical Applications, CASES03, Oct. 30Nov. 2, 2003, San Jose, California
131