Vous êtes sur la page 1sur 1515

E&CE 427: Digital System Engineering

Mark Aagaard University of Waterloo Dept of Electrical and Computer Engineering 2003t1Winter March 24, 2003

E&CE 427: 2003t1Winter 0

Contents
I Lecture Notes
1 VHDL LEC-02: Introduction to VHDL . . . . . . . . . . . . . . . . . . . 1.1 Prelude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Topics in this Chapter . . . . . . . . . . . . . . . . . . 1.1.2 Background Material . . . . . . . . . . . . . . . . . . . 1.1.3 Recommended Reading . . . . . . . . . . . . . . . . . 1.2 Introduction to VHDL . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 VHDL Origins and History . . . . . . . . . . . . . . . . 1.2.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Synthesis of a Simulation-Based Language . . . . . . 1.2.4 Solution to Synthesis Sanity . . . . . . . . . . . . . . . 1.2.5 VHDL Disadvantages . . . . . . . . . . . . . . . . . . 1.2.6 VHDL Advantages . . . . . . . . . . . . . . . . . . . . 1.2.7 VHDL and Other Languages . . . . . . . . . . . . . . 1.2.7.1 VHDL vs Verilog . . . . . . . . . . . . . . . . 1.2.7.2 VHDL vs SystemC . . . . . . . . . . . . . . . 1.2.7.3 VHDL vs Other Hardware Description Languages . . . . . . . . . . . . . . . . . . . . . 1.2.7.4 Summary of VHDL Evaluation . . . . . . . . 1.3 Overview of Syntax . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Syntactic Categories . . . . . . . . . . . . . . . . . . . 1.3.2 Library Units . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Entities and Architecture . . . . . . . . . . . . . . . . . 1.3.4 Concurrent Statements . . . . . . . . . . . . . . . . . 1.3.5 Component Declaration and Instantiations . . . . . . . 1.3.6 Processes . . . . . . . . . . . . . . . . . . . . . . . . 1.3.7 Sequential Statements . . . . . . . . . . . . . . . . . . 1.4 Concurrent vs Sequential Statements . . . . . . . . . . . . . 1.4.1 Concurrent Assignment vs Process . . . . . . . . . . 1.4.2 Conditional Assignment vs If Statements . . . . . . . 1.4.3 Selected Assignment vs Case Statement . . . . . . . i

1
3 1 4 5 6 7 9 10 14 18 19 20 21 22 23 24 25 26 27 28 29 31 36 39 40 45 47 48 49 50

CONTENTS
1.4.4 Coding Style . . . . . . . . . . . . . . . . . . . . . . . 1.5 Overview of Processes . . . . . . . . . . . . . . . . . . . . . 1.5.1 Combinational Process vs Clocked Process . . . . . . 1.5.2 Latch Inference . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Combinational vs Flopped Signals . . . . . . . . . . . LEC-03: Details of Process Execution . . . . . . . . . . . . . . . 1.6 Details of Process Execution . . . . . . . . . . . . . . . . . . 1.6.1 Denitions and Algorithm . . . . . . . . . . . . . . . . 1.6.1.1 Temporal Granularities of Simulation . . . . . 1.6.1.2 Process Modes . . . . . . . . . . . . . . . . 1.6.1.3 Simulation Algorithm . . . . . . . . . . . . . 1.6.1.4 Delta-Cycle Denitions . . . . . . . . . . . . 1.6.2 Example: Process Execution . . . . . . . . . . . . . . 1.6.3 Example: Need for Provisional Assignments . . . . . LEC-04: Hardware Building Blocks . . . . . . . . . . . . . . . . 1.7 VHDL and Hardware Building Blocks . . . . . . . . . . . . . . 1.7.1 Basic Building Blocks . . . . . . . . . . . . . . . . . . 1.7.2 Deprecated Building Blocks for RTL . . . . . . . . . . 1.7.3 Hardware and Code for Flops . . . . . . . . . . . . . . 1.7.3.1 Flip-Flops vs Latches . . . . . . . . . . . . . 1.7.3.2 Flops with Waits and Ifs . . . . . . . . . . . . 1.7.3.3 Flops with Synchronous Reset . . . . . . . . 1.7.3.4 Flops with Chip-Enable . . . . . . . . . . . . 1.7.3.5 Flops with Chip-Enable and Mux on Input . . 1.7.3.6 Flops with Chip-Enable, Muxes, and Reset . 1.7.4 An Example Sequential Circuit . . . . . . . . . . . . . 1.8 Synthesizable vs Non-Synthesizable Code . . . . . . . . . . 1.8.1 Unsynthesizable Code . . . . . . . . . . . . . . . . . . 1.8.1.1 Initial Values . . . . . . . . . . . . . . . . . . 1.8.1.2 Wait For . . . . . . . . . . . . . . . . . . . . . 1.8.1.3 Different Wait Conditions . . . . . . . . . . . 1.8.1.4 Multiple if rising edges in Same Process 1.8.1.5 if rising edge and wait in Same Process 1.8.1.6 if rising edge with else Clause . . . . 1.8.1.7 if rising edge Inside a for Loop . . . . 1.8.1.8 wait Inside of a for loop . . . . . . . . . 1.8.2 Synthesizable, but Undesirable Hardware . . . . . . . 1.8.2.1 Asynchronous Reset . . . . . . . . . . . . . 1.8.2.2 Bad Form of Nested Ifs . . . . . . . . . . . . 1.8.2.3 Deeply Nested Ifs . . . . . . . . . . . . . . . 1.9 Numbers, Arithmetic, Arrays, and Signals . . . . . . . . . . . 1.9.1 Arithmetic Packages . . . . . . . . . . . . . . . . . . . 51 55 62 67 71 1 5 6 7 12 17 22 24 79 1 5 6 8 12 13 14 15 16 17 19 22 28 29 30 31 32 34 35 36 37 39 41 42 43 44 45 46

CONTENTS
1.9.2 1.9.3 1.9.4 1.9.5 1.9.6 1.9.7 Shift and Rotate Operations . . . . Overloading of Arithmetic . . . . . Different Widths and Arithmetic . . Overloading of Comparisons . . . Different Widths and Comparisons Type Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 48 49 50 51 52 57 58 59 1 5 6 8 9 12 13 14 15 16 17 18 19 20 25 26 38 39 51 54 56 59 61 66 88 92 95 1 7 8 11 14

2 RTL Design with VHDL 2.1 Prelude to Chapter . . . . . . . . . . . . . . . . . . . 2.1.1 Topics in this Chapter . . . . . . . . . . . . . LEC-05: Dataow Diagrams . . . . . . . . . . . . . . . . 2.2 Design Flow . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Generic Design Flow . . . . . . . . . . . . . . Additional material in notes . . . . . . . . 2.2.2 Implementation Flows . . . . . . . . . . . . . 2.2.3 Classes of Hardware . . . . . . . . . . . . . . 2.2.4 Design Flow: Datapath vs Control vs Storage 2.2.4.1 Datapath-Centric Design Flow . . . 2.2.4.2 Control-Centric Design Flow . . . . 2.2.4.3 Storage-Centric Design Flow . . . . 2.3 Dataow Diagrams and High-Level Models . . . . . 2.3.1 Overview of Example . . . . . . . . . . . . . 2.3.1.1 Software vs Hardware Algorithms . 2.3.1.2 Serial vs Parallel . . . . . . . . . . . 2.3.2 Dataow Diagrams . . . . . . . . . . . . . . . 2.3.2.1 Dataow Diagrams Overview . . . . 2.3.2.2 Area Estimation . . . . . . . . . . . 2.3.3 Dataow Diagram Execution . . . . . . . . . 2.3.3.1 Performance Estimation . . . . . . . 2.3.3.2 Design Analysis . . . . . . . . . . . 2.3.4 Area / Performance Tradeoffs . . . . . . . . . 2.3.5 Optimize Inputs and Outputs . . . . . . . . . 2.3.6 From Dataow Diagram to High-Level Model 2.3.7 From Dataow Diagram to DP+Ctrl Model . . 2.3.7.1 Datapath for DP+Ctrl Model . . . . 2.3.8 Dataow Diagram Scheduling . . . . . . . . . 2.3.9 Summary: From Dataow to Hardware . . . . LEC-06: State Machine Design . . . . . . . . . . . . . . 2.4 Finite State Machines in VHDL . . . . . . . . . . . . 2.4.1 Mealy vs Moore State Machines . . . . . . . 2.4.2 State Machines and VHDL . . . . . . . . . . 2.4.2.1 Implicit and Explicit State Machines

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS
2.4.3 Some Simple State Machines . . . . . . . . . . . . . . 2.4.3.1 Implementing a Simple Moore Machine . . . 2.4.3.2 Implementing a Simple Mealy Machine . . . 2.4.4 State Encoding . . . . . . . . . . . . . . . . . . . . . . 2.4.4.1 Constants vs Enumerated Type . . . . . . . 2.4.4.2 Encoding Schemes . . . . . . . . . . . . . . 2.4.5 From Dataow to State Machine . . . . . . . . . . . . 2.4.6 Implicit vs Explicit State Machines . . . . . . . . . . . 2.4.7 Implicit State Machines . . . . . . . . . . . . . . . . . 2.4.7.1 Multi-Wait Process . . . . . . . . . . . . . . . 2.4.7.2 Counter . . . . . . . . . . . . . . . . . . . . . 2.4.8 Explicit State Machines . . . . . . . . . . . . . . . . . 2.4.8.1 State Machine . . . . . . . . . . . . . . . . . 2.4.8.2 Conditional Assignment . . . . . . . . . . . . 2.4.8.3 Conditional Assignment with Dont Care . . . 2.4.8.4 Selected Assignment with Dont Care . . . . 2.4.8.5 Case Statement . . . . . . . . . . . . . . . . 2.4.9 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.10 Input / Output Protocols . . . . . . . . . . . . . . . . LEC-07: Memory Design . . . . . . . . . . . . . . . . . . . . . . . 2.5 Memory Arrays and RTL Design . . . . . . . . . . . . . . . . 2.5.1 Memory Arrays and Dataow Diagrams . . . . . . . . 2.5.1.1 Legend for Dataow Diagrams . . . . . . . . 2.5.1.2 Basic Memory Operations . . . . . . . . . . 2.5.1.3 Data Dependencies . . . . . . . . . . . . . . 2.5.1.4 Denition of Three Types of Dependencies . 2.5.1.5 Dataow Diagrams and Data Dependencies 2.5.1.6 Example: Memory Array and Dataow Diagram . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Memory Arrays in VHDL . . . . . . . . . . . . . . . . . 2.5.2.1 Two-Dimensional Array . . . . . . . . . . . . 2.5.2.2 Memory Array in Hardware . . . . . . . . . . 2.5.2.3 Example VHDL Code for Memory Array in Hardware . . . . . . . . . . . . . . . . . . . . 2.5.2.4 Library Component . . . . . . . . . . . . . . 2.5.2.5 Build Memory from Slices . . . . . . . . . . . 2.5.2.6 Dual-Ported Memory . . . . . . . . . . . . . LEC-08: Design Example: Stack . . . . . . . . . . . . . . . . . . 2.6 Design Example: Stack . . . . . . . . . . . . . . . . . . . . . 2.6.1 Stack Requirements . . . . . . . . . . . . . . . . . . . 2.6.1.1 Stack Entity . . . . . . . . . . . . . . . . . . . 2.6.1.2 Stack Instructions . . . . . . . . . . . . . . . 17 18 30 38 39 44 46 48 49 50 51 52 53 54 55 56 57 59 62 1 7 8 9 10 12 17 18 25 39 40 42 43 44 48 53 1 7 8 9 10

CONTENTS
2.6.1.3 Stack Instruction Encoding . . . . . . . . . . 2.6.1.4 Miscellaneous Requirements . . . . . . . . . 2.6.2 Stack Algorithm . . . . . . . . . . . . . . . . . . . . . 2.6.3 Stack Dataow Diagrams . . . . . . . . . . . . . . . . 2.6.3.1 Initial Diagrams . . . . . . . . . . . . . . . . 2.6.3.2 Partition into Clock Cycles . . . . . . . . . . 2.6.3.3 High-Level Model . . . . . . . . . . . . . . . 2.6.3.4 Individual Block Diagrams . . . . . . . . . . . 2.6.3.5 Complete Block Diagram . . . . . . . . . . . 2.6.4 Stack: Register Transfer Level . . . . . . . . . . . . . 2.6.4.1 Stack: Separate Control, Datapath and Storage . . . . . . . . . . . . . . . . . . . . . . . 2.6.4.2 Stack: Datapath Operations . . . . . . . . . 2.6.4.3 Stack: Explicit State Machine . . . . . . . . . LEC-09: Guidelines and Optimization Techniques . . . . . . . . 2.7 RTL Coding Guidelines . . . . . . . . . . . . . . . . . . . . . 2.7.1 Design Process . . . . . . . . . . . . . . . . . . . . . 2.7.2 Signal Declarations . . . . . . . . . . . . . . . . . . . 2.7.3 Processes . . . . . . . . . . . . . . . . . . . . . . . . 2.7.4 Flip-Flops and Latches . . . . . . . . . . . . . . . . . 2.7.4.1 Multiplexors and Tri-State Signals . . . . . . 2.7.5 State Machines . . . . . . . . . . . . . . . . . . . . . . 2.7.5.1 Reset . . . . . . . . . . . . . . . . . . . . . . 2.7.6 Inputs and Outputs . . . . . . . . . . . . . . . . . . . . 2.8 Additional VHDL Features . . . . . . . . . . . . . . . . . . . . 2.8.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.2 Still More VHDL Features . . . . . . . . . . . . . . . . 2.9 General Optimization Techniques . . . . . . . . . . . . . . . . 2.9.1 Strength Reduction . . . . . . . . . . . . . . . . . . . 2.9.1.1 Arithmetic Strength Reduction . . . . . . . . 2.9.1.2 Boolean Strength Reduction . . . . . . . . . 2.9.2 Replication and Sharing . . . . . . . . . . . . . . . . . 2.9.2.1 Mux-Pushing . . . . . . . . . . . . . . . . . . 2.9.2.2 Common Subexpression Elimination . . . . . 2.9.2.3 Computation Replication . . . . . . . . . . . 2.9.3 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.4 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . LEC-10: FPGA-Specic Guidelines and Optimization . . . . . . 2.10 FPGA-Specic Guidelines . . . . . . . . . . . . . . . . . . . 2.10.1 Generic FPGAs . . . . . . . . . . . . . . . . . . . . . 2.10.1.1 Overview of Generic FPGA Hardware . . . 2.10.1.2 Generic Clocks . . . . . . . . . . . . . . . . 11 12 13 17 18 23 28 37 43 45 52 70 80 1 4 5 6 11 15 17 18 20 24 25 26 30 31 32 33 34 35 36 37 39 40 41 1 5 6 7 24

CONTENTS
2.10.1.3 Special Circuitry in FPGAs 2.10.2 Altera APEX20K . . . . . . . . . . . 2.11 Example Circuits . . . . . . . . . . . . . . . 2.11.1 Ripple-Carry Adder . . . . . . . . . . 2.11.2 Barrel Shifter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 32 36 37 38 43 1 8 9 14 21 22 27 31 34 39 46 50 51 53 55 57 58 61 63 64 65 66 68 69 1 10 11 14 15 16 17 19 21 28 33

3 Functional Validation LEC-11: Functional Validation of Datapath Circuits 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Validation / Verication / Testing . . . . . 3.1.2 Why Your First Circuit Will Not Work . . . 3.2 Test Cases . . . . . . . . . . . . . . . . . . . . . 3.2.1 Coverage . . . . . . . . . . . . . . . . . . 3.2.2 Heating System Example . . . . . . . . . 3.2.2.1 Number of Cases to Consider . 3.2.2.2 Representation Simplication . . 3.2.3 Floating Point Divider Example . . . . . . 3.2.4 Functional Validation Challenges . . . . . 3.3 Testbenches . . . . . . . . . . . . . . . . . . . . 3.3.1 Overview of Test Benches . . . . . . . . . 3.3.2 Reference Model Style Testbench . . . . 3.3.3 Relational Style Testbench . . . . . . . . 3.3.4 Coding Structure of a Testbench . . . . . 3.3.5 Datapath vs Control . . . . . . . . . . . . 3.4 Functional Validation for Datapath Circuits . . . . 3.4.1 A Spec-Less Testbench . . . . . . . . . . 3.4.2 Use an Array for Test Vectors . . . . . . . 3.4.3 Build Spec into Stimulus . . . . . . . . . . 3.4.4 Have Separate Specication Entity . . . . 3.4.5 Generate Test Vectors . . . . . . . . . . . 3.4.6 Relational Specication . . . . . . . . . . LEC-12: Functional Validation of State Machines . 3.5 Functional Validation of Control Circuits . . . . . 3.5.1 Overview of Queues in Hardware . . . . . 3.5.2 VHDL Coding . . . . . . . . . . . . . . . . 3.5.2.1 Package . . . . . . . . . . . . . 3.5.2.2 Other VHDL Coding . . . . . . . 3.5.3 Code Structure for Validation . . . . . . . 3.5.4 Instrumentation Code . . . . . . . . . . . 3.5.5 Coverage Monitors . . . . . . . . . . . . . 3.5.6 Assertions . . . . . . . . . . . . . . . . . 3.5.7 VHDL Coding Tips . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS
3.5.8 Queue Specication . . . . . . . . . . . . . . . . . . . 3.5.9 Queue Testbench . . . . . . . . . . . . . . . . . . . . 4 Performance Analysis and Optimization 4.1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Background Material . . . . . . . . . . . . . . . . . . . 4.1.3 Reading Material . . . . . . . . . . . . . . . . . . . . . LEC-13: Introduction to Performance Analysis . . . . . . . . . 4.2 Dening Performance . . . . . . . . . . . . . . . . . . . . . . 4.3 Comparing Performance . . . . . . . . . . . . . . . . . . . . . 4.3.1 Performance for Different Tasks . . . . . . . . . . . . . 4.3.2 Optimizing Performance . . . . . . . . . . . . . . . . . 4.4 Clock Speed, CPI, Program Length, and Performance . . . . 4.4.1 Mathematics . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Example: CISC vs RISC and CPI . . . . . . . . . . . . 4.4.3 Summary of Equations . . . . . . . . . . . . . . . . . LEC-14: Performance and Dataow Diagrams . . . . . . . . . . 4.5 Performance Analysis and Dataow Diagrams . . . . . . . . 4.5.1 Dataow Diagrams, CPI, and Clock Speed . . . . . . 4.5.1.1 Tradeoffs . . . . . . . . . . . . . . . . . . . . 4.5.2 Dataow Diagram with Two Instructions . . . . . . . . 4.5.2.1 Scheduling of Operations for Different Clock Periods . . . . . . . . . . . . . . . . . . . . . 4.5.2.2 Performance Computation for Different Clock Periods . . . . . . . . . . . . . . . . . 4.5.2.3 Example: Two Instructions Taking Similar Time 4.5.2.4 Example: Same Total Time, Different Order for A . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Example: From Algorithm to Optimized Dataow . . . 4.5.4 Optimality: Performance vs Area Tradeoffs . . . . . . 4.5.5 Affect of Instruction Set on Performance . . . . . . . . 4.5.6 Affect of Time to Market on Relative Performance . . 39 43 45 46 47 48 49 1 7 10 13 14 16 17 18 22 1 5 6 7 9 10 14 15 18 20 24 27 30

CONTENTS
5 Timing Analysis 5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Background Material . . . . . . . . . . . . . . . . . . . 5.1.3 Reading Material . . . . . . . . . . . . . . . . . . . . . LEC-15: Introduction to Timing Analysis . . . . . . . . . . . . . 5.2 Delays and Denitions . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Related Background Denitions . . . . . . . . . . . . . 5.2.2 Timing Constraints . . . . . . . . . . . . . . . . . . . . 5.2.2.1 Minimum Clock Period . . . . . . . . . . . . . 5.2.2.2 Hold Constraint . . . . . . . . . . . . . . . . 5.2.3 Clock-Related Timing Denitions . . . . . . . . . . . . 5.2.3.1 Clock Skew (Smith 6.5.1) . . . . . . . . . . . 5.2.3.2 Clock Latency (Smith 6.5.1) . . . . . . . . . . 5.2.3.3 Clock Jitter (Smith pp873) . . . . . . . . . . . 5.2.4 Storage Related Timing Denitions (Smith 2.5.2) . . . 5.2.4.1 Setup Time . . . . . . . . . . . . . . . . . . . 5.2.4.2 Hold Time . . . . . . . . . . . . . . . . . . . 5.2.4.3 Clock-to-Q Time . . . . . . . . . . . . . . . . 5.2.4.4 Example Timing Violations . . . . . . . . . . 5.2.5 Propagation Delays . . . . . . . . . . . . . . . . . . . 5.2.5.1 Load Delays (Smith 3.1) . . . . . . . . . . . . 5.2.5.2 Interconnect Delays (Smith 7.1) . . . . . . . 5.3 Critical Paths: False and True . . . . . . . . . . . . . . . . . . 5.3.1 Critical Path Example . . . . . . . . . . . . . . . . . . 5.3.2 Algorithm to Find Critical Path . . . . . . . . . . . . . 5.3.2.1 Critical Path Between Two Signals . . . . . . 5.3.2.2 Critical Path Between Sets of Signals . . . . 5.3.3 False Paths . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3.1 Static False Path Example . . . . . . . . . . 5.3.3.2 Dynamic False Path Example . . . . . . . . . CHANGE ver2 (2002/12/02): corrected edge polarity on a . . . . . . . . . . . . . . . . . . . . . 5.3.3.3 Another Dynamic False Path Example . . . . 5.3.3.4 And Another Dynamic False Path Example . 5.3.3.5 Algorithm for False Path Detection . . . . . . 5.3.4 Increasing the Accuracy of Critical Path Analysis . . . LEC-16: Math, Physics, and Applications of Timing Analysis . 5.4 Analog Effects in Timing Analysis . . . . . . . . . . . . . . . . 5.4.1 Timing Model (Smith 3.1, 13.6) . . . . . . . . . . . . . 5.4.1.1 Equation for Output Voltage . . . . . . . . . . 5.4.1.2 Extrinsic / Intrinsic Delays (Smith 13.6) . . . 33 34 35 36 37 1 10 11 17 21 22 23 24 26 27 29 31 32 33 34 38 39 41 42 46 47 48 51 52 53 59 68 71 73 76 84 1 5 6 7 13

vi

CONTENTS
5.4.2 Data-Dependent Delay . . . . . . . . . . . . . . . . 5.4.3 Interconnect Delay (Smith 7.1) . . . . . . . . . . . . 5.4.3.1 Elmore Time Constant (Smith 7.1.2) . . . . 5.4.3.2 Interconnect with Single Fanout . . . . . . 5.4.3.3 Interconnect with Multiple Gates in Fanout 5.4.3.4 FPGAs, Interconnect, and Synthesis . . . . 5.5 Practical Usage of Timing Analysis . . . . . . . . . . . . . . 5.5.1 Speed Binning (Smith 5.1.6) . . . . . . . . . . . . . 5.5.2 Worst Case Timing (Smith 5.1.7) . . . . . . . . . . . 5.5.2.1 Fanout delay . . . . . . . . . . . . . . . . . 5.5.2.2 Derating Factors . . . . . . . . . . . . . . . LEC-17: Timing Analysis (Latches and Flip Flops) . . . . . . 5.6 Timing Analysis of Latches and Flip Flops . . . . . . . . . . 5.6.1 Simple Latch . . . . . . . . . . . . . . . . . . . . . . 5.6.2 Clock-to-Q Time of a Simple Latch . . . . . . . . . . 5.6.3 Setup Timing of a Simple Latch . . . . . . . . . . . . 5.6.3.1 Hold Time of a Simple Latch . . . . . . . . 5.6.3.2 Example of a Bad Latch . . . . . . . . . . . 5.6.4 Timing Analysis of a Transmission Gate Latch . . . 5.6.4.1 Transmission Gate (Smith 2.4.3) . . . . . . 5.6.4.2 Transmission Gate Latch (Smith 2.5.1) . . 5.6.4.3 Clock-to-Q Delay for Latch . . . . . . . . . 5.6.4.4 Setup and Hold Times for Latch . . . . . . 5.6.5 Falling Edge Flip Flop (Smith 2.5.2) . . . . . . . . . 5.6.5.1 Behaviour of Flip-Flop . . . . . . . . . . . . 5.6.5.2 Clock-to-Q of Flip-Flop . . . . . . . . . . . 5.6.5.3 Setup of Flip-Flop . . . . . . . . . . . . . . 5.6.5.4 Hold of Flip-Flop . . . . . . . . . . . . . . . 5.6.6 Timing Analysis of FPGA Cells (Smith 5.1.5) . . . . 5.6.6.1 Standard Timing Equations . . . . . . . . . 5.6.6.2 Hierarchical Timing Equations . . . . . . . 5.6.6.3 Actel Act 2 Logic Cell . . . . . . . . . . . . 5.6.6.4 Timing Analysis of Actel Sequential Module 5.6.7 Exotic Flop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 16 17 19 25 37 38 39 40 41 42 1 4 5 20 21 25 29 30 31 32 35 36 39 40 41 42 43 44 45 46 47 52 54

CONTENTS
6 Power Analysis and Design LEC-18: Introduction to Power . . . . . . . . . . . . . . . . 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Importance of Power and Energy . . . . . . . . . 6.1.2 Industrial Names and Products . . . . . . . . . . 6.1.3 Power vs Energy . . . . . . . . . . . . . . . . . . 6.1.4 Batteries, Power and Energy . . . . . . . . . . . 6.1.4.1 Do Batteries Store Energy or Power? . 6.1.4.2 Battery Life and Efciency . . . . . . . 6.1.5 Example Problem: Battery Life and Power . . . . 6.2 Power Equations . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Dynamic Power and Activity Factor . . . . . . . . 6.2.2 Switching Power . . . . . . . . . . . . . . . . . . 6.2.3 Short-Circuited Power . . . . . . . . . . . . . . . 6.2.4 Leakage Power . . . . . . . . . . . . . . . . . . . 6.2.5 Glossary . . . . . . . . . . . . . . . . . . . . . . 6.2.6 Note on Power Equations . . . . . . . . . . . . . LEC-19: Data Encoding for Power Reduction . . . . . . . 6.3 Overview of Power Reduction Techniques . . . . . . . . 6.4 Voltage Reduction for Power Reduction . . . . . . . . . 6.5 Data Encoding for Power Reduction . . . . . . . . . . . 6.5.1 How Data Encoding Can Reduce Power . . . . . 6.5.2 Example Problem . . . . . . . . . . . . . . . . . 6.5.2.1 Problem Statement . . . . . . . . . . . 6.5.2.2 Additional Information . . . . . . . . . . 6.5.2.3 Answer . . . . . . . . . . . . . . . . . . LEC-20: Clock Gating for Power Reduction . . . . . . . . . 6.6 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Introduction to and Overview of Clock Gating . . 6.6.1.1 Examples of Clock Gating . . . . . . . . 6.6.1.2 Design Tradeoffs . . . . . . . . . . . . . 6.6.1.3 Functional Validation and Clock Gating 6.6.2 Implementing Clock Gating . . . . . . . . . . . . 6.6.2.1 Simple Power Analysis . . . . . . . . . 6.6.2.2 Valid-Bit Protocol . . . . . . . . . . . . . 6.6.2.3 Clock Gating and Big Circuit . . . . . 6.6.2.4 Designing Clock Gating Circuitry . . . . 6.6.3 Design Problem . . . . . . . . . . . . . . . . . . 6.6.3.1 Solution Sketch . . . . . . . . . . . . . 55 1 7 8 9 10 11 12 13 14 15 16 17 18 19 20 23 1 4 9 13 14 15 16 17 18 1 4 5 6 7 8 9 10 14 21 29 32 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS
7 Fault Testing and Testability 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Purpose and List of Concepts . . . . . . . . . . . . . . 7.1.2 Background Material . . . . . . . . . . . . . . . . . . . 7.1.3 Reading Material . . . . . . . . . . . . . . . . . . . . . LEC-21: Introduction to Faults, Testing, and Testability . . . . 7.2 Faults and Testing . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Overview of Faults and Testing . . . . . . . . . . . . . 7.2.1.1 Faults (Smith 14.3) . . . . . . . . . . . . . . 7.2.1.2 Causes of Faults (Smith 14.3) . . . . . . . . 7.2.1.3 Testing (Smith 14) . . . . . . . . . . . . . . . 7.2.1.4 Burn In (Smith 14.3.1) . . . . . . . . . . . . . 7.2.1.5 Bin Sorting (Smith 5.1.6) . . . . . . . . . . . 7.2.1.6 Testing Techniques (Smith 14) . . . . . . . . 7.2.1.7 Design for Testability (DFT) (Smith 14.6) . . 7.2.2 Example Problem: Economics of Testing (Smith 14.1) 7.2.3 Physical Faults (Smith 14.3.3) . . . . . . . . . . . . . . 7.2.3.1 Types of Physical Faults . . . . . . . . . . . . 7.2.3.2 Locations of Faults . . . . . . . . . . . . . . . 7.2.3.3 Layout Affects Locations . . . . . . . . . . . 7.2.3.4 Naming Fault Locations . . . . . . . . . . . . 7.2.4 Detecting a Fault . . . . . . . . . . . . . . . . . . . . . 7.2.4.1 Which Test Vectors will Detect a Fault? . . . 7.2.4.2 A Single Test-Vector Can Detect Several Faults . . . . . . . . . . . . . . . . . . . . . . 7.2.5 Mathematical Models of Faults (Smith 14.3.4) . . . . . 7.2.5.1 Single Stuck-At Fault Model . . . . . . . . . . 7.2.6 Generate Test Vector to Find a Mathematical Fault (Smith 14.4) . . . . . . . . . . . . . . . . . . . . . . . . 7.2.6.1 Algorithm . . . . . . . . . . . . . . . . . . . . 7.2.6.2 Example of Finding a Test Vector . . . . . . . 7.2.7 Undetectable Faults . . . . . . . . . . . . . . . . . . . 7.2.7.1 Redundant Circuitry . . . . . . . . . . . . . . 7.2.7.2 Curious Redundant Circuitry and Fault Detection . . . . . . . . . . . . . . . . . . . . . LEC-22: Fault Detection and Test-Vector Generation . . . . . . 7.3 Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Locations of Faults . . . . . . . . . . . . . . . . . . . . 7.3.2 Choosing Test Vectors (Smith 14.3.7) . . . . . . . . . 7.3.2.1 Fault Domination . . . . . . . . . . . . . . . . 7.3.2.2 Fault Equivalence . . . . . . . . . . . . . . . 7.3.2.3 Gate Collapsing . . . . . . . . . . . . . . . . 49 50 51 52 53 1 6 7 8 9 10 11 12 13 15 16 18 19 20 21 22 23 24 25 26 27 31 32 33 34 35 41 1 4 5 8 9 10 11

CONTENTS
7.3.2.4 Node Collapsing . . . . . . . . . . . . . . . . 7.3.2.5 Fault Collapsing Summary . . . . . . . . . . 7.3.3 Fault Coverage . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Generate Test Vectors for 100% Coverage . . . . . . 7.3.4.1 Collapse the Faults . . . . . . . . . . . . . . 7.3.4.2 Check for Fault Domination . . . . . . . . . . 7.3.4.3 Required Test Vectors . . . . . . . . . . . . . 7.3.4.4 Faults Not Covered by Required Test Vectors 7.3.4.5 Order to Run Test Vectors . . . . . . . . . . . 7.3.4.6 Summary of Technique to Find and Order Test Vectors . . . . . . . . . . . . . . . . . . 7.3.4.7 Complete Analysis . . . . . . . . . . . . . . . 7.3.5 One Fault Hiding Another . . . . . . . . . . . . . . . . LEC-23: Built In Self Test . . . . . . . . . . . . . . . . . . . . . . 7.4 Built In Self Test (Smith 14.7) . . . . . . . . . . . . . . . . . . 7.4.1 Block Diagram . . . . . . . . . . . . . . . . . . . . . . 7.4.1.1 Components . . . . . . . . . . . . . . . . . . 7.4.1.2 Linear Feedback Shift Register (LFSR) . . . 7.4.1.3 Maximal-Length LFSR . . . . . . . . . . . . . 7.4.2 Test Generator . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Signature Analyzer . . . . . . . . . . . . . . . . . . . . 7.4.4 Result Checker . . . . . . . . . . . . . . . . . . . . . . 7.4.5 Arithmetic over Binary Fields . . . . . . . . . . . . . . 7.4.6 Shift Registers and Characteristic Polynomials (Smith 14.7.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.6.1 Circuit Multiplication . . . . . . . . . . . . . . 7.4.7 Bit Streams and Characteristic Polynomials . . . . . . 7.4.8 Division . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.9 Signature Analysis: Math and Circuits . . . . . . . . . 7.4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . LEC-24: Scan Testing (JTAG) . . . . . . . . . . . . . . . . . . . . 7.5 Scan Testing in General (Smith 14.6) . . . . . . . . . . . . . . 7.5.1 Structure and Behaviour of Scan Testing . . . . . . . 7.5.2 Scan Chains . . . . . . . . . . . . . . . . . . . . . . . 7.5.2.1 Circuitry in Normal Mode . . . . . . . . . . . 7.5.2.2 Scan in Operation . . . . . . . . . . . . . . . 7.5.2.3 Scan in Operation with Example Circuit . . . 7.5.3 Summary of Scan Testing . . . . . . . . . . . . . . . . 7.5.4 Example: Time to Test a Chip . . . . . . . . . . . . . . 7.6 Boundary Scan . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Boundary Scan History . . . . . . . . . . . . . . . . . 7.6.2 Scan Pins . . . . . . . . . . . . . . . . . . . . . . . . . 12 13 14 15 16 20 24 25 26 29 30 31 1 5 6 13 17 23 27 30 34 35 39 42 43 44 47 50 1 4 5 6 7 9 18 32 33 34 36 37

CONTENTS
7.6.3 Scan Registers and Cells . . . . . . . . . 7.6.4 Scan Instructions . . . . . . . . . . . . . . 7.6.5 TAP Controller . . . . . . . . . . . . . . . 7.6.6 Other descriptions of JTAG/IEEE 1194.1 . 7.7 Summary and Conclusions on Testing . . . . . . 7.7.1 Faults . . . . . . . . . . . . . . . . . . . . 7.7.2 Testing . . . . . . . . . . . . . . . . . . . 7.7.2.1 Scan Testing . . . . . . . . . . . 7.7.2.2 Built-In Self Test (BIST) . . . . . 7.7.3 Scan vs Self Test . . . . . . . . . . . . . . 8 Review LEC-25: Review . . . . . . . . . . . . . . . 8.1 Overview of the Term . . . . . . . . . 8.2 VHDL . . . . . . . . . . . . . . . . . . 8.3 Design and Optimization Techniques . 8.4 Validation . . . . . . . . . . . . . . . . 8.5 Performance Prediction and Analysis 8.6 Timing Analysis . . . . . . . . . . . . . 8.7 Power . . . . . . . . . . . . . . . . . . 8.8 Testing . . . . . . . . . . . . . . . . . 8.9 Formulas to be Given on Final Exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 43 44 45 46 47 48 49 51 53 55 1 2 5 6 7 8 9 10 11 13

xi

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

CONTENTS

xi

II Solutions to Tutorial Notes


1 VHDL Problems SOL-01: VHDL Syntax . . . . . . . . . . . . . . . . . . . . . 1.1 IEEE 1164 . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Flops, Latches, and Combinational Circuitry . . . . . . . 1.3 Counting Clock Cycles . . . . . . . . . . . . . . . . . . . 1.4 Arithmetic Overow . . . . . . . . . . . . . . . . . . . . 1.5 8-Bit Register . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Asynchronous Reset . . . . . . . . . . . . . . . . 1.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . 1.5.3 Testbench for Register . . . . . . . . . . . . . . . 1.6 VHDL Syntax . . . . . . . . . . . . . . . . . . . . . . . . SOL-02: VHDL Semantics . . . . . . . . . . . . . . . . . . . 1.7 Clock-Cycle Simulation . . . . . . . . . . . . . . . . . . 1.8 Delta-Cycle Simulation: Pong . . . . . . . . . . . . . . . 1.9 Delta-Cycle Simulation: Femur . . . . . . . . . . . . . . 1.10 VHDL VHDL Behavioural Comparison: Teradactyl . 1.11 VHDL VHDL Behavioural Comparison: Ichtyostega 1.12 Waveform VHDL Behavioural Comparison . . . . . 1.13 Hardware VHDL Comparison . . . . . . . . . . . . 1.14 Synthesizable VHDL and Hardware . . . . . . . . . . . 1.15 Datapath Design . . . . . . . . . . . . . . . . . . . . . 1.15.1 Correct Implementation? . . . . . . . . . . . . . 1.15.2 Smallest Area . . . . . . . . . . . . . . . . . . . 1.15.3 Shortest Clock Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
3 1 2 3 6 9 10 12 13 14 16 1 2 5 7 10 12 15 18 20 22 23 31 33

CONTENTS
2 Design Problems SOL-03: Datapath and Control Design . . . . . . . 2.1 Synthesis . . . . . . . . . . . . . . . . . . . . . 2.1.1 Data Structures . . . . . . . . . . . . . . 2.1.2 Own Code vs Libraries . . . . . . . . . 2.2 Design Guidelines . . . . . . . . . . . . . . . . 2.3 Dataow Diagram Optimization . . . . . . . . . 2.3.1 Resource Usage . . . . . . . . . . . . . 2.3.2 Optimization . . . . . . . . . . . . . . . 2.4 Dataow Diagram Design . . . . . . . . . . . . 2.4.1 Maximum performance . . . . . . . . . 2.4.2 Minimum area . . . . . . . . . . . . . . 2.5 Design and Optimization . . . . . . . . . . . . SOL-04: Memory Design . . . . . . . . . . . . . . . 2.6 Dataow Diagrams with Memory Arrays . . . . 2.6.1 Algorithm 1 . . . . . . . . . . . . . . . . 2.6.2 Algorithm 2 . . . . . . . . . . . . . . . . SOL-05: Optimization and FPGA Implementation 2.7 2-bit adder . . . . . . . . . . . . . . . . . . . . 2.7.1 Generic Gates . . . . . . . . . . . . . . 2.7.2 Xilinx FPGA . . . . . . . . . . . . . . . . 2.8 Sketches of Problems . . . . . . . . . . . . . . 3 Functional Validation Problems SOL-06: Functional Validation . . . . . . 3.1 Functional Validation Problems . . . . 3.1.1 Carry Save Adder . . . . . . . 3.1.2 Trafc Light Controller . . . . . 3.1.3 State Machines and Validation 3.1.4 Additional Problem . . . . . . . 3.1.5 Test Plan Creation . . . . . . . 3.1.5.1 Early Tests . . . . . . 3.1.5.2 Corner Cases . . . . 35 1 2 3 4 5 9 10 11 12 13 16 17 1 2 3 6 1 2 3 4 5 7 1 2 3 4 6 9 10 11 13

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

CONTENTS
4 Performance Analysis and Optimization Problems SOL-07: Performance Analysis and Optimization . 4.1 Farmer . . . . . . . . . . . . . . . . . . . . . . . 4.2 Network and Router . . . . . . . . . . . . . . . . 4.2.1 Maximum Throughput . . . . . . . . . . . 4.2.2 Packet Size and Performance . . . . . . . 4.3 Performance Short Answer . . . . . . . . . . . . 4.4 Microprocessors . . . . . . . . . . . . . . . . . . 4.4.1 Average CPI . . . . . . . . . . . . . . . . 4.4.2 Why not you too? . . . . . . . . . . . . . . 4.4.3 Analysis . . . . . . . . . . . . . . . . . . . 4.5 Dataow Diagram Optimization . . . . . . . . . . 4.6 Optimization with Memory Arrays . . . . . . . . . 4.7 Multiply Instruction . . . . . . . . . . . . . . . . . 4.7.1 Highest Performance . . . . . . . . . . . 4.7.2 Optimality . . . . . . . . . . . . . . . . . . 4.7.3 Performance Metrics . . . . . . . . . . . . 15 1 2 4 5 6 7 8 9 11 12 13 14 21 22 24 25 27 1 2 3 4 5 6 7 8 15 16 17 18 19 1 2 3 4 5 6 7

xv

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

5 Timing Analysis Problems SOL-08: Timing Analysis . . . . . . . . . . . . . . . . . . . . . . 5.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Critical Path and False Path . . . . . . . . . . . . . . . . . . . 5.3 Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Ignoring potential false paths, list the signals in the critical path through this circuit. . . . . . . . . . . . . . 5.3.2 What is the combinational delay through the critical path? . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Missing Factors . . . . . . . . . . . . . . . . . . . . . . 5.3.4 False Path? . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Timing Models . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Worst Case Conditions and Derating Factor . . . . . . . . . . 5.5.1 Worst-Case Commercial . . . . . . . . . . . . . . . . . 5.5.2 Worst-Case Industrial . . . . . . . . . . . . . . . . . . 5.5.3 Worst-Case Industrial, Non-Ambient Junction Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . SOL-09: Timing Analysis (II) . . . . . . . . . . . . . . . . . . . . 5.6 Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Wires in FPGAs . . . . . . . . . . . . . . . . . . . . . 5.6.2 Age and Time . . . . . . . . . . . . . . . . . . . . . . 5.6.3 Temperature and Delay . . . . . . . . . . . . . . . . . 5.7 Hold Time Violations . . . . . . . . . . . . . . . . . . . . . . . 5.7.1 Cause . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS
5.7.2 Behaviour . . . . . . . . . . . 5.7.3 Rectication . . . . . . . . . 5.8 Latch Analysis . . . . . . . . . . . . 5.9 Combinational Timing (Smith 13.23) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 9 10 12 13 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 7 8 9 10 11 12 13 14 15 16

xv

6 Power Problems SOL-10: Power Analysis and Reduction . . . . . . . . . . 6.1 Power Analysis and Reduction Problems . . . . . . . 6.1.1 Short Answers . . . . . . . . . . . . . . . . . . 6.1.1.1 Power and Temperature . . . . . . . . 6.1.1.2 Leakage Power . . . . . . . . . . . . 6.1.1.3 Clock Gating . . . . . . . . . . . . . . 6.1.1.4 Gray Coding . . . . . . . . . . . . . . 6.1.2 VLSI Gurus . . . . . . . . . . . . . . . . . . . . 6.1.2.1 Affect on Power . . . . . . . . . . . . 6.1.2.2 Critique . . . . . . . . . . . . . . . . . 6.1.3 Advertising Ratios . . . . . . . . . . . . . . . . SOL-11: Power Analysis and Reduction . . . . . . . . . . 6.1.4 Vary Supply Voltage . . . . . . . . . . . . . . . 6.1.5 Power Reality and Math (Smith prob 15.16) . . 6.1.6 Clock Speed Increase Without Power Increase 6.1.6.1 Supply Voltage . . . . . . . . . . . . . 6.1.6.2 Supply Voltage . . . . . . . . . . . . . 6.1.7 Power Reduction Strategies . . . . . . . . . . . 6.1.7.1 Supply Voltage . . . . . . . . . . . . . 6.1.7.2 Transistor Sizing . . . . . . . . . . . . 6.1.7.3 Adding Registers to Inputs . . . . . . 6.1.7.4 Gray Coding . . . . . . . . . . . . . . 6.1.8 Power Consumption on New Chip . . . . . . . 6.1.8.1 Hypothesis . . . . . . . . . . . . . . . 6.1.8.2 Experiment . . . . . . . . . . . . . . . 6.1.8.3 Reality . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS
7 Problems on Faults, Testing, and Testability SOL-12: Faults, Testing, and Testability . . . . . . . . . . . . . . 7.1 Based on Smith q14.9: Testing Cost . . . . . . . . . . . . . . 7.2 Testing Cost and Total Cost . . . . . . . . . . . . . . . . . . . 7.3 Minimum Number of Faults . . . . . . . . . . . . . . . . . . . 7.4 Smith q14.10: Fault Collapsing . . . . . . . . . . . . . . . . . 7.5 Mathematical Models and Reality . . . . . . . . . . . . . . . . 7.6 Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . 7.7 Test Vector Generation . . . . . . . . . . . . . . . . . . . . . 7.7.1 Choice of Test Vectors . . . . . . . . . . . . . . . . . . 7.7.2 Number of Test Vectors . . . . . . . . . . . . . . . . . 7.8 Time to do a Scan Test . . . . . . . . . . . . . . . . . . . . . 7.9 BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9.1 Characteristic Polynomials . . . . . . . . . . . . . . . 7.9.2 Test Generation . . . . . . . . . . . . . . . . . . . . . 7.9.3 Signature Analyzer . . . . . . . . . . . . . . . . . . . . 7.9.4 Probabilty of Catching a Fault . . . . . . . . . . . . . . 7.9.5 Probabilty of Catching a Fault . . . . . . . . . . . . . . 7.9.6 Detecting a Specic Fault . . . . . . . . . . . . . . . . 7.9.7 Time to Run Test . . . . . . . . . . . . . . . . . . . . . 7.10 Power and BIST . . . . . . . . . . . . . . . . . . . . . . . . . 7.11 Timing Hazards and Testability . . . . . . . . . . . . . . . . 7.12 Testing Short Answer . . . . . . . . . . . . . . . . . . . . . . 7.12.1 Are there any physical faults that are detectable by scan testing but not by built-in self testing? . . . . . . 7.12.2 Are there any physical faults that are detectable by built-in self testing but not by scan testing? . . . . . . 7.13 Fault Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.13.1 Design test generator . . . . . . . . . . . . . . . . . 7.13.2 Design signature analyzer . . . . . . . . . . . . . . . 7.13.3 Determine if a fault is detectable . . . . . . . . . . . 7.13.4 Testing time . . . . . . . . . . . . . . . . . . . . . . . 17 1 2 4 6 7 8 9 10 11 12 13 14 15 16 19 23 24 25 27 28 29 32 33 34 35 36 37 38 39

xvi

Part I

Lecture Notes

Chapter 1

VHDL: The Language

LEC-02 Preliminaries

LEC-02: Introduction to VHDL


Lecture Notes Sections: 1.1 1.5.3

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-02 Preliminaries

Schedule
wk-01 05 VHDL Design and Optimization wk-01 Overview and VHDL wk-02 VHDL Semantics wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review

wk-06 wk-07 wk-08 wk-08 10 wk-11 12 wk-13

LEC-02 Preliminaries

Concepts
Lecture Notes: Sections 1.11.5.3

synthesis simulation entity architecture process concurrent statement sequential statement

port type direction signal combinational process clocked process latch inference

LEC-02:

1.1

PRELUDE

1.1

Prelude

LEC-02:

1.1.1

Topics in this Chapter

1.1.1

Topics in this Chapter

VHDL syntax VHDL semantics synthesizing VHDL

LEC-02:

1.1.2

Background Material

1.1.2

Background Material

Smith Chapters 1 and 2

LEC-02:

1.1.3

Recommended Reading

1.1.3

Recommended Reading

Links to many VHDL resources are on the E&CE 427 web pages under Documentation. In addition to Smith, two other books on VHDL are on reserve in the Davis Centre Library:

Relevant chapters in Smith: 8 (Software), 10 (VHDL), 12 (Synthesis); Appendix A. Suggested reading order in Smith:

Designers Guide to VHDL, Peter J. Ashenden VHDL for Logic Synthesis, Andrew Rushton

First pass Ch 8 10.5 entities and architectures,

10.10 sequential statements, 10.13 concurrent statements,

LEC-02:

1.1.3

Recommended Reading

8 10.9 other declarations 10.15 congurations and specications 10.16 example: engine controller remainder of Ch 12

10.14 execution 12.2 synthesis 12.6 VHDL logic synthesis

Third pass: 10.110.4 intro to VHDL 10.6 packages and libraries 10.8 type declarations

Second pass: 10.11 operators 10.12 arithmetic 12.7 FSM synthesis 12.8 Memory synthesis

Reference material: Table 10.27: VHDL summary Table 10.28: VHDL denitions Appendix A: VHDL syntax

LEC-02:

1.2

INTRODUCTION TO VHDL

1.2

Introduction to VHDL

LEC-02:

1.2.1

VHDL Origins and History

10

1.2.1

VHDL Origins and History

VHDL = VHSIC Hardware Description Language VHSIC = Very High Speed Integrated Circuit The VHSIC Hardware Description Language (VHDL) is a formal notation intended for use in all phases of the creation of electronic systems. Because it is both machine readable and human readable, it supports the development, verication, synthesis and testing of hardware designs, the communication of hardware design data, and the maintenance, modication, and procurement of hardware. Language Reference Manual (IEEE Design Automation Standards Committee, 1993a)

VHDL is a lot more than synthesis of digital hardware

LEC-02:

1.2.1

VHDL Origins and History

11

VHDL History

Developed by the United States Department of Defense as part of the very high speed integrated circuit (VHSIC) program in the early 1980s. The Department of Defense intended VHDL to be used for the documentation, simulation and verication for electronic systems. Goals: improve design process over schematic entry standardize design descriptions amongst multiple vendors portable and extensible

LEC-02:

1.2.1

VHDL Origins and History

12

VHDL History (Contd)

Inspired by the ADA programming language large: 97 keywords, 94 syntactic rules verbose (designed by committee) static type checking, overloading complicated syntax: parentheses are used for both expression grouping and array indexing Example: a <= b * (3 + c); a <= (3 + c); -- integer -- 1-element array of integers

LEC-02:

1.2.1

VHDL Origins and History

13

VHDL History (Contd)

Standardized by IEEE in 1987 (IEEE 1076-1987), revised in 1993, 2000. In 1993 the IEEE standard VHDL package for model interoperability, STD_LOGIC_1164 (IEEE Standard 1164-1993), was developed. std_logic_1164 denes 9 different values for signals (See Smith Section 10.6.2) In 1997 the IEEE standard packages for arithmetic over std logic and bit signals were dened (IEEE Standard 1076.31997). numeric_std denes arithmetic over std logic vectors and integers. NB: This is the package that you should use for arithmetic. Dont use std logic arith it has less uniform support for mixed integer/signal arithmetic and has a greater tendency for differences between tools. numeric_bit denes arithmetic over bit vectors and integers. We wont use bit signals in this course, so you dont need to worry about this package.

LEC-02:

1.2.2

Semantics

14

1.2.2

Semantics

The original goal of VHDL was to simulate circuits. The semantics of the language dene circuit behaviour.
a c <= a AND b;

simulation

b c

But now, VHDL is used in simulation and synthesis. Synthesis is concerned with the structure of the circuit. Synthesis: converts one type of description (behavioural) into another, lower level, description (usually a netlist).
c <= a AND b;

synthesis

a c b

Synthesis is a computer-aided design (CAD) technique that transforms a

LEC-02:

1.2.2

Semantics

15

designers concise, high-level description of a circuit into a structureal description of a circuit.

LEC-02:

1.2.2

Semantics

16

CAD Tools
CAD Tools allow designers to automate lower-level design processes in implementing the desired functionality of a system. NOTE: EDA = Electronic Design Automation. In digital hardware design EDA = CAD.

LEC-02:

1.2.2

Semantics

17

Synthesis vs Simulation
For synthesis, we want the code we write to dene the structure of the hardware that is generated.
c <= a AND b;

synthesis

a c b

But, the VHDL semantics dene the behaviour of the hardware that is generated, not the structure of the hardware.
a c b a c <= a AND b; b c a c b

LEC-02:

1.2.3

Synthesis of a Simulation-Based Language

18

1.2.3 Synthesis of a Simulation-Based Language

Not all of VHDL is synthesizable c <= a AND b; (synthesizable) c <= a AND b AFTER 2ns; (NOT synthesizable) how do you build a circuit with exactly 2ns of delay through an AND gate? more examples of non-synthesizable code are in section 1.8 See section 1.8 for more details Different synthesis tools support different subsets of VHDL Some tools generate erroneous hardware for some code behaviour of hardware differs from VHDL semantics Some tools generate unpredictable hardware There is an IEEE standard (1076.6) for a synthesizable subset of VHDL, but tool vendors dont yet conform to it. (Most vendors still dont have full support for the 1993 extensions to VHDL!). For more info, see http://www.vhdl.org/siwg/.

LEC-02:

1.2.4

Solution to Synthesis Sanity

19

1.2.4

Solution to Synthesis Sanity

Pick a high-quality synthesis tool and study its documentation thoroughly Learn the idioms of the tool Different VHDL code with same behaviour can result in very different circuits Be careful if you have to port VHDL code from one tool to another KISS: Keep It Simple Stupid VHDL examples in lectures will illustrate reliable coding techniques for the Synopsys tools (and most other tools as well). Follow the coding guidelines and examples from lecture As you write VHDL, think about the hardware you expect to get. NB: If you cant predict the hardware, then the hardware probably wont be very good (small, fast, correct, etc)

LEC-02:

1.2.5

VHDL Disadvantages

20

1.2.5

VHDL Disadvantages

Some VHDL programs cannot be synthesized Different tools support different subsets of VHDL. Different tools generate different circuits for same code VHDL is verbose Many characters to say something simple VHDL is complicated and confusing Many different ways of saying the same thing Constructs that have similar purpose have very different syntax (case vs. select) Constructs that have similar syntax have very different semantics (variables vs signals) Hardware that is synthesized is not always obvious (when is a signal a ip-op vs latch vs combinational) The infamous latch inference problem (See section 1.5.2 for more information)

LEC-02:

1.2.6

VHDL Advantages

21

1.2.6

VHDL Advantages

VHDL supports unsynthesizable constructs that are useful in writing testbenches and other non-hardware artifacts that we need in hardware design. VHDL can be used throughout a large portion of the design process in different capacities, from specication to implementation to verication. VHDL has static typechecking many errors can be caught before synthesis and/or simulation. (In this respect, it is more similar to Java than to C.) VHDL has a rich collection of datatypes VHDL is a full-featured language with a good module system (libraries and packages). VHDL has a well-dened standard.

LEC-02:

1.2.7

VHDL and Other Languages

22

1.2.7

VHDL and Other Languages

LEC-02:

1.2.7

VHDL and Other Languages

23

1.2.7.1 VHDL vs Verilog

Verilog is a simpler language: smaller language, simple circuits are easier to write VHDL has more features than Verilog richer set of data types and strong type checking VHDL offers more exibility and expressivity for constructing large systems. The VHDL Standard is more standard than the Verilog Standard VHDL and Verilog have simulation-based semantics Simulation vendors generally conform to VHDL standard Some Verilog constructs dont simulate the same in different tools VHDL is used more than Verilog in Europe and Japan Verilog is used more than VHDL in North America South-East Asia, India, South America: ?????

LEC-02:

1.2.7

VHDL and Other Languages

24

1.2.7.2 VHDL vs SystemC

System C looks like C familiar syntax C is often used in algorithmic descriptions of circuits, so why not try to use it for synthesizable code as well? If you think VHDL is hard to synthesize, try C.... SystemC simulation is slower than advertised

LEC-02:

1.2.7

VHDL and Other Languages

25

1.2.7.3 VHDL vs Other Hardware Description Languages

Superlog: A new language (still under active development) that is based on Verilog and C. Basic core comes from Verilog. C-like extensions included to make language more expressive and powerful. Developed by the Co-Design company. Esterelle: A language evolving from academia to commercial viability. Very clean semantics. Aimed at state machines, limited support for datapath operations.

LEC-02:

1.2.7

VHDL and Other Languages

26

1.2.7.4 Summary of VHDL Evaluation

VHDL is far from perfect and has lots of annoying characteristics VHDL is a better language for education than Verilog because the static typechecking enforces good software engineering practices The richness of VHDL will be useful in creating concise high-level models and powerful testbenches

LEC-02:

1.3

OVERVIEW OF SYNTAX

27

1.3

Overview of Syntax

This section is just a brief overview of the syntax of VHDL, focussing on the constructs that are most commonly used. Read a book on VHDL and use online resources. (Look for VHDL under the Documentation tab in the E&C 427 web pages for more information.)

LEC-02:

1.3.1

Syntactic Categories

28

1.3.1

Syntactic Categories

There are ve major categories of syntactic constructs. (There are many, many minor categories and subcategories of constructs.)

Library units (section 1.3.2) Top-level constructs (packages, entities, architectures) Concurrent statements (section 1.3.4) Statements executed at the same time (in parallel) Sequential statements (section 1.3.7) Statements executed in series (one after the other) Expressions Arithmetic (section 1.9), Boolean, Vectors , etc Declarations Components , signals, variables, types, functions, ....

LEC-02:

1.3.2

Library Units

29

1.3.2

Library Units

Library units are the top-level syntactic constructs in VHDL. They are used to dene and include libraries, declare and implement interfaces, dene packages of declarations and otherwise bind together VHDL code.

Package body dene the contents of a library Packages determine which parts of the library are externally visible Use clause use a library in an entity/architecture or another package technically, use clauses are part of entities and packages, but they proceed the entity/package keyword, so we list them as toplevel constructs Entity (section 1.3.3)

LEC-02:

1.3.2

Library Units

30

dene interface to circuit

See Smith Section 10.6 for information on packages and use clauses.

Architecture (section 1.3.3) dene internal signals and gates of circuit

LEC-02:

1.3.3

Entities and Architecture

31

1.3.3

Entities and Architecture


entity entity architecture

Each hardware module is described with an Entity/Architecture pair

architecture

Figure 1.1: Entity and Architecture The syntax of VHDL is dened using a variation on Backus-Naur forms (BNF). See Smith Appendix A.1 for a description of the rules for understanding VHDL grammar.

Entity: interface

LEC-02:

1.3.3

Entities and Architecture

32

names, modes (in / out), types of externally visible signals of circuit

library ieee; use ieee.std_logic_1164.all; entity and_or is port ( a, b, c : in std_logic ; z : out std_logic ); end and_or; Figure 1.2: Example of an entity

Architecture: internals structure and behaviour of module

LEC-02:

1.3.3

Entities and Architecture

33

Figure 1.3: Simplied grammar of entity

[ use_clause ] entity ENTITYID is [ port ( SIGNALID : (in | out) TYPEID [ := expr ] ; ); ] [ declaration ] [ begin concurrent_statement ] end [ entity ] ENTITYID ;

LEC-02:

1.3.3

Entities and Architecture

34

Architecture
architecture signal x : begin x <= a AND z <= x OR end main; main of and_or is std_logic; b; (a AND c); Figure 1.4: Example of architecture

LEC-02:

1.3.3

Entities and Architecture

35

[ use_clause ] architecture ARCHID of ENTITYID is [ declaration ] begin concurrent_statement ] [ end [ architecture ] ARCHID ; Figure 1.5: Simplied grammar of architecture

LEC-02:

1.3.4

Concurrent Statements

36

1.3.4

Concurrent Statements

Concurrent statements are used inside architectures Concurrent statements execute in parallel (Figure 1.6) Concurrent statements make VHDL fundamentally different from most software languages. Hardware (gates) naturally execute in parallel VHDL mimics the behaviour of real hardware. At each innitesimally small moment of time, each gate: 1. samples its inputs 2. computes the value of its output 3. drives the output

LEC-02:

1.3.4

Concurrent Statements

37

architecture main of bowser is begin x1 <= a AND b; x2 <= NOT x1; z <= NOT x2; end main;

architecture main of bowser is begin z <= NOT x2; x2 <= NOT x1; x1 <= a AND b; end main;

a b

x1

x2

Figure 1.6: The order of concurrent statements doesnt matter

LEC-02:

1.3.4

Concurrent Statements

38

conditional assignment

... <= ... when ... else ...;

selected assignment

with ... select ... <= ... when ... | ..., else ...;

component instantiation

...: ... port map ( ... => ..., ... );

for-generate

...: for ... in ... generate ... end generate;

if-generate

...: if ... generate ... end generate;

process

process ... begin ... end process;

Figure 1.7: The most commonly used concurrent statements

normal assignment (... <= ...) if-then-else style (uses when) Smith Section 10.13.4

case/switch style assignment Smith Section 10.13.4

use an existing circuit section 1.3.5, Smith Section 10.13.6

replicate some hardware Smith Section 10.13.7

conditionally create some hardware Smith Section 10.13.7

the body of a process is executed sequentially Sections 1.3.6, 1.6; Smith Section 10.10

LEC-02:

1.3.5

Component Declaration and Instantiations

39

1.3.5 Component Declaration and Instantiations


There are two different syntaxes for component declaration and instantiation. The VHDL-93 syntax is much more concise than the VHDL-87 syntax. Not all tools support the VHDL-93 syntax. In particlar for E&CE 427, the Synopsys tools do not fully support the VHDL-93 syntax. See Smith Section 10.13.6 for more discussion on the syntax of component declaration and instantiation.

LEC-02:

1.3.6

Processes

40

1.3.6

Processes

Processes are used to describe the behaviour of hardware A process is a concurrent statement (Section 1.3.4). The body of a process contains sequential statements (Section 1.3.7) Processes are the most complex and difcult to understand part of VHDL (Sections 1.5 and 1.6)

LEC-02:

1.3.6

Processes

41

Example Process with Sensitivity List


process (a, b, c) begin y <= a AND b; if (a = 1) then z1 <= b AND c; z2 <= NOT c; else z1 <= b OR c; z2 <= c; end if; end process;

LEC-02:

1.3.6

Processes

42

Example Process with Wait Statements


process begin y <= a AND b; z <= 0; wait until rising_edge(clk); if (a = 1) then z <= 1; y <= 0; wait until rising_edge(clk); else y <= a OR b; end if; end process; Processes must have either a sensitivity list or at least one wait statement on each execution path through the process. Processes cannot have both a sensitivity list and a wait statement.

LEC-02:

1.3.6

Processes

43

Sensivity List
The sensitivity list contains the signals that are read in the process. An important coding guideline to ensure consistent synthesis and simulation results is to include all signals that are read in the sensitivity list. If you forget some signals, you will either end up with unpredictable hardware and simulation results (different results from different programs) or undesirable hardware (latches where you expected purely combinational hardware). For more on this topic, see sections 1.5.2 and 1.6. There is one exception to this rule: for a process that implements a ip-op with an if rising edge statement, it is acceptable to include only the clock signal in the sensitivity list other signals may be included, but are not needed.

LEC-02:

1.3.6

Processes

44

Process Grammar
[ PROCLAB : ] process ( sensitivity_list ) declaration ] [ begin sequential_statement end process [ PROCLAB ] ; Figure 1.8: Simplied grammar of process

LEC-02:

1.3.7

Sequential Statements

45

1.3.7

Sequential Statements

Used inside processes and functions.

LEC-02:

1.3.7

Sequential Statements

46

wait signal assignment if-then-else case

loop while loop for loop next

wait until ...; ... <= ...; if ... then ... elsif ... end if; case ... is when ... | ... => ...; when ... => ...; end case; loop ... end loop; while ... loop ... end loop; for ... in ... loop ... end loop; next ...;

Figure 1.9: The most commonly used sequential statements

LEC-02:

1.4

CONCURRENT VS SEQUENTIAL STATEMENTS

47

1.4

Concurrent vs Sequential Statements

Concurrent assignments can be translated into sequential statements. But, not all sequential can be translated into concurrent statements.

LEC-02:

1.4.1

Concurrent Assignment vs Process

48

1.4.1

Concurrent Assignment vs Process


architecture main of tiny is begin process (a) begin b <= a; end process; end main;

The two code fragments below have identical behaviour: architecture main of tiny is begin b <= a; end main;

LEC-02:

1.4.2

Conditional Assignment vs If Statements

49

1.4.2 Conditional Assignment vs If Statements


The two code fragments below have identical behaviour: Concurrent Statements t <= <val1> when <cond> else <val2>; Sequential Statements if <cond> then t <= <val1>; else t <= <val2>; end if

LEC-02:

1.4.3

Selected Assignment vs Case Statement

50

1.4.3 Selected Assignment vs Case Statement


The two code fragments below have identical behaviour Concurrent Statements with <expr> select t <= <val1> when <choices1>, <val2> when <choices2>, <val3> when <choices3>; Sequential Statements case <expr> is when <choices1> => t <= <val1>; when <choices2> => t <= <val2>; when <choices3> => t <= <val3>; end case;

LEC-02:

1.4.4

Coding Style

51

1.4.4

Coding Style

Code thats easy to write with sequential statements, but difcult with concurrent:

LEC-02:

1.4.4

Coding Style

52

Sequential Statements
case <expr> is when <choice1> => if <cond> then o <= <expr1>; else o <= <expr2>; end if; when <choice2> => ... end case;

LEC-02:

1.4.4

Coding Style

53

Concurrent Statements
Overall structure: with <expr> select t <= ... when <choice1>, ... when <choice2>; Failed attempt: with <expr> select t <= -- want to write: -<val1> when <cond> -else <val2> -- but conditional assignment -- is illegal here when c1, ... when c2;

LEC-02:

1.4.4

Coding Style

54

Concurrent Statements (Contd)


Concurrent statement with correct behaviour, but messy: t <= <expr1> when (expr = <choice1> AND <cond>) else <expr2> when (expr = <choice1> AND NOT <cond>) else ... ;

Lesson: complicated, nested control constructs are easier with sequential statements than with concurrent statements.

LEC-02:

1.5

OVERVIEW OF PROCESSES

55

1.5

Overview of Processes

Processes are the most difcult VHDL construct to understand. This section gives an overview of processes. Section 1.5 gives the details of the semantics of processes.

Within a process, statements are executed almost sequentially Among processes, execution is done in parallel Remember: a process is a concurrent statement!

LEC-02:

1.5

OVERVIEW OF PROCESSES

56

entity ENTITYID is interface declarations end ENTITYID; architecture ARCHID of ENTITYID is begin concurrent statements process begin sequential statements end process; concurrent statements end ARCHID; Figure 1.10: Sequential statements in a process

LEC-02:

1.5

OVERVIEW OF PROCESSES

57

Key concepts in VHDL semantics for processes:

VHDL mimics hardware Hardware (gates) execute in parallel Processes execute in parallel with each other All possible orders of executing processes must produce the same simulation results (waveforms) If a signal is not assigned a value, then it holds its previous value

All orders of executing concurrent statements must produce the same waveforms

LEC-02:

1.5

OVERVIEW OF PROCESSES

58

It doesnt matter whether you are running on a single-threaded operating system, on a multi-threaded operating system, on a massively parallel supercomputer, or on a special hardware emulator with one FPGA chip per VHDL process all simulations must be the same. These concepts are the motivation for the semantics of executing processes in VHDL (Section 1.6) and lead to the phenomenon of latch-inference (Section 1.5.2).

LEC-02:

1.5

OVERVIEW OF PROCESSES
execution sequence execution sequence execution sequence

59

architecture procA: process stmtA1; stmtA2; stmtA3; end process; procB: process stmtB1; stmtB2; end process; B1 B2 B1 B2 B1 B2 A1 A2 A3 A1 A2 A3 A1 A2 A3

single threaded: single threaded: multithreaded: procA before procB before procA and procB procA procB in parallel Figure 1.11: Different process execution sequences

LEC-02:

1.5

OVERVIEW OF PROCESSES

60

Figure 1.12: All execution orders must have same behaviour

LEC-02:

1.5

OVERVIEW OF PROCESSES

61

Sections 1.5.11.5.3 discuss the hardware generated by processes. Sections 1.61.6.3 discuss the behaviour and execution of processes.

LEC-02:

1.5.1

Combinational Process vs Clocked Process

62

1.5.1 Combinational Process vs Clocked Process


Each synthesizable process is either combinational or clocked.

LEC-02:

1.5.1

Combinational Process vs Clocked Process

63

Combinational process:

Executing the process takes part of one clock cycle Target signals are outputs of combinational circuitry A combinational processes must have a sensitivity list A combinational process does not have any wait statements and does not have any events, rising_edges, or falling_edges in conditions for if or in case statements Hardware is just combinational circuitry

LEC-02:

1.5.1

Combinational Process vs Clocked Process

64

Clocked process:

Executing the process takes one (or more) clock cycles Target signals are outputs of ops Process contains one or more wait or if rising edge statements hardware contains combinational circuitry and ip ops

NOTE: C locked processes are sometimes called sequential processes, but this can be easily confused with sequential statements, so in E&CE 427 well refer to synthesizable processes as either combinational or clocked.

LEC-02:

1.5.1

Combinational Process vs Clocked Process

65

Example of Combinational Process


process (a,b,c) p1 <= a; if (b = c) then p2 <= b; else p2 <= a; end if; end process;

LEC-02:

1.5.1

Combinational Process vs Clocked Process

66

Example Clocked Processes


process begin wait until rising_edge(clk); b <= a; end process; process (clk) begin if rising_edge(clk) then b <= a; end if; end process;

LEC-02:

1.5.2

Latch Inference

67

1.5.2

Latch Inference

The semantics of VHDL require that if a signal is assigned a value on some passes through a process and not on other passes, then on a pass through the process when the signal is not assigned a value, it must maintain its value from the previous pass. process (a, b, c) begin if (a = 1) then z1 <= b; z2 <= b; else z1 <= c; end if; end process;
a b c z1 z2

Figure 1.13: Example of latch inference

LEC-02:

1.5.2

Latch Inference

68

When a signals value must be stored, VHDL infers a latch or a ip-op in the hardware to store the value.

LEC-02:

1.5.2

Latch Inference

69

If you want a latch or a ip-op for the signal, then latch inference is good. If you want combinational circuitry, then latch inference is bad.

LEC-02:

1.5.2

Latch Inference

70

Causes of Latch Inference


Generally, latch inference refers to the unintentional creation of latches. The usual cause of unintended latch inference is missing assignments to signals in if-then-else and case statements. Latch inference happens during elaboration. When using the Synopsys tools, look for: Inferred memory devices in the output or log les.

LEC-02:

1.5.3

Combinational vs Flopped Signals

71

1.5.3

Combinational vs Flopped Signals

Signals assigned to in combinational processes are combinational. Signals assigned to in clocked processes are outputs of ip-ops. The one exception to this can occur in a clocked process that contains a signal that is assigned to in every branch of every if-then-else and case statement. Such a signal might be generated as combinational logic. Mixing combinational and clocked signals in the same process is bad design discipline, because it can lead to different results from different synthesis tools. So, if you follow good coding practices, you wont need to worry about this exception.

LEC-03 Preliminaries

LEC-03: Details of Process Execution


Lecture Notes Sections: 1.6 1.6.3

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-03 Preliminaries

Schedule
wk-01 05 VHDL Design and Optimization wk-01 Overview and VHDL wk-02 VHDL Semantics wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review

wk-06 wk-07 wk-08 wk-08 10 wk-11 12 wk-13

LEC-03 Preliminaries

Overview
This lecture relates fragments of VHDL code to the basic building blocks of hardware: ip-ops, Boolean gates, arithmetic circuits, etc. The semantics of VHDL are behavioural, not structural, but by understanding the behavioural semantics of VHDL we can derive the relationship between VHDL code and netlists.

LEC-03 Preliminaries

Concepts
Lecture Notes: Sections 1.61.6.3

temporal granularities process modes simulation cycle simulation step

delta cycle simulation round provisional assignment

LEC-03:

1.6

DETAILS OF PROCESS EXECUTION

1.6

Details of Process Execution

LEC-03:

1.6.1

Denitions and Algorithm

1.6.1

Denitions and Algorithm

LEC-03:

1.6.1

Denitions and Algorithm

1.6.1.1 Temporal Granularities of Simulation


This begins our discussion of the behaviour and execution of processes. There are several different granularities of time to analyze VHDL behaviour. In this course, we will discuss three major granularities: clock cycles, timing simulation, and delta cycles.

LEC-03:

1.6.1

Denitions and Algorithm

Clock Cycle

smallest unit of time is a clock cycle combinational logic has zero delay ip-ops have a delay of one clock cycle used for simulation early in the design cycle fastest simulation run times

LEC-03:

1.6.1

Denitions and Algorithm

Timing Simulation

smallest unit of time is a nano, pico, or fempto second combinational logic and wires have delay as computed by timing analysis tools ip-ops have setup, hold, and clock-to-Q timing parameters used for simulation when ne-tuning design and conrming that timing contraints are satised slow simulation times for large circuits

LEC-03:

1.6.1

Denitions and Algorithm

10

Delta Cycles

In assignments and exams, you will need to be able to simulate VHDL code at each of the three different levels of temporal granularity. In the laboratories and project, you will use simulation programs for both clock-cycle simulation and timing simulation. We dont have access to a program that will produce delta-cycle waveforms, but if anyone is looking for a challenging co-op job or fourth-year design project....

units of time are artifacts of VHDL semantics and simulation software simulation cycles, delta cycles, and simulation steps are inntesimaly small amounts of time VHDL semantics are dened in terms of these concepts

LEC-03:

1.6.1

Denitions and Algorithm

11

Denitely Delta
For the remainder of section 1.6, well look at only the delta cycle view of the world.

LEC-03:

1.6.1

Denitions and Algorithm

12

1.6.1.2 Process Modes


Each process is in one of the following modes: active, suspended, or postponed.

NOTE: postponed This use of the word postponed differs from that in the VHDL Standard. We wont be using postponed processes as dened in the Standard.

LEC-03:

1.6.1

Denitions and Algorithm

13

Process Modes
active
e sp su te tiv a

nd

postponed resume

ac

suspended

LEC-03:

1.6.1

Denitions and Algorithm

14

Suspended
active
d en sp su e

postponed resume

ac

tiv at

suspended

Nothing to currently execute A process stays suspended until the event that it is waiting for occurs: either a change in a signal on its sensitivity list or the condition in a wait statement

LEC-03:

1.6.1

Denitions and Algorithm

15

Postponed
active
d en sp su e

postponed resume

ac

tiv at

suspended

Wants to execute, but not currently active A process becomes active when the simulator chooses it from the pool of postponed processes

LEC-03:

1.6.1

Denitions and Algorithm

16

Active
active
d en sp su e tiv at

postponed resume

ac

suspended

Currently executing A process stays active until it hits a wait statement or completes the execution of the last statement in the process, at which point it suspends

LEC-03:

1.6.1

Denitions and Algorithm

17

1.6.1.3 Simulation Algorithm


The algorithm presented here is a simplication of the actual algorithm in Section 12.6 of the VHDL Standard. The most signicant simplication is that this algorithm does not support delayed assignments. To support delayed assignments, each signals provisional value would be generalized to an event wheel, which holds provisional assignments for multiple times in the future. A somewhat ironic note, only six of the two hundred pages in the VHDL Standard are devoted to the semantics of executing processes.

LEC-03:

1.6.1

Denitions and Algorithm

18

Initialization
Simulations start at step 6 with all processes postponed and all signals with a default value (U for std logic).

LEC-03:

1.6.1

Denitions and Algorithm

19

The Algorithm

LEC-03:

1.6.1

Denitions and Algorithm

20

1. All processes are suspended. 2. Each process looks at the signals that changed value and checks its sensitivity list or wait condition to see if it should resume 3. Update signals with their provisional values 4. Resume all suspended processes whose sensitivity list changed or wait condition became true. 5. If there are no postponed processes, then simulation time increments to the next scheduled event and the simulation continues at Step 1. 6. While there are postponed processes: (a) Pick one or more postponed processes to become active. (b) As a process executes, assignments to signals are provisional new values do not become visible until step 3 in the next simulation cycle (c) A process runs until it hits its sensitivity list or a wait statement, at which point it suspends. (d) Processes that become suspended stay suspended until there are no more postponed or active processes. 7. Calculate the new simulation time: If zero-delay assignments were made in the current simulation cycle then simulation time does not advance else simulation time is set to time of next scheduled event

LEC-03:

1.6.1

Denitions and Algorithm

21

NOTE: Parallel execution In n-threaded execution, at most n processes are active at a time

LEC-03:

1.6.1

Denitions and Algorithm

22

1.6.1.4 Delta-Cycle Denitions


Denition simulation step: Executing one sequential assignment. Denition simulation cycle: The operations that occur between the time when all processes are suspended, until all are suspended again. Denition delta cycle: A simulation cycle that does not advance simulation time. Equivalently: A simulation cycle with zero-delay assignments. Denition simulation round: A sequence of simulation cycles that all have the same simulation time. Equivalently: a contiguous sequence of delta cycles.

LEC-03:

1.6.1

Denitions and Algorithm

23

NOTE: Ofcial and unofcial terminology Simulation cycle and delta cycle are ofcial denitions in the VHDL Standard. Simulation step and simulation round are not standard denitions. They are used in E&CE 427 because we need words to associate with the concepts that they describe.

LEC-03:

1.6.2

Example: Process Execution

24

1.6.2

Example: Process Execution

LEC-03:

1.6.2

Example: Process Execution

25

entity bamboozle is begin port ( a, b : in std_logic; e : out std_logic ); end bamboozle; architecture main of bamboozle is signal c, d : std_logic; begin procA : process (a, b) begin c <= a AND b; end process; procB : process (b, c, d) begin d <= NOT c; e <= b AND d; end process; end main; Figure 1.14: Example circuit for process execution

LEC-03:

1.6.2

Example: Process Execution

26

In simulation run, a and b are external inputs with the following scheduled events:

In this example, we will treat the external inputs as if they were driven by an external process.
procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process;

a: (0 at 0 ns), (1 at 10 ns), (0 at 15 ns) b: (1 at 0 ns), (0 at 12 ns)

d e

LEC-03:
0ns
a b c d e

1.6.2

Example: Process Execution


10ns 12ns 15ns

27

Run of external inputs

LEC-03:

1.6.2

Example: Process Execution

28

process mode (S=suspended, P=postponend A=active) simulation-step pointer (one per process) P procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; b <= 1; a U wait for 10 ns; b U a <= 1; wait for 2 ns; c U b <= 0; d U wait for 3 ns; a <= 0; e U end process; visible-assignment value

U a U b Uc Ud U e

Legend initial values simulation step

LEC-03:
P

1.6.2

Example: Process Execution

29

procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

U a U b Uc Ud U e

Step 6: Initial conditions

LEC-03:
A

1.6.2

Example: Process Execution

30

procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

U a U b Uc Ud U e

Step 6(a): Activate procA

LEC-03:

1.6.2

Example: Process Execution

31

A P

procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

U a U b UUc Ud U e

Step 6(b): Provisional assignment to c

LEC-03:
S

1.6.2

Example: Process Execution

32

procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

U a U b UUc Ud U e

Step 6(c): Suspend procA

LEC-03:
S

1.6.2

Example: Process Execution

33

procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

U a U b UUc Ud U e

Step 6(a): Activate procC

LEC-03:
S

1.6.2

Example: Process Execution

34

procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

0U a U b UUc Ud U e

Step 6(b): Provisional assignment to a

LEC-03:
S

1.6.2

Example: Process Execution

35

procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

0U a 1U b UUc Ud U e

Step 6(b): Provisional assignment to b

LEC-03:
S

1.6.2

Example: Process Execution

36

procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

0U a 1U b UUc Ud U e

Step 6(c): Suspend procC

LEC-03:
S

1.6.2

Example: Process Execution

37

procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

0U a U b UUc Ud U e

Step 6(a): Activate procB

LEC-03:
P

1.6.2

Example: Process Execution

38

procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

0U a U b UUc UUd U e

Step 6(b): Provisional assignment to d

LEC-03:
S

1.6.2

Example: Process Execution

39

procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

0U a 1U b UUc UUd UU e

U U

Step 6(b): Provisional assignment to e

LEC-03:
S

1.6.2

Example: Process Execution

40

procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

0U a 1U b UUc UUd UU e

U U

Step 6(c): Suspend procB

LEC-03:
S

1.6.2

Example: Process Execution

41

procA: process (a, b) begin c <= a AND b; 0U UUc UUd end process; a UU procB: process (b, c, d) begin e 1U d <= NOT c; b e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; U d U wait for 3 ns; U e U a <= 0; end process; All processes suspended: End of simulation cycle

LEC-03:
S

1.6.2

Example: Process Execution

42

procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

0U a 1U b UUc UUd UU e

0ns

U U

Step 7: Simulation time remains at 0 ns

LEC-03:
S

1.6.2

Example: Process Execution

43

procA: process (a, b) begin c <= a AND b; 0U UUc UUd end process; a UU procB: process (b, c, d) begin 1U d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process; Step 1: Beginning of next simulation cycle Note: First simulation cycle compacted into two columns. This is done only in this example to save space and is not standard practice.

LEC-03:
S

1.6.2

Example: Process Execution

44

procA: process (a, b) begin c <= a AND b; 0U UUc UUd end process; a UU procB: process (b, c, d) begin 1U d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process; Step 2: Check sensitivity lists for changes

LEC-03:
S

1.6.2

Example: Process Execution

45

procA: process (a, b) begin c <= a AND b; 0 end process; a Uc Ud procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process; Step 3: Update signal values

U e

LEC-03:
P

1.6.2

Example: Process Execution

46

procA: process (a, b) begin c <= a AND b; 0 end process; a Uc Ud procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process; Step 4: Resume procA and procB

U e

LEC-03:
A

1.6.2

Example: Process Execution

47

procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

0 Uc 1 Ud U e

Step 6(a): Activate procA

LEC-03:

1.6.2

Example: Process Execution

48

A P

procA: process (a, b) begin c <= a AND b; 0 end process; a 0Uc Ud U procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process; Step 6(b): Provisional assignment to c

LEC-03:
S

1.6.2

Example: Process Execution

49

procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

0 0Uc 1 Ud U e

Step 6(c): Suspend procA

LEC-03:
S

1.6.2

Example: Process Execution

50

procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

0 0Uc 1 Ud U e

Step 6(a): Activate procB

LEC-03:
S

1.6.2

Example: Process Execution

51

procA: process (a, b) begin c <= a AND b; 0 end process; a 0Uc UUd U procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; e U a <= 0; end process; Step 6(b): Provisional assignment to d

LEC-03:
S

1.6.2

Example: Process Execution

52

procA: process (a, b) begin c <= a AND b; 0 end process; a UUd 0Uc UU procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U U wait for 3 ns; e U U a <= 0; end process; Step 6(b): Provisional assignment to e

LEC-03:
S

1.6.2

Example: Process Execution

53

procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

0 0Uc 1 UUd UU e

U U

Step 6(c): Suspend procB

LEC-03:
S

1.6.2

Example: Process Execution

54

procA: process (a, b) begin c <= a AND b; 0 end process; a 0Uc UUd UU procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U e U a <= 0; end process; Step 7: All processes suspended; end of simulation cycle

LEC-03:
S

1.6.2

Example: Process Execution

55

procA: process (a, b) begin c <= a AND b; 0 end process; a 0Uc UUd UU procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U e U a <= 0; end process; Step 1: Begin next simulation cycle

LEC-03:
S

1.6.2

Example: Process Execution

56

procA: process (a, b) begin c <= a AND b; 0 end process; a 0Uc UUd UU procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U e U a <= 0; end process; Step 2: Check sensitivity lists for changes

LEC-03:
S

1.6.2

Example: Process Execution

57

procA: process (a, b) begin c <= a AND b; 0 end process; a 0c Ud procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U e U a <= 0; end process; Step 3: Update signal values

U e

LEC-03:
S

1.6.2

Example: Process Execution

58

procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

0 0c 1 Ud U e

0ns

U U

Step 4: Resume procB

LEC-03:
S

1.6.2

Example: Process Execution

59

procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 1Ud U procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U e U a <= 0; end process; Steps 6(a,b): Activate procB; Provisional assignment to d

LEC-03:
S

1.6.2

Example: Process Execution

60

procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 1Ud UU procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 6(b): Provisional assignment to e

LEC-03:
S

1.6.2

Example: Process Execution

61

procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

0 0c 1 1Ud UU e

0ns

U U U

Step 6(c): Suspend procB

LEC-03:
S

1.6.2

Example: Process Execution

62

procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 1Ud UU procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 7: All processes suspended; end of simulation cycle

LEC-03:
S

1.6.2

Example: Process Execution

63

procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 1Ud UU procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 1: Begin next simulation cycle

LEC-03:
S

1.6.2

Example: Process Execution

64

procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 1Ud UU procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 2: Check sensitivity lists for changes

LEC-03:
S

1.6.2

Example: Process Execution

65

procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

0 0c 1 1d U e

0ns

0ns

U U U

Step 3: Update signals

LEC-03:
S

1.6.2

Example: Process Execution

66

procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

0 0c 1 1d U e

0ns

0ns

U U U

Step 4: Resume procB

LEC-03:
S

1.6.2

Example: Process Execution

67

procA: process (a, b) begin c <= a AND b; 0 end process; 11d a 0c U procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Steps 6(a, b): Activate procB; provisional assignment to d

LEC-03:
P

1.6.2

Example: Process Execution

68

procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 11d 1U procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 6(b): Provisional assignment to e

LEC-03:
S

1.6.2

Example: Process Execution

69

procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

0 0c 1 11d 1U e

0ns

0ns

0ns

U U U

Step 6(c): Suspend procB

LEC-03:
S

1.6.2

Example: Process Execution

70

procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 1d 1 procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns 0ns 10ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U U wait for 3 ns; U U e U a <= 0; end process; simulation round Step 7: No changes to "sensitized" signals --- time advances

LEC-03:

1.6.2

Example: Process Execution

71

Step 1: Begin next simulation cycle (Not shown)


S procA: process (a, b) begin c <= a AND b; 0 a 0c 1d end process; 1 procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns 0ns 10ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U U wait for 3 ns; U U e U a <= 0; end process; simulation round Step 2: Resume procC

LEC-03:

1.6.2

Example: Process Execution

72

Step 2: Check sensitivity lists for changes (Not shown) Step 3: Update signal values (Not shown)
S procA: process (a, b) begin c <= a AND b; 10 end process; a 0c 1d 1 procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns 0ns 10ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U U wait for 3 ns; U U e U a <= 0; end process; Step 6(a,b): Activate procC, provisional assignment to a

LEC-03:
S

1.6.2

Example: Process Execution

73

procA: process (a, b) begin c <= a AND b; 10 0c 1d end process; a 1 procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns 0ns 10ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 6(c): Suspend procC; end of simulation cycle

LEC-03:

1.6.2

Example: Process Execution

74

Step 1: Begin next simulation cycle (Not shown)


S procA: process (a, b) begin c <= a AND b; 10 a 0c 1d end process; 1 procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns 0ns 10ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U U wait for 3 ns; e U U U a <= 0; end process; Step 2: Check sensitivity lists for changes

LEC-03:
S

1.6.2

Example: Process Execution

75

procA: process (a, b) begin c <= a AND b; 1 0c 1d end process; a 1 procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns 0ns 10ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 3: Update signal values

LEC-03:
P

1.6.2

Example: Process Execution

76

procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;

1 0c 1 1d 1 e

0ns

0ns

0ns

10ns

U U U

Step 4: Resume procA

LEC-03:

1.6.2

Example: Process Execution

77

Note and Questions


NB: If a signal is updated with the same value it had in the previous simulation cycle, then it does not change, and therefore does not trigger processes to resume.

Question: What are the different granularities of time that occur when doing delta-cycle simulation?

Answer: simulation step, simulation cycle, delta cycle, simulation round

Question: What is the order of granularity, from nest to coarsest, amongst the different granularities related to delta-cycle simulation?

LEC-03:

1.6.2

Example: Process Execution

78

Answer: same order as listed just above

LEC-03:

1.6.3

Example: Need for Provisional Assignments

79

1.6.3 Example: Need for Provisional Assignments


This is an example of processes where updating signals during a simulation cycle leads to different results for different process execution orderings. architecture main of flotsam is begin p_c: process (a, b) begin c <= a AND b; a end process; p_d: process (a, c) begin b d <= a XOR c; end process; end main;

Figure 1.15: Circuit to illustrate need for provisional assignments

LEC-03:

1.6.3

Example: Need for Provisional Assignments

80

1. Start with all signals at 0. 2. Simultaneously change to a = 1 and b = 1.

LEC-03:
.

1.6.3

Example: Need for Provisional Assignments

81 .

If assignments are not visible within same simulation cycle (correct: i.e. provisional assignments are used)

p_c p_d a b c d
0 0 0 0

P A P

S A

p_c S P A p_dS a b c d
0 0 0 0

P P A S

S P A S

If p c is scheduled before p d, then d will have a 1 pulse.

If p d is scheduled before p c, then d will have a 1 pulse.

LEC-03:
.

1.6.3

Example: Need for Provisional Assignments

82 .

If assignments are visible within same simulation cycle (incorrect)

p_c p_d a b c d
0 0 0 0

P A P

S A

p_c S P A p_dS a b c d
0 0 0 0

P P A S

S P A S

If p c is scheduled before p d, then d will stay constant 0.

If p d is scheduled before p c, then d will have a 1 pulse.

With provisional assignments, both orders of scheduling processes result in the same behaviour on all signals. Without provisional assignments, differ-

LEC-03:

1.6.3

Example: Need for Provisional Assignments

83

ent scheduling orders result in different behaviour.

LEC-04 Preliminaries

LEC-04: Hardware Building Blocks


Lecture Notes Sections: 1.7 1.9.7

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-04 Preliminaries

Schedule
wk-01 05 VHDL Design and Optimization wk-01 Overview and VHDL wk-02 VHDL Semantics wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review

wk-06 wk-07 wk-08 wk-08 10 wk-11 12 wk-13

LEC-04 Preliminaries

Overview
This lecture uses the VHDL semantics from Lecture 03 to describe how we determine what hardware will be synthesized from VHDL.

LEC-04 Preliminaries

Concepts
Lecture Notes: Sections 1.71.9.7 basic building blocks ip-ops and latches coding ip-ops coding sequential circuits

good and bad coding practices arithmetic operations on signals

LEC-04:

1.7

VHDL AND HARDWARE BUILDING BLOCKS

1.7

VHDL and Hardware Building Blocks

This section outlines the building blocks for register transfer level design and how to write VHDL code for the building blocks.

LEC-04:

1.7.1

Basic Building Blocks

1.7.1

Basic Building Blocks

(also: n-to-1 muxes) 2:1 mux

D CE

WE A DO

WE A0 DI0 A1 DO1 DO0

DI

LEC-04:

1.7.1

Basic Building Blocks


VHDL and, or, nand, nor, xor, xnor if-then-else, case statement, selected assignment, conditional assignment +, -, sll, srl, sla, sra, rol, ror wait until, if-thenelse, rising edge 2-d array or library component

Hardware AND, OR, NAND, NOR, XOR, XNOR multiplexer

adder, subtracter, negater shifter, rotater ip-op memory array, register le, queue

Figure 1.16: RTL Building Blocks

LEC-04:

1.7.2

Deprecated Building Blocks for RTL

1.7.2

Deprecated Building Blocks for RTL

LEC-04:

1.7.2

Deprecated Building Blocks for RTL

Latches

Use ops, not latches Latch-based designs are susceptible to timing problems The transparent phase of a latch can let a signal leak through a latch causing the signal to affect the output one clock cycle too early Its possible for a latch-based circuit to simulate correctly, but not work in real hardware, because the timing delays on the real hardware dont match those predicted in synthesis

LEC-04:

1.7.2

Deprecated Building Blocks for RTL

10

T, JK, SR, etc ip-ops

Limit yourself to D-type ip-ops Most FPGA and ASIC cell libraries include only D-type ip ops (However, the ip-ops in Alteras APEX FPGAs can be congured as D, T, JK, or SR ip-ops.)

LEC-04:

1.7.2

Deprecated Building Blocks for RTL

11

Tri-state buffers

Use multiplexers, not tri-state buffers Tri-state designs are susceptible to stability and signal integrity problems Getting tri-state designs to simulate correctly is difcult, some library components dont support tri-state signals Tri-state designs rely on the code never letting two signals drive the bus at the same time It can be difcult to check that bus arbitration will always work correctly Manufacturing and environmental variablity can make real hardware not work correctly even if it simulates correctly Typical industrial practice is to avoid use of tri-state signals on a chip, but allow tri-state signals at the board level

LEC-04:

1.7.3

Hardware and Code for Flops

12

1.7.3

Hardware and Code for Flops

LEC-04:

1.7.3

Hardware and Code for Flops

13

1.7.3.1 Flip-Flops vs Latches


ip-op Edge sensitive: output only changes on rising (or falling) edge of clock latch Level sensitive: output changes whenever clock is high (or low) A common implementation of a ip-op is a pair of latches (Master/Slave op). Latches are sometimes called transparent latches, because they are transparent (input directly connected to output) when the clock is high. The clock to a latch is sometimes called the enable line. There is more information in the course notes on timing analysis for storage devices (Section 5.6).

LEC-04:

1.7.3

Hardware and Code for Flops

14

1.7.3.2 Flops with Waits and Ifs


The two code fragments below synthesize to identical hardware (ops). If process (clk) begin if rising_edge(clk) then q <= d; end if; end process; Wait process begin wait until rising_edge(clk); q <= d; end process;

LEC-04:

1.7.3

Hardware and Code for Flops

15

1.7.3.3 Flops with Synchronous Reset


The two code fragments below synthesize to identical hardware (ops with synchronous reset). Notice that the synchronous reset is really nothing more than an AND gate on the input. If process (clk) begin if rising_edge(clk) then if (reset = 1) then q <= 0; else q <= d; end if; end if; end process; Wait process begin wait until rising_edge(clk); if (reset = 1) then q <= 0; else q <= d0; end if; end process;

LEC-04:

1.7.3

Hardware and Code for Flops

16

1.7.3.4 Flops with Chip-Enable


The two code fragments below synthesize to identical hardware (ops with chip-enable lines). If process (clk) begin if rising_edge(clk) then if (ce = 1) then q <= d; end if; end if; end process; Wait process begin wait until rising_edge(clk); if (ce = 1) then q <= d; end if; end process;

LEC-04:

1.7.3

Hardware and Code for Flops

17

1.7.3.5 Flops with Chip-Enable and Mux on Input


The two code fragments below synthesize to identical hardware (ops with chip-enable lines and muxes on inputs).

LEC-04:

1.7.3
If

Hardware and Code for Flops


Wait

18

process (clk) begin if rising_edge(clk) then if (ce = 1) then if (sel = 1) then q <= d1; else q <= d0; end if; end if; end if; end process;

process begin wait until rising_edge(clk); if (ce = 1) then if (sel = 1) then q <= d1; else q <= d0; end if; end if; end process;

LEC-04:

1.7.3

Hardware and Code for Flops

19

1.7.3.6 Flops with Chip-Enable, Muxes, and Reset


The two code fragments below synthesize to identical hardware (ops with chip-enable lines, muxes on inputs, and synchronous reset). Notice that the synchronous reset is really nothing more than a mux, or an AND gate on the input. NB: The specic combination and order of tests is important to guarantee that the circuit synthesizes to a op with a chip enable, as opposed to a level-sensitive latch testing the chip enable and/or reset followed by a op. NB: The chip-enable pin on the op is connected to both ce and reset. If the chip-enable pin was not connected to reset, then the op would ignore reset unless chipenable was asserted.

LEC-04:

1.7.3

Hardware and Code for Flops

20

Chip-Enable, Mux, Reset with If


process (clk) begin if rising_edge(clk) then if (ce = 1 or reset =1 ) then if (reset = 1) then q <= 0; elsif (sel = 1) then q <= d1; else q <= d0; end if; end if; end if; end process;

LEC-04:

1.7.3

Hardware and Code for Flops

21

Chip-Enable, Mux, Reset with Wait


process begin wait until rising_edge(clk); if (ce = 1 or reset = 1) then if (reset = 1) then q <= 0; elsif (sel = 1) then q <= d1; else q <= d0; end if; end if; end process;

LEC-04:

1.7.4

An Example Sequential Circuit

22

1.7.4

An Example Sequential Circuit

There are many ways to write VHDL code that synthesizes to the schematic in gure 1.17. The two major choices in the styles are:

Some examples of these different optiona are shown in gures 1.181.21.

Put all of the code in a single process, or have collection of clocked processes, combinational processes, and concurrent statements. Use wait or if rising edge for ip ops.

LEC-04:
sel reset

1.7.4

An Example Sequential Circuit

23

a
R

c clk
S

entity and_not_reg is port ( reset, clk, sel : in std_logic; c : out std_logic ); end; Schematic and entity for examples of different code organizations in Figures 1.181.21 Figure 1.17: Schematic and entity for and not reg

LEC-04:

1.7.4

An Example Sequential Circuit

24

One Process
architecture one_proc of and_not_reg is signal a : std_logic; begin process begin wait until rising_edge(clk); if (reset = 1) then a <= 0; elsif (sel = 1) then a <= NOT a; else a <= a; end if; c <= NOT a; end process; end one_proc; Figure 1.18: One process implementation of Figure 1.17

LEC-04:

1.7.4

An Example Sequential Circuit

25

Two Processes with Wait


architecture two_proc_wait of and_not_reg is signal a : std_logic; begin process begin wait until rising_edge(clk); if (reset = 1) then a <= 0; elsif (sel = 1) then a <= NOT a; else a <= a; end if; end process; process begin wait until rising_edge(clk); c <= NOT a; end process; end two_proc_wait; Figure 1.19: Two processes with wait implementation of Figure 1.17

LEC-04:

1.7.4

An Example Sequential Circuit

26

Two Processes with If-Then-Else


architecture two_proc_if of and_not_reg is signal a : std_logic; begin process (clk) begin if rising_edge(clk) then if (reset = 1) then a <= 0; elsif (sel = 1) then a <= NOT a; else a <= a; end if; end if; end process; process (clk) begin if rising_edge(clk) then c <= NOT a; end if; end process; end two_proc_if; Figure 1.20: Two processes with if-then-else implementation of Figure 1.17

LEC-04:

1.7.4

An Example Sequential Circuit

27

Concurrent Statements
architecture comb of and_not_reg is signal a, b, d : std_logic; begin process (clk) begin if rising_edge(clk) then if (reset = 1) then a <= 0; else a <= d; end if; end if; end process; process (clk) begin if rising_edge(clk) then c <= NOT a; end if; end process; d <= b when (sel = 1) else a; b <= NOT a; end comb; Figure 1.21: Concurrent statement implementation of Figure 1.17

LEC-04:

1.8

SYNTHESIZABLE VS NON-SYNTHESIZABLE CODE

28

1.8 Synthesizable vs Non-Synthesizable Code


Synthesis is done by matching VHDL code against templates or patterns. Its important to use idioms that your synthesis tools recognizes. If you arent careful, you could write code that has the same behaviour as one of the idioms, but which results in inefcient or incorrect hardware. Section 1.7 described common idioms and the resulting hardware.

LEC-04:

1.8.1

Unsynthesizable Code

29

1.8.1

Unsynthesizable Code

LEC-04:

1.8.1

Unsynthesizable Code

30

1.8.1.1 Initial Values


Initial values on signals (UNSYNTHESIZABLE) signal bad_signal : std_logic := 0; Reason: In most implementation technologies, when a circuit powers up, the values on signals are completely random. Some FPGAs are an exception to this. For some FPGAs, when a chip is powered up, all ip ops will be 0. For other FPGAs, the initial values can be programmed.

LEC-04:

1.8.1

Unsynthesizable Code

31

1.8.1.2 Wait For


Wait for length of time (UNSYNTHESIZABLE) wait for 10 ns; Reason: Delays through circuits are dependent upon both the circuit and its operating environment, particularly supply voltage and temperature. For example, imagine trying to build an AND gate that will have exactly a 2ns delay in all environments.

LEC-04:

1.8.1

Unsynthesizable Code

32

1.8.1.3 Different Wait Conditions


wait statements with different conditions in a process (UNSYNTHESIZABLE) -- different clock signals process begin wait until rising_edge(clk1); x <= a; wait until rising_edge(clk2); x <= a; end process;

LEC-04:

1.8.1

Unsynthesizable Code

33

-- different clock edges process begin wait until rising_edge(clk); x <= a; wait until falling_edge(clk); x <= a; end process; Reason: processes with multiple wait statements are turned into nite state machines. The wait statements denote transitions between states. The target signals in the process are outputs of ip ops. Using different wait conditions would require the ip ops to use different clock signals at different times. Multiple clock signals for a single ip op would be difcult to synthesize, inefcient to build, and fragile to operate.

LEC-04:

1.8.1

Unsynthesizable Code

34

1.8.1.4 Multiple if rising edges in Same Process


Multiple if rising edge statements in a process (UNSYNTHESIZABLE) process (clk) begin if rising_edge(clk) then q0 <= d0; end if; if rising_edge(clk) then q1 <= d1; end if; end process; Reason: The idioms for synthesis tools generally expect just a single if rising edge statement in each process.

LEC-04:

1.8.1

Unsynthesizable Code

35

1.8.1.5 if rising edge and wait in Same Process


An if rising edge statement and a wait statement in the same process (UNSYNTHESIZABLE) process (clk) begin if rising_edge(clk) then q0 <= d0; end if; wait until rising_edge(clk); q0 <= d1; end process; Reason: The idioms for synthesis tools generally expect just a single type of op-generating statement in each process.

LEC-04:

1.8.1

Unsynthesizable Code

36

1.8.1.6 if rising edge with else Clause


The if statement has a rising edge condition and an else clause (UNSYNTHESIZABLE). process (clk) begin if rising_edge(clk) then q0 <= d0; else q0 <= d1; end if; end process; Reason: q0 is supposed to be the output of a ip-op in one case and the output of combinational circuitry in another.

LEC-04:

1.8.1

Unsynthesizable Code

37

1.8.1.7 if rising edge Inside a for Loop


An if rising edge statement in a for-loop (UNSYNTHESIZABLESynopsys) process (clk) begin for i in 0 to 7 loop if rising_edge(clk) then q <= d; end if; end loop; end process;

LEC-04:

1.8.1

Unsynthesizable Code

38

Synthesizable Alternative
A synthesizable alternative to an if rising edge statement in a for-loop is to put the if-rising-edge outside of the for loop. process (clk) begin if rising_edge(clk) then for i in 0 to 7 loop q <= d; end loop; end if; end process; Reason: just an idiom of the synthesis tool. Synthesizable for loops are described in Rushton Section 8.7. For loops in general are described in Ashenden. Examples of for loops in E&CE will appear when describing testbenches for functional validation.

LEC-04:

1.8.1

Unsynthesizable Code

39

1.8.1.8 wait Inside of a for loop


wait statements in a for loop (UNSYNTHESIZABLE) process begin for i in 0 to 7 loop wait until rising_edge(clk); x <= to_unsigned(i,4); end loop; end process; Reason: Unknown. For-loops are generally unsynthsizable, but while-loops with the same behaviour are synthesizable.

NOTE: For loops For loops are very useful in simulation, particular for test benches.

LEC-04:

1.8.1

Unsynthesizable Code

40

Synthesizable Alternative to Wait-Inside-For


while loop (synthesizable) This is the synthesizable alternative to the the wait statement in a for loop above. process begin -- output values from 0 to 4 on i -- sending one value out each clock cycle i <= to_unsigned(0,4); wait until rising_edge(clk); while (4 > i) loop i <= i + 1; wait until rising_edge(clk); end loop; end process;

LEC-04:

1.8.2

Synthesizable, but Undesirable Hardware

41

1.8.2 Synthesizable, but Undesirable Hardware


NB: The results for the examples in this section are highly dependent upon the tool that you use and the target technology library.

LEC-04:

1.8.2

Synthesizable, but Undesirable Hardware

42

1.8.2.1 Asynchronous Reset


In an asynchronous reset, the test for reset occurs outside of the test for the clock edge. process (reset, clk) begin if (reset = 1) then q <= 0; elsif rising_edge(clk) then q <= d1; end if; end process;

LEC-04:

1.8.2

Synthesizable, but Undesirable Hardware

43

1.8.2.2 Bad Form of Nested Ifs


if rising edge statement inside another if (BAD HARDWARE) In Synopsys, with some target libraries, this design results in a levelsensitive latch whose input is a op. process (ce, clk) begin if (ce = 1) then if rising_edge(clk) then q <= d1; end if; end if; end process;

LEC-04:

1.8.2

Synthesizable, but Undesirable Hardware

44

1.8.2.3 Deeply Nested Ifs


Deeply chained if-then-else statements can lead to long chains of dependent gates, rather than checking different cases in parallel. Slow (maybe) if cond1 then stmts1 elsif cond2 then stmts2 elsif cond3 then stmts3 elsif cond4 then stmts4 end if; Fast (hopefully) if only one of the conditions can be true at a time, then try using a case statement or some other technique that allows the conditions to be evaluated in parallel.

LEC-04:

1.9

NUMBERS, ARITHMETIC, ARRAYS, AND SIGNALS

45

1.9 Numbers, Arithmetic, Arrays, and Signals


VHDL includes all of the common arithmetic and logical operators. Use the VHDL arithmetic operators and let the synthesis tool choose the better implementation for you. It is almost impossible for a hand-coded implementation to beat vendor-supplied arithmetic libraries. To use the operators, you must choose which arithmetic package you wish to use (section 1.9.1). The arithmetic operators are overloaded, and you can usually use any mixture of constants and signals of different types that you need (Section 1.9.3). However, you might need to convert a signal from one type (e.g. std logic vector) to another type (e.g. integer) (Section 1.9.7).

LEC-04:

1.9.1

Arithmetic Packages

46

1.9.1

Arithmetic Packages

Rushton Ch-7 covers arithmetic packages. Rushton Appendex A.5 has the code listing for the numeric std package. To do arithmetic with signals, use the numeric_std package. This package denes types signed and unsigned, which are std_logic vectors on which you can do signed or unsigned arithmetic. numeric std supersedes std logic arith. earlier arithmetic packages, such as

Use only one arithmetic package, otherwise the different denitions will clash and you can get strange error messages.

LEC-04:

1.9.2

Shift and Rotate Operations

47

1.9.2

Shift and Rotate Operations

Shift and rotate operations are described with three character acronyms:

The shift right arithmetic (sra) operation preserves the sign of the operand, by coping the most signicant bit into lower bit positions. The shift left arithmetic does the analogous operation, except that the least signicant bit is copied.

shift/rotate

left/right

arithmetic/logical

LEC-04:

1.9.3

Overloading of Arithmetic

48

1.9.3

Overloading of Arithmetic

The arithmetic operators +, -, and * are overloaded on signed vectors, unsigned vectors, and integers. Tables 1.11.4 show the different combinations of target and source types and widths that can be used. Table 1.1: Overloading of Arithmetic Operations (+, -) target unsigned unsigned src1 unsigned integer unsigned src2 integer unsigned signed

OK OK fails in analysis

In these tables means dont care.

LEC-04:

1.9.4

Different Widths and Arithmetic

49

1.9.4

Different Widths and Arithmetic


target narrow wide wide narrow narrow src1/2 wide narrow wide narrow narrow src2/1 int narrow int

Table 1.2: Different Vector Widths and Arithmetic Operations (+, -)

fails in elaboration fails in elaboration OK OK OK

wide narrow

Example vectors unsigned(7 downto 0) unsigned(4 downto 0)

LEC-04:

1.9.5

Overloading of Comparisons

50

1.9.5

Overloading of Comparisons
src1 unsigned integer signed integer unsigned signed src2 integer unsigned integer signed signed unsigned

Table 1.3: Overloading of Comparison Operations (=, /=, >=, >, <)

OK OK OK OK fails in analysis fails in analysis

LEC-04:

1.9.6

Different Widths and Comparisons

51

1.9.6

Different Widths and Comparisons


src1 wide narrow src2

Table 1.4: Different Vector Widths and Comparison Operations (=, /=, >=, >, <)

OK OK

LEC-04:

1.9.7

Type Conversion

52

1.9.7

Type Conversion

The functions unsigned, signed, to integer, to unsigned and to signed are used to convert between integers, std-logic vectors, signed vectors and unsigned vectors. The listing below summarizes the types of these functions. unsigned( val : std_logic_vector ) signed( val : std_logic_vector ) to_integer( val : signed ) to_integer( val : unsigned ) return unsigned; return signed; return integer; return integer;

to_unsigned( val : signed; width : natural) to_signed( val : integer; width : natural)

return signed; return signed;

The most common example of converting between two types arises when using a signal as an index into an array. To use a signal as an index into

LEC-04:

1.9.7

Type Conversion

53

an array, you must convert the signal into an integer using the function to_integer (Figure 1.22). library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; ... signal bit_sig : std_logic; signal uns_sig : unsigned(7 downto 0); signal vec_sig : std_logic_vector(255 downto 0); ... bit_sig <= vec_sig( to_integer(uns_sig) ); ... Figure 1.22: Using a signal as an index to array To convert a std_logic_vector into an integer, you must rst say whether the signal should be interpreted as signed or unsigned. As illus-

LEC-04:

1.9.7

Type Conversion

54

trated in gure 1.23, this is done by: 1. Convert the std_logic_vector signal to signed or unsigned, using the function signed or unsigned 2. Convert the signed or unsigned signal into an integer, using to_integer

LEC-04:

1.9.7

Type Conversion

55

library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; ... signal bit_sig : std_logic; signal std_sig : std_logic_vector(7 downto 0); signal vec_sig : std_logic_vector(255 downto 0); ... bit_sig <= vec_sig( to_integer( unsigned( std_sig ) ) ); ... Figure 1.23: Using a std logic vector as an index to array

LEC-04:

1.9.7

Type Conversion

56

Chapter 2

RTL Design with VHDL: From Requirements to Optimized Code

57

LEC-04:

2.1

PRELUDE TO CHAPTER

58

2.1

Prelude to Chapter

LEC-04:

2.1.1

Topics in this Chapter

59

2.1.1

Topics in this Chapter

design ows dataow diagrams state machines memory arrays design example optimization

LEC-05 Preliminaries

LEC-05: Dataow Diagrams


Lecture Notes Sections: 2.3 2.3.9

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-05 Preliminaries

Schedule
wk-01 02 VHDL wk-01 Overview and VHDL wk-02 VHDL Semantics Design and Optimization wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review

wk-03 05

wk-06 wk-07 wk-08 wk-08 10 wk-11 12 wk-13

LEC-05 Preliminaries

Concepts
Lecture Notes: Sections 2.32.3.9

serial vs parallel algorithms and hardware dataow diagrams area estimation performance estimation

register allocation datapath, register, input, output allocation area / performance tradeoffs scheduling

LEC-05 Preliminaries

Reading
Rushton VHDL for Logic Synthesis (On reserve in DC-Library).

Chapter 1: Introduction Chapter 2: The Register Transfer Level Design Cycle

LEC-05:

2.2

DESIGN FLOW

2.2

Design Flow

LEC-05:

2.2.1

Generic Design Flow

2.2.1

Generic Design Flow

Most people agree on the general terminology and process for a digital hardware design ow. However, each book and course has its own particular way of presenting the ideas. Here we will lay out the consistent set of denitions that we will use in E&CE 427. This might be different from what you have seen in other courses or on a work term. Focus on the ideas and you will be ne both now and in the future. The design ow presented here focuses on the artifacts that we work with, rather than the operations that are performed on the artifacts. This is because the same operations can be performed at different points in the design ow, while the artifacts each have a unique purpose.

LEC-05:

2.2.1

Generic Design Flow


Requirements

Modify Algorithm Analyze Modify High-Level Model Analyze dp/ctrl specific Modify DP+Ctrl Code Analyze Modify Opt. RTL Code Analyze Modify Implementation Analyze

Hardware

Figure 2.1: Generic Design Flow

LEC-05:

2.2.1

Generic Design Flow

Design Flow Artifacts


Additional material in notes Table 2.1: Artifacts in the Design Flow Requirements Algorithm High-Level Model Dataow Diagram Hardware Block Diagram State Machine DP+Ctrl RTL code Optimized RTL Code Implementation Code Description of what the customer wants Functional description of computation. HDL code with signals and clock cycles Picture of datapath behaviour Picture of datapath structure Picture of control behaviour Synthesizable HDL code HDL code written to meet design goals All of the info to build a specic chip

LEC-05:

2.2.2

Implementation Flows

2.2.2

Implementation Flows

Synopsys Design Compiler and FPGA Compiler are general-purpose synthesis programs. They have very few, if any, technology-specic algorithms. Instead, they rely on libraries to describe technology-specic parameters of the primitive building blocks (e.g. the delay and area of individual gates, PLAs, CLBs, ops, memory arrays). Mentor Graphics product Leonardo Spectrum, Cadences product BuildGates, and Synplicitys product Synplify are similar. In comparison, Avant! (Now owned by Synopsys) and Cadence sell separate tools that do place-and-route and other low-level (physical design) tasks. These general-purpose synthesis tools do not (generally) do the nal stages of the design, such as place-and-route and timing analysis, which are very specic to a given implementation technology. The implementationtechnology-specic tools generally also produce a VHDL le that accurately models the chip. We will refer to this le as the implementation VHDL code.

LEC-05:

2.2.2

Implementation Flows

10

Synopsys with Xilinx and Altera


With the Synopsys and Xilinx tools, we compile VHDL code into a Xilinxspecic design le (xnf Xilinx netlist le). We then use the Xilinx tools to generate a bit le, which can be downloaded to a Xilinx FPGA. The name of the implementation VHDL le is often sufxed with routed.vhd. With Synopsys and the Altera tool Quartus, we compile the VHDL code into an EDIF le for the netlist and a TCL le for the commands to Quartus. Quartus then generates a sof (SRAM Object File), which can be downloaded to an Altera SRAM-based FPGA. The extension of the implementation VHDL le is often .vho, for VHDL output.

LEC-05:

2.2.2

Implementation Flows

11

Terminology: Behavioural and Structural


NOTE: behavioural and structural models The phrases behavioural model and structural model are commonly used for what well call high-level models and synthesizable models. In most cases, what people call structural code contains both structural and behavioural code. The technically correct denition of a structural model is an HDL program that contains only component instantiations and generate statements. Thus, even a program with c <= a AND b; is, strictly speaking, behavioural.

LEC-05:

2.2.3

Classes of Hardware

12

2.2.3

Classes of Hardware

Each circuit tends to be dominated by either its datapath, control (state machine) or storage (memory).

Datapath Purpose: compute output data based on input data Each parcel of input produces one parcel of output Examples: arithmetic, decoders Storage Purpose: hold data for future use Data is not modied while stored Examples: register les, FIFO queues Control Purpose: modify internal state based on inputs, compute outputs from state and inputs Mostly individual signals, few data (vectors) Examples: bus arbiters, memory-controllers

LEC-05:

2.2.4

Design Flow: Datapath vs Control vs Storage

13

2.2.4 Design Flow: Datapath vs Control vs Storage


All three classes of circuits (datapath, control, and storage) follow the same generic design ow (Figure 2.1), but the details in the ow differ. This is particularly true for the transition from the high-level model to the model that separates the datapath and control circuitry. The different classes of circuits all use dataow diagrams, hardware block diagrams, and state machines. What differs is how much effort is put into each type of description and the order in which the different descriptions are used.

Lec-05:

2.2.4.1

Datapath-Centric Design Flow

14

2.2.4.1 Datapath-Centric Design Flow


High-Level Model

Modify Dataflow Analyze Modify Block Diagram Analyze State Machine

DP+Ctrl RTL Code

Figure 2.2: Datapath-Centric Design Flow

Lec-05:

2.2.4.1

Datapath-Centric Design Flow

15

2.2.4.2 Control-Centric Design Flow


High-Level Model

Modify State Machine Analyze Modify Dataflow Diagram Analyze Modify Block Diagram Analyze

DP+Ctrl RTL Code

Figure 2.3: Control-Centric Design Flow

Lec-05:

2.2.4.1

Datapath-Centric Design Flow

16

2.2.4.3 Storage-Centric Design Flow


In E&CE 427, we wont be discussing storage-centric design. Storagecentric design differs from datapath- and control-centric design in that storage-centric design focusses on building many replicated copies of small cells. Storage-centric designs include a wide range of circuits, from simple memory arrays to complicated circuits such as register les, translation lookaside buffers, and caches. The complicated circuits can contain large and very intricate state machines, which would benet from some of the techniques for control-centric circuits.

LEC-05:

2.3

DATAFLOW DIAGRAMS AND HIGH-LEVEL MODELS 17

2.3 Dataow Diagrams and High-Level Models

LEC-05:

2.3.1

Overview of Example

18

2.3.1

Overview of Example

Requirement: compute the sum of 6 numbers: output = a + b + c + d + e + f Well go through the following artifacts: 1. 2. 3. 4. 5. 6. requirements algorithm dataow diagram hardware block diagram state machine high-level model

LEC-05:

2.3.1

Overview of Example

19

2.3.1.1 Software vs Hardware Algorithms

In software, the expression: (((((a + b) + c) + d) + e) + f) takes the same amount of time to execute as: (a + b) + (c + d) + (e + f). But: hardware runs in parallel in algorithmic description, parentheses can guide parallel vs serial execution

LEC-05:

2.3.1

Overview of Example

20

2.3.1.2 Serial vs Parallel


Serial (((((a+b)+c)+d)+e)+f)
a b c d e f

Parallel (a+b)+(c+d)+(e+f)

+ + + + +
a b c d e f

+ +

LEC-05:

2.3.1

Overview of Example

21

Performance Estimation
Serial (((((a+b)+c)+d)+e)+f)
a b c d e f

Parallel (a+b)+(c+d)+(e+f)

1 + 2 + 3 + 4 + 5 +
a b c d e f

1 + 2 +

3 +

5 adders on longest path (slower)

3 adders on longest path (faster)

There is more information on performance in section 2.3.3.1 and all of chap-

LEC-05:

2.3.1

Overview of Example

22

ter 4 is devoted to performance.

LEC-05:

2.3.1

Overview of Example

23

Area Estimation
Serial (((((a+b)+c)+d)+e)+f)
a b c d e f

Parallel (a+b)+(c+d)+(e+f)

1 + 2 + 3 + 4 + 5 +
a b c d e f

1 + 4 +

2 +

3 +

5 +

5 adders used

5 adders used

LEC-05:

2.3.1

Overview of Example

24

Design Comparison
Serial (((((a+b)+c)+d)+e)+f)
a b c d e f

Parallel (a+b)+(c+d)+(e+f)

+ + + + +
5 adders on longest path (slower) 5 adders used
a b c d e f

+ +

+
3 adders on longest path (faster) 5 adders used

LEC-05:

2.3.2

Dataow Diagrams

25

2.3.2

Dataow Diagrams

A disciplined approach for going beyond combinational logic for datapathcentric circuits

LEC-05:

2.3.2

Dataow Diagrams

26

2.3.2.1 Dataow Diagrams Overview

Purpose: Provide a disciplined approach for designing datapath-centric circuits Guide the design from algorithm to high-level model Guide the design from high-level model to model with separated datapath and control Estimate area and performance Make tradeoffs between different design options Background Based on techniques from high-level synthesis tools

LEC-05:

2.3.2

Dataow Diagrams

27

Dataow Diagrams Overview


a b c d e f

+
x1

+
x2

+
x3

+
x4

+
z

LEC-05:

2.3.2

Dataow Diagrams

28

Clock Cycle Boundaries


a b c d e f

+
x1

+
x2

Horizontal lines mark clock cycle boundaries

+
x3

+
x4

+
z

LEC-05:

2.3.2

Dataow Diagrams

29

Latency
a b c d e f

+
2 3 4 5 6
z x1

+
x2

Horizontal lines mark clock cycle boundaries

+
x3

+
x4

+
Latency = 6 clock cycles

LEC-05:

2.3.2

Dataow Diagrams

30

Latency
a b c d e f

+
x1

+
2
x2

Horizontal lines mark clock cycle boundaries

+
x3

+
3 4
z x4

+
Latency = 4 clock cycles

Question:

Note the imbalanced clock cycle utilization.

LEC-05:

2.3.2

Dataow Diagrams

31

Flip Flops
a b c d e f

+
x1

+
x2

Horizontal lines mark clock cycle boundaries

+
x3

Signals crossing clock boundaries are flip-flops

+
x4

+
z

LEC-05:

2.3.2

Dataow Diagrams

32

Registered Inputs and Outputs


a b c d e f

+
x1

+
x2

Horizontal lines mark clock cycle boundaries

+
x3

Signals crossing clock boundaries are flip-flops

+
x4

+
z

Flops on both inputs and outputs

LEC-05:

2.3.2

Dataow Diagrams

33

Registered Inputs
a b c d e f

+
x1

+
x2

Horizontal lines mark clock cycle boundaries

+
x3

Signals crossing clock boundaries are flip-flops

+
x4

+
z

Flops on inputs, but not outputs (Latency = 5)

LEC-05:

2.3.2

Dataow Diagrams

34

Datapath Components
a b c d e f

+
x1

+
x2

Horizontal lines mark clock cycle boundaries

+
x3

Signals crossing clock boundaries are flip-flops

+
x4

Blocks in clock cycles are datapath components

+
z

LEC-05:

2.3.2

Dataow Diagrams

35

Inputs

Unconnected signal tails are inputs Horizontal lines mark clock cycle boundaries

+
x1

+
x2

+
x3

Signals crossing clock boundaries are flip-flops

+
x4

Blocks in clock cycles are datapath components

+
z

LEC-05:

2.3.2

Dataow Diagrams

36

Outputs
a b c d e f

Unconnected signal tails are inputs Horizontal lines mark clock cycle boundaries

+
x1

+
x2

+
x3

Signals crossing clock boundaries are flip-flops

+
x4

Blocks in clock cycles are datapath components

+
z

Unconnected signal heads are outputs

LEC-05:

2.3.2

Dataow Diagrams

37

Summary
a b c d e f

Unconnected signal tails are inputs Horizontal lines mark clock cycle boundaries

+
x1

+
x2

+
x3

Signals crossing clock boundaries are flip-flops

+
x4

Blocks in clock cycles are datapath components

+
z

Unconnected signal heads are outputs

LEC-05:

2.3.2

Dataow Diagrams

38

2.3.2.2 Area Estimation

Maximum number of blocks in a clock cycle is total number of that component that are needed Maximum number of signals that cross a cycle boundary is total number of registers that are needed Maximum number of unconnected signal tails in a clock cycle is total number of inputs that are needed Maximum number of unconnected signal heads in a clock cycle is total number of outputs that are needed

LEC-05:

2.3.3

Dataow Diagram Execution

39

2.3.3

Dataow Diagram Execution

LEC-05:

2.3.3

Dataow Diagram Execution

40

Execution with Registers on Both Inputs and Outputs

LEC-05:
a

2.3.3
b c

Dataow Diagram Execution


d e f

41

0
clk a

0 1 2 3 4 5 6

+ x1 + x2 + x3 + x4
x5

x1 x2 x3 x4 x5 z

+
z

LEC-05:
a

2.3.3
b

Dataow Diagram Execution


c d e f

42

0 1
clk a

0 1 2 3 4 5 6

+ x1 + x2 + x3 + x4
x5

x1 x2 x3 x4 x5 z

+
z

LEC-05:
a

2.3.3
b c

Dataow Diagram Execution


d e f

43

0 1
clk a

0 1 2 3 4 5 6

+ x1 + x2 + x3 + x4
x5

x1 x2 x3 x4 x5 z

+
z

LEC-05:
a

2.3.3
b

Dataow Diagram Execution


c d e f

44

0 1
clk a

0 1 2 3 4 5 6

+ x1 + x2 + x3 + x4
x5

2 3

x1 x2 x3 x4 x5 z

+
z

LEC-05:
a

2.3.3
b c

Dataow Diagram Execution


d e f

45

0 1
clk a

0 1 2 3 4 5 6

+ x1 + x2 + x3 + x4
x5

2 3 4

x1 x2 x3 x4 x5 z

+
z

LEC-05:
a

2.3.3
b

Dataow Diagram Execution


c d e f

46

0 1
clk a

0 1 2 3 4 5 6

+ x1 + x2 + x3 + x4
x5

2 3 4

x1 x2 x3 x4 x5 z

+
z

LEC-05:
a

2.3.3
b c

Dataow Diagram Execution


d e f

47

0 1
clk a

0 1 2 3 4 5 6

+ x1 + x2 + x3 + x4
x5

2 3 4

x1 x2 x3 x4 x5 z

+
z

5 6

LEC-05:
a

2.3.3
b

Dataow Diagram Execution


c d e f

48

0 1
clk a

0 1 2 3 4 5 6

+ x1 + x2 + x3 + x4
x5

2 3 4

x1 x2 x3 x4 x5 z

+
z

5 6

LEC-05:

2.3.3

Dataow Diagram Execution

49

Execution Without Output Registers


a b c d e f

0 1
clk a

0 1 2 3 4 5 6

x1

+ + x2 + x3 + x4
x5

2 3 4

x1 x2 x3 x4 x5 z

+
z

LEC-05:
a

2.3.3
b

Dataow Diagram Execution


c d e f

50

0 1
clk a

0 1 2 3 4 5 6

+ x1 + x2 + x3 + x4
x5

2 3 4

x1 x2 x3 x4 x5 z

+
z

LEC-05:

2.3.3

Dataow Diagram Execution

51

2.3.3.1 Performance Estimation

LEC-05:

2.3.3

Dataow Diagram Execution

52

Performance Equations
Performance 1 TimeExec

Latency = Number of clock cycles from inputs to outputs There is much more information on performance in chapter 4, which is devoted to performance.

TimeExec

Latency

ClockPeriod

LEC-05:

2.3.3

Dataow Diagram Execution

53

Performance of Dataow Diagrams

Latency: count horizontal lines in diagram Min clock period (Max clock speed) limited by longest path in a clock cycle
a b c d e f

0 1
clk a

0 1 2 3 4 5 6

x1

+
x2

+
x3

x1 x2

+
x4

x3 x4

+
x5

x5 z

+
z

LEC-05:

2.3.3

Dataow Diagram Execution

54

2.3.3.2 Design Analysis


a b c d e f

0 1
clk a

0 1 2 3 4 5 6

+ x1 + x2 + x3 + x4
x5

2 3 4

x1 x2 x3 x4 x5 z

+
z

LEC-05:

2.3.3

Dataow Diagram Execution

55

Design Analysis Contd


num inputs num outputs num registers num adders min clock period latency 6 1 6 1 delay through op and one adder 5 clock cycles

LEC-05:

2.3.4

Area / Performance Tradeoffs

56

2.3.4

Area / Performance Tradeoffs


one add per clock cycle
a b c d e f

two adds per clock cycle


0 1
a b c d e f

0 1

+
x1

+
x1

+
x2

+
x2

+
x3

+
x3

+
x4

+
x4

+
z

5 6

+
z

NB: In the Two-add design, half of the last clock cycle is wasted.

LEC-05:

2.3.4

Area / Performance Tradeoffs

57

Two Adds per Clock Cycle


a b c d e f

0
clk

0 1 2 3 4 5 6
a x1

+
x1

+
x2

x2

+
x3

x3

x4 x5

+
x4

+
z

3 4

LEC-05:

2.3.4

Area / Performance Tradeoffs

58

Design Comparison
One add per clock cycle
a b c d e f

Two adds per clock cycle


a b c d e f

0 1

0 1

+
x1

+
x1

+
x2

+
x2

+
x3

+
x3

+
x4

+
x4

+
z

5 6

+
z

inputs outputs registers adders clock period latency

6 1 6 1 op + 1 add 6

6 1 6 2 op + 2 add 4

Question: Under what circumstances would each of the design options (one add and two add) be the fastest?

Answer: time = latency * clock period compare execution times for both options

LEC-05:

2.3.5

Optimize Inputs and Outputs

59

2.3.5

Optimize Inputs and Outputs

inputs regs

If currently storing all inputs and can change environments behaviour to delay sending some inputs, then can reduce the number of inputs and registers. One-add before I/O opt
a b c d e f

One-add after I/O opt


a b

+
x1

+
x1

+
x2

+
x2

+
x3

+
x3

+
x4

+
x4

+
z

+
z

6 6

2 2

LEC-05:

2.3.5

Optimize Inputs and Outputs

60

Design Comparison
One-add after I/O opt
a b

Two-add after I/O opt


a b c

+
x1

+
x1 d

+
x2

+
x2 e

+
x3

+
x3 f

+
x4

+
x4

+
z

+
z

inputs outputs registers adders clock period latency

2 1 2 1 op + 1 add 6

3 1 3 2 op + 2 add 4

LEC-05:

2.3.6

From Dataow Diagram to High-Level Model

61

2.3.6 From Dataow Diagram to High-Level Model


Here we illustrate the process of going from a dataow diagram to a highlevel model. In the high-level model the entire circuit will be implemented in a single process. For larger circuits it may be benecial to have separate processes for different groups of signals. High-level models are distinguished from lower-level models in that the code for the datapath and control are intermingled. In a high-level model of a datapath-centric circuit, there will probably not be any code devoted to the state machine.

LEC-05:

2.3.6

From Dataow Diagram to High-Level Model

62

Hardware Recipe for Two-Add


Table 2.2: Hardware Recipe for Two-Add inputs adders registers output registered inputs registered outputs clock cycles from inputs to outputs 3 2 3 1 YES YES 4

LEC-05:

2.3.6

From Dataow Diagram to High-Level Model

63

High-Level Models of Datapaths


The following two fragments of VHDL code (the hlm and hlm2 architectures) are derived directly from the dataow diagram labeled Two-add after in section 2.3.5 after input/output, datapath and register allocation have been done. The code between wait statements describes the work that is done in a clock cycle. The hlm architecture combines the datapath and control in a single process with multiple wait statements in the process. Because the process is clocked, all of the signals that are assigned to in the process are registers. Combinational signals would need to be done using concurrent assignments or combinational processes. The hlm2 architecture is derived from the hlm architecture by separating combinational and registered signals.

LEC-05:

2.3.6

From Dataow Diagram to High-Level Model

64

High-Level Model with Single Process


architecture hlm of big_add is process begin -------------------------------wait until rising_edge(clk); -------------------------------r1 <= i1; r2 <= i2; r3 <= i3; -------------------------------wait until rising_edge(clk); -------------------------------r1 <= (r1 + r2) + r3; r2 <= i2; r3 <= i3; -------------------------------wait until rising_edge(clk); -------------------------------r1 <= (r1 + r2) + r3; r2 <= i2; -------------------------------wait until rising_edge(clk); -------------------------------r3 <= (r1 + r2); end process; o1 <= r3; end hlm;

LEC-05:

2.3.6

From Dataow Diagram to High-Level Model

65

High-Level Model with Combinational Signals


architecture hlm2 of big_add is ---------------------------------process begin ---------------------------wait until rising_edge(clk); ---------------------------r1 <= i1; r2 <= i2; r3 <= i3; ---------------------------wait until rising_edge(clk); ---------------------------r1 <= a2; r2 <= i2; r3 <= i3; ---------------------------wait until rising_edge(clk); ---------------------------r1 <= a2; r2 <= i2; ---------------------------wait until rising_edge(clk); ---------------------------r3 <= a1; end process; ---------------------------------a1 <= r1 + r2; a2 <= a1 + r3; ---------------------------------o1 <= r3; ---------------------------------end hlm2;

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

66

2.3.7 From Dataow Diagram to DP+Ctrl Model

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

67

Dataow Diagram and Datapath Blocks


a b c

+
x1

+
x2

+
x3

+ +
x4 f

+
z

Figure 2.4: Dataow diagram and building blocks for block diagram

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

68

I/O Allocation
i1 i2 a b i3 c i1 i2 i3

+
x1

+
x2

i2 d

i3 e

+
x3

+ +
x4 i2 f

+
z o1

+
o1

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

69

Datapath Allocation
i1 i2 a b a1 i3 c i1 i2 i3

+
x1 a2

+
x2 a1

i2 d

i3 e

+
x3 a2

a1

+
a2

+
x4 a1

i2 f

+
z o1

+
o1

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

70

Register Allocation
i1 i2 a b r1 r2 a1 i3 c r3 i1 i2 i3

+
x1 a2

+
x2 r1 a1

i2 d r2

i3 e r3

r1 a1

r2

r3

+
x3 a2

+
a2

+
x4 r1 a1

i2 f r2

+
r3 z o1

+
o1

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

71

Allocation Completed
i1 i2 a b r1 r2 a1 i3 c r3 i1 i2 i3

+
x1 a2

+
x2 r1 a1

i2 d r2

i3 e r3

r1 a1

r2

r3

+
x3 a2

+
a2

+
x4 r1 a1

i2 f r2

+
r3 z o1

+
o1

I/O Allocation

Datapath Allocation Register Allocation

i1 a i2 b, d, f i3 c, e o1 z a1 x1, x3, z a2 x2, x4 r1 a, x2, x4 r2 b, d, f r3 c, e

Figure 2.5: Block diagram after I/O, datapath, and register allocation

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

72

Connect the Blocks


To connect the blocks:

a1

Simulate the dataow diagram, drawing connections between blocks when they communicate
i1 i2 i3

i1 i2 a b r1 r2

i3 c r3

+
x1 a2

+
x2 r1 a1

i2 d r2

i3 e r3

r1 a1

r2

r3

+
x3 a2

+
a2

+
x4 r1 a1

i2 f r2

+
r3 z o1

+
o1

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

73

Connect the Blocks


To connect the blocks:

a1

Simulate the dataow diagram, drawing connections between blocks when the communicate
i1 i2 a b r1 r2 i3 c r3 i1 i2 i3

+
x1 a2

+
x2 r1 a1

i2 d r2

i3 e r3

r1 a1

r2

r3

+
x3 a2

+
a2

+
x4 r1 a1

i2 f r2

+
r3 z o1

+
o1

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

74

Connect the Blocks


To connect the blocks:

a1

Simulate the dataow diagram, drawing connections between blocks when the communicate
i1 i2 a b r1 r2 i3 c r3 i1 i2 i3

+
x1 a2

+
x2 r1 a1

i2 d r2

i3 e r3

r1 a1

r2

r3

+
x3 a2

+
a2

+
x4 r1 a1

i2 f r2

+
r3 z o1

+
o1

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

75

Connect the Blocks and Add Muxes


To connect the blocks:


a1

Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3

i1 i2 a b r1 r2

i3 c r3

+
x1 a2

+
x2 r1 a1

i2 d r2

i3 e r3

r1 a1

r2

r3

+
x3 a2

+
a2

+
x4 r1 a1

i2 f r2

+
r3 z o1

+
o1

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

76

Connect the Blocks and Add Muxes


To connect the blocks:


a1

Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3

i1 i2 a b r1 r2

i3 c r3

+
x1 a2

+
x2 r1 a1

i2 d r2

i3 e r3

r1 a1

r2

r3

+
x3 a2

+
a2

+
x4 r1 a1

i2 f r2

+
r3 z o1

+
o1

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

77

Connect the Blocks and Add Muxes


To connect the blocks:


a1

Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3

i1 i2 a b r1 r2

i3 c r3

+
x1 a2

+
x2 r1 a1

i2 d r2

i3 e r3

r1 a1

r2

r3

+
x3 a2

+
a2

+
x4 r1 a1

i2 f r2

+
r3 z o1

+
o1

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

78

Connect the Blocks and Add Muxes


To connect the blocks:


a1

Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3

i1 i2 a b r1 r2

i3 c r3

+
x1 a2

+
x2 r1 a1

i2 d r2

i3 e r3

r1 a1

r2

r3

+
x3 a2

+
a2

+
x4 r1 a1

i2 f r2

+
r3 z o1

+
o1

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

79

Connect the Blocks and Add Muxes


To connect the blocks:


a1

Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3

i1 i2 a b r1 r2

i3 c r3

+
x1 a2

+
x2 r1 a1

i2 d r2

i3 e r3

r1 a1

r2

r3

+
x3 a2

+
a2

+
x4 r1 a1

i2 f r2

+
r3 z o1

+
o1

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

80

Connect the Blocks and Add Muxes


To connect the blocks:


a1

Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3

i1 i2 a b r1 r2

i3 c r3

+
x1 a2

+
x2 r1 a1

i2 d r2

i3 e r3

r1 a1

r2

r3

+
x3 a2

+
a2

+
x4 r1 a1

i2 f r2

+
r3 z o1

+
o1

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

81

Connect the Blocks and Add Muxes


To connect the blocks:


a1

Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3

i1 i2 a b r1 r2

i3 c r3

+
x1 a2

+
x2 r1 a1

i2 d r2

i3 e r3

r1 a1

r2

r3

+
x3 a2

+
a2

+
x4 r1 a1

i2 f r2

+
r3 z o1

+
o1

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

82

Connect the Blocks and Add Muxes


To connect the blocks:


a1

Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3

i1 i2 a b r1 r2

i3 c r3

+
x1 a2

+
x2 r1 a1

i2 d r2

i3 e r3

r1 a1

r2

r3

+
x3 a2

+
a2

+
x4 r1 a1

i2 f r2

+
r3 z o1

+
o1

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

83

Connect the Blocks and Add Muxes


To connect the blocks:


a1

Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3

i1 i2 a b r1 r2

i3 c r3

+
x1 a2

+
x2 r1 a1

i2 d r2

i3 e r3

r1 a1

r2

r3

+
x3 a2

+
a2

+
x4 r1 a1

i2 f r2

+
r3 z o1

+
o1

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

84

Done with Simulation


To connect the blocks:


a1

Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3

i1 i2 a b r1 r2

i3 c r3

+
x1 a2

+
x2 r1 a1

i2 d r2

i3 e r3

r1 a1

r2

r3

+
x3 a2

+
a2

+
x4 r1 a1

i2 f r2

+
r3 z o1

+
o1

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

85

Add State Machine


To connect the blocks:

The state machine keeps track of which clock cycle of the dataow diagram is currently being executed.

Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers Clean up drawing, add state machine (control)

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

86

Add State Machine


The state machine drives the datapath signals whose values are dependent upon which clock cycle of the dataow diagram is being executed. Typical examples are:

Select signals on multiplexers Instruction signals on arithmetic modules Chip-enable lines on registers and ip-ops
i1 i2 i3

i1 i2 a b r1 r2 a1

i3 c r3

+
x1 a2

ctrl

+
x2 r1 a1

i2 d r2

i3 e r3

r1 a1

r2

r3

+
x3 a2

+
a2

+
x4 r1 a1

i2 f r2

+
r3 z o1

+
o1

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

87

Classes of Hardware
i1 i2 i3

datapath ctrl
r1 a1 r2 r3

storage control

+
a2

+
o1

Figure 2.6: Classes of hardware in example circuit

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

88

2.3.7.1 Datapath for DP+Ctrl Model


The following VHDL code is derived directly from the block diagram in gure 2.6.

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

89

architecture main of big_add is fsm : process ... end; process (clk) begin if rising_edge(clk) then if r1_gets_in = 1 then r1 <= i1; else r1 <= a2; end if; end if; end process; process (clk) begin if rising_edge(clk) then r2 <= i2; end if; end process; process (clk) begin if rising_edge(clk) then if r3_gets_in = 1 then r3 <= i3; else r3 <= a1; end if; end if; end process; a1 <= r1 + r2; a2 <= a1 + r3; o1 <= r3; end main;

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

90

In section 2.4, well discuss how to build the control circuitry (nite state machine, represented by the fsm process).

LEC-05:

2.3.7

From Dataow Diagram to DP+Ctrl Model

91

From Dataow to Hardware (Almost)


1. 2. 3. 4. 5. 6. 7. 8. 9. Create dataow diagram Optimize inputs and outputs I/O allocation: assign dataow signals to hardware inputs and outputs Datapath allocation: assign dataow blocks to components Register allocation: assign dataow signals to registers Derive high-level model Connect the hardware, add muxes where needed Derive datapath for DP+Ctrl model Build the state machine, connect to datapath + storage

LEC-05:

2.3.8

Dataow Diagram Scheduling

92

2.3.8

Dataow Diagram Scheduling

Schedule: move functional blocks between clock cycles Allows tradeoffs between performance and area NOTE: Parallel algorithms have higher performance and greater scheduling exibility than serial algorithms NOTE: Serial algorithms tend to have less area than parallel algorithms Serial (((((a+b)+c)+d)+e)+f)
a b c d e f

Parallel (a+b)+(c+d)+(e+f)

+ + + + +
a b c d e f

+ +

LEC-05:

2.3.8

Dataow Diagram Scheduling

93

Design Analysis
a b c d e f

+ +

+
clock period num adders 1 add 3

LEC-05:

2.3.8

Dataow Diagram Scheduling

94

Scheduling to Optimize Area


original
a b c d e f a

after scheduling
b c d

+ +

+ +

+ +

inputs outputs registers adders clock period latency

6 1 6 3 op + 1 add 3

4 1 4 2 op + 1 add 3

LEC-05:

2.3.9

Summary: From Dataow to Hardware

95

2.3.9 Summary: From Dataow to Hardware


1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Create dataow diagram Schedule data operations Optimize inputs and outputs I/O allocation: assign dataow signals to hardware inputs and outputs Datapath allocation: assign dataow blocks to components Register allocation: assign dataow signals to registers Derive high-level model Connect the hardware, add muxes where needed Derive datapath for DP+Ctrl model Build the state machine, connect to datapath + storage

LEC-06 Preliminaries

LEC-06: State Machine Design


Lecture Notes Sections: 2.4 2.4.10

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-06 Preliminaries

Schedule
wk-01 02 VHDL wk-01 Overview and VHDL wk-02 VHDL Semantics Design and Optimization wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review

wk-03 05

wk-06 wk-07 wk-08 wk-08 10 wk-11 12 wk-13

LEC-06 Preliminaries

Overview
This lecture builds on material from Lec-05, where dataow diagrams were introduced. The bulk of the lecture discusses nite state machine design. First how to build a state machine from a dataow diagram, and then various ways of coding up state machines in VHDL.

LEC-06 Preliminaries

Concepts
Lecture Notes: Sections 2.42.4.10

input/output protocols deriving nite state machines from dataow diagrams coding state machines in

VHDL state encoding explicit state machines implicit state machines

LEC-06 Preliminaries

Background
Mano Digital Design

Section 6-4: Analysis of Clocked Sequential Circuits Section 6-5: State Reduction and Assignment Section 6-7: Design Procedure

LEC-06 Preliminaries

Reading
Smith ASIC

Rushton VHDL for Logic Synthesis (On reserve in DC-Library).

By now, you should be done with Chapter 8 (Programable ASIC Design Software) and Chapter 10 (VHDL) Section 12.2: Synthesis (From Lec-02) Section 12.6: VHDL Logic Synthesis (From Lec-02) Section 12.7: Finite State Machine Synthesis

Chapter 8: Sequential VHDL Chapter 9: Registers Section 12.2: Finite State Machines

LEC-06:

2.4

FINITE STATE MACHINES IN VHDL

2.4

Finite State Machines in VHDL

LEC-06:

2.4.1

Mealy vs Moore State Machines

2.4.1

Mealy vs Moore State Machines

LEC-06:

2.4.1

Mealy vs Moore State Machines

Moore Machines

Outputs are dependent upon only the state No combinational paths from inputs to outputs Outputs can be either ops or combinational

s0/0 a s1/1 !a s2/0

s3/0

LEC-06:

2.4.1

Mealy vs Moore State Machines

10

Mealy Machines

Outputs are dependent upon both the state and the inputs Combinational paths from inputs to outputs Outputs must be combinational

s0 a/1 s1 /0 s3 /0 !a/0 s2

LEC-06:

2.4.2

State Machines and VHDL

11

2.4.2

State Machines and VHDL

A state machine is generally written as a single clocked process, or as a pair of processes, where one is clocked and one is combinational.

LEC-06:

2.4.2

State Machines and VHDL

12

Design Decisions

Moore vs Mealy (Sections 2.4.3.1 and 2.4.3.2) Implicit vs Explicit (Section 2.4.6) State values in explicit state machines: Enumerated type vs constants (Section 2.4.4.1) State values for constants: encoding scheme (binary, gray, one-hot, ...) (Section 2.4.4.2)

LEC-06:

2.4.2

State Machines and VHDL

13

How to Steer a State Machine


The following VHDL control constructs are useful to steer the transition from state to state:

if ... then ... case for ... loop while ... loop

else

loop next exit

LEC-06:

2.4.2

State Machines and VHDL

14

2.4.2.1 Implicit and Explicit State Machines


There are two general ways to code state machines: implicit and explicit.

LEC-06:

2.4.2

State Machines and VHDL

15

Implicit State Machines


Some state machines do not have a specic state signal. The state machine uses multiple wait states in a process to control the values driving the control signals needed by the datapath. These state machines are called implicit. The synthesis program adds an implicit register to hold the state signal and combinational circuitry to update the state signal. In Synopsys synthesis tools, the state signal dened by the synthesizer is named multiple wait state reg.

LEC-06:

2.4.2

State Machines and VHDL

16

Explicit State Machines


The alternative to an implicit state machine is an explicit style, where the engineer denes a signal to represent the state and provides code to store and update the state signal. In this case, each process has at most one wait statement.

LEC-06:

2.4.3

Some Simple State Machines

17

2.4.3

Some Simple State Machines

LEC-06:

2.4.3

Some Simple State Machines

18

2.4.3.1 Implementing a Simple Moore Machine

LEC-06:

2.4.3

Some Simple State Machines

19

Entity and Diagram

s0/0 a s1/1 !a s2/0

entity moore is port ( a, clk : in std_logic; z : out std_logic ); end moore;

s3/0

LEC-06:

2.4.3

Some Simple State Machines

20

Implicit State Machine


architecture main of moore is begin process begin z <= 0; wait until rising_edge(clk); if (a = 1) then z <= 1; else z <= 0; end if; wait until rising_edge(clk); z <= 0; wait until rising_edge(clk); end process; end main;

LEC-06:

2.4.3

Some Simple State Machines

21

Implicit State Machine

LEC-06:

2.4.3

Some Simple State Machines

22

Explicit with Flopped Outputs


architecture main of moore_v2 is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then case state is when s0 => if (a = 1) then state <= s1; z <= 1; else state <= s2; z <= 0; end if; when s1 | s2 => state <= s3; z <= 0; when s3 => state <= s0; z <= 1; end case; end if; end process; end main;

LEC-06:

2.4.3

Some Simple State Machines

23

Explicit with Flopped Outputs

LEC-06:

2.4.3

Some Simple State Machines

24

Explicit with Combinational Outputs


architecture main of moore_v3 is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then case state is when s0 => if (a = 1) then state <= s1; else state <= s2; end if; when s1 | s2 => state <= s3; when s3 => state <= s0; end case; end if; end process; z <= 1 when (state = s1) else 0; end main;

LEC-06:

2.4.3

Some Simple State Machines

25

Explicit with Combinational Outputs

LEC-06:

2.4.3

Some Simple State Machines

26

State Machine with Next Signals


architecture main of moore_v4 is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin type state_ty is (s0, s1, s2, s3); signal state, state_nxt : state_ty; begin process (clk) begin if rising_edge(clk) then state <= state_nxt; end if; end process; state_nxt <= s1 when (state = s0) and (a = 1) else s2 when (state = s0) and (a = 0) else s3 when (state = s1) or (state = s2) else s0; z <= 1 when (state = s1) else 0; end main;

LEC-06:

2.4.3

Some Simple State Machines

27

State Machine with Next Signals

LEC-06:

2.4.3

Some Simple State Machines

28

Explicit with Combinational Process


architecture main of moore_v4 is type state_ty is (s0, s1, s2, s3); signal state, state_nxt : state_ty; begin process (clk) begin if rising_edge(clk) then state <= state_nxt; end if; end process; process (state, a) begin case state is when s0 => if (a = 1) then state_nxt <= s1; else state_nxt <= s2; end if; when s1 | s2 => state_nxt <= s3; when s3 => state_nxt <= s0; end case; end process; z <= 1 when (state = s1) else 0; end main;

LEC-06:

2.4.3

Some Simple State Machines

29

Explicit with Combinational Process

LEC-06:

2.4.3

Some Simple State Machines

30

2.4.3.2 Implementing a Simple Mealy Machine

LEC-06:

2.4.3

Some Simple State Machines

31

Entity and Diagram

s0 a/1 s1 /0 s3 /0 !a/0 s2

entity mealy is port ( a, clk : in std_logic; z : out std_logic ); end moore;

LEC-06:

2.4.3

Some Simple State Machines

32

Implicit State Machine


architecture main of mealy is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process begin state <= s0; wait until rising_edge(clk); if (a = 1) then state <= s1; else state <= s2; end if; wait until rising_edge(clk); state <= s3; wait until rising_edge(clk); end process; z <= 1 when (state = s0) and a = 1 else 0; end main;

LEC-06:

2.4.3

Some Simple State Machines

33

Implicit State Machine

LEC-06:

2.4.3

Some Simple State Machines

34

Explicit State Machine


architecture main of mealy_v2 is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then case state is when s0 => if (a = 1) then state <= s1; else state <= s2; end if; when s1 | s2 => state <= s3; when others => state <= s0; end case; end if; end process; z <= 1 when (state = s0) and a = 1 else 0; end main;

LEC-06:

2.4.3

Some Simple State Machines

35

Explicit State Machine

LEC-06:

2.4.3

Some Simple State Machines

36

State Machine with Next Signal


architecture main of mealy_v3 is type state_ty is (s0, s1, s2, s3); signal state, state_nxt : state_ty; begin process (clk) begin if rising_edge(clk) then state <= state_nxt; end if; end process; state_nxt <= s1 when (state = s0) and a = 1 else s2 when (state = s0) and a = 0 else s3 when (state = s1) or (state = s2) else s0; z <= 1 when (state = s0) and a = 1 else 0; end main;

LEC-06:

2.4.3

Some Simple State Machines

37

State Machine with Next Signal

LEC-06:

2.4.4

State Encoding

38

2.4.4

State Encoding

LEC-06:

2.4.4

State Encoding

39

2.4.4.1 Constants vs Enumerated Type


Using an enumerated type: type state_ty is (s0, s1, s2, s3); signal state : state_ty; Using constants: type state_ty is std_logic_vector(1 downto 0); constant s0 : state_ty := "11"; constant s1 : state_ty := "10"; constant s2 : state_ty := "00"; constant s3 : state_ty := "01"; signal state : state_ty;

LEC-06:

2.4.4

State Encoding

40

Providing Encodings for Enumerated Types


Many synthesizers allow the user to provide hints on how to encode the states, or allow the user to provide explicitly the desire encoding. These hints are done either through VHDL attributes or special comments in the code.

LEC-06:

2.4.4

State Encoding

41

Simulation
When doing functional simulation with enumerated types, simulators often display waveforms with pretty-printed values rather than bits (e.g. s0 and s1 rather than 11 and 10). However, when simulating a design that has been mapped to gates, the enumerated type dissappears and you are left with just bits. If you dont know the encoding that the synthesis tool chose, it can be very difcult to debug the design.

LEC-06:

2.4.4

State Encoding

42

Covering All Cases


When writing case statements or selected assignments that test the value of std logic signals, you will get an error unless you include a provision for non 1/0 signals. For example:

signal t : std_logic; ... case t is when 1 => ... when 0 => ... end case; will result in an error message about missing cases. You must provide for t being H, U, etc. The simplest thing to do is to make the last test when other. However, this opens you up to potential bugs if the enumerated type you are testing grows to include more values, which then end up unintentionally executing your when other branch, rather than having a special branch of their own in the case statement.

LEC-06:

2.4.4

State Encoding

43

Unused Values
If the number of values you have in your datatype is not a power of two, then you will have some unused values that are representable. For example: type state_ty is std_logic_vector(2 downto 0); constant s0 : state_ty := "011"; constant s1 : state_ty := "000"; constant s2 : state_ty := "001"; constant s3 : state_ty := "011"; constant s4 : state_ty := "101"; signal state : state_ty; This type only needs ve unique values, but can represent eight different values. What should we do with the three representable values that we dont need? The safest thing to do is to code your design so that if an illegal value is encountered, the machine resets or enters an error state.

LEC-06:

2.4.4

State Encoding

44

2.4.4.2 Encoding Schemes

Binary: Conventional binary counter. One-hot: Exactly one bit is asserted at any time. Modied one-hot: Alteras Quartus synthesizer generates an almostone-hot encoding where the initial state is all Os. Gray: Transition between adjacent values requires exactly one bit ip. Custom: Choose encoding to simplify combinational logic for specic task.

LEC-06:

2.4.4

State Encoding

45

Tradeoffs in Encoding Schemes

Gray is good for low-power applications where consecutive data values typically differ by 1 (e.g. no random jumps). One-hot usually has less combinational logic and runs faster than binary for machines with up to a dozen or so states. With more than a dozen states, the extra ip-ops required by one-hot encoding become too expense. Custom is great if you have lots of time and are incredibly intelligent, or have deep insight into the guts of your design.

LEC-06:

2.4.5

From Dataow to State Machine

46

2.4.5

From Dataow to State Machine

This section designs the state machine for the big_add example used in dataow diagrams (Section 2.3.7). We pick up from the VHDL code for the datapath in section 2.3.7.1.
i1 i2 a b r1 r2 a1 i3 c r3 i1 i2 i3

+
x1 a2

ctrl

+
x2 r1 a1

i2 d r2

i3 e r3 r1 a1 r2 r3

+
x3 a2

+
a2

+
x4 r1 a1

i2 f r2

+
r3 z o1

+
o1

Two control signals from state machine: r1 gets in r3 gets in r1 reads from input or a2 r3 reads from input or a1

Simulate dataow diagram and record required values of signals. cycle 1 2 3 4 r1 gets in true false false r3 gets in true true false

LEC-06:

2.4.5

From Dataow to State Machine

47

Dont Care Values


NOTE: Dont care values In cycle 3, we dont care what is the value of r3 gets in. In cycle 4, we dont care what the value of r1 gets in is. So we assign these signals - in these clock cycles, which is dont care in VHDL. This should allow the synthesis tool to use whatever value is most helpful in simplifying the Boolean equations for the signal (e.g. Karnaugh maps). In the past, some groups in E&CE 427 have used - quite succesfuly to decrease the area of their design. However, a few groups found that using - increased the size of their design, when they were expecting it to decrease the size. So, if you are tweaking your design to squeeze out the last few unneeded bits of area, pay close attention as to whether using - hurts or helps.

LEC-06:

2.4.6

Implicit vs Explicit State Machines

48

2.4.6

Implicit vs Explicit State Machines

There are two broad categories of state machines in VHDL: explicit and implicit. Explicit state machines are a direct translation of the hardware: a concurrent assignments to for the next-state equations and a clocked process for the ops to hold the state. Implicit state machines are built with processes that have multiple wait statements in a process. Explicit state machines are more cumbersome to write, but they are simpler to synthesize and more commonly used. Implicit state machines are concise and readable. Very few books or synthesis manuals describe multiple-wait statement processes, but they are relatively well supported among synthesis tools.

LEC-06:

2.4.7

Implicit State Machines

49

2.4.7

Implicit State Machines

Several examples of implicit state machines that could be used to drive r1 gets in and r3 gets in.

LEC-06:

2.4.7

Implicit State Machines

50

2.4.7.1 Multi-Wait Process


This example directly controls the signals from a multi-wait process. process (clk) begin ------------------------------------------- cycle 1 wait until rising_edge(clk); r1_gets_in <= 1; r3_gets_in <= 1; ------------------------------------------- cycle 2 wait until rising_edge(clk); r1_gets_in <= 0; r3_gets_in <= 1; ------------------------------------------- cycle 3 wait until rising_edge(clk); r1_gets_in <= 0; r3_gets_in <= -; ------------------------------------------- cycle 4 wait until rising_edge(clk); r1_gets_in <= -; r3_gets_in <= 0; end process;

LEC-06:

2.4.7

Implicit State Machines

51

2.4.7.2 Counter
This example uses a counter in a process to keep track of the state, and then uses concurrent assignments for the control signals. The assignments to r1 gets in and r3 gets in could be done with conditional assignments, or a combinational process. Some of these alternatives are illustrated in section 2.4.8. ---------------------------------------------------process (clk) begin cycle_count <= to_unsigned(0, 2); -------------------------------wait until rising_edge(clk); -------------------------------while 3 > cycle_count loop cycle_count <= cycle_count + 1; wait until rising_edge(clk); end loop; end process; ---------------------------------------------------with cycle_count select r1_gets_in <= 1 when to_unsigned(0,2), 0 when others ; ---------------------------------------------------with cycle_count select r3_gets_in <= 1 when to_unsigned(3,2), 0 when others ; ----------------------------------------------------

LEC-06:

2.4.8

Explicit State Machines

52

2.4.8

Explicit State Machines

This is an explicit state machine. A clocked process is used to store the state and a concurrent assignment is used to calculate the next state. The datapath is the same as in section 2.3.6 The control signals for the datapath (r1_gets_in and r3_gets_in) drive the two multiplexors, one for each register (r1 and r3). The values of r1_gets_in and r3_gets_in are determined by the current state of the machine. In this section we rst write the explicit state machine, and then look at several different coding styles for communicating between the state machine and datapath.

LEC-06:

2.4.8

Explicit State Machines

53

2.4.8.1 State Machine


This is the explicit state machine. It stays the same for all of the different examples here. architecture main of big_add is type state_ty is (S0, S1, S2, S3); signal state, state_nxt : state_ty; ... begin process (clk) begin if rising_edge(clk) then state_cur <= state_nxt; end if; end process; with state_cur select state_nxt <= S1 when S0, S2 when S1, S3 when S2, S0 when S3 ; ...r1_gets_in asn... ...r3_gets_in asn... ...datapath... end main;

LEC-06:

2.4.8

Explicit State Machines

54

2.4.8.2 Conditional Assignment


The rst coding example uses simple conditional assignments. r1_gets_in <= else r3_gets_in <= else 1 when state_cur = S0 0; 1 when state_cur = S3 0;

LEC-06:

2.4.8

Explicit State Machines

55

2.4.8.3 Conditional Assignment with Dont Care


The simple conditional assignment doesnt take advantage of the fact that the last state doesnt use the adder a1, so we dont care whether r1 reads from the input or from the a2. We give the synthesis tool a chance to simplify equations for r1_gets_in (and thereby hopefully reduce area) by putting a dont care value for r1_gets_in in the last state. r1_gets_in <= 1 when state_cur = S0 else 0 when (state_cur = S1) OR (state_cur = S2) else -; r3_gets_in <= 1 when (state_cur = S0) OR (state_cur = S1) else 0 when (state_cur = S4); else -;

LEC-06:

2.4.8

Explicit State Machines

56

2.4.8.4 Selected Assignment with Dont Care


The conditional assignment code has many occurrences of state cur in the conditions, which is ugly. So, use a case-like statement (the selected assignment). with state_cur select r1_gets_in <= 0 when 1 when - when ; with state_cur select r3_gets_in <= 0 when 1 when - when ; S0, S1 | S2, others

S3 S0 | S1, others

LEC-06:

2.4.8

Explicit State Machines

57

2.4.8.5 Case Statement


The selected assignment code tests state cur for both assignments, so try a case statement in a process, which allows multiple assignments within the case statement. process (state_cur) begin case state_cur is when S0 => r1_gets_in <= r3_gets_in <= when S1 => r1_gets_in <= r3_gets_in <= when S2 => r1_gets_in <= r3_gets_in <= when S3 => r1_gets_in <= r3_gets_in <= end case; end process;

1; 1; 0; 1; 0; -; -; 0;

LEC-06:

2.4.8

Explicit State Machines

58

Summary and Conclusion


After writing out the different options, the selected assignment style looks to be the best option for this example. The code is short, clean and easy to understand.

LEC-06:

2.4.9

Reset

59

2.4.9

Reset

All circuits should have a reset signal that puts the circuit back into a good initial state. There are standard ways to add a reset signal to both explicit and implicit state machines. It is important that reset is tested on every clock cycle, otherwise a reset might not be noticed, or your circuit will be slow to react to reset and could generate illegal outputs after reset is asserted.

LEC-06:

2.4.9

Reset

60

Reset with Implicit State Machine


With an implicit state machine, we need to insert a loop in the process and test for reset after each wait statement. process (clk) begin init : loop cycle_count <= to_unsigned(0, 2); wait until rising_edge(clk); next init when (reset = 1); while 3 > cycle_count loop cycle_count <= cycle_count + 1; wait until rising_edge(clk); next init when (reset = 1); end loop; end loop; end process; -- outermost loop

-- test for reset

-- test for reset

LEC-06:

2.4.9

Reset

61

Reset with Explicit State Machine


Reset is often easier to include in an explicit state machine, because we need only put a test for reset = 1 in the clocked process for the state: process (clk) begin if rising_edge(clk) then if reset = 1 then state_cur <= S0; else state_cur <= state_nxt; end if; end if; end process;

LEC-06:

2.4.10

Input / Output Protocols

62

2.4.10 Input / Output Protocols


An important aspect of hardware design is choosing a input/output protocol that is easy to implement and suits both your circuit and your environment. Here are a few simple and common protocols.

LEC-06:

2.4.10

Input / Output Protocols

63

Four phase handshaking protocol


rdy data ack

Figure 2.7: Four phase handshaking protocol Used when timing of communication between producer and consumer is unpredictable. The disadvantage is that it is cumbersome to implement and slow to execute.

LEC-06:

2.4.10

Input / Output Protocols

64

Valid-bit protocol
clk valid data

Figure 2.8: Valid-bit protocol A low overhead (both in area and performance) protocol. Consumer must always be able to accept incoming data. Often used in pipelined circuits. More complicated versions of the protocol can handle pipeline stalls.

LEC-06:

2.4.10

Input / Output Protocols

65

Start/Done Protocol
clk start data_in done data_out

Figure 2.9: Start/Done protocol A low overhead (both in area and performance) protocol. Useful when a circuit works on one piece of data at a time and the time to compute the result is unpredictable.

LEC-07 Preliminaries

LEC-07: Memory Design


Lecture Notes Sections: 2.5 2.5.2.6

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-07 Preliminaries

Schedule
wk-01 02 VHDL wk-01 Overview and VHDL wk-02 VHDL Semantics Design and Optimization wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review

wk-03 05

wk-06 wk-07 wk-08 wk-08 10 wk-11 12 wk-13

LEC-07 Preliminaries

Overview
This lecture builds on material from Lec-05, where dataow diagrams were introduced. In this lecture, we show how to deal with memory reads and writes in dataow diagrams. This ties in with data hazards in computer architecture.

LEC-07 Preliminaries

Concepts
Lecture Notes: Sections 2.52.5.2.6

memory arrays in dataow diagrams data dependencies and

hazards memory arrays in VHDL

LEC-07 Preliminaries

Background

LEC-07 Preliminaries

Reading
Smith ASIC

Section 12.8: Memory Synthesis The remainder of Chapter 12

LEC-07:

2.5

MEMORY ARRAYS AND RTL DESIGN

2.5

Memory Arrays and RTL Design

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

2.5.1 Memory Arrays and Dataow Diagrams

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

2.5.1.1 Legend for Dataow Diagrams


name name name name (rd) name(wr)

Input port

Output port

State signal

Array read

Array write

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

10

2.5.1.2 Basic Memory Operations


mem mem addr mem(rd) data mem (anti-dependency) mem(wr) data addr

mem

data := mem[addr]; Memory Read

mem[addr] := data; Memory Write

Dataow diagrams show the dependencies between operations. The basic memory operations are similar, in that each arrow represents a data dependency.

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

11

Basic Memory Operations (Contd)


There are a few aspects of the basic memory operations that are potentially surprising:

The antidependency for memory reads is related to Write-after-Read dependencies, as discussed in Section 2.5.1.4. The apparent dependency on and production of an entire memory array is because we dont know which address in the array will be read from or written to. There are optimizations that can be performed when we know the address (Section 2.5.1.5).

The anti-dependency arrow producing mem on a read. Reads and writes are dependent upon the entire previous value of the memory array. The write operation appears to produce an entire memory array, rather than just updating an individual element of an existing array.

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

12

2.5.1.3 Data Dependencies


Instructions in a program can be reordered, so long as the data dependencies are preserved.

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

13

Data Dependencies (Contd)


M[3] M[2] M[1] M[0] 30 20 10 0 M[2] := 21 M[3] := 31 A B := M[2] := M[0] 21

M[3] := 32 M[0] := 01 C := M[3]

Initial Program

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

14

Data Dependencies (Contd)


M[3] M[2] M[1] M[0] 30 20 10 0 M[2] := 21 M[3] := 31 A B := M[2] := M[0] 21

M[3] := 32 M[0] := 01 C := M[3]

Initial Program with Dependencies

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

15

Data Dependencies (Contd)


M[2] := 21 B A := M[0] := M[2]

M[3] := 31 M[3] := 32 M[0] := 01 C := M[3]

Valid Modication

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

16

Data Dependencies (Contd)


M[2] := 21 B A := M[0] := M[2]

M[3] := 31 C := M[3]

M[3] := 32 M[0] := 01

Valid (or Bad?) Modication

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

17

2.5.1.4 Denition of Three Types of Dependencies


There are three types of data dependencies. pipeline terminology in computer architecture.
M[i] := := M[i] := :=

The names come from

:= M[i] :=

:= M[i]

M[i]

:=

M[i]

:=

Read after Write

Write after Write

Write after Read

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

18

2.5.1.5 Dataow Diagrams and Data Dependencies

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

19

Read after Write Dependencies


Algo: mem[wr addr] := data in; data out := mem[rd addr]; mem data_in wr_addr

mem(wr)

rd_addr

mem(rd)

mem

data_out

Read after Write

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

20

Read after Write Optimization


Algo:
mem

mem[wr addr] := data out := data_in wr_addr

data in; mem[rd addr];


rd_addr

mem(wr)

mem(rd)

mem

data_out



Optimization when rd addr

wr addr

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

21

Write after Write Dependencies


Algo:
mem

mem[wr1 addr] := mem[wr2 addr] := data1 wr1_addr

data1; data2;

mem(wr)

data2

wr2_addr

mem(wr)

mem

Write after Write

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

22

Write after Write Scheduling Option


Algo: mem[wr1 addr] := data1; mem[wr2 addr] := data2; mem data2 wr2_addr

mem(wr) data1 wr1_addr

mem(wr)

mem



Scheduling option when wr1 addr

wr2 addr

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

23

Write after Read Dependencies


Algo: rd data := mem[wr addr] := mem rd_addr mem[rd addr]; wr data;

mem(rd)

wr_data wr_addr

mem(wr)

rd_data

mem

Write after Read

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

24

Write after Read Optimization


Algo: rd data := mem[rd addr]; mem[wr addr] := wr data; mem rd_addr wr_data wr_addr

mem(rd)

mem(wr)

rd_data

mem



Optimization when rd addr

wr addr

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

25

2.5.1.6 Example: Dataow Diagram

Memory

Array

and

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

26

Initial Dataow Diagram


mem M data_in wr_addr 21 2

M(wr)

31

M(wr)

M(rd)

M(rd)

32

1 2 3 4 5 6 7

M[2] := 21 M[3] := 31 A B := M[2] := M[0]

M(wr)

01

M(wr)

M[3] := 32 M[0] := 01 C := M[3] M C 7 M(rd)

Figure 2.10: Memory array example code and initial dataow diagram

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

27

Dependency Arrow and Addresses


The dependency and anti-dependency arrows in dataow diagram in Figure 2.10 are based solely upon whether an operation is a read or a write. The arrows do not take into account the address that is read from or written to. In gure 2.11, we have used knowledge about which addresses we are accessing to remove unneeded dependencies. These are the real dependencies and match those shown in the code fragment for gure 2.10. In gure 2.12 we have placed an ordering on the read operations and an ordering on the write operations. The ordering is derived by obeying data dependencies and then rearranging the operations to perform as many operations in parallel as possible.

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

28

Optimize Dependencies for Known Addresses


mem M data_in wr_addr 21 2

M(wr)

31

M(wr)

M(rd)

M(rd)

32

M(wr)

01

M(wr)

M(rd)

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

29

Optimize Anti-Dependencies for Known Addresses


mem M data_in wr_addr 21 2

M(wr)

31

M(wr)

M(rd)

M(rd)

32

M(wr)

01

M(wr)

M(rd)

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

30

Minimal Dependencies
M 0 21 2 31 3

M(rd) B 01 0 M(wr)

M(wr)

M(wr)

2 M(rd)

32 3 M(wr) 3 M(rd)

Figure 2.11: Memory array with minimal dependencies

Question:

What is the critical path?

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

31

Critical Path
M 0 21 2 31 3

M(rd) B 01 0 M(wr)

M(wr)

M(wr)

2 M(rd)

32 3 M(wr) 3 M(rd)

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

32

Reads and Writes


M 0 21 2 31 3

M(rd) B 01 0 M(wr)

M(wr)

M(wr)

read write

2 M(rd)

32 3 M(wr) 3 M(rd)

Question:

In what order should operations occur?

Question:

Which operations must be rst or last?

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

33

Obvious First and Last Operations


M 0 21 2 31 3

M(rd) B 01 0

M(wr)

M(wr)

2 M(rd)

32 3 M(wr) 3 3 M(rd)

M(wr)

First and last read are obvious from critical path. Last write is obvious.

Question: point?

Any operations forced into a specic order at this

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

34

Middle Read
M 0 21 2 31 3

M(rd) B 01 0

M(wr)

M(wr)

2 2 M(rd)

32 3 M(wr) 3 3 M(rd)

M(wr)

Only three reads, so once rst and last have been picked, the middle one is determined

Question:

Which write should happen rst?

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

35

First Write
M 0 21 2 31 3

M(rd) B 01 0

M(wr)

M(wr)

2 2 M(rd)

32 3 M(wr) 3 3 M(rd)

M(wr)

First write is one closest to start of critical path, although because we know addresses, could reschedule rst two writes.

Question:

Can we complete the ordering?

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

36

Complete Ordering
M 0 21 2 31 3

M(rd) B 01 0

M(wr)

M(wr)

2 2 M(rd) 3

32 3 M(wr) 3 3 M(rd)

M(wr)

Figure 2.12: Memory array with orderings Ordering of writes 2 and 3 are determined because both have 3 as their address.

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

37

Place Operations in Clock Cycles


M 0 21 2

M(rd) B

M(wr)

2 2 M(rd) A 2

31 3 M(wr)

32 3 3 M(wr)

01 0 4 M(wr) 3

3 M(rd)

LEC-07:

2.5.1

Memory Arrays and Dataow Diagrams

38

Final Dataow Diagram


M 0 1 M(rd) B 2 2 M(rd) A 32 3 3 M(wr) 2 31 3 M(wr) 1 21 2 M(wr)

3 3 M(rd) C 4

01 0 M(wr) M

Figure 2.13: Final version of Figure 2.10 Put as many parallel operations into same clock cycle as allowed by resources (one write + one read, two reads, or one write for dual port RAM). Preserve depencies by putting dependent operations in separate clock cycles.

LEC-07:

2.5.2

Memory Arrays in VHDL

39

2.5.2

Memory Arrays in VHDL

LEC-07:

2.5.2

Memory Arrays in VHDL

40

2.5.2.1 Two-Dimensional Array


A memory array can be written in VHDL as a two-dimensional array: subtype data is std_logic_vector(7 downto 0); type data_vector is array( natural range <> ) of data; signal mem : data_vector(31 downto 0); However, a two-dimensional array does not accurately capture the limitations on a memory array in hardware.

LEC-07:

2.5.2

Memory Arrays in VHDL

41

Two-Dimensional Array
The example below illustrates: lack of interface protocol, combinational write, multiple write ports, multiple read ports. architecture main of mem_not_hw is subtype data is std_logic_vector(7 downto 0); type data_vector is array( natural range <> ) of data; signal mem : data_vector(31 downto 0); begin y <= mem( a ); mem( a ) <= b; -- comb read process (clk) begin if rising_edge(clk) then mem( c ) <= w; -- write port #1 end if; end process; process (clk) begin if rising_edge(clk) then mem( d ) <= v; -- write port #2 end if; end process; u <= mem( e ); -- read port #2 end main;

LEC-07:

2.5.2

Memory Arrays in VHDL

42

2.5.2.2 Memory Array in Hardware


Most simple memory arrays are single- or dual-ported, support just one write operation at a time, and have an interface protocol using a clock and write-enable.
WE WE A DI DO A0 DI0 A1 DO1 DO0

LEC-07:

2.5.2

Memory Arrays in VHDL

43

2.5.2.3 Example VHDL Code for Memory Array in Hardware


package mem_pkg is subtype data is std_logic_vector(7 downto 0); type data_vector is array( natural range <> ) of data; end; entity mem is port ( clk : in std_logic; we : in std_logic -a : in unsigned(4 downto 0); -di : in data; -do : out data -); end mem; architecture main of mem is signal mem : data_vector(31 downto 0); begin do <= mem( to_integer( a ) ); process (clk) begin if rising_edge(clk) then if we = 1 then mem( to_integer( a ) ) <= di; end if; end if; end process; end main;

write enable address data_in data_out

LEC-07:

2.5.2

Memory Arrays in VHDL

44

2.5.2.4 Library Component


Synopsys synthesis tools implement each bit in a two-dimensional array as a ip-op. Each FPGA and ASIC vendors supplies libraries of memory arrays that are smaller and faster than a two-dimensional array of ip ops. These libraries exploit specialized hardware on the chips to implement the memory. NB: To synthesize a reasonable implementation of a memory array with Synopsys, you must instantiate a vendor-supplied memory component. Some other synthesis tools can infer memory arrays from two-dimensional arrays and synthesize efcient implementations.

LEC-07:

2.5.2

Memory Arrays in VHDL

45

Recommended Design Process with Memory


1. high-level model with two-dimensional array 2. two-dimensional array packaged inside memory entity/architecture 3. vendor-supplied component

LEC-07:

2.5.2

Memory Arrays in VHDL

46

Altera
Altera uses MegaFunctions to implement RAM in VHDL. A MegaFunction is a black-box description of hardware on the FPGA. There are tools in Quartus to generate VHDL code for RAM components of different sizes. In E&CE 427 we will provide you with the VHDL code for the RAM components that you will need in Lab-3 and the Project. The APEX20KE chips that we are using have dedicated SRAM blocks called Embedded System Blocks (ESB). Each ESB can store 2048 bits and can be congured in any of the following sizes: Number of Elements 2048 1024 512 256 128 Word Size (bits) 1 2 4 8 16

LEC-07:

2.5.2

Memory Arrays in VHDL

47

Xilinx
Use component instantiation to get these components

Other sizes are also available, consult the datasheet for your chip.

 

ram16x1s ram16x1d

16 16

1 single ported memory 1 dual-ported memory

LEC-07:

2.5.2

Memory Arrays in VHDL

48

2.5.2.5 Build Memory from Slices


If the vendors libraries of memory components do not include one that is the correct size for your needs, you can construct your own component from smaller ones.

LEC-07:

2.5.2

Memory Arrays in VHDL

49

Widen the Words in a Memory


WriteEn Addr DataIn[W-1..0] DataIn[2W-1..2] Clk
WE A DI DO WE A DI DO

NxW

NxW

DataOut[W-1..0] DataOut[2W-1..W]

Figure 2.14: An N 2W memory from N W components

LEC-07:

2.5.2

Memory Arrays in VHDL

50

Increase Number of Words in a Memory


WriteEn Addr[logN] Addr[logN-1..0] DataIn Clk
WE A DI DO

NxW

WE A DI DO

NxW

DataOut

Figure 2.15: A 2N W memory from N W components

LEC-07:

2.5.2

Memory Arrays in VHDL

51

A 16 4 Memory from 16 1 Components


library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity ram16x4s is port ( clk, we : in std_logic; data_in : in std_logic_vector(3 downto 0); addr : in unsigned(3 downto 0); data_out : out std_logic_vector(3 downto 0) ); end ram16x4s;

LEC-07:

2.5.2

Memory Arrays in VHDL

52

A 16 4 Memory from 16 1 Components


architecture main of ram16x4s is component ram16x1s port (d : in std_logic; -- data in a3, a2, a1, a0 : in std_logic; -- address we : in std_logic; -- write enable wclk : in std_logic; -- write clock o : out std_logic -- data out ); end component; begin mem_gen: for i in 0 to 3 generate ram : ram16x1s port map ( we => we, wclk => clk, ----------------------------------------------- d and o are dependent on i a3 => addr(3), a2 => addr(2), a1 => addr(1), a0 => addr(0), d => data_in(i), o => data_out(i) ---------------------------------------------); end generate; end main;

LEC-07:

2.5.2

Memory Arrays in VHDL

53

2.5.2.6 Dual-Ported Memory


Dual ported memory is similar to single ported memory, except that it allows two simultaneous reads, or a simultaneous read and write. When doing a simultaneous read and write to the same address, the read will not see the data currently being written.

Question: Why do dual-ported memories usually not support writes on both ports?

LEC-08 Preliminaries

LEC-08: Design Example: Stack


Lecture Notes Sections: 2.6 2.6.4.3

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-08 Preliminaries

Schedule
wk-01 05 VHDL wk-01 Overview and VHDL wk-02 VHDL Semantics Design and Optimization wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review

wk-03 05

wk-06 wk-07 wk-08 wk-08 10 wk-11 12 wk-13

LEC-08 Preliminaries

Overview
This lecture builds on material from the previous three lectures where dataow diagrams, nite state machines, and memory array design were described. This lecture takes a stack (push, pop, swap, top) from an algorithmic description to an RTL implementation in VHDL. The major new idea is working with dataow diagrams for circuits that perform multiple operations.

LEC-08 Preliminaries

Concepts
Lecture Notes: Sections 2.62.6.4.3 combining FSMs, datapath, and storage

dataow diagrams with multiple instructions

LEC-08 Preliminaries

Background

LEC-08 Preliminaries

Reading

LEC-08:

2.6

DESIGN EXAMPLE: STACK

2.6

Design Example: Stack

LEC-08:

2.6.1

Stack Requirements

2.6.1

Stack Requirements

LEC-08:

2.6.1

Stack Requirements

2.6.1.1 Stack Entity


VHDL entity for the stack: entity stack is port ( reset, clk : in std_logic; inp : in std_logic_vector(3 downto 0); outp : out std_logic_vector(3 downto 0) ); end stack; The input signal inp is used for both instructions and data.

LEC-08:

2.6.1

Stack Requirements

10

2.6.1.2 Stack Instructions


push pop swap tos put a new piece of data onto the top of the stack remove the top piece of data from the stack swap the top two pieces of data output the current data on the top of the stack

LEC-08:

2.6.1

Stack Requirements

11

2.6.1.3 Stack Instruction Encoding


VHDL package dening stack instructions: package stack_instr is constant pop : std_logic_vector(3 constant push : std_logic_vector(3 constant tos : std_logic_vector(3 constant swap : std_logic_vector(3 end stack_instr; downto downto downto downto 0) 0) 0) 0) := := := := "0001"; "0010"; "0100"; "1000";

LEC-08:

2.6.1

Stack Requirements

12

2.6.1.4 Miscellaneous Requirements

     

The stack shall have 16 elements The inputs shall be registered. When a push operation is done, in the clock cycle following the push instruction, inp shall have the data that is to be pushed onto the stack. Popping from an empty stack or pushing onto a full stack results in undened behaviour. When doing a tos or pop operation, the output outp shall have the tos data in the clock cycle after the tos instruction is input. At all other times the output is unconstrained. In the clock cycle following reset being asserted (set to 1), the stack shall be empty.

LEC-08:

2.6.2

Stack Algorithm

13

2.6.2

Stack Algorithm

A simple Perl program to implement an algorithmic description of the stack. NB: You dont need to know Perl in E&CE 427. Perl is just one example of the many different software programming languages that can be used to create algorithmic descriptions of circuits.

LEC-08:

2.6.2

Stack Algorithm

14

Stack Algorithm Preliminaries


#! /usr/bin/perl -Wall local ($line, @stack, $stack, $tmp); $tos = 0;

LEC-08:

2.6.2

Stack Algorithm

15

Stack Algorithm Core

if ( $line eq "tos") print( $stack $tos ); elsif ( $line eq "pop") print( $stack $tos ); $tos = $tos - 1; elsif ( $line eq "push" ) $tos = $tos + 1; $line = <STDIN>; chop( $line ); $stack $tos = $line; elsif ( $line eq "swap" ) $tmp = $stack $tos ; $stack $tos = $stack $tos-1 ; $stack $tos-1 = $tmp;

  

while ($line = <STDIN>) chop( $line );

  

   

LEC-08:

2.6.2

Stack Algorithm

16

Usage of Perl Stack


push 3 tos 3 push 4 tos 4 pop 4 tos 3

LEC-08:

2.6.3

Stack Dataow Diagrams

17

2.6.3

Stack Dataow Diagrams

LEC-08:

2.6.3

Stack Dataow Diagrams

18

2.6.3.1 Initial Diagrams


Do one diagram for each operation. Do the initial dataow diagrams without any clock cycle information.

LEC-08:

2.6.3

Stack Dataow Diagrams

19

Pop
stack tos

stack(rd)

-1

stack

data_out

tos

Pop

LEC-08:

2.6.3

Stack Dataow Diagrams

20

Push
stack data_in tos

+1

stack(wr)

stack

tos

Push

LEC-08:

2.6.3

Stack Dataow Diagrams

21

Tos
stack tos

stack(rd)

stack

data_out

tos

Tos

LEC-08:

2.6.3

Stack Dataow Diagrams

22

Swap
stack tos

-1

stack(rd)

stack(rd)

stack(wr)

stack(wr)

stack

tos

Swap Note: scheduling decision and anti-dependency arrows

LEC-08:

2.6.3

Stack Dataow Diagrams

23

2.6.3.2 Partition into Clock Cycles

LEC-08:

2.6.3

Stack Dataow Diagrams

24

Pop, Push
stack data_in stack tos tos +1

stack(rd)

-1

stack(wr) stack tos

stack

data_out

tos

2 1

Pop registers (stack, tos) ALU

3 1

Push registers (stack, tos, data in) ALU

LEC-08:

2.6.3

Stack Dataow Diagrams

25

Tos
stack tos

stack(rd)

stack

data_out

tos

Tos registers (stack, tos)

LEC-08:

2.6.3

Stack Dataow Diagrams

26

Swap
stack tos

-1

stack(rd)

stack(rd)

stack(wr)

stack(wr)

stack

tos

5 1

registers (stack, tos, stack[tos], stack[tos-1], tos-1) ALU Swap version 1

LEC-08:

2.6.3

Stack Dataow Diagrams

27

Swap (Optimized)
stack tos

-1

stack(rd)

stack(rd)

-1 stack(wr)

stack(wr)

stack

tos

4 1

registers (stack, tos, stack[tos], stack[tos-1]) ALU Swap version 2 (Optimized) eliminated one register

LEC-08:

2.6.3

Stack Dataow Diagrams

28

2.6.3.3 High-Level Model


This high-level model is taken directly from the dataow diagrams and block diagrams. There is one process that combines control, datapath, and storage; except for the output (outp), which is done with a concurrent assignment statement. Notice that there is a next init when (reset = 1); after every wait statement. This is needed to get the circuit back to its initial state in the next clock cycle when reset is asserted. First, well see the overall structure of the hlm architecture, and then the gory details.

LEC-08:

2.6.3

Stack Dataow Diagrams

29

Stack HLM Structure


architecture hlm of stack is ...declarations... begin ----------------------------------------------process begin init : loop ...reset assignments... loop -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------case inp is when pop => ...pop code... when push => ...push code... when swap => ...swap code... when tos => ...tos code... when others => next init; end case; end loop; end loop; end process; ----------------------------------------------outp <= stack(to_integer(tos)); ----------------------------------------------end hlm;

LEC-08:

2.6.3

Stack Dataow Diagrams

30

Stack HLM Declarations


architecture hlm of stack is ----------------------------------------------subtype data_ty is std_logic_vector(3 downto 0); type stack_ty is array (15 downto 0) of data_ty; ----------------------------------------------signal tos : unsigned(3 downto 0); signal tmp1, tmp2 : data_ty; signal stack : stack_ty; signal empty : std_logic; ----------------------------------------------begin

LEC-08:

2.6.3

Stack Dataow Diagrams

31

Stack HLM: Pop


when pop => tos <= tos - 1;

LEC-08:

2.6.3

Stack Dataow Diagrams

32

Stack HLM: Push


when push => if (empty = 0) then tos <= tos + 1; end if; -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------stack(to_integer(tos)) <= inp; empty <= 0;

LEC-08:

2.6.3

Stack Dataow Diagrams

33

Stack HLM: Swap


when swap => tmp1 <= stack(to_integer(tos-1)); -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------tmp2 <= stack(to_integer(tos)); -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------stack(to_integer(tos-1)) <= tmp2; -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------stack(to_integer(tos)) <= tmp1;

LEC-08:

2.6.3

Stack Dataow Diagrams

34

Stack HLM: Tos


when tos => null;

LEC-08:

2.6.3

Stack Dataow Diagrams

35

Stack HLM: Others


when others => next init; end case; end loop; end loop; end process;

LEC-08:

2.6.3

Stack Dataow Diagrams

36

Stack HLM: Output


----------------------------------------------outp <= stack(to_integer(tos)); ----------------------------------------------end hlm;

LEC-08:

2.6.3

Stack Dataow Diagrams

37

2.6.3.4 Individual Block Diagrams


Build one block diagram for each operation.

LEC-08:

2.6.3

Stack Dataow Diagrams

38

Pop
stack

stack

tos
tos

we a di do
outp

stack(rd)

-1
-1

stack

data_out

tos

Pop

LEC-08:

2.6.3

Stack Dataow Diagrams

39

Push
stack data_in tos control +1

stack tos

stack(wr) stack tos

d
ce

q
1

we

a di do

inp

Push

LEC-08:

2.6.3

Stack Dataow Diagrams

40

Tos
stack tos

stack(rd) 0
tos

stack

we a di do
outp

stack

data_out

tos

Tos

LEC-08:

2.6.3

Stack Dataow Diagrams

41

Swap Dataow
stack tos

-1

stack(rd)

stack(rd)

-1 stack(wr)

stack(wr)

stack

tos

LEC-08:

2.6.3

Stack Dataow Diagrams

42

Swap Block Diagram


control
tmp1 stack tos

d ce

we a

-1

di

do
tmp2

d ce

Swap

LEC-08:

2.6.3

Stack Dataow Diagrams

43

2.6.3.5 Complete Block Diagram


Merge all of the block diagrams together, reusing components whereever possible.

LEC-08:

2.6.3

Stack Dataow Diagrams

44

Block Diagram for All Operations


control
tos_inc_dec_sel tos_ce tmp2_ce stack_addr_sel stack_data_sel stack_we tmp1_ce

reset

r
tos

d ce

q
stack

tmp1

d ce

we a

-1 1

di

do
tmp2

outp

d
inp

ce

All Operations

LEC-08:

2.6.4

Stack: Register Transfer Level

45

2.6.4

Stack: Register Transfer Level

The high-level model is synthesizable, but might be large and slow.

 

It uses a 2-d array for the stack, rather than specialized memory components from the library. We are relying on the synthesis tool to build a state machine to drive the datapath. Sometimes, by writing code that is closer to gate-level hardware, we can improve peformance and/or area.

LEC-08:

2.6.4

Stack: Register Transfer Level

46

Structuring RTL Code


There are four different ways to structure your RTL code:

   

Single process Separate datapath Separate control, storage, and datapath Fully disassembled

LEC-08:

2.6.4

Stack: Register Transfer Level

47

Single Process Structure


There are four different ways to structure your RTL code:

   

Single process Separate datapath Separate control, storage, and datapath Fully disassembled

Control Storage Datapath

LEC-08:

2.6.4

Stack: Register Transfer Level

48

Separate Datapath
There are four different ways to structure your RTL code:

   

Single process Separate datapath Separate control, storage, and datapath Fully disassembled

Control Storage

Datapath

LEC-08:

2.6.4

Stack: Register Transfer Level

49

Separate Control, Storage and Datapath


There are four different ways to structure your RTL code:
Control

   

Single process Separate datapath Separate control, storage, and datapath Fully disassembled

Storage

Datapath

LEC-08:

2.6.4

Stack: Register Transfer Level

50

Fully Disassembeled
There are four different ways to structure your RTL code:
Next-State Funs

   

Single process Separate datapath Separate control, storage, and datapath Fully disassembled

Control Storage

Storage

Datapath

LEC-08:

2.6.4

Stack: Register Transfer Level

51

Stack RTL
To write the RTL code for the stack, consider the following options:

(e.g. dene a state type and a signal of type state and do assignments to current and next-state signals Question to ponder: does an explicit state machine result in better hardware?

 

Replacing the stack as an array with a component instantiation of a memory array from the FPGA libraries Dening a state machine and signals to control the datapath

LEC-08:

2.6.4

Stack: Register Transfer Level

52

2.6.4.1 Stack: Separate Control, Datapath and Storage


This design is derived directly from the hardware block diagram. We separate the state machine and datapath using the control signals that drive the datapath (mux select lines, chip enables, etc). The state machine drives signals that control the datapath. The state machine is very similar to that in the high level model. In every state we assign values to the signals that control the datapath. The datapath is done with concurrent statements. By using concurrent statements, rather than processes, for the datapath, we eliminate the need for the datapath assignments to have sensitivity lists, which simplies the code. This style works best when there are a large number of states and a small number of datapath components.

LEC-08:

2.6.4

Stack: Register Transfer Level

53

Block Diagram
control
tos_inc_dec_sel tos_ce tmp2_ce stack_addr_sel stack_data_sel stack_we tmp1_ce

reset

r
tos

d ce

q
stack

tmp1

d ce

we
stack_addr

a di do

-1 1

tos_adj+

outp tmp2

d
inp stack_data_in

ce

Registers Memory Combinational FSM outputs

Inventory tos, tmp1, tmp2 stack tos adj, stack addr, stack data in tos ce, tos inc dec sel, stack addr sel, stack data sel, stack we, tmp2 ce, tmp1 ce,

LEC-08:

2.6.4

Stack: Register Transfer Level

54

SepFsm Overview (1)


architecture sepfsm of stack is ...declarations... begin ...component instantiation for memory... ...clocked process for state machine... ...clocked process for tmp1... ...clocked process for tmp2... ...clocked process for tos... ...concurrent assignment for tos adj... ...concurrent assignment for stack addr... ...concurrent assignment for stack data in... end sepfsm;

LEC-08:

2.6.4

Stack: Register Transfer Level

55

SepFsm Overview (2)


architecture sepfsm of stack is ...declarations... begin ...component instantiation for memory... process begin init : loop ...initialization... loop -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------case inp is when pop => ...pop code... when push => ...push code... when swap => ...swap code... when tos => ...tos code... when others => next init; end case; end loop; end loop; end process; ...clocked process for tmp1... ...clocked process for tmp2... ...clocked process for tos... ...concurrent assignment for tos adj... ...concurrent assignment for stack addr... ...concurrent assignment for stack data in... end sepfsm;

LEC-08:

2.6.4

Stack: Register Transfer Level

56

SepFsm Declarations (1)


architecture sepfsm of stack is signal tos, tos_adj, stack_addr : unsigned(3 downto 0); signal inp_intern, stack_data_in, stack_data_out, tmp1, tmp2 : std_logic_vector(3 downto 0);

Question: Why are some signals unsigned and others std logic vector?

Answer: Signals that are used as numbers (e.g. addresses for memory array) are unsigned. Non-numeric signals are std logic vector

LEC-08:

2.6.4

Stack: Register Transfer Level

57

SepFsm Declarations (2)


signal synch_reset, empty, tos_inc_dec_sel, stack_addr_sel, tos_ce, stack_we, tmp1_ce, tmp2_ce : std_logic; signal stack_data_sel : std_logic_vector(1 downto 0);

LEC-08:

2.6.4

Stack: Register Transfer Level

58

SepFsm Declarations (3)


-----------------------------------------------------component ram16x4s port (data_in : in std_logic_vector(3 downto 0); addr : in unsigned(3 downto 0); we : in std_logic; clk : in std_logic; data_out : out std_logic_vector(3 downto 0) ); end component; ------------------------------------------------------

LEC-08:

2.6.4

Stack: Register Transfer Level

59

SepFsm Ram Instantiation


begin stack : ram16x4s port map ( ---------------------------------------------we => stack_we, clk => clk, ---------------------------------------------addr => stack_addr, data_in => stack_data_in, data_out => stack_data_out ---------------------------------------------);

LEC-08:

2.6.4

Stack: Register Transfer Level

60

SepFsm Initialization
process begin init : loop -------------------------------empty <= 1; tos_inc_dec_sel <= -; stack_addr_sel <= -; tos_ce <= 1; stack_we <= 0; stack_data_sel <= "--"; tmp1_ce <= -; tmp2_ce <= -;

LEC-08:

2.6.4

Stack: Register Transfer Level

61

SepFsm Pop
-------------------------------loop -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------case inp is when pop => tos_inc_dec_sel <= 0; stack_addr_sel <= 1; tos_ce <= 1; stack_we <= 0; stack_data_sel <= "--"; tmp1_ce <= -; tmp2_ce <= -;

LEC-08:

2.6.4

Stack: Register Transfer Level

62

SepFsm Push
when push => if (empty = 1) then tos_inc_dec_sel <= -; stack_addr_sel <= 0; tos_ce <= 0; else tos_inc_dec_sel <= 1; stack_addr_sel <= 1; tos_ce <= 1; end if; stack_data_sel <= "--"; stack_we <= 0; tmp1_ce <= -; tmp2_ce <= -; -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------empty <= 0; tos_inc_dec_sel <= -; stack_addr_sel <= 0; tos_ce <= 0; stack_data_sel <= "00"; stack_we <= 1; tmp1_ce <= -; tmp2_ce <= -;

LEC-08:

2.6.4

Stack: Register Transfer Level

63

SepFsm Swap
when swap => ... end case; end loop; end loop; end process;

LEC-08:

2.6.4

Stack: Register Transfer Level

64

SepFsm tmp1
-----------------------------------------------------process (clk) begin if rising_edge(clk) then if (tmp1_ce = 1) then tmp1 <= stack_data_out; end if; end if; end process;

LEC-08:

2.6.4

Stack: Register Transfer Level

65

SepFsm tmp2
-----------------------------------------------------process (clk) begin if rising_edge(clk) then if (tmp2_ce = 1) then tmp2 <= stack_data_out; end if; end if; end process;

LEC-08:

2.6.4

Stack: Register Transfer Level

66

SepFsm Tos
-----------------------------------------------------process (clk) begin if rising_edge(clk) then if (reset = 1) then tos <= to_unsigned(0, 4); elsif (tos_ce = 1) then tos <= tos_adj; end if; end if; end process;

LEC-08:

2.6.4

Stack: Register Transfer Level

67

SepFsm Tos Adjustment


-----------------------------------------------------tos_adj <= tos + 1 when (tos_inc_dec_sel = 1) else tos - 1 ; ...

LEC-08:

2.6.4

Stack: Register Transfer Level

68

SepFsm Stack Address


-----------------------------------------------------stack_addr <= tos when (stack_addr_sel = 0) else tos_adj ;

LEC-08:

2.6.4

Stack: Register Transfer Level

69

SepFsm Stack Data


-----------------------------------------------------stack_data_in <= inp_intern when (stack_data_sel = "00") else tmp1 when (stack_data_sel = "01") else tmp2 ; -----------------------------------------------------end sepfsm;

LEC-08:

2.6.4

Stack: Register Transfer Level

70

2.6.4.2 Stack: Datapath Operations


The state machine in Section 2.6.4.1 controlled each datapath component individually. An alternative style is for the state machine to tell the datapath what state it is in, or what global collection of operations to perform, then each part of the datapath decodes this and takes the appropriate action. This style works best when there are a small number of states and a large number of datapath components.

LEC-08:

2.6.4

Stack: Register Transfer Level

71

Dp-Op Declarations
architecture dp_op of stack is ----------------------------------------------------- define the states type dp_op_ty is (init_op, pop_op, push1_op, push2_op, swap_wr_tmp1_op, swap_wr_tmp2_op, swap_rd_tmp1_op, swap_rd_tmp2_op, nop_op ); signal dp_op : dp_op_ty; signal tos, tos_adj, stack_addr : unsigned(3 downto 0); signal inp_intern, stack_data_in, stack_data_out, tmp1, tmp2 : std_logic_vector(3 downto 0); signal empty, stack_we : std_logic; begin

LEC-08:

2.6.4

Stack: Register Transfer Level

72

Dp-Op State Machine


--------------------------------------------------------process begin init : loop -------------------------------empty <= 1; dp_op <= init_op; loop -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------case inp is when pop => dp_op <= pop_op; when push => dp_op <= push1_op; -------------------------------wait until rising_edge(clk); next init when (reset = 1); --------------------------------- stack(to_integer(tos)) <= inp; dp_op <= push2_op; empty <= 0; when swap => ... ... end case; end loop; end loop; end process;

LEC-08:

2.6.4

Stack: Register Transfer Level

73

Dp-Op Input Storage


----------------------------------------------------process (clk) begin if rising_edge(clk) then inp_intern <= inp; end if; end process;

LEC-08:

2.6.4

Stack: Register Transfer Level

74

Dp-Op Tos
-----------------------------------------------------process (clk) begin if rising_edge(clk) then if (dp_op = init_op) then tos <= to_unsigned(0,4); elsif ( (dp_op = pop_op) OR (dp_op = push1_op and (empty = 0)) ) then tos <= tos_adj; end if; end if; end process; -----------------------------------------------------tos_adj <= tos + to_unsigned(1,3) when (dp_op = push1_op) else tos - to_unsigned(1,3) ; ------------------------------------------------------

LEC-08:

2.6.4

Stack: Register Transfer Level

75

Dp-Op Stack Address


stack_addr <= tos_adj when ( OR OR OR ) else tos ;

(dp_op = pop_op) ((dp_op = push1_op) AND (empty = 0)) (dp_op = swap_wr_tmp1_op) (dp_op = swap_rd_tmp2_op)

LEC-08:

2.6.4

Stack: Register Transfer Level

76

Dp-Op Data Input Register


stack_data_in <= inp_intern when (dp_op = push2_op) else tmp1 when (dp_op = swap_rd_tmp1_op) else tmp2 ;

LEC-08:

2.6.4

Stack: Register Transfer Level

77

Dp-Op Write Enable


stack_we <= 1 when ( (dp_op = push2_op) OR (dp_op = swap_rd_tmp1_op) OR (dp_op = swap_rd_tmp2_op) ) else 0

LEC-08:

2.6.4

Stack: Register Transfer Level

78

Dp-Op Output
----------------------------------------------------outp <= stack_data_out; -----------------------------------------------------

LEC-08:

2.6.4

Stack: Register Transfer Level

79

Dp-Op RAM Instantiation


stack : ram16x4s port map ( ---------------------------------------------we => stack_we, clk => clk, ---------------------------------------------addr => stack_addr, data_in => stack_data_in, data_out => stack_data_out ---------------------------------------------); end dp_op;

LEC-08:

2.6.4

Stack: Register Transfer Level

80

2.6.4.3 Stack: Explicit State Machine


Here we drop the loop ... wait ... style of implicit state machines and build an explicit state machine with current and next state signals. Notice that the stack is such a simple design that each datapath operation in the Dp-Op architecture is used in only one state. This is a sign that the Dp-Op style is not well-suited to the stack. This example also illustrates the use of a function to capture common code. The function is used here to determine which state to go to next when a new input instruction arrives.

LEC-08:

2.6.4

Stack: Register Transfer Level

81

Explicit Declarations
architecture state of stack is type state_ty is (init_st, pop_st, push1_st, push2_st, swap_wr_tmp1_st, swap_wr_tmp2_st, swap_rd_tmp1_st, swap_rd_tmp2_st, nop_st ); signal state, state_n : state_ty; ... ...

LEC-08:

2.6.4

Stack: Register Transfer Level

82

Explicit Function
-------------------------------------------------------function restart (inp : std_logic_vector(3 downto 0)) return state_ty is begin case inp is when pop => return(pop_st); when push => return(push1_st); when swap => return(swap_wr_tmp1_st); when others => return(nop_st); end case; end restart; begin

LEC-08:

2.6.4

Stack: Register Transfer Level

83

Explicit State Storage


-----------------------------------------------------process (clk) begin if rising_edge(clk) then if (reset = 1) then state <= init_st; empty_n <= 1; else state <= state_n; empty_n <= empty; end if; end if; end process;

LEC-08:

2.6.4

Stack: Register Transfer Level

84

Explicit Next State


-----------------------------------------------------process (state, inp) begin case state is when init_st | pop_st | push2_st | swap_wr_tmp2_st | nop_st => state_n <= restart(inp); when push1_st => state_n <= push2_st; when swap_rd_tmp1_st => state_n <= swap_rd_tmp2_st; when swap_rd_tmp2_st => state_n <= swap_wr_tmp1_st; when swap_wr_tmp1_st => state_n <= swap_wr_tmp2_st; end case; end process; ...

LEC-08:

2.6.4

Stack: Register Transfer Level

85

Explicit Tos
process (clk) begin if rising_edge(clk) then if (state = init_st) then tos <= to_unsigned(0,4); elsif ( (state = pop_st) OR (state = push1_st and (empty = 0)) ) then tos <= tos_adj; end if; end if; end process;

LEC-08:

2.6.4

Stack: Register Transfer Level

86

Explicit Tos Adjustment


-----------------------------------------------------tos_adj <= tos + to_unsigned(1,3) when (state = push1_st) else tos - to_unsigned(1,3) ;

LEC-08:

2.6.4

Stack: Register Transfer Level

87

Explicit Stack Address


-----------------------------------------------------stack_addr <= tos_adj when ( (state = pop_st) OR ((state = push1_st) AND (empty = 0)) OR (state = swap_wr_tmp1_st) OR (state = swap_rd_tmp2_st) ) else tos ; ... end state;

LEC-09 Preliminaries

LEC-09: Guidelines and Optimization Techniques


Lecture Notes Sections: 2.7 2.9.4

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-09 Preliminaries

Schedule
wk-01 05 VHDL Design and Optimization wk-01 Overview and VHDL wk-02 VHDL Semantics Design and Optimization wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Guidelines and Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review

wk-03 05

wk-06 wk-07 wk-08 wk-08 10 wk-11 12 wk-13

LEC-09 Preliminaries

Concepts
Lecture Notes: Sections 2.72.9.4 coding guidelines more vhdl features strength reduction mux-pushing

  

   

common subexpression elimination replication arithmetic optimizations

LEC-09:

2.7

RTL CODING GUIDELINES

2.7

RTL Coding Guidelines

LEC-09:

2.7.1

Design Process

2.7.1

Design Process

Recommendation: Spend the time up front to plan a good design on paper. Use dataow diagrams and state machines to predict performance and area. This section gives guidelines for building robust, portable, and synthesizable VHDL code. Portability is both for different simulation and synthesis tools and for different implementation technologies. Remember, there is a world of difference between getting a design to work in simulation and getting it to work on a real FPGA. And there is also a huge difference between getting a design to work in an FPGA for a few minutes of testing and getting thousands of products to work for months at a time in thousands of different environments around the world. The coding guidelines here are designed both for helping you to get your E&CE 427 project to work as well as all of the subsequent industrial designs. Finally, note that there are exceptions to every rule. You might nd yourself in a circumstance where your particular circumstance (e.g. choice of tool, target technology, etc) would benet from bending or breaking a guideline here. Within E&CE 427, of course, there wont be any such circumstances.

LEC-09:

2.7.2

Signal Declarations

2.7.2

Signal Declarations

LEC-09:

2.7.2

Signal Declarations

Signals vs Variables

Use signals, do not use variables reason The intention of the creators of VHDL was for signals to be wires and variables to be just for simulation. Some synthesis tools allow some uses of variables, but when using variables, it is easy to create a design that works in simulation but not in real hardware.

LEC-09:

2.7.2

Signal Declarations

Std Logic

Use std_logic signals, do not use bit or Boolean reason std_logic is the most commonly used signal type across synthesis tools, simulations tools, and cell libraries

LEC-09:

2.7.2

Signal Declarations

Port Modes

Use in or out, do not use inout reason inout signals are tri-state. note If you have an output signal that you also want to read from, you might be tempted to declare the direction of the signal to be inout. A better solution is to create a new, internal, signal that you both read from and write to. Then, your output signal can just read from the internal signal.

LEC-09:

2.7.2

Signal Declarations

10

Primary Inputs and Outputs of Chip

Declare the primary inputs and outputs of chips as either std logic and std logic vector. Do not use signed or unsigned for primary inputs or outputs. reason Both the Altera tool Quartus and the Xilinx tool ngd2vhdl convert signed and unsigned vectors in entities into std-logicvectors. If you want your same testbench to work for both functional simulation and timing simulation, you must not use signed or unsigned signals in the top-level entity of your chip. note Signed and unsigned signals are ne inside testbenches, for non-top-level entities, and inside architectures. It is only the toplevel entity that should not use signed or unsigned signals.

LEC-09:

2.7.3

Processes

11

2.7.3

Processes

For a combinational process, the sensitivity list should contain all of the signals that are read in the process. reason Gives consistent results across different tools. Many synthesis tools will implicitly include all signals that a process reads in its sensitivity list. This differs from the VHDL Standard. A tool that adheres to the standard will introduce latches if not all signals that are read from are included in the sensitivity list. exception In a clocked process using an if rising edge, it is acceptable to have only the clock in the sensitivity list

LEC-09:

2.7.3

Processes

12

Combinational Processes

For a combinational process, every signal that is assigned to, must be assigned to in every branch of if-then and case statements. reason If a signal is not assigned a value in a path through a combinational process, then that signal will be a latch. note For a clocked process, if a signal is not assigned a value in a clock cycle, then the ip-op for that signal will have a chipenable pin.

LEC-09:

2.7.3

Processes

13

Single Assignment Rule

Each signal should be assigned to in only one process. reason Multiple processes driving the same signal is the same as having multiple gates driving the same wire. This can cause contention, short circuits, and other bad things. exception Multiple drivers are acceptable if your implementation technology has wired-ANDs or wired-ORs. FPGAs dont have wiredANDs or wired-ORs.

LEC-09:

2.7.3

Processes

14

Separate Unrelated Signals

Separate unrelated signals into different processes reason Grouping assignments to unrelated signals into a single process can complicate the control circuitry for that process. Each branch in a case statement or if-then-else adds multiplexor or chip-enable circuitry.

LEC-09:

2.7.4

Flip-Flops and Latches

15

2.7.4

Flip-Flops and Latches

 

Use ops, not latches (see section 1.7.2). Use D-ops, not T, JK, etc (see section 1.7.2).

LEC-09:

2.7.4

Flip-Flops and Latches

16

Know Your Hardware

For every signal in your design, know whether it should be a ip-op or combinational. Before simulating your design, examine the log le LOG/dc shell.log to see if the ip ops in your circuit match your expectations, and to check that you dont have any latches in your design.

LEC-09:

2.7.4

Flip-Flops and Latches

17

2.7.4.1 Multiplexors and Tri-State Signals

Use multiplexors, not tri-state buffers (see section 1.7.2).

LEC-09:

2.7.5

State Machines

18

2.7.5

State Machines

In a state machine, illegal and unreachable states should transition to the reset state reason Creates more robust implementations. In the eld, your circuit will be subjected to illegal inputs, voltage spikes, temperature uctuations, clock speed variations, etc. At some point in time, something wierd will happen that will cause it to jump into an illegal state. Having a system crash and reboot is much better than having it generate incorrect outputs that arent detected.

LEC-09:

2.7.5

State Machines

19

State Encoding

If your state machine has less than 16 states, use a one-hot encoding. reason For n states, a one-hot encoding uses n ip-ops, while a binary encoding uses log2 n ip-ops. One-hot signlas are simpler to decode, because only one bit must be checked to determine if the circuit is in a particular state. For small values of n, a one-hot signal results in a smaller and faster circuit. For large values of n, the number of signals required for a one-hot design is too great of a penalty to compensate for the simplicity of the decoding circuitry. note Using an enumerated type for states allows the synthesis tool to choose state encodings that it thinks will work well to balance area and clock speed. Quartus uses a modied one-hot encoding, where the bit that denotes the reset state is inverted. That is, when the reset bit is 0, the system is in the reset state and when the reset bit is a 1 the system is not in the reset state. The other bits have the normal polarity. The result is that when the system is in the reset state, all bits are 0 and when the system is in a non-reset state, two bits are 1. note Using your own encoding allows you to leverage knowledge about your design that the synthesis tool might not be able to deduce.

LEC-09:

2.7.5

State Machines

20

2.7.5.1 Reset

Include a reset signal in all clocked circuits. reason For most implementation technologies, when you power-up the circuit, you do not know what state it will start in. Also, if something goes wrong while the circuit is running, you need a way to get it into a guaranteed state.

LEC-09:

2.7.5

State Machines

21

Reset with Implicit State Machines

For implicit state machines, check for reset after every wait statement. reason Missing a wait statement means that your circuit might not notice a reset signal, or different signals could reset in different clock cycles, causing your circuit to get out of synch.

LEC-09:

2.7.5

State Machines

22

Reset Only Important Flops

Connect reset to the important control signals in the design, such as the state signal. Do not reset every ip op. reason Using reset adds area and delay to a circuit. The fewer signals that need reset, the faster and smaller your design will be. note Connect the reset signal to critical ip-ops, such as the state signal. Datapath signals rarely need to be reset. You do not need to reset every signal

LEC-09:

2.7.5

State Machines

23

Synchronous Reset

Use synchronous, not asynchronous, reset reason Creates more robust implementations. Signal propagation delays mean that asynchronous resets cause different parts of the circuit to be reset at different times. This can lead to glitches, which then might cause the circuit to move to an illegal state.

LEC-09:

2.7.6

Inputs and Outputs

24

2.7.6

Inputs and Outputs

Put ip ops on primary inputs and outputs of a chip reason Creates more robust implementations. Signal delays between chips are unpredictable. Signal integrity can be a problem (remember transmission lines from E&CE 324?). Putting ip ops on inputs and outputs of chip provides clean boundaries between circuits. note This only applies to primary inputs and outputs of a chip (the signals in the top-level entity). Within a chip, you should adopt a standard of putting ip-ops on either inputs or outputs. Within a chip, you do not need to put ip-ops on both inputs and outputs.

LEC-09:

2.8

ADDITIONAL VHDL FEATURES

25

2.8

Additional VHDL Features

LEC-09:

2.8.1

Vectors

26

2.8.1

Vectors

VHDL supports reading from and assigning to slices (aka discrete subranges) of vectors.

  

The ranges on both sides of the assignment must be the same. The direction (downto or to) of each slice must match the direction of the signal declaration. The direction of the target and expression may be different.

LEC-09:

2.8.1

Vectors

27

Declarations
---------------------------------------------------a, b : in std_logic_vector(15 downto 0); c, d, e : out std_logic_vector(15 downto 0); ---------------------------------------------------ax, bx : in std_logic_vector(0 to 15); cx, dx, ex : out std_logic_vector(0 to 15); ---------------------------------------------------m, n : in unsigned(15 downto 0); p, q, r : out unsigned(15 downto 0); ---------------------------------------------------w, x : in signed(15 downto 0); y, z : out signed(15 downto 0) ----------------------------------------------------

LEC-09:

2.8.1

Vectors

28

Legal code
c(3 downto 0) cx(0 to 3) (e(3), e(4)) (e(5), e(6)) <= <= <= <= a(15 downto 12); a(15 downto 12); bx(12 to 13); b(13 downto 12);

LEC-09:

2.8.1

Vectors

29

Illegal code
d(0 to 3) <= a(15 to 12); -- slice dirs must be same as decl e(3) & e(2) <= b(12 to 13); -- syntax error on & p(3 downto 0) <= (m + n)( 3 downto 0); -- syntax error on )( z(3 downto 0) <= m(15 downto 12); -- types on lhs and rhs must match

LEC-09:

2.8.2

Still More VHDL Features

30

2.8.2

Still More VHDL Features

Some constructs that are useful and will be described in later chapters and sections: for-generate : replicates hardware if-generate : conditionally generates hardware report : print a message on stderr while simulating assert : assertions about behaviour of signals, very useful with report statements. generics : parameters to an entity that are dened at elaboration time. attributes : predened functions for different datatypes. For example: high and low indices of a vector.

LEC-09:

2.9

GENERAL OPTIMIZATION TECHNIQUES

31

2.9

General Optimization Techniques

LEC-09:

2.9.1

Strength Reduction

32

2.9.1

Strength Reduction

Strength reduction replaces one operation with another that is simpler.

LEC-09:

2.9.1

Strength Reduction

33

2.9.1.1 Arithmetic Strength Reduction


Multiply by a constant power of two Multiply by a power of two Divide by a constant power of two Divide by a power of two wired shift logical left shift logical left wired shift logical right shift logical right

LEC-09:

2.9.1

Strength Reduction

34

2.9.1.2 Boolean Strength Reduction


Boolean tests that can be implemented as wires

  

is odd, is even : least signicant bit is neg, is pos : most signicant bit NOTE: use is odd(a) rather than a(0)

LEC-09:

2.9.2

Replication and Sharing

35

2.9.2

Replication and Sharing

LEC-09:

2.9.2

Replication and Sharing

36

2.9.2.1 Mux-Pushing
Pushing multiplexors into the fanin of a signal can reduce area. Before z <= a + b when (w = 1) else a + c; After tmp <= b when (w = 1) else c; z <= a + tmp; The rst circuit will have two adders, while the second will have one adder. Some synthesis tools will perform this optimization automatically, particularly if all of the signals are combinational.

LEC-09:

2.9.2

Replication and Sharing

37

2.9.2.2 Common Subexpression Elimination


Introduce new signals to capture subexpressions that occur multiple places in the code. Before y <= else z <= else a + b + c when (w = 1) d; a + c + d when (w = 1) e; a + c; b + tmp when (w = 1) d; d + tmp when (w = 1) e;

After tmp <= y <= else z <= else

LEC-09:

2.9.2

Replication and Sharing

38

Subexpression Elimination
NOTE: Clocked subexpressions Care must be taken when doing common subexpression elimination in a clocked process. Putting the temporary signal in the clocked process will add a clock cycle to the latency of the computation, because the tmp signal will be ip-op. The tmp signal must be combinational to preserve the behaviour of the circuit.

LEC-09:

2.9.2

Replication and Sharing

39

2.9.2.3 Computation Replication

 

To improve performance If same result is needed at two very distant locations and wire delays are signicant, it might improve performance (increase clock speed) to replicate the hardware To reduce area If same result is needed at two different times that are widely separated, it might be cheaper to reuse the hardware component to repeat the computation than to store the result in a register

NOTE: Muxes are not free Each time a component is reused, multiplexors are added to inputs and/or outputs. Too much sharing of a component can cost more area in additional multiplexors than would be spent in replicating the component

LEC-09:

2.9.3

Arithmetic

40

2.9.3

Arithmetic

VHDL is left-associative. The expression a + b + c + d is interpreted as (((a + b) + c) + d). You can use parentheses to suggest parallelism. Perform arithmetic on the minimum number of bits needed. If you only need the lower 12 bits of a result, but your input signals are 16 bits wide, trim your inputs to 12 bits. This results in a smaller and faster design than computing all 16 bits of the result and trimming the result to 12 bits.

LEC-09:

2.9.4

Pipelining

41

2.9.4

Pipelining

Pipelines will not be covered in E&CE 427. This subsection is provided for those who already understand the basics of pipelining. You can turn a dataow diagram into a pipeline by making each clock cycle of the dataow diagram a separate pipe stage. However, this can be complicated and error-prone. You need to worry about data hazards if you have state-holding registers in your algorithm. You need to worry about structural hazards if different instructions have different latencies.

LEC-10 Preliminaries

LEC-10: FPGA-Specic Guidelines and Optimization


Lecture Notes Sections: 2.10 2.11.2

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-10 Preliminaries

Schedule
wk-01 05 VHDL wk-01 Overview and VHDL wk-02 VHDL Semantics Design and Optimization wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Guidelines and Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review

wk-03 05

wk-06 wk-07 wk-08 wk-08 10 wk-11 12 wk-13

LEC-10 Preliminaries

Overview
In this lecture we will go over some design guidelines and optimization techniques that are specic to FPGAs.

LEC-10 Preliminaries

Concepts
Lecture Notes: Sections 2.102.11.2 Coding guidelines for FPGAs Hardware for generic FPGAs

 

 

Altera hardware Coding guidelines for Altera FPGAs

LEC-10:

2.10

FPGA-SPECIFIC GUIDELINES

2.10

FPGA-Specic Guidelines

LEC-10:

2.10.1

Generic FPGAs

2.10.1 Generic FPGAs

LEC-10:

2.10.1

Generic FPGAs

2.10.1.1 ware

Overview of Generic FPGA Hard-

LEC-10:

2.10.1

Generic FPGAs

Generic FPGA Cell


Cell = = Logic Element (LE) in Altera Congurable Logic Block (CLB) in Xilinx
carry_in

data_in

comb

D CE

data_out

ctrl_in

carry_out

LEC-10:

2.10.1

Generic FPGAs

Congurable Comb/Flop Connection


carry_in comb_data_out comb_data_in comb
D CE R

flop_data_out

flop_data_in ctrl_in

carry_out

LEC-10:

2.10.1

Generic FPGAs

10

Separate Comb and Flop


carry_in comb_data_out comb_data_in comb
D CE R

flop_data_out

flop_data_in ctrl_in

carry_out

LEC-10:

2.10.1

Generic FPGAs

11

Connect Comb and Flop


carry_in comb_data_out comb_data_in comb
D CE R

flop_data_out

flop_data_in ctrl_in

carry_out

LEC-10:

2.10.1

Generic FPGAs

12

Flopped and Unopped Outputs


carry_in comb_data_out comb_data_in comb
D CE R

flop_data_out

flop_data_in ctrl_in

carry_out

LEC-10:

2.10.1

Generic FPGAs

13

Generic FPGA Cell


carry_in comb_data_out comb_data_in comb
D CE R

flop_data_out

flop_data_in ctrl_in

carry_out

LEC-10:

2.10.1

Generic FPGAs

14

Flip Flops Are Free

Flip-ops are almost free in FPGAs reason In FPGAs, the area consumed by a design is usually determined by the amount of combinational circuitry, not by the number of ip-ops. Usually each 4:1 combinational circuit has a ip-op.

LEC-10:

2.10.1

Generic FPGAs

15

Use It or Lose

Aim for using 8090% of the cells on a chip. reason If you use more than 90% of the cells on a chip, then the placeand-route program might not be able to route the wires to connect the cells. reason If you use less than 80% of the cells, then probably: there are optimizations that will increase performance and still allow the design to t on the chip; or you spent too much human effort on optimizing for low area; or you could use a smaller (cheaper!) chip. exception In E&CE 427 (unlike in real life), the mark is based on the actual number of cells used.

LEC-10:

2.10.1

Generic FPGAs

16

Area Estimation

You can estimate the area of a design by counting the number of ipops in the fanin of each ip-op. reason Each set of four source signals requires one cell. Source ops Cells 1 1 2 1 3 1 4 1 5 2 6 2 7 2 8 3 9 3 10 3 11 4 note This technique is generally an overestimate, because a single cell can drive several other cells (common subexpression elimination).

LEC-10:

2.10.1

Generic FPGAs

17

Local Connections for Generic Cell


NB: In these slides, the space between tightly grouped wires sometimes dissapears, making a group of wires appear to be a single large wire.

LEC-10:

2.10.1

Generic FPGAs

18

Local Connections for Generic Cell

 

General purpose interconnect (congurable, slow) Carry chains and cascade chains (verticaly adjacent cells, fast)

LEC-10:

2.10.1

Generic FPGAs

19

Local Connections for Generic Cell

 

General purpose interconnect (congurable, slow) Carry chains and cascade chains (vertically adjacent cells, fast)

LEC-10:

2.10.1

Generic FPGAs

20

Generic Blocks of Cells

LEC-10:

2.10.1

Generic FPGAs

21

Generic Blocks of Cells

LEC-10:

2.10.1

Generic FPGAs

22

Generic Blocks of Cells

LEC-10:

2.10.1

Generic FPGAs

23

Generic Blocks of Cells

Cells not used for computation can be used as wires to shorten length of path between cells.

LEC-10:

2.10.1

Generic FPGAs

24

2.10.1.2

Generic Clocks

Characteristics of clock signals:

Characteristics of FPGAs:

   

High fanout (drive many gates) Long wires (destination gates scattered all over chip)

Very few gates that are large (strong) enough to support a high fanout. Very few wires that traverse entire chip and can be connected to every ip-op.

LEC-10:

2.10.1

Generic FPGAs

25

Clocks
Guideline for clock signals on FPGAs:

Use just one clock signal reason If all ip-ops use the same clock, then the clock does not impose any constraints on where the place-and-route tool puts ipops and gates. If different ip-ops used different clocks, then ip-ops that are near each other would probably be required to use the same clock.

LEC-10:

2.10.1

Generic FPGAs

26

Clocks
Guideline for clock signals on FPGAs:

Use only one edge of clock signal reason There are two ways to use both rising and falling edges of a clock signal: have rising-edge and falling-edge ip ops, or have two different clock signals that are inverses of each other. Most FPGAs have only rising-edge ip ops. Thus, using both edges of a clock signal is equivalent to having two different clock signals, which is deprecated by the preceding guideline.

LEC-10:

2.10.1

Generic FPGAs

27

2.10.1.3

Special Circuitry in FPGAs

LEC-10:

2.10.1

Generic FPGAs

28

Memory
For ve or more years, FPGAs have had special circuits for RAM and ROM. In Altera FPGAs, these circuits are called ESBs (Embedded System Blocks). These special circuits are possible because many FPGAs are fabricated on the using the same processes as SRAM chips. So, the FPGAs simply contain small chunks of SRAM.

LEC-10:

2.10.1

Generic FPGAs

29

Microprocessors
A new feature to appear in FPGAs in 2001 and 2002 is hardwired microprocessors on the same chip as programmable hardware.

Altera Xilinx: Virtex-II Pro

Hard Arm 922T with 200 MIPs Power PC 405 with 420 D-MIPs

Soft Nios with ?? MIPs Microblaze with 100 D-MIPs

The Xilinx-II Pro has 4 Power PCs and enough programmable hardware to implement a complete 32-bit microprocessor.

LEC-10:

2.10.1

Generic FPGAs

30

Arithmetic Circuitry
A new feature to appear in FPGAs in 2001 and 2002 is hardwired circuits for multipliers and adders.

Using these resources can improve signicantly both the area and performance of a design.

 

Altera: Mercury Xilinx: Virtex-II Pro

16 18

16 at 130MHz 18 at ???MHz

LEC-10:

2.10.1

Generic FPGAs

31

Input / Output
Recently, high-end FPGAs have started to include special circuits to increase the bandwidth of communication with the outside world. Product True-LVDS (1 Gbps) Rocket I/O (3 Gbps)

Altera Xilinx

LEC-10:

2.10.2

Altera APEX20K

32

2.10.2 Altera APEX20K

LEC-10:

2.10.2

Altera APEX20K

33

APEX20K Block Hierarchy


Chip 52 Mega Logic Array Blocks (MegaLABs) 1 Embedded System Block (ESB) Memory and wide combinational functions 16 Logic Array Blocks (LABs) 10 Logic Elements (LEs) 4-input lookup table Carry and cascade Flip-op

Each level of hierarchy has its own interconnect (wires).

LEC-10:

2.10.2

Altera APEX20K

34

LE Computation and Storage

   

4-input lookup table (LUT) Carry-chain computation circuitry Cascade-chain computation circuitry Flip-op with load, clear, clock-enable

LEC-10:

2.10.2

Altera APEX20K

35

LE Interconnect

      

4 data inputs 2 data outputs Carry in, carry out Cascade in, cascade out Clock, clock-enable Async clear, synch set (load), synch clear (reset) Global reset

LEC-10:

2.11

EXAMPLE CIRCUITS

36

2.11

Example Circuits

LEC-10:

2.11.1

Ripple-Carry Adder

37

2.11.1 Ripple-Carry Adder


Ripple-Carry Adder 70 65 60 55 50 45 40 35 30 Delay (ns) 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 Data Width

LEC-10:

2.11.2

Barrel Shifter

38

2.11.2 Barrel Shifter


This example illustrates:

   

packages for-generate if-generate faking 2-dimensional arrays

LEC-10:

2.11.2

Barrel Shifter

39

Barrel Shifter Package


library ieee; use ieee.std_logic_1164.all; package shift_const_pkg is constant width : integer := 28; constant depth : integer := 3; end shift_const_pkg;

LEC-10:

2.11.2

Barrel Shifter

40

Barrel Shifter Entity


library ieee; use ieee.std_logic_1164.all; use work.shift_const_pkg.all; entity barrel_shift is port ( clk : in std_logic; di : in std_logic_vector(width - 1 downto 0); do : out std_logic_vector(width - 1 downto 0); sel : in std_logic_vector(depth - 1 downto 0) ); end barrel_shift;

LEC-10:

2.11.2

Barrel Shifter

41

Barrel Shifter Architecture


architecture main of barrel_shift is subtype word is std_logic_vector(width - 1 downto 0); type x_ty is array(depth downto 0) of word; signal x : x_ty; begin process (clk) begin if rising_edge(clk) then for w in width - 1 downto 0 loop x(0)(w) <= di(w); do(w) <= x(depth)(w); end loop; end if; end process; for_d : for d in depth - 1 downto 0 generate for_w : for w in width - 1 downto 0 generate if_msb : if w + 2**depth >= width generate x(d+1)(w) <= 0 when sel(0) = 1 else x(d)(w); end generate; if_norm : if not(w + 2**depth >= width) generate x(d+1)(w) <= x(d)(w + 2**d) when sel(0) = 1 else x(d)(w); end generate; end generate; end generate; end main;

LEC-10:

2.11.2

Barrel Shifter

42

Chapter 3

Functional Validation

LEC-11 Preliminaries

LEC-11: Functional Validation of Datapath Circuits


Lecture Notes Sections: 3.1 3.4.6

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-11 Preliminaries

Schedule
wk-01 02 wk-03 05 wk-06 VHDL Design and Optimization Functional Validation Lec-11 Datapath Validation and Testbenches Lec-12 Control Validation and Assertions Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review

wk-07 wk-08 wk-08 10 wk-11 12 wk-13

LEC-11 Preliminaries

Overview
The purpose of this lecture is to illustrate techniques to quickly and reliably detect bugs in datapath circuits. We will discusses validation of datapath circuits and introduce the notions of testbench, specication, and implementation. Well illustrate a progression of techniques that can be used to go from very simple tests to more complete and complicated tests.

LEC-11 Preliminaries

Concepts

   

Specication Implementation Design Under Test (DUT) Unit Under Test (UUT) Test Bench Stimulus

     

Manual tests Array of test vectors Generated tests Functional specication Relational specication

LEC-11 Preliminaries

Background

Basic hardware and software debugging techniques

LEC-11 Preliminaries

Reading (Smith)

Smiths ASIC: 10.2.7 13.1 : 13.2 : 13.5 : : sample testbench levels of temporal abstraction for simulation simulation example different simulation models for hardware

LEC-11 Preliminaries

Reading (Rushton and Ashenden)

  

Rushtons VHDL for Logic Synthesis: Ch 13 : Testbenches Ashendens Designers Guide to VHDL: Sect 1.4 : Testbenches Sect 6.2.1 : Testing the Behavioural Model of a Pipelined Multiplier Accumulator Sect 6.3.3 : Testing the Register-Transfer-Level Model of a Pipelined Multiplier Accumulator Sect 15.3 : Testing the Behavioural Model of a DLX Computer System Sect 15.5 : Testing the Register-Transfer-Level Model of a DLX Computer System Janick Bergerons verication guild website: http://www.janick.bergeron.com/guild/default.htm

LEC-11:

3.1

OVERVIEW

3.1

Overview

LEC-11:

3.1.1

Validation / Verication / Testing

3.1.1

Validation / Verication / Testing

functional validation checking that a design (e.g. RTL code) has the correct behaviour

  

usually treats combinational circuitry as having zero-delay usually done by simulating circuit with test vectors big challenges are simulation speed and test generation

LEC-11:

3.1.1

Validation / Verication / Testing

10

Terminology
formal verication checking that a design has the correct behaviour for every possible input and internal state

    

uses mathematics to reason about circuit, rather than checking individual vectors of 1s and 0s capacity problems: only usable on detailed models of small circuits or abstract models of large circuits mostly a research topic, but some practical applications have been demonstrated tools include model checking and theorem proving formal verication is not a guarantee that the circuit will work correctly

LEC-11:

3.1.1

Validation / Verication / Testing

11

Terminology
performance validation checking that implementation has (at least) desired performance power validation checking that implementation has (at most) desired power equivalence verication (checking) checking that the design generated by a synthesis tool has same behaviour as RTL code. timing verication checking that all of the paths in a circuit t meet the timing constraints

LEC-11:

3.1.1

Validation / Verication / Testing

12

Terminology Dogma (Formal Verication)


To the formal verication community, verication implies that all possible cases have been checked. In comparison validation means that some, but not all, cases were checked. Obviously not everyone follows this convention...

LEC-11:

3.1.1

Validation / Verication / Testing

13

Terminology Dogma (Hardware vs Software)


Note: in software testing refers to running programs with specic inputs and checking if the program does the right thing. In hardware, testing usually means manufacturing testing, which is checking the circuits that come off of the manufacturing line.

LEC-11:

3.1.2

Why Your First Circuit Will Not Work

14

3.1.2

Why Your First Circuit Will Not Work


Notes from Kenn Heinrich (UW E&CE grad)

Everyone should get a lecture on why their rst industrial design wont work in the eld. Here are few reasons:

LEC-11:

3.1.2

Why Your First Circuit Will Not Work

15

Unreachable States
1. You forgot to make your unreachable states transition to the initial (reset) state. Clock glitches, power surges, etc will occasionally cause your system to jump to a state that isnt dened or produce an illegal data value. When this happens, your design should reset itself, rather than crash or generatel illegal outputs.

LEC-11:

3.1.2

Why Your First Circuit Will Not Work

16

Untestable Registers
2. You have internal registers that you cant access or test. If you can set a register you must have some way of reading the register from outside the chip.

LEC-11:

3.1.2

Why Your First Circuit Will Not Work

17

Cannot Isolate Your Chip


3. Another chip controls your chip, and the other chip is buggy. All of your external control lines should be able to be disabled, so that you can isolate the source of problems.

LEC-11:

3.1.2

Why Your First Circuit Will Not Work

18

Insufcient Decoupling Capacitors


4. Not enough decoupling capacitors on your board. The analog world is cruel and and unusual. Voltage spikes, current surges, crosstalk, etc can all corrupt the integrity of digital signals. Trying to save a few cents on decoupling capacitors can cause headaches and signicant nancial costs in the future.

LEC-11:

3.1.2

Why Your First Circuit Will Not Work

19

The Laboratory is Not Reality


5. You only tested your system in the lab, not in the real world. As a product, systems will need to run for months in the eld, simulation and simple lab testing wont catch all of the weirdness of the real world.

LEC-11:

3.1.2

Why Your First Circuit Will Not Work

20

Unexplored Corner Cases


6. You didnt adequately test the corner cases and boundary conditions. Every corner case is as important as the main case. Even if some weird event happens only once every six months, if you do not handle it correctly, the bug can still make your system unusable and unsellable.

LEC-11:

3.2

TEST CASES

21

3.2

Test Cases

Test case / test vector : A combination of inputs and internal state values. Represents one possible test of the system. Boundary conditions / corner cases : A test case that represents an unusual situation on input and/or internal state signals. Corner cases are likely to contain bugs. Test scenario : A sequence of test vectors that, together, exercise a particular situation (scenario) on a circuit. For example, a scenario for an elevator controller might include a sequence of button pushes and movements between oors. Test suite : A collection of test vectors that a run on a circuit.

LEC-11:

3.2.1

Coverage

22

3.2.1

Coverage

To be sure that an implementation is correct, we must check every combination of values. This includes both input values and internal state (ip ops). If we have ni bits of inputs and ns bits in ip-ops, we have to test 2ni ns different cases when doing functional validation.

Question: If we have nc combinational signals, why dont we have to test 2ni ns nc different cases?

 

LEC-11:

3.2.1

Coverage

23

Coverage
Question: If we have nc combinational signals, why dont we have to test 2ni ns nc different cases?

Answer: The value of each combinational signal is determined by the ip ops and inputs in its fanin. Once the values of the inputs and ip ops are known, the value of each combinational signal can be calculated. Thus, the combinational signals do not add additional cases that we need to consider.

 

LEC-11:

3.2.1

Coverage

24

Coverage
Denition Coverage: The coverage that a suite of tests achieves on a circuit is the percentage of cases that are simulated by the tests. 100% coverage means that the circuit has been simulated for all combinations of values for input signals and internal signals.

LEC-11:

3.2.1

Coverage

25

Coverage
NOTE: Coverage Terminology There are many different types of coverage, which measure everything from percentage of cases that are exercised to number of output values that are exercises.

LEC-11:

3.2.1

Coverage

26

Coverage
NOTE: Coverage Tools There are many different commercial software programs that measure code and other types of coverage. Company Cadence Cadence Fintronic interHDL Summit Design Synopsys TransEDA Verisity Veritools Aldec Tool Afrma Coverage Analyzer DAI Coverscan FinCov Coverit HDLScore CoverMeter Verication Navigator SureCov Express VCT, VeriCover Riviera Coverage code, expressions, fsm code bought by Avant! ? code, events, variables code coverage (dead?) code and fsm code, block, values, fsm code, branch code, block

LEC-11:

3.2.2

Heating System Example

27

3.2.2

Heating System Example

This example is a simple heating system that might appear in a home.

   

Three states: off, low, and high. The user can set the desired temperature to any value between 15C and 25C. There is a thermometer to measure the current temperature for values between 0C and 40C. The state machine in gure 3.1 describes the transitions between states.

LEC-11:

3.2.2

Heating System Example

28

Transitions Between States


diff = des_temp - cur_temp 3 =< diff < 5 OFF diff < -3 7 =< diff diff < -2 LOW

Figure 3.1: Transitions between states

5 =< di ff

HIGH

LEC-11:

3.2.2

Heating System Example

29

Sample Scenario

off low des_tmp low high high current state low high

off low 23 22 20 15 13

off

high

low

off

low

high

Figure 3.2: A sample scenario for the heating system

LEC-11:

3.2.2

Heating System Example

30

State and Signal Ranges


Item state cur_temp des_temp Range off, low, high 0..40 15..25 Num Values 3 41 11

Figure 3.3: State and Signal Ranges

LEC-11:

3.2.2

Heating System Example

31

3.2.2.1 Number of Cases to Consider

Figure 3.4: Number of cases in heating systems

 

(number of inputs) 451 1353 A total of 1353 cases to test

Number of input values Number of states Number of cases

451 3 (number of states) 3

41

11

LEC-11:

3.2.2

Heating System Example

32

But, how many bits to represent 41, 11, or 3 values?


Signals are vectors of Boolean values. They must have 2n possible values. Item state des_temp cur_temp Range off, low, high 15..25 0..40 Num Values 3 11 41 Bits 2 4 6 Representable Values 4 16 64

Figure 3.5: Number of bits for signals in heating systems

LEC-11:

3.2.2

Heating System Example

33

Actual Number of Cases to Consider

Figure 3.6: Actual number of cases to consider

Three times more values to consider than originally thought

Number of input values Number of states Number of cases

64

16

1024 4 4096

LEC-11:

3.2.2

Heating System Example

34

3.2.2.2 Representation Simplication


Two-thirds of representable values are illegal / unused. Unused values leads to wasted area in circuit and increases validation effort.

LEC-11:

3.2.2

Heating System Example

35

Adjust Range to be Powers of Two


Item state des_temp Range off, low, high 15 25 12 27 17 24 0..40 -20..43 17 7 24 4 17 5 24 3 Num Values 3 11 16 8 41 64 19 16 Bits 2 4 4 3 6 6 5 4 Actual Num of Values 4 16 16 8 64 64 32 16

cur_temp

$ "!! # $ "!! #

!"! !"! "!!

LEC-11:

3.2.2

Heating System Example

36

Scenario with Adjusted Ranges


off low des_tmp low high high low high 15 13 off low 23 22 20

current state

off

high

low

off

low

high

Notice that with adjusted ranges, there is very little change in behaviour.

LEC-11:

3.2.2

Heating System Example

37

State Machine with Adjusted Ranges


2 =< diff < 4 OFF diff < -3 5 =< diff diff < -2 LOW

4 =< di ff

HIGH

LEC-11:

3.2.2

Heating System Example

38

Reduced Number of Cases to Consider


Number of legal input values Number of illegal input values Total number of input values Number of legal state values Number of illegal state values Total number of state values Number of legal cases Number of illegal cases Total number of cases Old 451 573 1024 3 1 4 1353 2743 4096 New 128 0 128 3 1 4 384 128 512

Choosing data ranges to be powers of two reduced number of illegal inputs and internal state values.

LEC-11:

3.2.3

Floating Point Divider Example

39

3.2.3

Floating Point Divider Example

Consider doing the functional simulation for a double precision (64-bit) oating-point divider. Given Information Data width Number of gates in circuit Number of assembly-language instructions to simulate one gate for one test case Number of clock cycles required to execute one assembly language instruction on the computer that is running the simulation Clock speed of computer that is running the simulation 64 bits 10 000 100 0.5

1 Gigahertz

LEC-11:

3.2.3

Floating Point Divider Example

40

Number of Cases
Question: How many cases must be considered?

Answer:

3 4E 38cases

% & ' & 

NumTestsTot

NumInputCases NumStateCases 264 264 20

$ ! $ !

item src1 src2

bits 64 64

num values 1 8E 19 264 1 8E 19 264

$ !

LEC-11:

3.2.3

Floating Point Divider Example

41

Simulation Run Time


Question: How long will it take to simulate all of the different possible cases?

Answer:

1 7E 35secs 5 6E 26years

Learn the general technique, not the specic formula!

$ ! 

! 

TestTimeTot

10000gates

100

instrs gate

05

cycles instr

1E 9

secs cycle

3 4E 38cases

$ ! $ !

LEC-11:

3.2.3

Floating Point Divider Example

42

Coverage
Question: If you can run simulations non-stop for one year on ten computers, what coverage will you achieve?

Answer:

1. Calculate number of seconds to simulate one test case on one computer instrs cycles secs TestTime1:1 10000gates 100 05 1E 9 gate instr cycle 5E 4secs

! 

LEC-11:

3.2.3

Floating Point Divider Example

43

One Test : Ten Computers


2. Number of seconds to simulate one test using 10 computers TestTime1:1 TestTime1:10 10comps 5E 4secs 10 5E 5secs

LEC-11:

3.2.3

Floating Point Divider Example

44

Number of Tests
3. Number of tests per year using ten computers secs mins hours days 60 60 24 365 25 min hour day year NumTests:10 TestTime1:10 SpeedOfLight in m/s TestTime1:10 3E 8secs 5E 5secs 6E 12cases

LEC-11:

3.2.3

Floating Point Divider Example

45

Coverage
4. Calculate coverage achieved by running tests on ten computers for one year NumTestsRun Covg NumTestsTot NumTests:10 NumTestsTot 6E 12 3E 38 2E 26 0 0000000000000000000000002%

$ $

LEC-11:

3.2.4

Functional Validation Challenges

46

3.2.4

Functional Validation Challenges

From Validating the Intel(R) Pentium(R) Microprocessor by Bob Bentley, Design Automation Conference 2001. (Link on E&CE 427 web page.)

  

Simulating the Pentium 4 Processor on a Pentium 3 Processor ran at about 15 MHz. By tapeout, over 200 billion simulation cycles had been run on a network of computers. All of these simulations represent less than two minutes of running a real processor.

LEC-11:

3.2.4

Functional Validation Challenges

47

Research
Research challenges: 1. How to make simulations run faster? 2. How to choose test cases so that cases that are run are likely to detect bugs?

LEC-11:

3.2.4

Functional Validation Challenges

48

Research
Research activities in functional validation: 1. 2. 3. 4. Simulation accelleration Coverage analysis Test generation Formal verication

LEC-11:

3.2.4

Functional Validation Challenges

49

Practice
Challenges in practice: 1. 2. 3. 4. Writing specication Identifying corner cases Choosing test cases Finding root cause of unexpected behaviour

LEC-11:

3.3

TESTBENCHES

50

3.3

Testbenches

A test bench (also known as a test rig, test harness, or test jig) is a collection of code used to simulate a circuit and check if it works correctly. Testbenches are not synthesized. You do not need to restrict yourself to the synthesizable subset of VHDL. Use the full power of VHDL to make your testbenches concise and powerful.

LEC-11:

3.3.1

Overview of Test Benches

51

3.3.1

Overview of Test Benches


testbench specification stimulus check

implementation

Implementation Circuit that youre checking for bugs also known as: design under test or unit under test Stimulus Generates test vectors Specication Describes desired behaviour of implementation Check Checks whether implementation obeys specication

LEC-11:

3.3.1

Overview of Test Benches

52

Notes and observations

) ) ) ) ) )

Testbenches usually do not have any inputs or outputs. Inputs are generated by stimulus Outputs are analyzed by check and relevant information is printed using report statements Different circuits will use different stimuli, specications, and checks. The roles of the specication and check are somewhat exible. Most circuits will have complex specications and simple checks. However, some circuits will have simple specications and complex checks. If two circuits are supposed to have the same behaviour, then they can use the same stimuli, specication, and check. If two circuits are supposed to have the same behaviour, then one can be used as the specication for the other. Testbenches are restricted to stimulating only primary inputs and observing only primary outputs. To check the behaviour of internal signals, use assertions (Lec-12).

LEC-11:

3.3.2

Reference Model Style Testbench

53

3.3.2

Reference Model Style Testbench


reference model testbench specification stimulus

implementation

) ) )

Specication has same inputs and outputs as implementation. Specication is a clock-cycle accurate description of desired behaviour of implementation. Check is an equality test between outputs of specication and implementation.

LEC-11:

3.3.2

Reference Model Style Testbench

54

Examples

) ) )

Execution modules: output is sum, difference, product, quotient, etc.of inputs DSP lters Instruction decoders

NOTE: Functional specication vs Reference model Functional specication and reference model are often used interchangeably.

LEC-11:

3.3.3

Relational Style Testbench

55

3.3.3

Relational Style Testbench


relational testbench

stimulus

check

implementation

) ) ) )

Relational testbenches, or relational specications are used when we do not want to specify the specic output values that the implementation must produce. Instead, we want to check that some relationship holds between the output and the input, or that some relationship holds amongst the output values (independent of the values of the input signals.) Specication is usually just wires to feed the input signals to the check. Check is the brains and encodes the desired behaviour of the circuit.

LEC-11:

3.3.3

Relational Style Testbench

56

Examples

) ) )

Carry-save adders: the two outputs are the sum of the three inputs, but do not specify exact values of each individiual output. Arbiters: every request is eventually granted, but do not specify in which order requests are granted. One-hot encoding: exactly one bit of vector is a 1, but do not specify which bit is a 1.

NOTE: Relational specication vs relational testbench Relational specication and relational testbench are often used interchangeably.

LEC-11:

3.3.4

Coding Structure of a Testbench

57

3.3.4
testbench

Coding Structure of a Testbench


specification

stimulus

check

implementation

architecture main of athabasca_tb is component declaration for implementation; other declarations begin implementation instantiation; stimulus process; specification process (or component instantiation); check process; end main;

LEC-11:

3.3.5

Datapath vs Control

58

3.3.5

Datapath vs Control

Datapath and control circuits tend to use different styles of testbenches.

LEC-11:

3.3.5

Datapath vs Control

59

Datapath Validation
Datapath circuits tend to be well-suited to reference-model style testbenches:

) )

Each set of inputs generates one set of outputs Each set of outputs is a function of just one set of inputs

LEC-11:

3.3.5

Datapath vs Control

60

Control Validation
Control circuits often pose problems for testbenches,

Assertions (Lec-12) can be used to check the behaviour of internal signals. Control circuits tend to use assertions to check correctness and rely on testbenches only to stimulate inputs.

) ) ) )

Many more internal signals than outputs. The behaviour of the outputs provides a view into only a fragment of the current state of the circuit. It may take many clock cycles from when a bug is exercised inside the circuit until it generates a deviation from the correct behaviour on the outputs. When the deviation on the outputs is observed, it is very difcult to pinpoint the precise cause of the deviation (the root cause of the bug).

LEC-11:

3.4

FUNCTIONAL VALIDATION FOR DATAPATH CIRCUITS 61

3.4 Functional Validation for Datapath Circuits


In this section we will incrementally develop a testbench for a very simple circuit: an AND gate. The process scales well to very large circuits. The process allows validation to begin as soon a circuit is simulatable, even before a complete specication has been written.

LEC-11:

3.4

FUNCTIONAL VALIDATION FOR DATAPATH CIRCUITS 62

Implementation
entity and2 is port ( a, b : in std_logic; c : out std_logic ); end and2; architecture main of and2 is begin c <= 1 when (a = 1 AND b = 1) else 0; end and2;

LEC-11:

3.4.1

A Spec-Less Testbench

63

3.4.1

A Spec-Less Testbench

(NOTE: this code has not been checked for correctness) First, use waveform viewer to check that implementation generates reasonable outputs for a small set of inputs. entity and2_tb is end and2_tb; architecture main_tb of and2_tb is component and2 port ( a, b : in std_logic; c : out std_logic ); end component; signal ta, tb, tc_impl : std_logic; signal ok : boolean; begin --------------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); --------------------------------------------stimulus : process begin ta <= 0; tb <= 0; wait for 10ns; ta <= 1; tb <= 1; wait for 10ns; end process; --------------------------------------------end main_tb; Use this testbench until implementation generates solid Boolean values (No X or U data) and have checked that a few simple test cases generate correct outputs.

LEC-11:

3.4.2

Use an Array for Test Vectors

64

3.4.2

Use an Array for Test Vectors

Writing code to drive inputs and repetitively typing wait for 10 ns; can get tedious, so code up test vectors in an array. (NOTE: this code has not been checked for correctness) architecture main_tb of and2_tb is ... begin ... stimulus : process type test_datum_ty is record ra, rb : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty ; constant test_vectors : test_vectors_ty := -a b ( ( 0, 0), ( 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; wait for 10 ns; end loop; end process; end main_tb; Use this testbench until checking the correctness of the outputs by hand using waveform viewer becomes difcult.

LEC-11:

3.4.3

Build Spec into Stimulus

65

3.4.3

Build Spec into Stimulus

(NOTE: this code has not been checked for correctness) After a few test vectors appear to be working correctly (via a manual check of waveforms on simulation), begin automatically checking that outputs are correct.

architecture main_tb of and2_tb is ... begin -----------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); -----------------------------------------stimulus : process type test_datum_ty is record ra, rb, rc : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := -a, b: inputs -c : expected output -a b c ( ( 0, 0, 0), ( 0, 1, 0), ( 1, 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; tc_spec <= test_vectors(i).rc; wait for 10 ns; end loop; end process; -----------------------------------------check : process (tc_impl, tc_spec) begin ok <= (tc_impl = tc_spec); end process; -----------------------------------------end main_tb; Use this testbench until it becomes tedious to calculate manually the correct result for each test case.

) )

Add expected result to stimulus Add check process

LEC-11:

3.4.4

Have Separate Specication Entity

66

3.4.4

Have Separate Specication Entity

Rather than write the specication as part of stimulus, create separate specication entity/architecture. The specication component then calculates the expected output values. (NOTE: if your simulation tool supports congurations, the spec and impl can share the same entity, well see this in section 3.5)

LEC-11:

3.4.4

Have Separate Specication Entity

67

entity and2_spec is ...(same as and2 entity)... end and2_spec; architecture spec of and2_spec is begin c <= a AND b; end spec; architecture main_tb of and2_tb is component and2 ...; component and2_spec ...; signal ta, tb, tc_impl, tc_spec : std_logic; signal ok : boolean; begin -----------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); spec : and2_spec port map (a => ta, b => tb, c => tc_spec); -----------------------------------------stimulus : process begin type test_datum_ty is record ra, rb : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := -a b ( ( 0, 0), ( 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; wait for 10 ns; end loop; end process; -----------------------------------------check : process (tc_impl, tc_spec) begin ok <= (tc_impl = tc_spec); end process; -----------------------------------------end main_tb;

LEC-11:

3.4.5

Generate Test Vectors

68

3.4.5

Generate Test Vectors

When it becomes tedious to write out each test vector by hand, we can automaticaly compute them. This example uses a pair of nested for loops to generate all four permutations of input values for two signals. architecture main_tb of and2_tb is ... begin ... stimulus : process subtype std_test_ty of std_logic is (0, 1); begin for va in std_test_tylow to std_test_tyhigh loop for vb in std_test_tylow to std_test_tyhigh loop ta <= va; tb <= vb; wait for 10 ns; end loop; end loop; end process; ... end main_tb;

LEC-11:

3.4.6

Relational Specication

69

3.4.6

Relational Specication

Sometimes we want to check a relationship between the output and the input, rather than check that the output has a specic value. To do this, we drop the spec process, and put the brains into the check process. architecture main_tb of and2_tb is ... begin -----------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); -----------------------------------------stimulus : process ... end process; -----------------------------------------check : process (tc_impl, tc_spec) begin ok <= NOT (tc_impl = 1 AND (ta =0 OR tb = 0)); end process; -----------------------------------------end main_tb;

LEC-12 Preliminaries

LEC-12: Functional Validation of State Machines


Lecture Notes Sections: 3.5 3.5.9

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-12 Preliminaries

Schedule
wk-01 02 wk-03 05 wk-06 VHDL Design and Optimization Functional Validation Lec-11 Datapath Validation and Testbenches Lec-12 Control Validation and Assertions Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review

wk-07 wk-08 wk-08 10 wk-11 12 wk-13

LEC-12 Preliminaries

Overview
This lecture illustrates techniques for validating state machines by using a FIFO queue. The lecture goes over an implementation, specication, and testbench. The verication uses assertions and coverage monitors inside the implementation to improve the chances of catching bugs.

LEC-12 Preliminaries

Concepts
Dont care conditions : Conditions or situations where we dont care what the implementation does. Use of uninitialized data : Implementation should start with U on all signals. assert and report statements : Printing error messages to the screen.

LEC-12 Preliminaries

Concepts (Contd)
Instrumentation code : Code that is added to design but will not appear in hardware. Used to measure (instrument) behaviour of internal signals in circuit. Often used to aid in validation, performance analysis, etc. Coverage monitors : Processes that help check if test vectors are fully exercising behaviour of implementation. Assertions : Properties that behaviour of internal signals should obey.

LEC-12 Preliminaries

Concepts (Contd)
Running multiple scenarios from one test bench General VHDL coding guidelines

LEC-12 Preliminaries

VHDL Constructs and Ideas

) )

) ) )

separate package and package body assert, report textio package: read, write,

readline dont care std match

LEC-12 Preliminaries

Background
State machine design

LEC-12 Preliminaries

Reading
None

LEC-12:

3.5

FUNCTIONAL VALIDATION OF CONTROL CIRCUITS 10

3.5 Functional Validation of Control Circuits


Control circuits are often more challenging to validate than datapath circuits.

In this section, we will explore the functional validation of state machines via a First-In First-Out queue. The VHDL code for the queue is on the web at: http://www.ece.uwaterloo.ca/ece427/exs/queue

) )

Control circuits have many internal signals. Testbenches are unable access key information about the behaviour of a control circuit. Many clock cycles can elapse between when a bug causes an internal signal to have an incorrect value and when an output signal shows the effect of the bug.

LEC-12:

3.5.1

Overview of Queues in Hardware

11

3.5.1

Overview of Queues in Hardware


write read

Figure 3.7: Structure of queue


Write 1 A Write 2 A

queue

Empty

Figure 3.8: Write Sequence


Write 1 A B Write 2 A B

Figure 3.9: A Second Example Write

LEC-12:

3.5.1

Overview of Queues in Hardware


Read 2 A A B

12

Read 1

Figure 3.10: Example Read Sequence


Write 1 Write 2

B C D E F G H I J

B C D E F G H I J

Figure 3.11: Write Illustrating Index Wrap


Write 1 K B C D E F G H I J Write 2 K B C D E F G H I J

Figure 3.12: Write Illustrating Full Queue

LEC-12:
do_rd

3.5.1

Overview of Queues in Hardware

13

mem do_wr rd_idx data_rd data_wr wr_idx

empty

Figure 3.13: Queue Signals


do_rd wr_idx

mem do_wr data_wr rd_idx


WE A0 DI0 A1 DO1 DO0

data_rd

empty

Figure 3.14: Incomplete Queue Blocks Control circuitry not shown.

LEC-12:

3.5.2

VHDL Coding

14

3.5.2

VHDL Coding

LEC-12:

3.5.2

VHDL Coding

15

3.5.2.1 Package
Things to notice in queue package: 1. separation of package and body package queue_pkg is subtype data is std_logic_vector(3 downto 0); function to_data(i : integer) return data; end queue_pkg; package body queue_pkg is function to_data(i : integer) return data is begin return std_logic_vector(to_unsigned(i, 4)); end to_data; end queue_pkg;

LEC-12:

3.5.2

VHDL Coding

16

3.5.2.2 Other VHDL Coding


VHDL coding techniques to notice in queue implementation: 1. type declaration for vectors 2. attributes (a) Smith pp420,421 (Tables 10.14, 10.15) (b) low, high, length, 3. functions (reduce overall implementation and maintenance effort) (a) reduce redundant code (b) hide implementation details (c) (just like software engineering....)

LEC-12:

3.5.3

Code Structure for Validation

17

3.5.3

Code Structure for Validation

Validation things to notice in queue implementation: 1. instrumentation code 2. coverage monitors 3. assertions

LEC-12:

3.5.3

Code Structure for Validation

18

Code Structure for Validation


architecture ... is ... begin ... normal implementation ... process (clk) begin if rising_edge(clk) then ... instrumentation code ... prev_signame <= signame; end if; end process; ... assertions ... ... coverage monitors ... end;

LEC-12:

3.5.4

Instrumentation Code

19

3.5.4

Instrumentation Code

process (clk) begin if rising_edge(clk) then prev_rd_idx <= rd_idx; prev_wr_idx <= wr_idx; prev_do_rd <= do_rd; prev_do_wr <= do_wr; end if; end process;

) ) ) ) )

Added to implementation to support validation Usually keeps track of previous values of signals Does not create hardware (Optimized away during synthesis) Does not feed any output signals Must use synthesizable subset of VHDL

LEC-12:

3.5.4

Instrumentation Code

20

Naming Convention
NOTE: Naming convention for instrumentation For assertions, signals are named prev signame and signame, rather than next signame and signame as is done for state machines. This is because for assertions we use the prev signals as history signals, to keep track of past events. In contrast, for state machines, we name the signals next, because the state machine computes the next values of signals.

LEC-12:

3.5.5

Coverage Monitors

21

3.5.5

Coverage Monitors

The goal of a coverage monitors is to check if a certain event is exercised in a simulation run. If a test suite does not trigger a coverage monitor, then we probably want to add a test vector that will trigger the monitor. For example, for a circuit used in a microwave oven controller, we might want to make sure that we simulate the situation when the door is opened while the power is on.

LEC-12:

3.5.5

Coverage Monitors

22

Steps to Creating Coverage Monitors


1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Identify important events, conditions, transitions Write instrumentation code to detect event Use report to write when event happens When run simulation, report statements will print when coverage condition detected Pipe simulation results to log le Examine log le and coverage monitors to nd cases and transitions not tested by existing test vectors Add test vectors to exercise missing cases Idea: automate detection of missing cases using Perl script to nd coverage messages in VHDL code that arent in log le Real world: most commercial synthesis tools come with add-on packages that provide different types of coverage analysis Research/entrepreneurial idea: based on missing coverage cases, nd new test vectors to exercise case

LEC-12:

3.5.5

Coverage Monitors

23

Coverage Events for Queue


Prev wr rd wr rd Now

Prev rd wr wr

Now

rd

Prev wr rd wr

Now

rd

LEC-12:

3.5.5

Coverage Monitors

24

Coverage Events for Queue

) ) ) ) ) )

wr wr wr rd rd wr

idx and rd idx are far apart idx and rd idx are equal idx catches rd idx idx catches wr idx idx wraps idx wraps

LEC-12:

3.5.5

Coverage Monitors

25

Coverage Monitor Template


process (signals read) begin if (condition) then report "coverage: message"; elsif (condition) ) then report "coverage: message"; else report "error: case fall through on message" severity warning; end if; end process;

LEC-12:

3.5.5

Coverage Monitors

26

Coverage Monitor Code


Events related to rd idx equals wr idx. process (prev_rd_idx, prev_wr_idx, rd_idx, wr_idx) begin if (rd_idx = wr_idx) then if ( prev_rd_idx = prev_wr_idx ) then report "coverage: read = write both moved"; elsif ( rd_idx /= prev_rd_idx ) then report "coverage: Read caught write"; elsif ( wr_idx /= prev_wr_idx ) then report "coverage: Write caught read"; else report "error: case fall through on rd/wr catching" severity warning; end if; end if; end process;

LEC-12:

3.5.5

Coverage Monitors

27

Coverage Monitor Code


Events related to rd idx wrapping. process (rd_idx) begin if (rd_idx = low_idx) then report "coverage: rd mv to low"; elsif (rd_idx = high_idx) then report "coverage: rd mv to high"; else report "coverage: rd mv normal"; end if; end process;

LEC-12:

3.5.6

Assertions

28

3.5.6

Assertions

LEC-12:

3.5.6

Assertions

29

Assertions for Queue


1. 2. 3. 4. 5. If rd idx changes, then it increments or wraps. If rd idx changes, then do rd was 1, or reset is 1. If wr idx changes, then it increments or wraps. If wr idx changes, then do wr was 1, or reset is 1. And many others....

LEC-12:

3.5.6

Assertions

30

Assertion Template
process (signals read) begin assert (required condition) report "error: message" severity warning; end process;

LEC-12:

3.5.6

Assertions

31

Assertions: Read Index


process (rd_idx) begin assert ((rd_idx > prev_rd_idx) or (rd_idx = low_idx)) report "error: rd inc" severity warning; assert ((prev_do_rd = 1) or (reset = 1)) report "error: rd imp do_rd" severity warning; end process;

LEC-12:

3.5.6

Assertions

32

Assertions: Write Index


process (wr_idx) begin assert ((wr_idx > prev_wr_idx) or (wr_idx = low_idx)) report "error: wr inc" severity warning; assert ((prev_do_wr = 1) or (reset = 1)) report "error: wr imp do_wr" severity warning; end process;

LEC-12:

3.5.7

VHDL Coding Tips

33

3.5.7

VHDL Coding Tips

LEC-12:

3.5.7

VHDL Coding Tips

34

Vector Type Declaration


type data_array_ty is array(natural range <>) of data; signal data_array : data_array_ty(7 downto 0);

LEC-12:

3.5.7

VHDL Coding Tips

35

Functions
function to_idx (i : natural range data_arraylow to data_arrayhigh) return idx_ty is begin return to_unsigned(i, idx_tylength); end to_idx; Conversion to Index Without Function With Function rd_idx <= to_unsigned(5, 3); rd_idx <= to_idx(5); The function code is verbose, but is very maintainable, because neither the function itself nor uses of the function need to know the width of the index vector.

LEC-12:

3.5.7

VHDL Coding Tips

36

Attributes
function inc_idx (idx : idx_ty) return idx_ty is begin if idx < data_arrayhigh then return (idx + 1); else return (to_idx(data_arraylow)); end if; end inc_idx;

LEC-12:

3.5.7

VHDL Coding Tips

37

Feedback Loops, and Functions


Coding guideline: use functions. Dont use procedures. inc as fun wr_idx <= inc_idx(wr_idx); inc as proc inc_idx(wr_idx);

Functions clearly distinguish between reading from a signal and writing to a signal. By examining the use of a procedure, you cannot tell which signals are read from and which are written to. You must examine the declaration or implementation of the procedure to determine modes of signals. Modifying a signal within a procedure results in a tri-state signal. This is bad.

LEC-12:

3.5.7

VHDL Coding Tips

38

File I/O (textio package)


TEXTIO denes read, write, readline, writeline functions. Described in:

These functions can be used to read test vectors from a le and write results to a le.

) )

Smith 10.6.3 http://www.eng.auburn.edu/department/ee/mgc/vhdl.html#textio

LEC-12:

3.5.8

Queue Specication

39

3.5.8

Queue Specication

Most bugs in queues are related to the queue becoming full, becoming empty, and/or wrap of indices. Specication should be obviously correct. Avoid bugs in specication by making specication queue larger than the max number of writes that we will do in test suite. Thus, the specication queue will never become full or wrap. However, the implementation queue will become full and wrap.

LEC-12:

3.5.8

Queue Specication

40

Write Index Update in Specication


We increment write-index on every write, we never wrap. process (clk) begin if rising_edge(clk) then if (reset = 1) then wr_idx <= 0; elsif (do_wr = 1) then wr_idx <= wr_idx + 1; end if; end if; end process;

LEC-12:

3.5.8

Queue Specication

41

Things to Notice
Things to notice in queue specication: 1. dont care conditions (-) 2. uninitialized data (hint: what is the value of rd_data when do more reads than writes?

LEC-12:

3.5.8

Queue Specication

42

Dont Care
rd_data <= data_array(rd_idx) when (do_rd =1) else (others => -);

LEC-12:

3.5.9

Queue Testbench

43

3.5.9

Queue Testbench

Things to notice in queue testbench: 1. running multipe test sequences 2. uninitialized data U 3. std_match to compare spec and impl data

With equality, - 1, but we want to use - to mean dont care in specication. The solution is to use std match, rather than = to check implementation signals against the specication.

10

0 0 0 0 0

0 0 1 1 everything else

0 L 1 H everything everything

12

LEC-12:

3.5.9

Queue Testbench

44

Stimulus Process Structure


The stimulus process runs multiple test vectors in a single simulation run. stimulus : process type test_datum_ty is record r_reset, ... normal fields ... end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := ( -reset ... ( 1, normal fields), ( 0, normal fields), ... -- wr_idx passes rd_idx (overwrite entries) -reset ... ( 1, normal fields), ( 0, normal fields), ... ); begin for i in test_vectorsrange loop if (test_vectors(i).r_reset = 1) then ... reset code ... end if; reset <= 0; ... normal sequence ... wait until rising_edge(clk); end loop; end process; After reset is asserted, set signals to U.

Chapter 4

Performance Analysis and Optimization

LEC-12:

4.1

INTRO

46

4.1

Intro

LEC-12:

4.1.1

Concepts

47

4.1.1

Concepts

) ) ) ) )

denition of performance different ways of measuring performance comparing performance (speedup, n% faster) improving performance Amdahls law (limits on performance improvements)

LEC-12:

4.1.2

Background Material

48

4.1.2

Background Material

Algebra, basic familiarity with assembly language

LEC-12:

4.1.3

Reading Material

49

4.1.3

Reading Material

Performance is not described in Smiths book. Hennessey and Pattersons Quantitative Computer Achitecture (textbook for E&CE 429) has good information on performance.

LEC-13 Preliminaries

LEC-13: Introduction to Performance Analysis


Lecture Notes Sections: 4.2 4.4.3

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-13 Preliminaries

Schedule
wk-01 02 wk-03 05 wk-06 wk-07 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Lec-13 Computer Performance Lec-14 Digital Circuit Performance Timing Analysis Power Analysis and Reduction Faults and Testing Review

wk-08 wk-08 10 wk-11 12 wk-13

LEC-13 Preliminaries

Overview No more VHDL in lectures!


This lecture introduces the concepts behind performance measurement and illustrates the importance of mathematical analysis when making performance tradeoffs. This lecture overlaps with some material in the computer architecture course E&CE 429. The second lecture on performance will apply performance analysis to dataow diagrams and so will not overlap with E&CE 429.

LEC-13 Preliminaries

Concepts

) )

denition of performance different ways of measuring performance comparing performance (speedup, n% faster)

) ) )

improving performance Amdahls law (limits on performance improvements) clock speed, program length, cpi, and performance

LEC-13 Preliminaries

Background
Algebra, basic familiarity with assembly language

LEC-13 Preliminaries

Reading
Performance is not described in Smiths book. Hennessey and Pattersons Quantitative Computer Achitecture (textbook for E&CE 429) has good information on performance.

LEC-13:

4.2

DEFINING PERFORMANCE

4.2

Dening Performance
Performance Work Time

You can double your performance by: doing twice the work in the same amount of time OR doing the same amount of work in half the time

LEC-13:

4.2

DEFINING PERFORMANCE

Benchmarking
Performance Work Time

Measuring time is easy, but how do we accurately measure work? The game of benchmarking is nding a denition of work that makes your system appear to get the most work done in the least amount of time. Measure of Work clock cycle instruction synthetic program real program travel 1/4 mile Measure of Performance MHz MIPs Dhrystone, Whetstone, D-MIPs (Dhrystone MIPs) SPEC drag race

LEC-13:

4.2

DEFINING PERFORMANCE

SPEC Benchmarks
Denition SPEC: Standard Performance Evaluation Corporation MISSION: To establish, maintain, and endorse a standardized set of relevant benchmarks and metrics for performance evaluation of modern computer systems http://www.spec.org.

LEC-13:

4.3

COMPARING PERFORMANCE

10

4.3

Comparing Performance

We can measure the performance of practically anything (cars, computers, vacuum cleaners, printers....)

LEC-13:

4.3

COMPARING PERFORMANCE

11

Comparing Performance
printer1 printer2 Black and White 9ppm 12ppm Colour 6ppm 4ppm

Question: faster is it?

Which printer is faster at B&W and how much

n% faster

TSlow TFast TFast

LEC-13:

4.3

COMPARING PERFORMANCE

12

BW Performance
Answer: BW1 1 9ppm

BW2

1 12ppm

0 0833min page TSlow TFast TFast BW1 BW2 BW2 0 1111 0 08333 0 08333 33%faster

BWFaster

4 3

5 5

4 4

2 2 2 2 2 2 2 2

0 1111min page

LEC-13:

4.3.1

Performance for Different Tasks

13

4.3.1

Performance for Different Tasks

Question: If average workload is 90% BW and 10% Colour, which printer is faster and how much faster is it? A potentially helpful formula is the average time to do one of k different tasks:

TAvg

i 1

Answer:

0 1167min page

0 1000min page TSlow TFast TFast Avg1 Avg2 Avg2 0 1167 0 1000 0 1000 16 7%faster

AvgFaster

0 90

0 0833

4 9 4 7 A8 @

0 10

4 3

5 4 4 9 4 7

TAvg2

%BW

BW2

%C

C2 0 2500

0 90

0 1111

4 9 4 7 A8 @

0 10

5 4 4 9 4 7

TAvg1

%BW

BW1

8 78 7
%i Ti %C C1 0 1667

2 2 2 2 2 2 2 2 2 2

LEC-13:

4.3.2

Optimizing Performance

14

4.3.2

Optimizing Performance

Question: If we want to optimize printer1 to match performance of printer2, should we optimize BW or Colour printing?

Answer:

Colour printing is slower, so appears that can save more time by optimizing colour printing. However, look at extreme case of optimizing colour printing to be instantaneous for P1:
0.150m/p 0.100m/p 0.050m/p 0.000m/p P1 P2

Even if make colour printing instantaneous for printer 1 and kept same for printer 2, printer 1 would not be measurably faster. Amdahls law Make the common case fast.

Optimizations need to take into account both run time and frequency of occurrence.

LEC-13:

4.3.2

Optimizing Performance

15

Optimization without Engineering


Question: If you have to re all of the engineers because your stock price plummeted, how can you get printer1 to be faster than printer2?

NOTE: Hmmmm This question was actually humorous during the high-tech bubble...

Answer:

Hire more marketing people! Notice that colour printing on printer 1 is faster than on printer 2. So, marketing suggests that people are increasing the percentage of printing that is done in colour.

Question: Revised question: what percentage of printing must be done in colour for printer1 to beat printer2?

Answer:

%C

0 25

4 3

4 3

%C

0 1111

4 @ 4 3

0 1111 0 0833

0 0833 0 2500

%C

BW1 BW2 BW1 BW2 C2

C1 0 1667

B E8

BW1

%C

C1

BW1

BW2

%C

C2

BW2

7 9

9 D8

7 9

9 C8

%C

BW1

%C

C1

3 7

%BW

%C

%C

BW2

%C

B 2

%BW

BW1

%C

C1

TAvg1

TAvg2 %BW BW2 %C C2

3 7

C2

LEC-13:

4.4

CLOCK SPEED, CPI, PROGRAM LENGTH, AND PERFORMANCE 16

4.4 Clock Speed, CPI, Program Length, and Performance

LEC-13:

4.4.1

Mathematics

17

4.4.1

Mathematics
CPI NumInsts ClockSpeed Cycles per instruction Number of instructions Clock speed

Time

NumInsts CPI ClockSpeed

LEC-13:

4.4.2

Example: CISC vs RISC and CPI

18

4.4.2

Example: CISC vs RISC and CPI


AMD Athlon Fujitsu SPARC64 Clock Speed 1.2GHz 675MHz SPECint 409 443

The AMD Athlon is a CISC microprocessor (it uses the IA-32 instruction set). The Fujitsu SPARC64 is a RISC microprocessor (it uses Suns Sparc instruction set). Assume that it requires 20% more instructions to write a program in the Sparc instruction set than the same program requires in IA32.

LEC-13:

4.4.2

Example: CISC vs RISC and CPI

19

SPECint and Performance


AMD Athlon Fujitsu SPARC64 Clock Speed 1.2GHz 675MHz SPECint 409 443

Question: Which of the two processors has higher performance?

Answer: SPECint, SPECfp, and SPEC are measures of performance. Therefore, the higher the SPEC number, the higher the performance. The Fujitsu SPARC64 has higher performance

LEC-13:

4.4.2

Example: CISC vs RISC and CPI

20

Relative CPI
Question: What is the ratio between the CPIs of the two microprocessors?

LEC-13:

4.4.2

Example: CISC vs RISC and CPI

21

Absolute CPI
Question: Can you determine the absolute (actual) CPI of either microprocessor?

LEC-13:

4.4.3

Summary of Equations

22

4.4.3

Summary of Equations

Time to perform a task: NumInsts CPI ClockSpeed

Time

Average time to do one of k different tasks:

TAvg

i 1

Performance: Performance Work Time

8 78 7

%i Ti

LEC-13:

4.4.3

Summary of Equations

23

Summary of Equations (Contd)


Speedup: TSlow TFast

Speedup TFast is n% faster than TSlow: n% faster

TSlow TFast TFast

LEC-14 Preliminaries

LEC-14: Performance and Dataow Diagrams


Lecture Notes Sections: 4.5 4.5.6

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-14 Preliminaries

Schedule
wk-01 02 wk-03 05 wk-06 wk-07 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Lec-13 Computer Performance Lec-14 Digital Circuit Performance Timing Analysis Power Analysis and Reduction Faults and Testing Review

wk-08 wk-08 10 wk-11 12 wk-13

LEC-14 Preliminaries

Overview
In this lecture we relate the general performance equations from Lec-13 to dataow diagrams.

LEC-14 Preliminaries

Concepts
predicting performance for dataow diagrams choosing clock speed in dataow diagrams instruction scheduling

) )

) ) )

dataow diagrams with multiple instructions and performance design effort vs performance

LEC-14:

4.5

PERFORMANCE ANALYSIS AND DATAFLOW DIAGRAMS 5

4.5 Performance Analysis and Dataow Diagrams

LEC-14:

4.5.1

Dataow Diagrams, CPI, and Clock Speed

4.5.1 Dataow Diagrams, CPI, and Clock Speed


One of the challenges in designing a circuit is to choose the clock speed. Increasing the clock speed of a circuit doesnt necessarily improve its performance. In this section we will work through several example dataow diagrams to pick a clock speed for the circuit and schedule operations into clock cycles.

LEC-14:

4.5.1

Dataow Diagrams, CPI, and Clock Speed

4.5.1.1 Tradeoffs
When partitioning dataow diagrams into clock cycles, need to take both area and performance into account. Goal Minimize area Action decrease clock period Affect fewer operations per clock cycle, so fewer datapath components and more opportunities to reuse hardware more exibility in grouping operations in clock cycles decreases number of ops that data traverses through

Increase scheduling exibility Decrease percentage of clock cycle spent in ops (overhead time in ops is not doing useful work) Decrease time to execute an instruction

increase clock period increase clock period

????

depends on dataow diagram

LEC-14:

4.5.1

Dataow Diagrams, CPI, and Clock Speed

General Plan
Our general plan to nd the clock period for maximum performance is: 1. Pick clock period to be delay through slowest component + delay through op. 2. For each instruction, for each operation, schedule the operation in the earliest clock cycle possible without violating clockperiod timing constraints. 3. Calculate average time to execute an instruction as: NumInsts CPI Combine: Time = ClockSpeed

to derive:

Time

i 1

ClockSpeed

4. If the maximum latency through dataow diagram is greater than 1, then increase clock period by minimum amount needed to decrease latency by one clock period and return to Step 2. 5. If the maximum latency through dataow diagram is 1, then clock period for highest performance is clock period resulting in fastest Time. 6. If possible, adjust the schedule of operations to reduce the maximum number of occurrences of a component per instruction per clock cycle without increasing latency for any instruction.

NumInsts

i 1

and:

CPIavg

%i

CPIi

%i

CPIi

LEC-14:

4.5.2

Dataow Diagram with Two Instructions

4.5.2 Dataow Diagram with Two Instructions


Circuit supports two instructions, A and B (e.g. multiply and divide). At any point in time, the circuit is doing either A or B it does not need to support doing A and B simultaneously. The diagrams below show the ow for each instruction and the delay through the components (f,g,h,i) that the instructions use. The delay through a register is 5ns. Each operation (A and B) occurs 50% of the time. Our goal is to nd a clock period and dataow diagram for the circuit that will give us the highest overall performance. Instruction A
f (30ns)

Instruction B
i (40ns)

g (50 ns)

g (50 ns)

h (20 ns)

g (50 ns)

LEC-14:

4.5.2

Dataow Diagram with Two Instructions

10

4.5.2.1 Scheduling of Operations for Different Clock Periods

LEC-14:

4.5.2

Dataow Diagram with Two Instructions

11

Scheduling (1)
55ns Clock Period
55ns 55ns f (30ns) i (40ns)
75ns

75ns Clock Period


f (30ns) i (40ns)

g (50 ns) h (20 ns)

g (50 ns)
75ns g (50 ns) h (20 ns) g (50 ns) g (50 ns)

55ns

55ns

g (50 ns)

75ns

LEC-14:

4.5.2

Dataow Diagram with Two Instructions

12

Scheduling (2)
85ns Clock Period
f (30ns) 85ns g (50 ns) h (20 ns) 85ns g (50 ns) g (50 ns) 95ns g (50 ns) i (40ns) 95ns g (50 ns) h (20 ns)

95ns Clock Period


f (30ns) i (40ns) g (50 ns)

LEC-14:

4.5.2

Dataow Diagram with Two Instructions

13

Scheduling (3)
155ns Clock Period
f (30ns) g (50 ns) 155ns h (20 ns) g (50 ns) i (40ns) g (50 ns)

LEC-14:

4.5.2

Dataow Diagram with Two Instructions

14

4.5.2.2 Performance Computation for Different Clock Periods


Question: Which clock speed will result in the highest overall performance?

Answer:

3 PI

4 @ 9 4 @ 9 4 @ 9 4 @ 9 4 @ 9

55 75 85 95 155

05 05 05 05 05

4 3 2 2 1

2 8 9 2 8 9 2 8 9 2 8 9 2 8 9

4 9 4 7 9 4 7 9 4 7 9 4 77 9

Clock Period 55ns 75ns 85ns 95ns 155ns

CPIA 4 3 2 2 1

CPIB 2 2 2 1 1

Tavg 05 2 05 2 05 2 05 1 05 1

165 187 5 170 143 155

LEC-14:

4.5.2

Dataow Diagram with Two Instructions

15

4.5.2.3 Example: Two Instructions Taking Similar Time

LEC-14:

4.5.2

Dataow Diagram with Two Instructions

16

A and B take similar amounts of time


Question: For the ow below, which clock speed will result in the highest overall performance? A 30ns 50ns 20ns 50ns B 40ns 50ns 40ns

Answer:

55ns 55ns

f (30ns)

i (40ns) 75ns

f (30ns)

i (40ns)

g (50 ns) h (20 ns)

g (50 ns) 75ns g (50 ns) h (20 ns) 75ns g (50 ns) i (40ns) g (50 ns) i (40ns)

55ns

55ns

g (50 ns)

f (30ns) 85ns g (50 ns)

i (40ns)

f (30ns) h (20 ns) 85ns g (50 ns) g (50 ns) 95ns g (50 ns) h (20 ns) 85ns i (40ns) 95ns g (50 ns)

i (40ns) g (50 ns) i (40ns)

f (30ns) 105ns g (50 ns) h (20 ns) 105ns g (50 ns)

i (40ns) g (50 ns)

i (40ns)

LEC-14:

4.5.2

Dataow Diagram with Two Instructions


i (40ns) g (50 ns) i (40ns)

17

f (30ns) 135ns g (50 ns) h (20 ns)

135ns

g (50 ns)

f (30ns) g (50 ns) 155ns h (20 ns)

i (40ns) g (50 ns) i (40ns)

g (50 ns)

Clock Period 55ns 75ns 85ns 95ns 105ns 135ns 155ns

CPIA 4 3 2 2 2 2 1

CPIB 3 3 3 2 2 1 1

Tavg 193 225 213 190 NO GAIN 203 155

A clock period of 155 ns results in the highest performance. For a clock period of 105 ns, we did not calculate the performance, because we could see that it would be worse than the performance with a clock period of 95 ns. The dataow diagram with a 105 ns clock period has the same latency as the diagram with a clock period of 95 ns. If the data ow diagram with the longer clock period has the same latency as the diagram with the shorter clock period, then the diagram with the longer clock period will have lower performance.

LEC-14:

4.5.2

Dataow Diagram with Two Instructions

18

4.5.2.4 Example: Same Total Time, Different Order for A

LEC-14:

4.5.2

Dataow Diagram with Two Instructions

19

Example: Different Order of Operations for A


Question: For the ow below, which clock speed will result in the highest overall performance? A 30ns 20ns 50ns 50ns B 40ns 50ns 40ns Answer:

Clock Period 55ns 95ns 105ns 135ns 155ns

CPIA 3 3 2 2 1

CPIB 3 2 2 1 1

Tavg 165ns 238ns 210ns 203ns 155ns

A clock period of 155 ns results in lowest average execution time, and hence the highest performance. This is the same answer as the previous problem, but the total times for higher clock frequencies differ signicantly between the two problems.

LEC-14:

4.5.3

Example: From Algorithm to Optimized Dataow

20

4.5.3 Example: From Algorithm to Optimized Dataow


This question involves doing some of the design work for a circuit that implements InstP and InstQ using the components described below.

Component Delays 2-input Mult 40ns 2-input Add 25ns Register 5ns

9 C8 @ @ @ 8 A8 9 7 R8 9 Q7 7 9 9 @ @ 7
e

Instruction InstP InstQ

Algorithm b a b b d i j k l m

Frequence of Occurrence 75% 25%

LEC-14:

4.5.3

Example: From Algorithm to Optimized Dataow

21

NOTES

) ) ) )

There is a resource limitation of a maximum of 3 input ports. (There are no other resource limitations.) You must put registers on your inputs, you do not need to register your outputs. The environment will directly connect your outputs (its inputs) to registers. Each input value (a, b, c, d, e, i, j, k, l, m) can be input only once if you need to use a value in multiple clock cycles, you must store it in a register.

LEC-14:

4.5.3

Example: From Algorithm to Optimized Dataow

22

Questions
Question: What clock period will result in the best overall performance?

Question: Find a minimal set of resources that will achieve the performance you calculated.

Answer:
a b d

*
70ns a*b

*
b*d e (a*b) + (b*d) (a*b) + (b*d) + e (a*b)*((a*b) + (b*d) + e)

+ + *

CPI = 2 InstP
i j k

+ + +
70ns l m

*
CPI = 2 InstQ

LEC-14:

4.5.3

Example: From Algorithm to Optimized Dataow

23

Resource Usage
Fastest execution time Clock period Inputs Outputs Registers Adders Multipliers 140ns 70ns 3 1 3 2 2

LEC-14:

4.5.4

Optimality: Performance vs Area Tradeoffs

24

4.5.4 Optimality: Tradeoffs

Performance vs Area

You are designing a 16-bit barrel shifter. You have the option of supporting an entire 15-bit shift in a single clock cycle (which gives a latency of 1 clock cycles), shifting 1-bit per clock cycle (which gives a latency of 15 clock cycles), or anything in between. You do the design and measure the following information: Max Shift 1 3 7 15 Min Period 21ns 27ns 40ns 34ns Area (CLBs) 13 36 57 53

Question: Which circuit gives you the best optimality, in terms of MIPs/CLB? Answer: Assume that all shift amounts have same probability of occurrence. Shift amounts can be anywhere from 0 (no shift) to 16 (shift all data out, leaving only zeroes). The data for the shift amounts and latencies were generated using Synopsys Design Compiler for a Xilinx FPGA. Max shift of 1

Max shift of 3 Max Shift 1 3 7 15 Min Period 21ns 27ns 34ns 40ns Latency 15 5 3 1 Time 315ns 135ns 102ns 40ns MIPs 3.2 7.4 9.8 25 Area 13 36 57 53 MIPs/CLB 0.25 0.21 0.17 0.47

3 PI

8 78 7 78 7 6 S 78 7

TAvg

i 0 16 %i Ti Task i is to shift by i bits A shift amount of i requires i clock cycles 16 1 17 i ClkPeriod i 0 16 1 ClkPeriod i 17 i 0 1 21 136 17 168ns

8 78 7

2 2 2 2

LEC-14:

4.5.4

Optimality: Performance vs Area Tradeoffs

25

New assumptions: 1. All shift amounts have same probability of occurrence. 2. The latency of a shift operation is dependent upon the shift amount. 3. Shift amounts can be anywhere from 0 (no shift) to 15 (shift leastsignicant bit to most signicant position). 4. Shifting by 0 requires 1 clock cycle.

Question: With the revised assumptions, which circuit gives you the best optimality, in terms of MIPs/CLB?

Answer:

Max shift of 1

Max shift of 3 Shift amount 0 3 4 6 7 9 10 12 13 15 5 different tasks. Ti Latency 1 2 3 4 5

ClkPeriod and %i

0 20.

TAvg

i 1 5 i 1

81 ns

Max shift of 7

9 78 4 7 6 8 78 7

%i Ti

0 20 i

27

4 2

8 78 7 @ 78 7 @ 6 S 78 7 @ 6 9 2 44 QQ4 44 QQ4 4QQ4 4 4QQ4 4 44 QQ4 2 2 2

TAvg

i 0 15 %i Ti Task i is to shift by i bits A shift amount of i requires i clock cycles The exception is i 0, which requires 1 clock cycle 15 1 1 ClkPeriod i ClkPeriod 16 i 1 16 15 1 1 ClkPeriod ClkPeriod i 16 16 i 0 1 1 21 21 120 16 16 158 ns

8 78 7

2 2 2 2 2

LEC-14:

4.5.4

Optimality: Performance vs Area Tradeoffs


Shift amount 0 7 8 14 15 Latency 1 2 3

26

3 different tasks. Ti TAvg

ClkPeriod and %i

i 1 3 i 1

67 ns

Max shift of 15 Shift amount 0 15 1 task. Ti ClkPeriod and %i TAvg Latency 1 1 00.

i 1

%i Ti

40 ns

3 PI

Max Shift 1 3 7 15

Min Period 21ns 27ns 40ns 34ns

Latency 15 5 3 1

Time 158ns 81ns 67ns 40ns

MIPs 6.3 12 15 25

6 8 78 7 4 2

9 78 4 7 6 8 78 7
%i Ti 0 33 i

ClkPeriod

4 2

44 QQ4 44 QQ4 44 QQ4 6

9 2 2 2 2

0 33.

Area 13 36 57 53

MIPs/CLB 0.48 0.33 0.26 0.47

LEC-14:

4.5.5

Affect of Instruction Set on Performance

27

4.5.5 Affect of Instruction Set on Performance

LEC-14:

4.5.5

Affect of Instruction Set on Performance

28

Example: Changing Instruction Set and Performance


Your group designs a microprocessor and you are considering adding a fused multiply-accumulate to the instruction set. (A fused multiply accumulate is a single instruction that does both a multiply and an addition. It is often used in digital signal processing.) Your studies have shown that, on average, half of the multiply operations are followed by an add instruction that could be done with a fused multiply-add. Additionally, you know: ADD MUL Other cpi 0.8 CPIavg 1.2 CPIavg 1.0 CPIavg % 15% 5% 80%

LEC-14:

4.5.5

Affect of Instruction Set on Performance

29

Options
You have three options: option 1 : no change option 2 : add the MAC instruction, increase the clock period by 20%, and MAC has the same CPI as MUL. option 3 : add the MAC instruction, keep the clock period the same, and the CPI of a MAC is 50% greater than that of a multiply.

Question: Which option will result in the highest overall performance?

LEC-14:

4.5.6

Affect of Time to Market on Relative Performance

30

4.5.6 Affect of Time to Market on Relative Performance

LEC-14:

4.5.6

Affect of Time to Market on Relative Performance

31

Example: Time to Market and Optimizations


Assume that performance of the average product in your market segment doubles every 18 months. You are considering an optimization that will improve the performance of your product by 7%.

Question: If you add the optimization, how much can you allow your schedule to slip before the delay hurts your relative performance compared to not doing the optimization and launching the product according to your current schedule?

LEC-14:

4.5.6

Affect of Time to Market on Relative Performance

32

Chapter 5

Timing Analysis

LEC-14:

5.1

PRELIMINARIES

34

5.1

Preliminaries

LEC-14:

5.1.1

Overview

35

5.1.1

Overview

) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )

Clock Skew Clock Jitter Clock-to-Q delay (latch and op) Setup Time (latch and op) Hold time (latch and op) Capacitive Load delay Critical path False path Setup, hold, and clock-to-Q times in hierarchical circuits Propagation delay Interconnect (Wire) delay Load delay Elmore time constant Worst case timing Derating factors Speed binning

LEC-14:

5.1.2

Background Material

36

5.1.2

Background Material

) ) ) )

resistance, capacitance, voltage equations over time ip-op timing, setup and hold times (Mano, Digital Design 6-3) digital view of CMOS transistor behaviors a tiny bit of calculus integration in Lec-12

LEC-14:

5.1.3

Reading Material

37

5.1.3

Reading Material

There is a tremendous amount of material on delay and timing scattered throughout Smiths book. Chapter 2 : transistor and logic review 2.1 : transistor review 2.4 : combinational logic cells 2.5 : sequential logic cells Chapter 3 : fundamentals of timing and delay 3.13.2 : transistors and delay Chapter 5 : timing and delay within cells 5.1.5 5.1.7 : Actel cells 5.2.4 : Xilinx LCA timing 5.4.2 : Altera MAX timing Chapter 7 : timing and delay between cells 7.1 : Actel interconnect 7.2 7.4 : Xilinx LCA timing 7.4 : Altera MAX timing (constant delay for all interconnect) Chapter 13 : simulation 13.1 13.2 13.5 13.6 13.7 : : : : : levels of temporal abstraction for simulation simulation example different simulation models for hardware delay models static timing analysis

Chapter 16 16.1.2 : clock trees and timing in oorplanning Chapter 17 17.1.2 : timing in routing Suggestion:

) ) ) )

skim/read Chs 2 and 3 to refresh/learn fundamentals of delay skim relevant sections of Chs 4, 7, 13 16, 17 read in depth: 5.155.1.7 7.1 13.2, 13.6, 13.7 16.1.2 read remaining sections as time and interest dictates

LEC-15 Preliminaries

LEC-15: Introduction to Timing Analysis


Lecture Notes Sections: 5.2 5.3.4

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-15 Preliminaries

Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Lec-15 Introduction Lec-16 Math, Physics and Applications Lec-17 Timing Analysis of Storage Elements Power Analysis and Reduction Faults and Testing Review

wk-09 10 wk-11 12 wk-13

LEC-15 Preliminaries

Overview
This lecture introduces the fundamentals of timing analysis. In particular, how do we determine the fastest clock speed that a circuit will support?

LEC-15 Preliminaries

Concepts

) ) ) ) )

Minimum clock period Hold constraint Clock skew Clock latency Clock jitter Setup time Hold time Clock-to-Q time

) ) ) ) ) ) ) )

Cause and effect of timing violations Propagation delay Load delay Interconnect delay Critical path False path

LEC-15 Preliminaries

Background
For those who took E&CE-324, there is some overlap between the material in this chapter and the material in E&CE-324. In E&CE-427, we will focus on calculating the critical path of a circuit and on techniques to calculate the timing parameters of a storage device (e.g. latch or op). One terminology difference: what was called margin in E&CE-324 will be called slack in E&CE-427.

LEC-15 Preliminaries

Reading Material
There is a tremendous amount of material on delay and timing scattered throughout Smiths book.

NOTE: Reading and exam All of the exam material will come from the courses notes, but it could be helpful to read the relevant sections in Smiths book to better understand the material. Chapter 2 : Transistor and logic review 2.1 : transistor review 2.4 : combinational logic cells 2.5 : sequential logic cells Chapter 3 : Fundamentals of timing and delay 3.13.2 : transistors and delay

LEC-15 Preliminaries

Reading Material (contd)


Chapter 5 : Timing and delay within cells 5.1.5 5.1.7 : Actel cells 5.2.4 : Xilinx LCA timing 5.4.2 : Altera MAX timing Chapter 7 : Timing and delay between cells 7.1 : Actel interconnect 7.2 7.4 : Xilinx LCA timing 7.4 : Altera MAX timing (constant delay for all interconnect)

LEC-15 Preliminaries

Reading Material (contd)


Chapter 13 : Simulation 13.1 13.2 13.5 13.6 13.7 : : : : : levels of temporal abstraction for simulation simulation example different simulation models for hardware delay models static timing analysis

Chapter 16 : Floorplanning and placement 16.1.2 : clock trees and timing in oorplanning Chapter 17 : Routing 17.1.2 : timing in routing

LEC-15 Preliminaries

Suggested Strategy for Reading

) ) ) )

skim/read Chs 2 and 3 to refresh/learn fundamentals of delay skim relevant sections of Chs 4, 7, 13, 16, 17 read in depth: 2.5.2 (setup and hold) 6.5.1 (clocks) 3.1 (timing model) 13.2, 13.6, 13.7 (timing models and timing analysis) 7.1 (interconnect delay) 16.1.2 (interconnect delay) 5.1.55.1.7 (timing analysis of storage devices)

read remaining sections as time and interest dictates

LEC-15:

5.2

DELAYS AND DEFINITIONS

10

5.2

Delays and Denitions

LEC-15:

5.2.1

Related Background Denitions

11

5.2.1

Related Background Denitions

LEC-15:

5.2.1

Related Background Denitions

12

Fanin
y0 y1 y2 y3 y4 x

Denition fanin: The fanin of a gate or signal x are all of the gates or signals y where an input of x is connected to an output of y.

LEC-15:

5.2.1

Related Background Denitions

13

Fanout
y0 x y1 y2 y3 y4

Denition fanout: The fanout of a gate or signal x are all of the gates or signals y where an output of x is connected to an input of y.

LEC-15:

5.2.1

Related Background Denitions

14

Immediate Fanin and Fanout


y0 y1
x y0 y1 y2 y3 y4

y2 y3 y4

Figure 5.1: Immediate Fanin of x

Figure 5.2: Immediate Fanout of x

Denition immediate fanin/fanout: The phrases immediate fanout and immediate fanin mean that there is a direct connection between the gates.

LEC-15:

5.2.1

Related Background Denitions

15

Transitive Fanin and Fanout

Figure 5.3: Transitive Fanin

Figure 5.4: Transitive Fanout

Denition transitive fanin/fanout: The phrases transitive fanout and transitive fanin mean that there is either a direct or indirect connection between the gates.

LEC-15:

5.2.1

Related Background Denitions

16

Immediate vs. Transitive


NOTE: Immediate vs Transitive fanin and fanout Be careful to distinguish between immediate fan(in/out) and transitive fanin/out. If fanin or fanout are not qualied with immediate or transitive, be sure to make sure whether immediate or transitive is meant. In E&CE 427, fan(in/out) will mean immediate fan(in/out).

LEC-15:

5.2.2

Timing Constraints

17

5.2.2

Timing Constraints

For a circuit to operate correctly, the clock period must be longer than the sum of the delays shown in table 5.1. Each of these timing parameters is described in more detail in section 5.2.3.

LEC-15:

5.2.2

Timing Constraints

18

Minimum Clock Period


Name Skew Jitter Clock-to-Q Interconnect Load Setup T SUD Symbol Denition Difference in arrival times for different clock signals Difference in clock period over time Delay from clock signal to Q output of op Delay along wire Delay due to load (fanout/consumers/readers) Length of time prior to clock/enable that data must be stable

CO

Table 5.1: Summary of delay factors for minimum clock period

LEC-15:

5.2.2

Timing Constraints

19

Propagation Delay
Denition Propagation Delay: Sum of Interconnect and Load delay.

LEC-15:

5.2.2

Timing Constraints

20

Propagation Delay
Denition Slack: Difference between required value of timing parameter and actual value. A negative slack means that there is a timing violation. A positive slack means that the constraint for the timing parameter is satised. NB: Slack was called margin in E&CE 324. Both terms are used commonly.

LEC-15:

5.2.2

Timing Constraints

21

5.2.2.1 Minimum Clock Period


a clk1 clk2 b signal is stable signal may change signal may rise signal may fall

clock period propagation skew jitter clock-to-Q wire + load setup

clk1 clk2 a b slack

CO

U VT

ClockPeriod

Skew

Jitter

Interconnect

Load

SUD

LEC-15:

5.2.2

Timing Constraints

22

5.2.2.2 Hold Constraint


a clk1 clk2 b signal is stable signal may change signal may rise signal may fall

skew
-Q

jitter

hold
io n

to

k-

oc

clk1 clk2 a b slack

cl

HO

CO

Y U PX

Skew

Jitter

pr

op

ag

at

Load

Interconnect

LEC-15:

5.2.3

Clock-Related Timing Denitions

23

5.2.3

Clock-Related Timing Denitions

LEC-15:

5.2.3

Clock-Related Timing Denitions

24

5.2.3.1 Clock Skew (Smith 6.5.1)


skew clk1 clk2 clk3 clk4 clk2 clk4 clk1 clk3

Denition Clock Skew: The difference in arrival times for the same clock edge at different ip-ops. Clock skew is caused by the difference in interconnect delays to different points on the chip.

LEC-15:

5.2.3

Clock-Related Timing Denitions

25

Clock Tree Design


Clock tree design is critical in high-performance designs to minimize clock skew. Sophisticated synthesis tools put lots of effort into clock tree design, and the techniques for clock tree design still generate PhD theses.

LEC-15:

5.2.3

Clock-Related Timing Denitions

26

5.2.3.2 Clock Latency (Smith 6.5.1)


master clock latency intermediate clock final clock master clock intermediate clock final clock

Denition Clock Latency: The difference in arrival times for the same clock edge at different levels of interconnect along the clock tree. (Intuitively different points in the clock generation circuitry.)

NOTE: Clock latency Clock latency does not affect the limit on the minimim clock period.

LEC-15:

5.2.3

Clock-Related Timing Denitions

27

5.2.3.3 Clock Jitter (Smith pp873)


ideal clock

clock with jitter jitter

Denition Clock Jitter: Difference between actual clock period and ideal clock period.

LEC-15:

5.2.3

Clock-Related Timing Denitions

28

Causes of Clock Jitter


Clock jitter is caused by:

` ` ` `

temperature and voltage variations over time temperature and voltage variations across different locations on a chip manufacturing variations between different parts etc.

LEC-15:

5.2.4

Storage Related Timing Denitions (Smith 2.5.2)

29

5.2.4 Storage Related Timing Denitions (Smith 2.5.2)


Storage devices (latches, ip-ops, memory arrays, etc) dene setup, hold and clock-to-Q times.
Setup d
d clk q

Hold

clk q Clock-to-Q

Figure 5.5: Setup, hold, and clock-to-Q times for a ip op

LEC-15:

5.2.4

Storage Related Timing Denitions (Smith 2.5.2)

30

Forward Reference
In this section, we will use the denitions of setup, hold and clock-to-Q. Section 5.6 will show how to calculate setup, hold, and clock-to-Q times for ip ops, latches, and other storage devices.

LEC-15:

5.2.4

Storage Related Timing Denitions (Smith 2.5.2)

31

5.2.4.1 Setup Time


) : Latest time before arrival of SUD clock edge (ip op), or deasserting of enable line (latch), that input data is required to be stable in order for storage device to work correctly. If setup time is violated, current input data will not be stored; input data from previous clock cycle might remain stored. Denition Setup Time (T

LEC-15:

5.2.4

Storage Related Timing Denitions (Smith 2.5.2)

32

5.2.4.2 Hold Time


): Latest time after arrival of clock HO edge (ip op), or deasserting of enable line (latch), that input data is required to remain stable in order for storage device to work correctly. If hold time is violated, current input data will not be stored; input data from next clock cycle might slip through and be stored. Denition Hold Time (T

LEC-15:

5.2.4

Storage Related Timing Denitions (Smith 2.5.2)

33

5.2.4.3 Clock-to-Q Time


Denition Clock-to-Q Time (T ): Earliest time after arrival CO of clock edge (ip op), or asserting of enable line (latch) when output data is guaranteed to be stable.

NOTE: Require / Guarantee Setup and hold times are requirements that the storage device imposes upon its environment. Clock-to-Q is a guarantee that the storage device provides its environment.

LEC-15:

5.2.4

Storage Related Timing Denitions (Smith 2.5.2)

34

5.2.4.4 Example Timing Violations

LEC-15:

5.2.4

Storage Related Timing Denitions (Smith 2.5.2)

35

Good Timing
a clk b c d

a clk b

Clock-to-Q

Prop Setup Hold

c d

Figure 5.6: Good Timing

LEC-15:

5.2.4

Storage Related Timing Denitions (Smith 2.5.2)

36

Setup Violation
a clk b Clock-to-Q Prop Setup c d ??? ???

Figure 5.7: Setup Violation

LEC-15:

5.2.4

Storage Related Timing Denitions (Smith 2.5.2)

37

Hold Violation
a clk b c d

a clk b

Clock-to-Q Prop Hold

c d

???

Figure 5.8: Hold Violation

LEC-15:

5.2.5

Propagation Delays

38

5.2.5

Propagation Delays

LEC-15:

5.2.5

Propagation Delays

39

5.2.5.1 Load Delays (Smith 3.1)


Delay is proportional to load capacitance. Timing of a simple inverter with a load.

Vi

Vo

Schematic

LEC-15:

5.2.5

Propagation Delays

40

Load Delays

1->0 0->1

0->1 1->0

Load capacitance is a dependent on the fanout (how many other gates a gate drives) and how big the other gates are. Section 5.4.1 goes into more detail on timing models and equations for load delay.

Input 1 0: Charge output cap

Input 0 1: Discharge output cap

LEC-15:

5.2.5

Propagation Delays

41

5.2.5.2 Interconnect Delays (Smith 7.1)


Wires, also known as interconnect, have resistance, and there is a capacitance between parallel wires. Both of these factors increase delay.

More on this in section 5.4.3.

` ` ` ` ` `

Wire resistance is dependent upon the material and geometry of the wire. Wire capacitance is dependent on wire geometry, geometry of neighboring wires, and materials. Shorter wires are faster. Fatter wires are faster. FPGAs have special routing resources for long wires. CMOS processes use higher metal layers for long wires, these layers have wires with much larger cross sections than lower levels of metal.

LEC-15:

5.3

CRITICAL PATHS: FALSE AND TRUE

42

5.3

Critical Paths: False and True


Denition critical path: The slowest path on the chip between ops or ops and pins. The critical path limits the maximum clock speed.

Three classes of paths: entry path from an input to a op stage path from one op to another op exit path from a op to an output

LEC-15:

5.3

CRITICAL PATHS: FALSE AND TRUE

43

Entry Path

entry path: from an input to a op Quartus does not report this by default. When Quartus reports this path, it is reported as the period associated with System fmax. In Xilinx timing reports this is reported as Maximum Delay

LEC-15:

5.3

CRITICAL PATHS: FALSE AND TRUE

44

Stage Path

stage path: from one op to another op In Quartus timing reports, this is reported as the period associated with Internal fmax. In Xilinx timing reports, this is reported as Clock to Setup and Maximum Frequency.

LEC-15:

5.3

CRITICAL PATHS: FALSE AND TRUE

45

Exit Path

exit path: from a op to an output Quartus does not report this by default. When Quartus reports this path, it is reported as the period associated with System fmax. In Xilinx timing reports this is reported as Maximum Delay

LEC-15:

5.3.1

Critical Path Example

46

5.3.1
a

Critical Path Example


d f g k h l m i j

b c

gate NOT AND OR XOR

delay 2 4 4 6

Question: Assuming all delay and timing factors other than combinational logic delay are negligible:

The answer to this question appears as Problem 5.3. In this circuit, it is extremely difcult to determine which path is the real critical path and which paths are false paths. There are many paths with reconvergent fanout, which greatly complicates the analysis. Most circuits are not nearly this difcult to analyze.

` `

what is the critical path through this circuit? what is the delay along the path?

LEC-15:

5.3.2

Algorithm to Find Critical Path

47

5.3.2

Algorithm to Find Critical Path

LEC-15:

5.3.2

Algorithm to Find Critical Path

48

5.3.2.1 Critical Path Between Two Signals


The following is an algorithm to nd the critical path from a source signal to a destination signal.

LEC-15:

5.3.2

Algorithm to Find Critical Path

49

Basic Idea to Find Critical Path

` ` `

Start at source node and traverse through fanout to destination node, annotating intermediate nodes with maximum delay to the intermediate nodes. The delay to the destination node is the delay of the critical path. The critical path is found by starting at the destination path and working backwards, choosing node with maximum delay at each step.

LEC-15:

5.3.2

Algorithm to Find Critical Path

50

Algorithm to Find Critical Path


1. 2. 3. 4. Start at source signal Set current time to 0 Annotate node with current time For each node in immediate fanout of current node, (a) set current time to time of current node plus interconnect delay (if any) to the input of fanout node (b) annotate input of fanout node with current time 5. For each node that has times on all of its inputs but not a time for itself, (a) annotate the output of the node with the maximum time on the inputs to the node plus the delay through the node (b) go to step 4 6. To nd the critical path, work backwards through fanin from destination node, choosing fanin node with maximum delay at each step.

LEC-15:

5.3.2

Algorithm to Find Critical Path

51

5.3.2.2 Critical Path Between Sets of Signals


To nd the critical path from a set of source signals to a set of destination signals: run the above algorithm, but start from all source nodes. The destination of the critical path is the destination node that is annotated with the greatest delay. Run the back-tracking procedure for this destination signal of the critical path.

LEC-15:

5.3.3

False Paths

52

5.3.3

False Paths

Denition false path: A path from a source signal to a destination signal such that changes on the source signal will not propagate along the path to cause a change on the destination signal. There are two classes of false paths, static and dynamic. Static are easier to detect, while dynamic false paths can be tedious and difcult to detect.

LEC-15:

5.3.3

False Paths

53

5.3.3.1 Static False Path Example


Question: Ignoring the behaviour of the gates, nd the critical path through the circuit.

a b c
gate NOT AND OR XOR delay 2 4 4 6

f g

z y h

LEC-15:

5.3.3

False Paths

54

Answer
Answer:

The answer follows on the next few slides

LEC-15:

5.3.3

False Paths

55

Annotate Paths with Delays


a b c 2 2 4 4 0 8 0 8 2 2 12 8 2 12 10 16 z y

The path from a to y has a delay of 16. Check if it is a false critical path.

LEC-15:

5.3.3

False Paths

56

False Path from a to y


a b c 2 2 !a 4 a 4 b 0 8 b 0 ab 8 !c 2 2 12 ab 8 !b 2 ab + !c 12 10 16 !a + !b !b!c z y

Equation for y is: !b!c, which does not contain a, so y is independent of a. In other words: changes on a do not lead to changes on y. In other words: the path from a to y is a false path We were able to use static analysis to determine that the path from a to y is a false path.

LEC-15:

5.3.3

False Paths

57

Find Next Candidate Path


a b c 2 2 !a 4 a 4 b 0 8 4 b 0 ab 4 !c 2 2 8 ab 8 !b 2 ab + !c 8 10 12 !a + !b !b!c z y

To nd the next candidate critical path, recompute delay values along the false path. Leave all other delays the same as before. For each node along the false path, maintain two delay values. One delay is the value already calculated. The other delay value is the maximum delay to that node, ignoring the prex of false path. The prex of a false path is the set of nodes whose fanin comes only from false paths.

LEC-15:

5.3.3

False Paths

58

Candidate Path
a b c 2 (0,2)!a (0,4) a (0,4) b 0 (4,8) 8 4 b 0 ab 4 !c 2 2 8 ab 8 !b 2 ab + !c 8 10 12 !a + !b !b!c z y

The next candidate is from b to y. Static analysis shows that b is in the equation for y, so static analysis cannot detect whether this is a false path. We must use dynamic analysis.

LEC-15:

5.3.3

False Paths

59

5.3.3.2 Dynamic False Path Example


Question: Determine if the critical path you found in the previous question is a real critical path or a false path. If it is a false path, nd the real critical path and its delay.

Answer:

LEC-15:

5.3.3

False Paths

60

Test Candidate Path


Try to push a rising edge from source to destination, assign values to nodes not on critical path that allow rising edge to propagate.
a 1 b 0 c 1 0 0 y 1 z

Rising edge fails to generate a change on y.

LEC-15:

5.3.3

False Paths

61

Test Candidate Path


Try to push a falling edge from source to destination.
a 1 b 0 c 1 0 0 y 1 z

Both rising and falling edges failed to generate a change on output, therefore found another false path. NB: Pushing edges forward is not a smart way to explore candidate critical paths, because this technique does not help isolate the cause the of false path. Pushing edges backwards will identify the cause of the false path.

LEC-15:

5.3.3

False Paths

62

Test Candidate Path Intelligently


a 1 b c 1 0 0 1 z 0 0 1 1 y

Try to push a rising edge backwards along path between b and y. Contradictory assignment for b, therefore false path.

LEC-15:

5.3.3

False Paths

63

Reconvergent Fanout
a b y c 0 z

Two paths from point of contradictory assignment to y. This is reconvergent fanout. Reconvergent fanout is most common cause of false paths. It also causes problems with fault-detection (Chapter 7).

LEC-15:

5.3.3

False Paths

64

Pushing Edge with Reconvergent Fanout


a 1 b y c 1 0 0 1 z

Try to push a rising edge backwards along path, but put edge (not constant) on node in reconvergent fanout. Contradictory assignments to b.

LEC-15:

5.3.3

False Paths

65

Next Candidate Path


a b c 2 !a 4 a 4 b 0 8 0 b 0 ab 0 !c 2 2 6 ab !b 2 ab + !c 6 10 10 !a + !b !b!c z y

To nd the next candidate critical path, recompute the delay values for nodes along the false path. Leave all other delays the same as before. To recompute delay along a false path, ignore the prex of the false path. The prex is the set of nodes whose fanin comes only from false paths.

LEC-15:

5.3.3

False Paths

66

Shortcut for Candidate Paths


As a shortcut, you do not need to maintain two delay values for nodes in the sufx of the false path. The sufx is the set of nodes who fanout only to the false path. The nodes in the sufx do not need to maintain their old delay value. They only need their new delay value.

LEC-15:

5.3.3

False Paths

67

Next Candidates
a b c 2 2 !a 4 a 4 b 0 8 0 b 0 ab 0 !c 2 2 6 ab 8 !b 2 ab + !c 6 10 10 !a + !b !b!c z y

LEC-15:

5.3.3

False Paths

68

Test First Candidate


a 1 b 1 y c z

(*CHANGE ver2 (2002/12/02): corrected edge polarity on a *) Propagate a rising edge backwards. It works!

LEC-15:

5.3.3

False Paths

69

Test Second Candidate


a - b 0 c - - 0 - 0 0 0 0 1 1 y z

Propagate a rising edge backwards. It works!

LEC-15:

5.3.3

False Paths

70

Summary
There are two paths with a delay of 10: one from a to z and one from c to y. We can push edges along both of these paths, so they are real critical paths. Note that different values on b result in different critical paths.

LEC-15:

5.3.3

False Paths

71

5.3.3.3 Another Dynamic False Path Example


Question:
a b c d e

Find the false critical path in the circuit below.

f h g

i k j

LEC-15:

5.3.3

False Paths

72

Answer
Answer:
a b c d e 0 /= 1 f g 1 0 h j i 1 k

LEC-15:

5.3.3

False Paths

73

5.3.3.4 And Another Dynamic False Path Example


Question: Find the real critical path in the circuit below.
delay=8 a b y c delay=2 z x

LEC-15:

5.3.3

False Paths

74

First Candidate
Answer:
4 a0 b0 8 c 0 2 2 2 4 0 8 0 2 12 12 2 12 16 delay=8 12 x

delay=2

14

This is a false path, we saw it before in an earlier problem.

LEC-15:

5.3.3

False Paths

75

Second Candidate
4 a0,0 b0 c 0 2 0,2 0,4 0 0,8 0 0,8 2 2 6,12 12 2 6 10 delay=8 12 x

delay=2

14

The real critical path is the path from a to z, which has a delay of 14.

LEC-15:

5.3.3

False Paths

76

5.3.3.5 Algorithm for False Path Detection


To determine if a path through a circuit is a false path: 1. Start at destination node of path, try to push a 1 0 or 0 1 backwards along the candidate critical path. 2. Follow the critical path backwards, at each gate, assign values (0 or 1) to the non-critical input signals according to the rules in gure 5.9. If have reconvergent fanout, then can assign 1 0 or 0 1 to noncritical inputs, otherwise must use just 0 or 1. 3. If assign different values to same signal, then the candidate critical path is a false path. 4. If dont assign different values to same signal, then assignments calculated along path give values that will exercise critical path. 5. Push values on non-critical nodes to primary inputs to give assignment that will exercise the critical path.

LEC-15:

5.3.3

False Paths

77

Rules for Pushing Edges


1 1

1 0

1 0

General rules

Additional rules for reconvergent fanout

Figure 5.9: Rules for pushing rising and falling edges through gates

LEC-15:

5.3.3

False Paths

78

Reconvergent Fanout Rules


Question: Why do the rules for reconvergent fanout have only rising edges for AND gates and falling edge for OR gates?

Answer:
a b a c b c

Falling edge on non-critical path will cause output to change before edge on critical path affects output.

LEC-15:

5.3.3

False Paths

79

Analyzing Rules for Reconvergent Fanout


The pictures below show all combinations of output edge (rising or falling) and input values (constant 1, constant 0, rising edge, falling edge) for AND, OR , NAND , and NOR gates. The pictures that are crossed out illustrate combinations of inputs and outputs that are contradictory to the behaviour of the gate.

LEC-15:

5.3.3

False Paths

80

Reconvergent for AND


0 0 is controlling 1 0 0 is controlling 1

1 glitch on output

constant 0 output

0 is controlling

LEC-15:

5.3.3

False Paths

81

Reconvergent for OR
0 0

1 is controlling

1 is controlling

1 is controlling

constant 1 output

0 glitch on output

LEC-15:

5.3.3

False Paths

82

Reconvergent for NAND


0 0 is controlling 1 0 0 is controlling 1

0 glitch on output

0 is controlling

constant 0 output

LEC-15:

5.3.3

False Paths

83

Reconvergent for NOR


0 0

1 is controlling

1 is controlling

1 is controlling

constant 1 output

0 glitch on output

LEC-15:

5.3.4

Increasing the Accuracy of Critical Path Analysis

84

5.3.4 Increasing the Accuracy of Critical Path Analysis


When doing critical path calculations, often useful to strike a balance between accuracy and effort. In examples so far we have been assuming that all signals have the same wire and load delays. This assumption simplies calculations, but reduces accuracy. Section 5.4 discusses how the analog world affects timing analysis.

LEC-16 Preliminaries

LEC-16: Math, Physics, and Applications of Timing Analysis


Lecture Notes Sections: 5.4 5.5.2.2

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-16 Preliminaries

Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Lec-15 Introduction Lec-16 Math, Physics and Applications Lec-17 Timing Analysis of Storage Elements Power Analysis and Reduction Faults and Testing Review

wk-09 10 wk-11 12 wk-13

LEC-16 Preliminaries

Overview
This lecture looks at the analog equations that affect delay and relates them up to the digital world.

LEC-16 Preliminaries

Concepts

` ` ` `

Timing model Data dependend delay Propagation delay Load delay Interconnect delay Elmore time constant

` ` ` ` ` `

Extrinsic delay Intrinsic delay Worst case timing Derating factors Speed binning

LEC-16:

5.4

ANALOG EFFECTS IN TIMING ANALYSIS

5.4

Analog Effects in Timing Analysis

LEC-16:

5.4.1

Timing Model (Smith 3.1, 13.6)

5.4.1

Timing Model (Smith 3.1, 13.6)


Rpu Vi Cp Rpd Vo Cout

Rpu Rpd Cp Cout

Timing model pull up resistor in p-tran pull down resistor in n-tran parasitic capacitance load capacitance

LEC-16:

5.4.1

Timing Model (Smith 3.1, 13.6)

5.4.1.1 Equation for Output Voltage


Output voltage when Vo discharges through Rpd (Equation 3.1 from Smith).

Vo

VDD

i ph

g f

Rpd Cp

t Cout

LEC-16:

5.4.1

Timing Model (Smith 3.1, 13.6)

Measuring Delay Through an Inverter


Vdd 0.65 Vdd 0.35 Vdd Vout 0 Vin

To measure delay through inverter, what voltage levels do we use?

LEC-16:

5.4.1

Timing Model (Smith 3.1, 13.6)

Dening Trip Points


Denition Trip Points: A high or 1 trip point is the voltage level where an upwards transition means the signal represents a 1. A low or 0 trip point is the voltage level where a downwards transition means the signal represents a 0.
Vdd 0.65 Vdd 0.35 Vdd 0

LEC-16:

5.4.1

Timing Model (Smith 3.1, 13.6)

10

Picking Trip Points


We need to pick our trip points, then these determine the start and stop time for measuring delay. Pick the trip points to simplify the delay equation. Pick trips points of 0.35/0.65:

` `

low-voltage (0) trip point of 0.35 Vdd high-voltage (1) trip point of 0.65 Vdd

LEC-16:

5.4.1

Timing Model (Smith 3.1, 13.6)

11

Trip Points and Delay Equation


Delay equation for falling output with 0.35 trip point: TPD Rpd Cp Cout

g f

s th q r f

Solving for TPD , using ln 1 0 35 TPD

Rpd Cp

0 35VDD

VDD

1, doing some more approximations: Cout

ih

g f

d q

LEC-16:

5.4.1

Timing Model (Smith 3.1, 13.6)

12

Some Rough Intuition

A larger transistor has a lower resistance, but a higher capacitance. Resistance affects timing of source (driving) signals. Capacitance affects (mostly) timing of destination (load) signals. Decreasing resistance increases the current through drivers. Increasing capacitance slows down (dis)charging of load capacitors.

g f

TPD

Rpd Cp

Cout

` ` ` ` `

LEC-16:

5.4.1

Timing Model (Smith 3.1, 13.6)

13

5.4.1.2 Extrinsic / Intrinsic Delays (Smith 13.6)


Denition intrinsic delay: Delay resulting from pull(up/down) resistor and parasitic capacitance.

Denition extrinsic delay: Delay resulting from load capacitance.

LEC-16:

5.4.2

Data-Dependent Delay

14

5.4.2

Data-Dependent Delay

Sometimes the delay through a component is dependent upon the values on signals.

In a ripple-carry adder, if a carry out of the MSB is generated from the least signicant bit, then it will take longer for the output to stabilize than if no carries generated at all.

In a state machine using a one-hot state encoding, false paths might exist when more than one state bit is a 1.

` ` `

Some implementation technologies (e.g. NMOS and exotic latches) have faster transitions from 1 0 than 0 1.

LEC-16:

5.4.2

Data-Dependent Delay

15

Analysis and Accuracy


Because of these effects, the most accurate delay analysis requires looking at the actual data values that will occur in practice. Conversely, a timing simulation may not demonstrate the actual slowest behaviour of your circuit: if you dont ever generate a carry from LSB to MSB, then youll never exercise the critical path in your adder.

NOTE: Asynchronous circuits Data dependent delays are one motivation for asynchronous circuits. Asynchronous circuits are still an active area of research, but are beginning to be used in commercial circuits.

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

16

5.4.3

Interconnect Delay (Smith 7.1)

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

17

5.4.3.1 Elmore Time Constant (Smith 7.1.2)


Elmore time constants are used to analyze interconnect delay with intermediate connections and/or fanout.

Di

Elmore time constant for node i n ER Ck (n is the number of nodes in the k,i k 1 circuit)

ER k,i

= resistance along path from node i to the source that is also on the path from node k to source

w w

v u

Vi t

The voltage on node i (capacitor i) at time t t Di

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

18

Elmore Time Constant

k 1

If we:

approximate Vi t as an exponential waveform, and use 0.35/0.65 trip points

then the delay from the source to node i is Di seconds.

Di

ERk,iCk

hf

` `

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

19

5.4.3.2 Interconnect with Single Fanout


This is similar to the example in Smith 7.1.3, except that Smith has one more wire segment (L4) between the gates.

G1

G2

Ra4 Ra1
G1

C3 Rw3

Ra3

G2 C1 Rw1 G1
Rpu

C2 Rw2 Ra2

G2 Ra1 Rw1 Ra2 Rw2 Ra3 Rw3 Ra4

Vi Cp Rpd

C1

C2

C3

CG2

G* C* Ra* Rw*

gate capacitance on wire resistance through antifuse resistance through antifuse

Question:

Calculate delay from gate 1 to gate 2

Answer:

Gate 2 represents node 4 on the RC tree.

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

20

k 1

Ra1

Rw1 C1

Ra1

Rw1

Ra2

Rw2 C2

Ra1

Rw1

Ra2

Rw2

Ra3

Rw3 C3

Ra1

Rw1

Ra2

Rw2

Ra3

Rw3

ER C1 1,4

f g

f g

f g

b b b

D4

ERk,iCk
ER C2 2,4 ER C3 3,4 ER C4 4,4 Ra4 CG2

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

21

approximate Rai

Ra j

D4

4 Ra CG2

3 Ra C3

2 Ra C2

Ra C1

h f g h f g h f g h f b g g f g f g h f

h g f g h

D4

Ra1 C1 Ra1 Ra2 C2 Ra1 Ra1 Ra2 Ra3 Ra4 CG2

approximate Ra

Rw Ra2 Ra3 C3

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

22

Doubling Antifuses
Question: If you double the number of antifuses and wires needed to connect two gates, what will be the approximate effect on the wire delay between the gates?

Answer:

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

23

Doubling Antifuses Answer


Di

k 1

ERk,iCk

Assume all resistances and capacitances are the same values (R and C), and assume that all intermediate nodes are along path between the two gates of interest. k R ER k,i

h xf

Di

k 1

k RC

b b

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

24

Antifuse Doubling (Contd)


Using the mathematical theorem:

i 1

n2

We simplify delay equation:

k 1 n2 RC

We see that the delay is propotional to the square of the number of antifuses along the path.

h xf

Di

k RC

h g f
2
n

b s b

1n

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

25

5.4.3.3 Interconnect with Multiple Gates in Fanout


G1 G3 G2 G1 G2 G3

Question: Assuming that wire resistance is much less than antifuse resistance and that all antifuses have equal resistance, calculate the delay from the source inverter (G1) to G2

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

26

Answer:

2. Label interconnect with resistance and capacitance identiers.

R4 C5 G2 C1 R1 G1

C4

R3 C3 R5 R6 C7

G3 C6 R2 C2

1. There are a total of 7 nodes in the circuit (n

7).

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

27

3. Draw RC tree
G1 Rpu R1 n1 R2 n2 Cp Rpd G3 R5 n6 R6 C6 C1 C2 n3 R3 n4 R4 C3 C4 G2

Vi

n5 C5

n7 C7

4. G2 is node 5 in the circuit (i

5).

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

28

5. Elmore delay equations

k 1

ER C5 5,5

ER C6 6,5

ER C7 7,5

ER C1 1,5

b b

D5

k 1 7

ERk,5Ck
ER C2 2,5 ER C3 3,5 ER C4 4,5

Di

ERk,iCk

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

29

6. Elmore resistances ER = R1 1,5 ER ER ER ER ER ER 2,5 3,5 4,5 5,5 6,5 7,5 = = = = = = R1 + R2 R1 + R2 R1 + R2 + R3 R1 + R2 + R3 + R4 R1 + R2 R1 + R2

= = = = = = =

R 2R 2R 3R 4R 2R 2R

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

30

7. Plug resistances into delay equations

D5

R C1 2R C2 2R C3 2R C6 2R C7

3R C4

4R C5

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

31

Delay from G1 to G3
Question: Assuming that wire resistance is much less than antifuse resistance and that all antifuses have equal resistance, calculate the delay from the source inverter (G1) to G3

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

32

Answer:

1. G3 is node 7 in the circuit (i

7).

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

33

2. Elmore delay equations

k 1

ER C5 5,7

ER C6 6,7

ER C7 7,7

ER C1 1,7

D7

k 1 7

ERk,7Ck
ER C2 2,7 ER C3 3,7 ER C4 4,7

Di

ERk,iCk

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

34

3. Elmore resistances ER = R1 1,7 ER ER ER ER ER ER 2,7 3,7 4,7 5,7 6,7 7,7 = = = = = = R1 + R2 R1 + R2 R1 + R2 R1 + R2 R1 + R2 + R5 R1 + R2 + R5 + R6

= = = = = = =

R 2R 2R 2R 2R 3R 4R

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

35

4. Plug resistances into delay equations

D7

R C1 2R C2 2R C3 3R C6 4R C7

2R C4

2R C5

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

36

Delay to G2 vs G3
Question: Assuming all wire segments at same level have roughly the same capacitance, which is greater, the delay to G2 or the delay to G3?

Answer:

1. Equations for delay to G2 (D5 ) and G3 (D7 )

2. Difference in delays

3. Compare capacitances

4. Conclusion: delays are approximately equal.

C5

C4

C6 C7

D7

D5


R C1 2R C2 2R C3 2R C4 2R C5 3R C6 D7 RC4 2RC5 RC6 2RC7

D5

R C1

2R C2

2R C3

3R C4

4R C5

2R C6

2R C7 4R C7

LEC-16:

5.4.3

Interconnect Delay (Smith 7.1)

37

5.4.3.4 FPGAs, Interconnect, and Synthesis


On FPGAs 40-60% of clock cycle is consumed by interconnect. When synthesizing, increasing effort (number of iterations) of place and route can signicantly reduce the clock period on large designs.

LEC-16:

5.5

PRACTICAL USAGE OF TIMING ANALYSIS

38

5.5

Practical Usage of Timing Analysis

LEC-16:

5.5.1

Speed Binning (Smith 5.1.6)

39

5.5.1

Speed Binning (Smith 5.1.6)

Speed binning is the process of testing each manufactured part to determine the maximum clock speed at which it will run reliably. Manufacturers sell chips off of the same manufacturing line at different prices based on how fast they will run. A speed bin is the clock speed that chips will be labeled with when sold. Overclocking: running a chip at a clock speed faster than what it is rated for (and hoping that your software crashes more frequently than your overstressed hardware will).

LEC-16:

5.5.2

Worst Case Timing (Smith 5.1.7)

40

5.5.2

Worst Case Timing (Smith 5.1.7)

LEC-16:

5.5.2

Worst Case Timing (Smith 5.1.7)

41

5.5.2.1 Fanout delay


Table 5.2 (Fanout delay) combines two separate parameters:

into a single parameter (fanout). This is common, and ne. But, when reading a table such as this, you need to know whether fanout delay is combining both capacitive load delay and interconnect delay, or is just capacitive load.

capacitive load delay interconnect delay

LEC-16:

5.5.2

Worst Case Timing (Smith 5.1.7)

42

5.5.2.2 Derating Factors


Delays are dependent upon supply voltage and temperature.

D D

Temp Supply voltage

Delay Delay

LEC-16:

5.5.2

Worst Case Timing (Smith 5.1.7)

43

Temperature

As temp goes up, atoms vibrate more, and so have greater probability of colliding with electrons owing with current.

Temp

Temp

Delay Resistivity of wires

LEC-16:

5.5.2

Worst Case Timing (Smith 5.1.7)

44

Supply Voltage

current age

time to charge load capacitors to threshold volt-

Supply voltage

Supply voltage

Delay current (V = IR)

LEC-16:

5.5.2

Worst Case Timing (Smith 5.1.7)

45

Derating Factor Denition


A derating factor is a number to adjust timing numbers to account for different temperature and voltage conditions. Excerpt from table 5.3 in book (Actel Act 3 derating factors): Derating factor 1.17 1.00 0.63 Temp 125C 70C -55C Vdd 4.5V 5.0V 5.5V

LEC-17 Preliminaries

LEC-17: Timing Analysis (Latches and Flip Flops)


Lecture Notes Sections: 5.6 5.6.7

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-17 Preliminaries

Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Lec-15 Introduction Lec-16 Math, Physics and Applications Lec-17 Timing Analysis of Storage Elements Power Analysis and Reduction Faults and Testing Review

wk-09 10 wk-11 12 wk-13

LEC-17 Preliminaries

Concepts
Setup, hold, and clock-to-Q time calculations for the following circuits:

We wont have time to cover all of these in lecture. Hierarchical FPGA is in Smith. Exotic op is for your interest and buzz-word completedness in interviews, it will not be on the nal exam.

Latch Master/Slave ip op

Exotic ops Hierarchical FPGA cell

LEC-17:

5.6

TIMING ANALYSIS OF LATCHES AND FLIP FLOPS

5.6 Timing Analysis of Latches and Flip Flops

LEC-17:

5.6.1

Simple Latch

5.6.1

Simple Latch

Two modes for latch:

loading data: loads input data into storage circuitry input data passes through to output using stored data input signal is disconnected from output storage circuitry drives output
clk o

Schematic

LEC-17:

5.6.1

Simple Latch

Two Modes for Latch


1 i o i 0 o

Loading / pass-through mode

Storage mode

LEC-17:

5.6.1

Simple Latch

Implementing a Latch
s a b o a sel b

Multiplexor: symbol and implementation


clk i o d clk

Latch implementation

LEC-17:

5.6.1

Simple Latch

Latch Glitching
d clk

NOTE: inverters on sel Both of the inverters on the sel signal are needed. Together, they prevent a glitch on the OR gate when sel is deasserted. If there was only one inverter, a glitch would occur. For more on this, see section 5.6.3.2

LEC-17:

5.6.1

Simple Latch

Loading 0
d=0 clk=1 1 1 0 1 1 0 0 o

LEC-17:

5.6.1

Simple Latch

10

Loading 1
d=1 clk=1 0 1 0 0 0 0 1 o

LEC-17:

5.6.1

Simple Latch

11

Storing 0
d clk=0 0 1 1 0 1 1 0 o=0

LEC-17:

5.6.1

Simple Latch

12

Storing 1
d clk=0 0 1 1 0 1 1 1 o=1

LEC-17:

5.6.1

Simple Latch

13

Timing Analysis Strategy


The key to calculating setup and hold times of a latch, op, etc is to identify: 1. how the data is stored when not connected to the input (often a pair of inverters in a loop) 2. the gate(s) that the clock uses to cause the stored data to drive the output (often a transmission gate or multiplexor) 3. the gate(s) that the clock uses to cause the input to be drive the output (often a transmission gate or multiplexor)

LEC-17:

5.6.1

Simple Latch

14

Storage Path with Gating Gate


d clk=0 0 1 0 o

LEC-17:

5.6.1

Simple Latch

15

Load Path with Gating Gate


d clk=1 1 0

LEC-17:

5.6.1

Simple Latch

16

Clock-to-Q
NOTE: Clock-to-Q for latches For latches, clock-to-Q times are measured with respect to the clock edge that connects the data input to the output. For active-high latches, this is a rising edge.

LEC-17:

5.6.1

Simple Latch

17

Setup and Hold


NOTE: Setup and hold time for latches For latches, hold time and setup time are measured with respect to the clock edge that disconnects the data input from the output. For active-high latches, this is a falling edge. Hold time is concerned with the next data value sneaking in before the latch goes into storage mode. Setup time is concerned with the previous data value still being in the storage circuitry when the input is disconnected.

LEC-17:

5.6.1

Simple Latch

18

Requirements and Guarantees


NOTE: Requirements vs. Guarantees For a storage device, the setup and hold times are requirements that the device imposes upon its environment. The clock-to-Q time is a guarantee. If the environment satises the setup and hold times, then the storage device guarantees that it will satisfy the clock-to-Q time.

LEC-17:

5.6.1

Simple Latch

19

Storage Devices and Signals


NOTE: Storage devices vs. Signals We can talk about the setup and hold time of a signal or of a storage device. For a storage device, the setup and hold times are requirements that it imposes upon all environments in which it operates. For an individual signal in a circuit, there is a setup and hold time, which is the amount of time that the signal is stable before and after a clock edge.

LEC-17:

5.6.2

Clock-to-Q Time of a Simple Latch

20

5.6.2

Clock-to-Q Time of a Simple Latch


d clk l1 c2 cn l2 qn s2 s1 q

Figure 5.10: Latch for Clock-to-Q Analysis


d l1 l2 qn q s1 s2 clk cn c2 clock-to-Q

Calculate clock-to-Q time by nding delay of critical path from where clock signal enters storage circuit to where q exits storage circuit.

LEC-17:

5.6.3

Setup Timing of a Simple Latch

21

5.6.3

Setup Timing of a Simple Latch


d clk l1 c2 cn l2 qn s2 s1 q

Figure 5.11: Latch for Setup Analysis


setup + slack d l1 l2 qn q s1 s2 clk cn c2

Figure 5.12: Setup OK: goal is to store

LEC-17:

5.6.3

Setup Timing of a Simple Latch


l1 c2 cn

22

d clk

l2 qn s2 s1 q

setup with negative slack d l1 l2 qn q s1 s2 clk cn c2


/

Figure 5.13: Setup Violation

LEC-17:

5.6.3

Setup Timing of a Simple Latch


l1 c2 cn

23

d clk

l2 qn s2 s1 q

setup d l1 l2 qn q s1 s2 clk cn c2

Figure 5.14: Minimum Setup Time

LEC-17:

5.6.3

Setup Timing of a Simple Latch


l1

24

d clk

l2 qn q

cn

s2 s1

setup d l1 l2 qn q s1 s2 clk cn c2

Minimum Setup Time must arrive at s1 before cn is asserted. Otherwise, will affect storage circuitry when data input is disconnected. Setup time is difference between path from d to s1 and path from clk to cn.

LEC-17:

5.6.3

Setup Timing of a Simple Latch

25

5.6.3.1 Hold Time of a Simple Latch


d clk cn s2 s1 l1 c2 l2 qn q

Figure 5.15: Latch for Hold Analysis


hold + slack d l1 l2 qn q s1 s2 clk cn c2

Figure 5.16: Hold OK: goal is to store

LEC-17:

5.6.3

Setup Timing of a Simple Latch


l1 c2 cn s2 s1
hold with negative slack

26

d clk

l2 qn q

d l1 l2 qn q s1 s2 clk cn c2

Figure 5.17: Hold violation: slips through to q

LEC-17:

5.6.3

Setup Timing of a Simple Latch


l1 c2 cn s2 s1
hold

27

d clk

l2 qn q

d l1 l2 qn q s1 s2 clk cn c2

Figure 5.18: Minimum Hold Time


d clk l1 c2 cn l2 qn s2 s1 q

LEC-17:

5.6.3

Setup Timing of a Simple Latch


hold

28

d l1 l2 qn q s1 s2 clk cn c2

Cant let affect l1 before c2 deasserts. Hold time is difference between path from clk to c2 and path from d to l1.

LEC-17:

5.6.3

Setup Timing of a Simple Latch

29

5.6.3.2 Example of a Bad Latch


d clk l1 c2 cn l2 qn s2 s1 d l1 l2 qn q s1 s2 clk c2 cn q

LEC-17:

5.6.4

Timing Analysis of a Transmission Gate Latch

30

5.6.4 Timing Analysis of a Transmission Gate Latch

LEC-17:

5.6.4

Timing Analysis of a Transmission Gate Latch

31

5.6.4.1 Transmission Gate (Smith 2.4.3)

Symbol
s 1 0

Implementation 0

Open
0

Closed

Transmit 1
s i o

Transmit 0

Transmission gate as switch

LEC-17:

5.6.4

Timing Analysis of a Transmission Gate Latch

32

5.6.4.2 Transmission Gate Latch (Smith 2.5.1)


d clk q

LEC-17:

5.6.4

Timing Analysis of a Transmission Gate Latch

33

Loading Data into Latch


d clk 1 0 1 1 0 q

LEC-17:

5.6.4

Timing Analysis of a Transmission Gate Latch

34

Using Stored Data from Latch


d clk 1 1 0 0 1 q

LEC-17:

5.6.4

Timing Analysis of a Transmission Gate Latch

35

5.6.4.3 Clock-to-Q Delay for Latch


d clk 1 q

LEC-17:

5.6.4

Timing Analysis of a Transmission Gate Latch

36

5.6.4.4 Setup and Hold Times for Latch

LEC-17:

5.6.4

Timing Analysis of a Transmission Gate Latch

37

Setup Time for Latch


d clk 1 path2 path1 q

Setup time = path1 path2

LEC-17:

5.6.4

Timing Analysis of a Transmission Gate Latch

38

Hold Time for Latch

path2 d clk 1 path1

Hold time = path1 path2

LEC-17:

5.6.5

Falling Edge Flip Flop (Smith 2.5.2)

39

5.6.5

Falling Edge Flip Flop (Smith 2.5.2)


d clk
EN

m
EN

d clk m clk_b q ??

??

LEC-17:

5.6.5

Falling Edge Flip Flop (Smith 2.5.2)

40

5.6.5.1 Behaviour of Flip-Flop


d clk
EN

m
EN

TInv d clk m clk_b q Tinv Tmd Latch Setup Latch Clock-Q

TInv Tmd

delay through an inverter propagation delay from m to d

LEC-17:

5.6.5

Falling Edge Flip Flop (Smith 2.5.2)

41

5.6.5.2 Clock-to-Q of Flip-Flop


d clk
EN

m
EN

d clk m clk_b q

Tinv Latch Clock-to-Q

Flop Clock-to-Q

Flop CO

TInv

Latch CO

LEC-17:

5.6.5

Falling Edge Flip Flop (Smith 2.5.2)

42

5.6.5.3 Setup of Flip-Flop


d clk
EN

m
EN

d clk m clk_b q

Tinv

Tmd

Latch Setup clock path data path

Flop Setup

SUD

Flop

Tmd

Latch SUD

TInv

LEC-17:

5.6.5

Falling Edge Flip Flop (Smith 2.5.2)

43

5.6.5.4 Hold of Flip-Flop


d clk
EN

m
EN

d clk m clk_b q

Hold time for latch Hold time for flop

The hold of the ip op is the same as the hold time of the master latch. This is because, once the data is stored in the master latch, it will be held for the slave latch.

Flop HO

Latch HO

LEC-17:

5.6.6

Timing Analysis of FPGA Cells (Smith 5.1.5)

44

5.6.6 Timing Analysis of FPGA Cells (Smith 5.1.5)

LEC-17:

5.6.6

Timing Analysis of FPGA Cells (Smith 5.1.5)

45

5.6.6.1 Standard Timing Equations

HO

CO

PD T CLKD T OUT T SUD

delay from D-inputs to storage element delay from clk-input to storage element delay from storage element to output setup time slowest D path fastest clk path T T PD Max CLKD Min hold time slowest clk path fastest D path T T CLKD Max PD Min delay clk to Q clk path output path T T CLKD OUT

LEC-17:

5.6.6

Timing Analysis of FPGA Cells (Smith 5.1.5)

46

5.6.6.2 Hierarchical Timing Equations


Add combinational logic to inputs, clock, and outputs of storage element.
t SUD data inputs t PD d t HO t CO clk clk t CLKD q t OUT

HO CO

CLKD Max CLKD Max

SUD T HO T CO

SUD

PD Max

CLKD Min T PD Min T OUT Max

LEC-17:

5.6.6

Timing Analysis of FPGA Cells (Smith 5.1.5)

47

5.6.6.3 Actel Act 2 Logic Cell


Timing analysis of Actel Act 2 logic cell (Smith 5.1.5). Actel ACT

Basic logic cells are called Logic Module ACT 1 family: one type of Logic Module (see Figure 5.1, Smiths pp. 192) ACT 2 and ACT 3 families: use two different types of Logic Module (see Figure 5.4, Smiths pp. 198) C-Module (Combinatorial Module) combinational logic similar to ACT 1 Logic Module but capable of implementing ve-input logic function S-Module (Sequential Module) C-Module + Sequential Element (SE) that can be congured as a ip-op

LEC-17:

5.6.6

Timing Analysis of FPGA Cells (Smith 5.1.5)

48

Actel Timing
Actel Timing

ACT family: (see Figure 5.5, Smiths pp. 200) Simple. Why? Only logic inside the chip Not exact delay (as no place and route, physical layout, hence not accounting for interconnection delay) Non-Deterministic Actel Architecture All primed parameters inside S-Module are assumed Calculate tSUD, tH, and tCO The combinational logic delay of 3 ns: 0.4 went into increasing the setup time, tSUD, and 2.6 ns went into increasing the clockoutput delay, tCO. From outside we can say that the combinational logic delay is buried in the ip-op set up time

LEC-17:

5.6.6

Timing Analysis of FPGA Cells (Smith 5.1.5)

49

Actel Latch
d clk q d clk clr q

Simple Actel-style latch

Actel latch with active-low clear

LEC-17:

5.6.6

Timing Analysis of FPGA Cells (Smith 5.1.5)

50

d clk clr

Actel op with active-low clear

LEC-17:

5.6.6

Timing Analysis of FPGA Cells (Smith 5.1.5)

51

C-Module d00 d01 d10 d11 a1 b1 a0 b0

SE-Module

m se_clk se_clk_n

clk clr

Actel sequential module

LEC-17:

5.6.6

Timing Analysis of FPGA Cells (Smith 5.1.5)

52

5.6.6.4 Timing Analysis of Actel Sequential Module


Timing parameters for Actel latch with active-low clear T SUD T HO T CO 0.4ns 0.9ns 0.4ns

Other given timing parameters C-Module delay (tPD ) tCLKD (from clk to se clk and se clk n) 3ns 2.6ns

LEC-17:

5.6.6

Timing Analysis of FPGA Cells (Smith 5.1.5)

53

Timing of Actel Module


Question: What are the setup, hold, and T times for the CO entire Actel sequential module?

Answer:

See Smith pp 199. Use Smiths eqn 5.15, 5.16, and assume 2 6ns. t CLKD T SUD T HO T CO

0.8ns 0.5ns 3.0ns

LEC-17:

5.6.7

Exotic Flop

54

5.6.7

Exotic Flop

q d clk

Inverter chain creates evaluation window in time when clock has just risen and p transistors are turned on. When clock is 0, internal nodes precharge to 1. Inverter loops are keepers, which store data value.

Chapter 6

Power Analysis and Design

55

LEC-18 Preliminaries

LEC-18: Introduction to Power


Lecture Notes Sections: 6.1 6.2.6

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-18 Preliminaries

Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 wk-09 10 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Lec-18 Introduction Lec-19 Data Encoding Lec-20 Clock Gating Faults and Testing Review

wk-11 12 wk-13

LEC-18 Preliminaries

Purpose and List of Concepts


The purpose of this lecture is to convey the importance that power consumption plays in a wide spectrum of digital systems and to introduce the physical equations used to model power consumption. Power Energy Battery Energy Heat Removal Static Power Consumption Dynamic Power Consumption

Activity Factor Switching Power Consumption Short-Circuiting Power Consumption Leakage Power Consumption

LEC-18 Preliminaries

Background Material

Basic electricity and magnetism equations for voltage, power, current, etc

LEC-18 Preliminaries

Reading Material
All of the articles and papers below are linked to from the Documentation page on the E&CE 427 web site. Smith 15.5 Mudge Power: A rst class design constraint. Trevor Mudge. Computer, vol. 34, no. 4, April 2001, pp. 52-57
http://www.eecs.umich.edu/tnm/papers/computer01.pdf

For more info (optional) :

Infrared Expose: Thermal imaging of 29 200-MHz and 233-MHz notebooks. PC Online. 1997
http://www.zdnet.com/pcmag/features/notebook3/heat.htm

Power-aware microarchitecture: Design and modeling challenges for next-generation microprocessors. Brooks, D.M.; Bose, P.; Schuster, S.E.; Jacobson, H.; Kudva, P.N.; Buyuktosunoglu, A.; Wellman, J.; Zyuban, V.; Gupta, M.; Cook, P.W. IEEE Micro Dec 2000.
http://ieeexplore.ieee.org/iel5/40/19226/00888701.pdf?isNumber=19226

Managing the Impact of Increasing Microprocessor Power Consumption. Stephen H. Gunther, Frank Binns, Douglas M. Carmean, and Jonathan C. Hall. Intel Technology Journal. 2001 Quarter 1.
http://developer.intel.com/technology/itj/q12001/articles/art 4.htm

the following are three papers from the 1998 Design Automation Conference (DAC) in a session on Power Dissipation and Distribution in High Performance Processors Power Considerations in the Design of the Alpha 21264 Microprocessor. Michael K. Gowan, Larry L. Biro, Daniel B. Jackson.

http://www.sigda.acm.org/Archives/ProceedingArchives/Dac/Dac98/papers/1998/dac98/p

LEC-18 Preliminaries

Reducing Power in High-Performance Microprocessors. Vivek Tiwari, Deo Singh, Suresh Rajgopal, Gaurav Mehta, Rakesh Patel, Franklin Baez. Design and Analysis of Power Distribution Networks in PowerPC(TM)Microprocessors. Abhijit Dharchoudhury, Rajendran Panda, David Blaauw, Ravi Vaidyanathan, Bogdan Tutuianu, David Bearden.

http://www.sigda.acm.org/Archives/ProceedingArchives/Dac/Dac98/papers/1998/dac98/p

http://www.sigda.acm.org/Archives/ProceedingArchives/Dac/Dac98/papers/1998/dac98/p

LEC-18:

6.1

OVERVIEW

6.1

Overview

LEC-18:

6.1.1

Importance of Power and Energy

6.1.1

Importance of Power and Energy

Laptops, PDA, cell-phones, etc obvious! Every watt above 40W adds $1 to manufacturing cost Approx 25% of operating expense of server farm goes to energy bills (Dis)Comfort of Unix labs in E2 Sandia Labs had to build a special sub-station when they took delivery of Teraops massively parallel supercomputer (over 9000 Pentium Pros) High-speed microprocessors today can run so hot that they will damage themselves Athlon reliability problems, Willamette thermal throttling In 2000, information technology consumed 8% of total power in US.

LEC-18:

6.1.2

Industrial Names and Products

6.1.2

Industrial Names and Products

All of the articles and papers below are linked to from the Documentation page on the E&CE 427 web site. AMDs Athlon PowerNow! Reduce power consumption in laptops when running on battery by reducing clock speed and supply voltage. Intel Speedstep Reduce power consumption in laptops when running on battery by reducing clock speed to 70-80% of normal. Intel X-Scale An ARM5-compatible microprocessor for low-power systems: http://developer.intel.com/design/intelxscale/ Synopsys PowerMill A simulator that estimates power consumption of the circuit as it is simulated: http://www.synopsys.com/products/etg/powermill ds.html Compaq Itsy Satellites

LEC-18:

6.1.3

Power vs Energy

10

6.1.3

Power vs Energy

Most people talk about power reduction, but sometimes they mean power and sometimes energy.

Power

Watts

Energy / Time

Volts I Joules sec

Type Energy

Power minimization is usually about heat removal Energy minimization is usually about battery life or energy costs Units Joules Equivalent Types Work Equations Volts Coulombs 1 C Volts2 2

LEC-18:

6.1.4

Batteries, Power and Energy

11

6.1.4

Batteries, Power and Energy

LEC-18:

6.1.4

Batteries, Power and Energy

12

6.1.4.1 Do Power?

Batteries

Store
Coulombs

Energy

or

Batteries rated in Amp-hours at a voltage.

Energy

Batteries store energy.

Coulombs Seconds Seconds Coulombs Volts

battery

Amps

Power

Energy Time

Seconds

Energy

Volts

Volts Volts

LEC-18:

6.1.4

Batteries, Power and Energy

13

6.1.4.2 Battery Life and Efciency


To extend battery life, we want to increase the amount of work done and/or decrease energy consumed. Work and energy are same units, therefore to extend battery life, we truly want to improve efciency. Power efciency of microprocessors normally measured in MIPS/Watt. Is this a real measure of efciency? MIPs Watts millions of instructions Seconds millions of instructions Energy Seconds Energy

Both instructions executed and energy are measures of work, so MIPs/Watt is a measure of efciency. (This assumes that all instructions perform the same amount of work!)

LEC-18:

6.1.5

Example Problem: Battery Life and Power

14

6.1.5 Example Problem: Battery Life and Power


Question: Running a VHDL simulation requires executing an average of 1 million instructions per simulation step. My computer runs at 700MHz and burns 70W of power. My battery is rated at 10V and 2.5AH. Assuming all of my computers clock cycles go towards running VHDL simulations, how many simulation steps can I run on one battery?

Question: If I use the SpeedStep feature of my computer, and run at 600MHz with 60W of power, how much longer can I keep the computer running on one battery? How many more simulation steps can I run on one battery?

LEC-18:

6.2

POWER EQUATIONS

15

6.2

Power Equations
DynamicPower StaticPower

Dynamic Power Static Power Switching Power Short Circuit Power Leakage Power

dependent upon clock speed independent of clock speed useful charges up transistors not useful both N and P transistors are on not useful leaks around transistor

e fd

e fd

Power

SwitchPower

ShortPower

LeakagePower

LEC-18:

6.2.1

Dynamic Power and Activity Factor

16

6.2.1

Dynamic Power and Activity Factor

Dynamic power is proportional to how often signals change their value (switch). Roughly 20% of signals switch during a clock cycle. Need to take glitches into account when calculating activity factor. Equations for dynamic power contain clock speed and activity factor.

LEC-18:

6.2.2

Switching Power

17

6.2.2

Switching Power

1->0 0->1 CapLoad

0->1 1->0 CapLoad

Charging a capacitor 1 2

Disharging a capacitor

f : frequency at which invertor goes through complete charge-discharge cycle. (eqn 15.4 in Smith)

ClockSpeed ActFact

clock speed average number of times that signal switches from 0 1 or from 1 0 during a clock cycle

average switching power

1 2

ActFact ClockSpeed CapLoad VoltSup2

average switching power

CapLoad

VoltSup2

energy to (dis)charge capacitor

CapLoad

VoltSup2

LEC-18:

6.2.3

Short-Circuited Power

18

6.2.3

Short-Circuited Power
IShort

Vi

Vo

VoltSup VoltThresh

VoltSup - VoltThresh GND P-trans on N-trans on TimeShort

Gate Voltage

Charging

a capacitor PwrShort ActFact ClockSpeed TimeShort IShort VoltSup

LEC-18:

6.2.4

Leakage Power

19

6.2.4

Leakage Power
Vi Vo

N P

N-substrate

Cross section of invertor showing parasitic diode


I ILeak V

Leakage current through parasitic diode

l mk

ILeak e

PwrLk

ILeak

VoltSup VoltThresh k T

LEC-18:

6.2.5

Glossary

20

6.2.5
VoltSup

Glossary
def aka def aka def aka = Clock speed f Supply voltage V Threshold voltage Vth voltage at which P transistors turn on

ClockSpeed

VoltThresh

LEC-18:
ILeak

6.2.5
def aka def aka = def aka = def aka = =

Glossary
Leakage current IS (reverse bias saturation current) q VoltThresh k T e short circuit time Time that both N and P transistors are turned on when signal changes value. Short circuit current Ishort Current that goes through transistor network while both N and P transistors are turned on. activity factor A NumTransitions NumSignals NumClockCycles Per signal: percentage of clock cycles when signal changes value. Per clock cycle: percentage of signals that change value per clock cycle. Note: When measuring per circuit, sometimes approximate by looking only at ops, rather than every single signal. load capacitance CL switching power (dynamic) 1 ActFact ClockSpeed CapLoad 2 2 VoltSup switching power (dynamic) ActFact ClockSpeed TimeShort IShort VoltSup leakage power (static) ILeak VoltSup total power PwrSw PwrShort PwrLk

21

TimeShort

IShort

ActFact

CapLoad PwrSw

PwrShort

PwrLk Power

def = def =

def =

def aka def =

p qo

LEC-18:

6.2.5

Glossary
def aka Maximum clock speed that an implementation technology can support. fmax VoltSup VoltThresh 2 VoltSup electron charge 1 60218 10 19 C Boltzmanns constant 1 38066 10 23 J/K temperature in Kelvin

22

MaxClockSpeed

q k T

x x

s s

w w

def = def = def

LEC-18:

6.2.6

Note on Power Equations

23

6.2.6

Note on Power Equations


DynamicPower StaticPower PwrSw PwrShort PwrLk ActFact ClockSpeed 1 CapLoad 2 ActFact ClockSpeed TimeShort ILeak VoltSup

The power equation:

is for an individual signal.

s s

s s

u u

t t

y y y

Power

VoltSup2 IShort VoltSup

LEC-18:

6.2.6

Note on Power Equations

24

Multiple Signals
To calculate dynamic power for n signals with different CapLoad, TimeShort, and IShort:

i 1 n

i 1

ActFacti

ClockSpeed

TimeShorti

IShorti

v
VoltSup

zu zu t

DynamicPower

ActFacti

1 CapLoadi 2

ClockSpeed

VoltSup2

LEC-18:

6.2.6

Note on Power Equations

25

Average Power
If know average CapLoad, TimeShort, and IShort, then the above formula simplies to: DynamicPower n ActFactAV G

If capacitances and short-circuit parameters dont have an even distribution, then dont average them. If high-capacitance signals have high-activity factors, then averaging the equations will result in erroneously low predictions for power.

s u

ActFactAV G

ClockSpeed

TimeShortAV G

IShortAV G

v
VoltSup

1 2 CapLoadAV G

ClockSpeed

VoltSup2

s u

LEC-19 Preliminaries

LEC-19: Data Encoding for Power Reduction


Lecture Notes Sections: 6.2.6 6.5.2.3

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-19 Preliminaries

Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 wk-09 10 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Lec-18 Introduction Lec-19 Data Encoding Lec-20 Clock Gating Faults and Testing Review

wk-11 12 wk-13

LEC-19 Preliminaries

Purpose and List of Concepts


The purpose of this lecture is to give an overview of power reduction techniques and then examine the design process for a common power reduction technique, data encoding.

LEC-19:

6.3

OVERVIEW OF POWER REDUCTION TECHNIQUES

6.3 Overview of Power Reduction Techniques


We can divide power reduction techniques into two classes: analog and digital.

LEC-19:

6.3

OVERVIEW OF POWER REDUCTION TECHNIQUES

Analog Parameters
Power reduction parameters at the analog level. capacitance for example, Silicon on Insulator (SOI) resistance for example, copper wires voltage low-voltage circuits

LEC-19:

6.3

OVERVIEW OF POWER REDUCTION TECHNIQUES

Analog Techniques
Power reduction techniques at the analog level. dual-Vt Two different supply voltages: high voltage for performance-critical portions of design, low voltage for remainder of circuit. exotic circuits Special ops, latches, and combinational circuitry that run at a high frequency while minimizing power adiabatic circuits Special circuitry that consumes power on 0 1 transitions, but not 1 0 transitions. These sacrice performance for reduced power. clock trees Up to 30% of total power can be consumed in clock generation and clock tree

LEC-19:

6.3

OVERVIEW OF POWER REDUCTION TECHNIQUES

Digital Parameters
Power-reduction parameters at the digital level. capacitance (number of gates) activity factor clock frequency

LEC-19:

6.3

OVERVIEW OF POWER REDUCTION TECHNIQUES

Digital Techniques
Power-reduction techniques at the digital level. multiple clocks Put a high speed clock in performance-critical parts of design and a low speed clock for remainder of circuit clock gating Turn off clock to portions of a chip when its not being used data encoding Gray coding vs one-hot vs fully encoded vs ... asynchronous circuits Get rid of clocks altogether.... Additional low-power design techniques for RTL from a Qualis engineer: http://home.europa.com/celiac/lowpower.html

LEC-19:

6.4

VOLTAGE REDUCTION FOR POWER REDUCTION

6.4

Voltage Reduction for Power Reduction

If our goal is to reduce power, the most promising approach is to reduce the supply voltage, because, from:

we observe: Power VoltSup2

TimeShort

s s

s s

u u t t

Power

ActFact ClockSpeed ActFact ClockSpeed ILeak VoltSup

1 2 CapLoad

VoltSup2 IShort VoltSup

LEC-19:

6.4

VOLTAGE REDUCTION FOR POWER REDUCTION

10

Reducing Difference Between Supply and Threshold Voltage


As the supply voltage decreases, it takes longer to charge up the capacitive load, which increases the load delay of a circuit. In the chapter on timing analysis, we saw that increasing the supply voltage will decrease the delay through a circuit. (From V IR.) However, it is more accurate to take into account both the value of the supply voltage, and the difference between the supply voltage and the threshold voltage. MaxClockSpeed

VoltSup VoltThresh VoltSup

LEC-19:

6.4

VOLTAGE REDUCTION FOR POWER REDUCTION

11

Effect of Decreasing Supply Voltage on Delay


Question: If the delay along the critical path of a circuit is 20 ns, the supply voltage is 2.8 V, and the threshold voltage is 0.7 V, calculate the critical path delay if the supply voltage is dropped to 2.2 V.

Answer: d 20ns current delay along critical path d ?? new delay along critical path V 2 8V current supply voltage V 2 2V new supply voltage Vt 0 7V threshold voltage

MaxClockSpeed 1 d

y }

y }

d d

20ns

31ns

v w | w u s v w w | w s u w v |} u s v | u s } v |} u s v | u } v | u

MaxClockSpeed

Vt

Vt

Vt

Vt

2 8V 0 7V 2 8V

VoltSup VoltThresh VoltSup V V Vt 2

w w w

} }

2 2V 2 2V 0 7V

LEC-19:

6.4

VOLTAGE REDUCTION FOR POWER REDUCTION

12

Reducing Threshold Voltage Increases Leakage Current


If we reduce the supply voltage, we want to also reduce the threshold voltage. However, as threshold voltage drops, leakage current increases:

And increasing the leakage current increases the power: Power ILeak So, need to strike a balance between reducing VoltSup (which has a quadratic affect on reducing power), and increasing ILeak, which has a linear affect on increasing power.

ILeak e

s |

VoltThresh k T

LEC-19:

6.5

DATA ENCODING FOR POWER REDUCTION

13

6.5

Data Encoding for Power Reduction

LEC-19:

6.5.1

How Data Encoding Can Reduce Power

14

6.5.1 How Data Encoding Can Reduce Power


Data encoding is a technique that chooses data values so that normal execution will have a low activity factor. The most common example is Gray coding where exactly one bit changes value each clock cycle when counting.

LEC-19:

6.5.2

Example Problem

15

6.5.2

Example Problem

LEC-19:

6.5.2

Example Problem

16

6.5.2.1 Problem Statement


Your task is to do the power analysis for a circuit that should send out a one-clock-cycle pulse on the done signal once every 16 clock cycles. (That is, done is 0 for 15 clock cycles, then 1 for one cycle, then repeat with 15 cycles of 0 followed by a 1, etc.)
1 clk done 2 3 15 16 17 31 32 33

Required behaviour You have been asked to consider three different types of counters: a binary counter, a Gray-code counter, and a one-hot counter. (The table below shows the values from 0 to 15 for the different encodings.)

Question: What is the relative amount of power consumption for the different options?

LEC-19:

6.5.2

Example Problem

17

6.5.2.2 Additional Information


Your implementation technology is an FPGA where each cell has a programable combinational circuit and a ip-op. The combinational circuit has 4 inputs and 1 output. The capacitive load of the combinational circuit is twice that of the ip-op.

PLA

cell 1. You may neglect power associated with clocks. 2. You may assume that all counters: (a) are implemented on the same fabrication process (b) run at the same clock speed (c) have negligible leakage and short-circuit currents Decimal 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Gray 0000 0001 0011 0010 0110 0111 0101 0100 1100 1101 1111 1110 1010 1011 1001 1000 One-Hot 0000000000000001 0000000000000010 0000000000000100 0000000000001000 0000000000010000 0000000000100000 0000000001000000 0000000010000000 0000000100000000 0000001000000000 0000010000000000 0000100000000000 0001000000000000 0010000000000000 0100000000000000 1000000000000000 Binary 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

LEC-19:

6.5.2

Example Problem

18

6.5.2.3 Answer

LEC-19:

6.5.2

Example Problem

19

Outline of Thinking
Factors to consider that distinguish: capacitance and activity factor: Capacitance is dependent upon the number of signals, and whether a signal is combinational or a op.

LEC-19:

6.5.2

Example Problem

20

Sketch Out the Circuitry


Name the output done and the count digits d().
d(0) PLA

d(1) PLA

d(2) PLA

d(3) PLA

PLA

done

Block diagram for Gray and Binary Counters


d(0) PLA PLA d(1) d(15) PLA done

Block diagram for One-Hot Observation:

The Gray and Binary counters have the same design, and the Gray counter will have the lower activity factor. Therefore, the Gray counter will have lower power than the Binary counter.

LEC-19:

6.5.2

Example Problem

21

However, we dont know how much lower the power of the Gray counter will be, and we dont know how much power the One-Hot counter will consume.

LEC-19:

6.5.2

Example Problem

22

Capacitance
Gray d() done 1-Hot d() done Binary d() done PLAs Flops PLAs Flops PLAs Flops PLAs Flops PLAs Flops PLAs Flops cap 2 1 2 1 2 1 2 1 2 1 2 1 number 4 4 1 0 0 16 0 0 4 4 1 0 subtotal cap 8 4 2 0 0 16 0 0 8 4 2 0

LEC-19:

6.5.2

Example Problem

23

One-Hot Activity Factor


NOTE: Activity factor for One-Hot counter Because all clock cycles have the same number of transitions for the One-Hot counter, could have calculated activity factor as two transitions per sixteen signals.

LEC-19:

6.5.2

Example Problem

24

Activity Factor

LEC-19:

6.5.2

Example Problem

25

Gray Coding Activity Factor


clk d(0) d(1) d(2) d(3) done 8/16 4/16 2/16 2/16 2/16

Gray coding

LEC-19:

6.5.2

Example Problem

26

One-Hot Activity Factor


clk d(0) d(1) d(2) 2/16 2/16 2/16 2/16 done 2/16

One-hot coding

LEC-19:

6.5.2

Example Problem

27

Binary Coding Activity Factor


clk d(0) d(1) d(2) d(3) done 16/16 8/16 4/16 2/16 2/16

Binary coding

LEC-19:

6.5.2

Example Problem

28

Summary of Activity Factors


Gray d() done 1-Hot d() done Binary d() PLAs Flops PLAs Flops PLAs Flops PLAs Flops PLAs Flops done PLAs Flops act fact 1/4 signals in each clock cycle 1/4 signals in each clock cycle 2 transitions / 16 clock cycles 2 transitions / 16 clock cycles 16 + 8 + 4 + 2 transitions = 0.47 4 signals 16 clock cycles 16 + 8 + 4 + 2 transitions = 0.47 4 signals 16 clock cycles 2 transitions / 16 clock cycles

s s

LEC-19:

6.5.2

Example Problem

29

Putting it all Together


Gray d() done PLAs Flops PLAs Flops Total PLAs Flops PLAs Flops Total PLAs Flops PLAs Flops Total subtotal cap 8 4 2 0 0 16 0 0 8 4 2 0 act fact 1/4 1/4 2/16 2/16 0.47 0.47 2/16 power 2 1 4/16 0 3.25 0 2 0 0 2 3.76 1.88 0.25 0 5.87

1-Hot

d() done

Binary

d() done

LEC-19:

6.5.2

Example Problem

30

Final Answer
If choose Binary counting as baseline, then relative amounts of power are: Gray One-Hot Binary 54% 35% 100%

If choose One-Hot counting as baseline, then relative amounts of power are: Gray One-Hot Binary 156% 100% 288%

LEC-20 Preliminaries

LEC-20: Clock Gating for Power Reduction


Lecture Notes Sections: 6.6 6.6.3.1

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-20 Preliminaries

Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 wk-09 10 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Lec-18 Introduction Lec-19 Data Encoding Lec-20 Clock Gating Faults and Testing Review

wk-11 12 wk-13

LEC-20 Preliminaries

Purpose and List of Concepts


The purpose of this lecture is to outline the design process for a common power reduction technique, clock-gating, and to analyze the success of the design.

clock gating: idea circuitry for clock gating power analysis of clock gating

LEC-20:

6.6

CLOCK GATING

6.6

Clock Gating

The basic idea of clock gating is to reduce power by turning off the clock when a circuit isnt needed. This reduces the activity factor.

LEC-20:

6.6.1

Introduction to and Overview of Clock Gating

6.6.1 Introduction to and Overview of Clock Gating

LEC-20:

6.6.1

Introduction to and Overview of Clock Gating

6.6.1.1 Examples of Clock Gating


Condition O/S in standby mode No oating point instructions for k clock cycles Instruction cache miss No instruction in pipe stage i Circuitry turned off Everything except core state (PC, registers, caches, etc) oating point circuitry

Instruction decode circuitry Pipe stage i 1

LEC-20:

6.6.1

Introduction to and Overview of Clock Gating

6.6.1.2 Design Tradeoffs


Can signicantly reduce activity factor (Synopsys PowerCompiler claims that can cut power to be 5080% of ungated level) Increases design complexity

Increases area

Increases clock skew

| | |

design effort bugs!

LEC-20:

6.6.1

Introduction to and Overview of Clock Gating

6.6.1.3 Functional Validation and Clock Gating


Its a functional bug to turn a clock off when its needed for valid data. Its functionally ok, but wasteful to turn a clock on when its not needed. (About 5% of the bugs caught on Willamette (Intel Pentium 4 Processor) were related to clock gating.) Nicolas Mokhoff. EE Times. June 27, 2001. http://www.edtn.com/story/OEG20010621S0080

LEC-20:

6.6.2

Implementing Clock Gating

6.6.2

Implementing Clock Gating

Clock gating is implemented by adding a component that disables the clock when the circuit isnt needed.
i_data i_valid clk o_data

o_valid

Without clock gating


i_* o_* cool_clk clk clk_en i_wakeup Clock Enable State Machine

With clock gating

LEC-20:

6.6.2

Implementing Clock Gating

10

6.6.2.1 Simple Power Analysis


Sample problem:

Question: How much power will be saved in the following clock-gating scheme?

70% of the time the main circuit has valid data clock gating circuit is 90% effective (90% of the time that the circuit has invalid data, the clock is off) clock gating circuit has 10% of the area of the main circuit clock gating circuit has same activity factor as main circuit neglect short-circuiting and leakage power

LEC-20:

6.6.2

Implementing Clock Gating

11

Answer to Simple Power Analysis


Answer:

1. Set up main equations

LEC-20:

6.6.2

Implementing Clock Gating

12

Main

ClkFsm PwrTot PwrTot

2. Find new activity factor for main circuit (A ):

w t}

0 1A A

1 2

V2

s w s

s s s t s

s} s

A 1 2 A C V2 1 2 A 0 1C V2

CClkFsm

AClkFsm

CMain

AMain

A C A 0 1C

PwrTot

1 2

AMain

s s

Pwr

PwrLk PwrShort

negligible negligible 1 A C V2 2 CMain V2 1 2 AClkFsm CClkFsm V2

s s

PwrSw

Pwr

PwrSw PwrLk 1 A C V2 2

PwrTot

y y y

PwrMain PwrMain PwrClkFsm

power for main circuit without clock gating power for main circuit with clock gating power for clock enable state machine PwrMain PwrClkFsm PwrShort

LEC-20:

6.6.2

Implementing Clock Gating

13

3. Find ratio of new total power to previous total power:

PwrTot

0 73A 0 1A A 0 83

4. Final answer: new power is 83% of original power

w t

w t}

PwrTot

0 1A A

sv DQv

w sCQv w | u w | u v | u | u s

y }

| u

y y y y y y y

Eff PctValid PctClk

effectiveness of clock gating percentage of clock cycles with valid data percentage of clock cycles that clock toggles 1 Eff 1 PctValid Intuition: when E = 0%, PctClk=100%; when E = 100%, PctClk=PctValid PctClk A 1 Eff 1 PctValid A 1 09 1 07 A 0 73A

LEC-20:

6.6.2

Implementing Clock Gating

14

6.6.2.2 Valid-Bit Protocol


Need a mechanism to tell circuit when to pay attention to data inputs e.g. when is it supposed to decode and execute an instruction, or write data to a memory array?
clk i_valid i_data clk i_valid i_data o_valid o_data

LEC-20:

6.6.2

Implementing Clock Gating

15

Valid-Bit Protocol
clk i_valid i_data clk i_valid i_data o_valid o_data o_valid o_data

i valid: high when i data has valid data signies whether circuit should pay attention to or ignore data. o valid: high when o data has valid data signies whether whether environment should pay attention to output of circuit. For more on circuit protocols, see section 2.4.10.

LEC-20:

6.6.2

Implementing Clock Gating

16

Microscopic Analysis of Valid-Bit Propagation


i_valid clk clk i_valid o_valid o_valid

LEC-20:

6.6.2

Implementing Clock Gating

17

Which Clock Edges Are Needed?


i_valid clk clk i_valid o_valid o_valid

LEC-20:

6.6.2

Implementing Clock Gating

18

Minimal Sequence of Clock Edges?


i_valid clk clk i_valid o_valid o_valid

LEC-20:

6.6.2

Implementing Clock Gating

19

Too Few Clock Edges


i_valid clk clk i_valid o_valid o_valid

LEC-20:

6.6.2

Implementing Clock Gating

20

Minimal Sequence of Clock Edges!


i_valid clk clk i_valid o_valid o_valid

LEC-20:

6.6.2

Implementing Clock Gating

21

6.6.2.3 Clock Gating and Big Circuit

LEC-20:

6.6.2

Implementing Clock Gating

22

Before Clock Gating


data_in valid_in clk clk valid_in data_in valid_out data_out dont care uninitialized data_out valid_out

LEC-20:

6.6.2

Implementing Clock Gating

23

After Clock Gating: Circuitry


data_in valid_in data_out valid_out

hot_clk clk_en wakeup_in Clock Enable State Machine

cool_clk

wakeup_out


hot clk: clock that always toggles cool clk: gated clock sometimes toggles, sometimes stays low wakeup: alerts circuit that valid data will be arriving soon clk en: turns on cool clk

LEC-20:

6.6.2

Implementing Clock Gating

24

After Clock Gating: New Signals


hot_clk wakeup_in valid_in data_in clk_en cool_clk valid_out data_out wakeup_out

LEC-20:

6.6.2

Implementing Clock Gating

25

New Signal: Wakeup (no, not you)


hot_clk wakeup_in valid_in

LEC-20:

6.6.2

Implementing Clock Gating

26

New Signal: Clock Enable, Cool Clock


hot_clk wakeup_in valid_in data_in clk_en cool_clk valid_out data_out

LEC-20:

6.6.2

Implementing Clock Gating

27

New Signal: Wakeup Out


hot_clk wakeup_in valid_in data_in clk_en cool_clk valid_out data_out wakeup_out

LEC-20:

6.6.2

Implementing Clock Gating

28

After Clock Gating: New Signals


hot_clk wakeup_in valid_in data_in clk_en cool_clk valid_out data_out wakeup_out

LEC-20:

6.6.2

Implementing Clock Gating

29

6.6.2.4 Designing Clock Gating Circuitry

LEC-20:

6.6.2

Implementing Clock Gating

30

Design Decisions

What level of granularity for gated clocks? entire module? individual pipe stages? something in between? When should the clocks turn off? When should the clocks turn on? Protocol for incoming wakeup signal? Protocol for outgoing wakeup signal?

LEC-20:

6.6.2

Implementing Clock Gating

31

Wakeup Protocol
Designers negotiate incoming and outgoing wakeup protocol with environment. Example wakeup protocol:

wakeup in will arrive 1 clock cycle before valid data wakeup in will stay high until have at least 3 cycles of invalid data same protocol for wakeup out

LEC-20:

6.6.3

Design Problem

32

6.6.3

Design Problem

Design a clock enable state machine for a pipelined module whose latency varies from 5 to 10 clock cycles and that can hold a maximum of 6 instructions (parcels of data).

LEC-20:

6.6.3

Design Problem

33

Design Strategy
When designing clock gating circuitry, consider the two extreme case:

For a constant stream of valid data, the key is to not incur a large overhead in design complexity, area, or clock period when clocks will always be toggling. For a single parcel of valid data, the key is to make sure that the clocks are toggling so that data can percolate through circuit. Also, we want to turn off the clock as soon as possible after data leaves.

a constant stream of valid data circuit is turned off and receives a single parcel of valid data

LEC-20:

6.6.3

Design Problem

34

6.6.3.1 Solution Sketch

LEC-20:

6.6.3

Design Problem

35

Scenario 1
1. Scenario: turned off and get one parcel. (a) Need to turn on and stay on until parcel departs (b) idea #1 (parcel count): count number of parcels inside module keep clocks toggling if have non-zero parcels. (c) idea #2 (cycle count): count number of clock cycles since last valid parcel entered module once hit 10 clock cycles without any valid parcels entering, know that all parcels have exited. keep clocks toggling if counter is less than 10

LEC-20:

6.6.3

Design Problem

36

Scenario 2
1. Scenario: constant stream of parcels (a) parcel count would require looking at input and output stream and conditionally incrementing or decrementing counter (b) cycle count would keep resetting counter

LEC-20:

6.6.3

Design Problem

37

Waveforms for Parcel Count


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 i_valid o_valid parcel_count parcel_clk_en

LEC-20:

6.6.3

Design Problem

38

Waveforms for Cycle Count


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 i_valid o_valid cycle_count

0 1 2 0 0 0 0 1 2 3 4 0 1 2 3 4 5 6 7 8 9 10 cycle_clk_en

LEC-20:

6.6.3

Design Problem

39

Parcel Count Design


Need to count (0..6) parcels, therefore need 3 bits for counter. Counter must be able to increment and decrement. Equations for counter action (increment/decrement/no-change): i valid 0 0 1 1 o valid 0 1 0 1 action no change decrement increment no change

LEC-20:

6.6.3

Design Problem

40

Parcel Count Design


Combined increment and decrement can be done with half-adder (AND, NOR , OR ) and one XOR gate. Count each normal gate as one unit of capacitance, XOR as 1.5 units of capacitance, and op as 2 units of capacitance. (This information would be given on an exam.) To perform both increment and decrement, will need 4.5 units of capacitance per bit for the combinational circuitry and 2 units of capacitance per bit for the op. This gives a total of 6.5 units of capacitance per bit.

LEC-20:

6.6.3

Design Problem

41

Cycle Count Design


Need to count (0..10) cycles, therefore need 4 bits for counter. Counter must be able to increment and reset. Increment on each clock cycle, unless get i valid, in which case reset. To perform increment, will need just half adders, which is 3 gates of capacitance per bit for the combinational circuitry. After adding a op, there is a total of 5 units of capacitance per bit.

LEC-20:

6.6.3

Design Problem

42

Design Analysis
Assuming that:

The two factors affecting power are activity factor and capacitance.

both designs will be implemented on same technology leakage current is negligible switching power is negligible

LEC-20:

6.6.3

Design Problem

43

Design Analysis (Contd)


Capacitance num bits circuit total cap parcel count 3 bit counter (0..6) inc/dec 3 6 5 19 5 cycle count 4 bit counter (0..10) half adders 4 5 20

Parcel count wins on capacitance.

Power If parcel leaves after 5 clock cycles, cycle count will continue to power circuit for another 5 cycles (wasting power!). So, it looks like parcel count wins. However, we should carry out a detailed analysis to see how much difference there is between the two options.

y s

y w s

LEC-20:

6.6.3

Design Problem

44

Behavioural Analysis
Assuming:

Answer:

60% of incoming data are valid even distribution of latencies average length of continuosly valid data is 80 instructions

Question:

Which design option has lower power?

Goal: determine what percentage of time cool clk is toggling for each of the two design options.

LEC-20:

6.6.3

Design Problem

45

Construct Average Waveform


1. Assume that all three of the circuits in question (main circuit without clock gating, and the clock enable state machines) have the same activity factor. 2. Construct average waveform for cool clock. (a) 60% of incoming data are valid (b) average length of valid data is 80 instructions (c) length of window for average data is: ValidLength WindowLength PctValid 80 06 133cycles

80 valid data 133 clock cycles

y y y

LEC-20:

6.6.3

Design Problem

46

Parcel Count Clocking


3. Calculate percentage of clock cycles that parcel count circuit is powered. (a) Clock will run for: 80 clock cycles + average latency - 1 + 1 cycle to clear out last parcel The rst clock cycle latency of the last parcel is counted in the 80 clock cycles. The last clock cycle clears out the last valid parcel by opping in an invalid parcel. See section 6.6.2.2. (b) Minimum latency is 5, max is 10, distribution is even. Therefore average latency is 7.5. (c) Clock will run for: 80 7 5 1 1 87 5cycles. (d) Percentage clocking is 87 5 133 65 8%

y ~ w y t | w t

LEC-20:

6.6.3

Design Problem

47

Cycle Count Clocking


4. Calculate percentage of clock cycles that cycle count circuit is powered. (a) Clock will run for: 80 clock cycles + 10 - 1 for powering last parcel + 1 cycle to clear out last parcel = 90.0 clock cycles (b) Percentage clocking is 90 0 133 67 7%

~ w

LEC-20:

6.6.3

Design Problem

48

Wrapup
5. Summary Capacitance Percentage clocking Parcel Count 19.5 65.8% Cycle Count 20 67.7%

6. Parcel count wins on both capacitance and activity factor, therefore it has the lowest power consumption. 7. How much more power does the cycle count design consume?

5 5%

w v w s w u w s w u v w s w u |

y y y

n%more power

CycPwr PclPwr PclPwr 20 0 0 677 19 5 19 5 0 658

0 658

Chapter 7

Fault Testing and Testability

49

LEC-20:

7.1

INTRODUCTION

50

7.1

Introduction

LEC-20:

7.1.1

Purpose and List of Concepts

51

7.1.1

Purpose and List of Concepts

The purpose of this lecture is to explain the sources of manufacturing faults, how the faults are caught, and the tradeoffs in trying to catch these faults. We will then introduce the mathematical models for the physical faults.

physical faults wired-AND wired-OR stronger wins mathematical model of fault causes of faults testing burn in bin sorting scan testing built-in self test IDDQ testing economics of testing locations of faults test vector to detect a fault single stuck-at faults undetectable faults redundant circuitry fault domination fault collapsing

fault equivalence gate collapsing node collapsing fault collapsing (intelligent collapsing) fault coverage test vector generation required test vectors order to run test vectors fault hiding scan testing scan chain testing procedure time to run a test boundary scan testing JTAG IEEE 1149 length of time to do a scan test hardware to do scan testing

LEC-20:

7.1.2

Background Material

52

7.1.2

Background Material

Karnaugh maps

LEC-20:

7.1.3

Reading Material

53

7.1.3
Smith ch14

Reading Material

LEC-21 Preliminaries

LEC-21: Introduction to Faults, Testing, and Testability


Lecture Notes Sections: 7.2 7.2.7.2

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-21 Preliminaries

Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 wk-09 10 wk-11 12 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Lec-21 Introduction Lec-22 Detection and Test Generation Lec-23 Built-In Self Test (BIST) Lec-24 JTAG Review

wk-13

LEC-21 Preliminaries

Purpose and List of Concepts


The purpose of this lecture is to explain the sources of manufacturing faults, how the faults are caught, and the tradeoffs in trying to catch these faults. We will then introduce the mathematical models for the physical faults.

physical faults wired-AND wired-OR stronger wins mathematical model of fault causes of faults testing burn in bin sorting

scan testing built-in self test IDDQ testing economics of testing locations of faults test vector to detect a fault single stuck-at faults undetectable faults redundant circuitry

LEC-21 Preliminaries

Background Material

Karnaugh maps

LEC-21 Preliminaries

Reading Material
Smith ch14

LEC-21:

7.2

FAULTS AND TESTING

7.2

Faults and Testing

LEC-21:

7.2.1

Overview of Faults and Testing

7.2.1

Overview of Faults and Testing

LEC-21:

7.2.1

Overview of Faults and Testing

7.2.1.1 Faults (Smith 14.3)


During manufacturing, faults can occur that make the physical product behave incorrectly. Denition: A fault is a manufacturing defect that causes a wire, poly, diffusion, or via to either break or connect to something it shouldnt.
Good wires

Shorted wires

Open wire

LEC-21:

7.2.1

Overview of Faults and Testing

7.2.1.2 Causes of Faults (Smith 14.3)

Fabrication process (initial construction is bad) chemical mix impurities dust Manufacturing process (damage during construction) handling probing cutting mounting materials corrosion adhesion failure cracking peeling

LEC-21:

7.2.1

Overview of Faults and Testing

10

7.2.1.3 Testing (Smith 14)


Denition Testing is the process of checking that the manufactured wafer/chip/board/system has the same functionality as the simulations.

LEC-21:

7.2.1

Overview of Faults and Testing

11

7.2.1.4 Burn In (Smith 14.3.1)


Some chips that come off the manufacturing line will work for a short period of time and then fail. Denition Burn-in: The process of subjecting chips to extreme conditions (high and low temps, high and low voltages, high and low clock speeds) before and during testing. The purpose is to cause (and catch) failures in chips that would pass a normal test, but fail in early use by customers.
Soon to break wire

The hope is that the extreme conditions will cause chips to break that would otherwise have broken in the customers system soon after arrival. The trick is to create conditions that are extreme enough that bad chips will break, but not so extreme to cause good chips to break.

LEC-21:

7.2.1

Overview of Faults and Testing

12

7.2.1.5 Bin Sorting (Smith 5.1.6)


Each chip (or wafer) is run at a variety of clock speeds. The chips are grouped labeled (binned) at the maximum clock frequency at which they will work reliably. For example, chips coming off of the same production line might be labelled as 800MHz, 900MHz, and 1000MHz. Overclocking is taking a chip rated at nMHz and running it at 1 x nMHz. (Sure your computer often crashes and loses your assignment, but just think how much more productive you are when it is working...)

s w

LEC-21:

7.2.1

Overview of Faults and Testing

13

7.2.1.6 Testing Techniques (Smith 14)


Scan Testing or Boundary Scan Testing (BST, JTAG) (Smith 14.2, 14.6):

Built In Self Test (BIST) (Smith 14.7): Build circuitry on chip that generates tests and compares actual and expected results IDDQ Testing : (Smith 14.3.6)

Load test vector from tester into chip Run chip on test data Unload result data from chip to tester Compare results from chip against those produced by simulation If results are different, then chip was not manufactured correctly

Measure the quiescent current between VDD and GND. Variations from expected values indicate faults.

LEC-21:

7.2.1

Overview of Faults and Testing

14

Challenges
The challenges in testing:

The crux of testing is to use yesterdays technology to nd faults in tomorrows chips. Agilent engineer at ARVLSI 2001.

test circuitry consumes chip area test circuitry reduces performance decrease fault escapee rate of product that ships while having minimal impact on production cost and chip performance external tester can only look at I/O pins ratio of internal signals to I/O pins is increasing some faults will only manifest themselves at high-clock frequencies

LEC-21:

7.2.1

Overview of Faults and Testing

15

7.2.1.7 Design for Testability (DFT) (Smith 14.6)


Scan testing and self-testing require adding extra circuitry to chips. Design for test is the process of adding this circuitry in a disciplined and correct manner. A hot area of research, that is becoming mainstream practice, is developing synthesis tools to automatically add the testing circuitry.

LEC-21:

7.2.2

Example Problem: Economics of Testing (Smith 14.1)

16

7.2.2 Example Problem: Testing (Smith 14.1)


Given information:

Economics of

The ACHIP costs $10 without any testing Each board uses one ACHIP (plus lots of other chips that we dont care about) 68% of the manufactured ACHIPS do not have any faults For the ACHIP, it costs $1 per chip to catch half of the faults Each 50% reduction in fault escapees doubles cost of testing (intuition: doubles number of tests that are run) If board-level testing detects a bad ACHIP, it costs $200 to replace the ACHIP Board-level testing will detect 100% of the faults in an ACHIP

LEC-21:

7.2.2

Example Problem: Economics of Testing (Smith 14.1)

17

Economics of Testing
Question: ACHIP? What escapee fault rate will minimize cost of the

For high-volume, small-area chips, testing can consume more than 50% of the total cost.

w s

w s

w s

w s

w s w s w s

NoTestCost $10 $10 $10 $10 $10 $10 $10

Testcost $0 $1 $2 $4 $8 $16 $32

EscapeeProb 32% 16% 8% 4% 2% 1% 0.5%

ReplaceCost (200 0 32 = $64) (200 0 16 = $32) (200 0 08 = $16) (200 0 04 = $8) (200 0 02 = $4) (200 0 01 = $2) (200 0 005 = $1)

TotCost

NoTestCost

TestCost

EscapeeProb

ReplaceCost

TotCost $74 $43 $28 $22 $22 $28 $43

LEC-21:

7.2.3

Physical Faults (Smith 14.3.3)

18

7.2.3

Physical Faults (Smith 14.3.3)

LEC-21:

7.2.3

Physical Faults (Smith 14.3.3)

19

7.2.3.1 Types of Physical Faults


Good Circuit
a b c d

Bad Circuits open wired-AND bridging short wired-OR bridging short stronger wins bridging short (b is stronger) short to VDD
a b a b a b a b c d c d c d c d

a b a b

c d c d

short to GND

LEC-21:

7.2.3

Physical Faults (Smith 14.3.3)

20

7.2.3.2 Locations of Faults


Each segment of wire, poly, diffusion, via, etc is a potential fault location. Different segments affect different gates in the fanout. A potential fault location is a segment or segments where a fault at any position affects the same set of gates in the same way.

BAD

OK

BAD

b BAD

BAD

OK

Three different locations for potential faults.

LEC-21:

7.2.3

Physical Faults (Smith 14.3.3)

21

7.2.3.3 Layout Affects Locations


a b c d
L2

f g h i b

L2 L1 L4 L3

e g h e

L1

L3 L5 L4

g h

For the same schematic, we can have either four or ve different locations for potential faults, depending upon how the circuit is layed out.

LEC-21:

7.2.3

Physical Faults (Smith 14.3.3)

22

7.2.3.4 Naming Fault Locations


Two ways to name a fault location: pin-fault model Faults are modelled as occuring on input and output pins of gates. net-fault model Faults are modelled as occuring on segments of wires. In E&CE 427, well use the net-fault model, because it is simpler to work with and is closer to what actually happens in hardware.

LEC-21:

7.2.4

Detecting a Fault

23

7.2.4

Detecting a Fault

To detect a fault, we compare the actual output of the circuit against the expected value.

LEC-21:

7.2.4

Detecting a Fault

24

7.2.4.1 Which Test Vectors will Detect a Fault?


a b c d e c a b d e

Good circuit a 0 0 0 0 1 1 1 1 b 0 0 1 1 0 0 1 1 c 0 1 0 1 0 1 0 1 good 0 1 0 1 0 1 1 1 faulty 0 1 0 1 0 1 0 1

Faulty circuit The only test vector that will detect the fault in the circuit is 110. Sometimes multiple test vectors will catch the same fault. Sometimes a single test vector can catch multiple faults.

| P

LEC-21:

7.2.4

Detecting a Fault

25

7.2.4.2 A Single Test-Vector Can Detect Several Faults


a b c d e

Another fault The test vector 110 can catch both this fault and the previous one.

| P

a 1

b 1

c 0

good 1

faulty 0

LEC-21:

7.2.5

Mathematical Models of Faults (Smith 14.3.4)

26

7.2.5 Mathematical Models of Faults (Smith 14.3.4)


Goal: develop reliable and predictable technique for detecting faults in circuits. Problems:

Need: a mathematical model for faults that is abstracted from complexities of circuit layout and plethora of possible faults, yet still detects most or all possible faults.

The possible faults in a circuit are dependent upon the physical layout of the circuit. A very wide variety of possible faults A single test vector can catch many different faults

LEC-21:

7.2.5

Mathematical Models of Faults (Smith 14.3.4)

27

7.2.5.1 Single Stuck-At Fault Model


Two simplifying assumptions: 1. A maximum of one fault per tested circuit 2. All faults are either: (a) stuck-at 1: short to VDD (b) stuck-at 0: short to GND

LEC-21:

7.2.5

Mathematical Models of Faults (Smith 14.3.4)

28

Example of Stuck-At Faults


a b c d
L1 L5 L2 L6 L3 L7 L4 L8 L9

L10

L12

L11

If restrict to single stuck-at fault model, then have 24 faulty circuits to consider.

12 fault locations possible faults.

2 types of faults

24

LEC-21:

7.2.5

Mathematical Models of Faults (Smith 14.3.4)

29

Problems with Multiple Faults


a b c d
L1@0,1 L5@0,1 L2@0,1 L3@0,1 L7@0,1 L4@0,1 L6@0,1 L8@0,1 L9@0,1

L10@0,1

L12@0,1

L11@0,1

If allowed multiple faults, then could have up to 12 different faults in the same circuit. How many faulty circuits would need to be considered? Each of the 12 locations has three possible values: good, stuck-at-1, stuckat-0. Therefore, 312 5 3 105 different circuits would need to be considered! If allowed multiple faults of 4 different types at 12 different locations, then would have 512 1 2 4 108 different faulty circuits to consider!

s w y

s w y |

LEC-21:

7.2.5

Mathematical Models of Faults (Smith 14.3.4)

30

Faults and Possible Circuits


There are 22 6 6 104 different Boolean functions of four inputs. Thus, there are 6 6 104 possible equations for circuits with four inputs and one output. This is much less than the number of faulty circuit models that would be generated by the simultaneous-faults-at-every-location models. So both of the simultaneous-faults-at-every-location models are too extreme.

s w y

s w

LEC-21:

7.2.6

Generate Test Vector to Find a Mathematical Fault (Smith 14.4) 31

7.2.6 Generate Test Vector to Find a Mathematical Fault (Smith 14.4)

LEC-21:

7.2.6

Generate Test Vector to Find a Mathematical Fault (Smith 14.4) 32

7.2.6.1 Algorithm
compute Karnaugh map for correct circuit compute Karnaugh map for faulty circuit nd region of disagreement any assignment in region of disagreement is a test vector that will detect fault 5. any assignmemnt outside of region of disagreement will result in same output on both correct and faulty circuit 1. 2. 3. 4.

LEC-21:

7.2.6

Generate Test Vector to Find a Mathematical Fault (Smith 14.4) 33

7.2.6.2 Example of Finding a Test Vector


a b c
a c b c1 c0 ab ab ab ab 10 11 01 00

d e

a b c
a c

d e
b

Good circuit
a c

Faulty circuit

Difference between good and faulty circuits

LEC-21:

7.2.7

Undetectable Faults

34

7.2.7

Undetectable Faults

Not all faults are detectable. 1. If a circuit is irredundant then all single stuck-at faults can be detected. A redundant circuit is one where one or more gates can be removed without affecting the functional behaviour. 2. If not trying to nd all of the faults in a circuit, then a fault that you arent looking for can mask a fault that you are looking for.

LEC-21:

7.2.7

Undetectable Faults

35

7.2.7.1 Redundant Circuitry

LEC-21:

7.2.7

Undetectable Faults

36

Timing Hazards
Static hazard Dynamic hazard Timing hazards are often removed by adding redundant circuitry.

LEC-21:

7.2.7

Undetectable Faults

37

Redundant Circuitry
a b

a b

1,1 1,0

c 1,0 1,0,1 d

e f g

d c
1,1

0,1

0,1

Irredundant circuit Glitch on g is caused because the on.


AND

Illustration of timing hazard gate for e turns off before f turns

LEC-21:

7.2.7

Undetectable Faults

38

Redundant Circuitry
In this sum-of-products style circuit, each in the Karnaugh map.
a c b

AND

gate corresponds to a cube

We can prevent this transition from causing a glitch by adding a cube that covers the two squares of the transition from 111 to 101. This cube is 1-1, which is the black cube in the Karnaugh map below and the signal h in the redundant circuit below.
a c b c a b

LEC-21:

7.2.7

Undetectable Faults

39

Redundant Circuitry
a b c a b e h d c f d e
L1

f h g

Redundant circuit

No more timing hazards

LEC-21:

7.2.7

Undetectable Faults

40

Redundant Circuitry
L1@0 is undetectable. Correct circuit ab bc Faulty circuit ab bc ac With L1@0, ac 0 ab bc 0 ab bc Same equation as correct circuit

{ |

LEC-21:

7.2.7

Undetectable Faults

41

7.2.7.2 Curious Redundant Circuitry and Fault Detection


The two circuits below have the same steady-state behaviour.
a
L2

a z z c

b c
a c

L1 L3

So, the signal b and the two extra XOR gates are redundant.

LEC-21:

7.2.7

Undetectable Faults

42

Detectable Faults in Redundant Circuitry


In the redundant circuit, a stuck-at fault at L1 is undetectable, but faults at either L2 or L3 are detectable.
a
L2

a z
c

b c

L1 L3

z c

fault L2@0 L2@1

eqn a a b b c c

K-map
a c b

diff w/ ckt
a c b

The lesson is that not all faults in redundant circuitry are undetectable.

v u v u

a c

b c

LEC-22 Preliminaries

LEC-22: Fault Detection and Test-Vector Generation


Lecture Notes Sections: 7.3 7.3.5

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-22 Preliminaries

Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 wk-09 10 wk-11 12 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Lec-21 Introduction Lec-22 Detection and Test Generation Lec-23 Built-In Self Test (BIST) Lec-24 JTAG Review

wk-13

LEC-22 Preliminaries

Purpose and List of Concepts


The purpose of this lecture is to demonstrate the fundamental techniques for detecting faults in circuits. Subsequent lectures will build on these fundamentals to show applications of these techniques.

node collapsing

redundant circuitry addendum fault domination fault collapsing fault equivalence gate collapsing

fault collapsing (intelligent collapsing) fault coverage test vector generation required test vectors order to run test vectors fault hiding

LEC-22:

7.3

FAULTS

7.3

Faults

LEC-22:

7.3.1

Locations of Faults

7.3.1
a b c

Locations of Faults

Throughout this lecture well be using the circuit below:


a

L4 L2 L5

At rst, we will consider only the following faults: L2@1, L4@1, L5@1.

ab

bc
b

LEC-22:

7.3.1

Locations of Faults

Simple Analysis of L2@1, L4@1, L5@1


a b c
L4 L2 L5

fault

eqn

K-map
a c b

diff w/ ckt
a c b

test vectors

3)

L5@1

ab

2)

L4@1

1)

L2@1

c
a c b c a b

101, 001, 100

bc
a c b c a b

101, 100

101, 001

LEC-22:

7.3.1

Locations of Faults

Choose Test Vector


fault eqn K-map
a c b c

diff w/ ckt
a b

test vectors

If we choose 101, we can detect all three faults. Choosing either 001 or 100 will miss one of the three faults.

3)

L5@1

ab

2)

L4@1

1)

L2@1

c
a c b c a b

101, 001, 100

bc
a c b c a b

101, 100

101, 001
a c b

LEC-22:

7.3.2

Choosing Test Vectors (Smith 14.3.7)

7.3.2

Choosing Test Vectors (Smith 14.3.7)

The goal of test vector generation is to nd the smallest set of test vectors that will detect the faults of interest. Test vector generation requires analyzing the faults. We can simplify the task of fault analysis by reducing the number of faults that we have to analyze. Smith has examples of this in Figures 14.13 and 14.14.

LEC-22:

7.3.2

Choosing Test Vectors (Smith 14.3.7)

7.3.2.1 Fault Domination


fault eqn K-map
a c b c

Diff w/ ckt
a b

test vectors

1)

L5@1

ab+c
a c b c a b

101, 001

2)

L6@1

101, 001, 100, 010, 000

Any test vector that detects L5@1 will also detect L6@1. Denition f1 dominates f2 : any test vector that detects f1 will also detect f2 . L5@1 dominates L6@1. When choosing test vectors we can ignore L6@1 and just include L5@1.

Question: What would happen if we ignored L5@1 and just included L6@1?

Answer: If we chose 100, 010, or 000 as our test vector to detect L6@1, then we would not detect L5@1.

LEC-22:

7.3.2

Choosing Test Vectors (Smith 14.3.7)

10

7.3.2.2 Fault Equivalence


fault eqn K-map
a c b c

Diff w/ ckt
a b

1)

L1@1

b
a c b c a b

2)

L3@1

The two faults above are equivalent. Denition f1 is equivalent to f2 : f1 and f2 are detected by exactly the same set of test vectors. That is, all of the test vectors that detect f1 will also detect f2 , and vice versa. When choosing test vectors we can ignore one of the faults and just include the other.

LEC-22:

7.3.2

Choosing Test Vectors (Smith 14.3.7)

11

7.3.2.3 Gate Collapsing


A 1 on the input to an OR gate will force the output to be 1. A stuck-at-1 fault on the input to an OR gate is equivalent to a stuck-at-1 fault on the output of the OR gate. By looking at the functionality of a gate, we can nd equivalent faults. Denition: Gate collapsing is the technique of looking at the functionality of a gate and nding equivalent faults between inputs and outputs. Sets of collapsable faults for common gates
@0

AND
@1

@0

@0

@1 OR Question: What is the set of collapsible faults for a NAND gate?

@1

NAND

LEC-22:

7.3.2

Choosing Test Vectors (Smith 14.3.7)

12

7.3.2.4 Node Collapsing


When two segments affect the same set of gates (ignoring any gates between the two segments), then faults on the two segments can be collapsed. With an invertor or buffer, the segment on the input affects the same gates as the output. Therefore, faults on the input and output segments are equivalent. Sets of collapsable faults for nodes
@1 @0 @1

NOT-1
@0

NOT-0 With the net-fault model, which is the one we are using in E&CE 427, inverters and buffers are the only gates where we node collapsing is relevant. With the pin-faul model, where faults are modelled as occuring on the pins of gates, there are other instances where node collapsing can be used.

LEC-22:

7.3.2

Choosing Test Vectors (Smith 14.3.7)

13

7.3.2.5 Fault Collapsing Summary


When calculating the test-vectors to detect a set of faults, apply the fault collapsing techniques of:

to reduce the number of faults that you must examine.

gate collapsing node collapsing general fault equivalence (intelligent collapsing) fault domination

LEC-22:

7.3.3

Fault Coverage

14

7.3.3

Fault Coverage

Denition Fault coverage: percentage of detectable faults that are detected by a set of test vectors. DetectedFaults DetectableFaults

NOTE: In Smiths book, undetectable faults dont hurt your coverage. This is not universally true. Some peoples denition of fault coverage has denominator of AllPossibleFaults, not just those that are detectable.

FaultCoverage

LEC-22:

7.3.4

Generate Test Vectors for 100% Coverage

15

7.3.4 Generate Test Vectors for 100% Coverage


In this section we will nd the test vectors to achieve 100% coverage of single stuck at faults for the circuit of the day. We will use a simple algorithm, there are much more sophisticated algorithms that are more efcient. The problem of test vector generation is often called Automatic Test Pattern Generation (ATPG) and continues to be an active area of research. A trendy idea is to use Genetic Algorithms (inspired by how DNA works) to generate test vectors that catch the maximum number of faults. The classic algorithm is the D algorithm invented by Roth in 1966 (Smith 14.5.1, 14.5.2). An enhanced version is the Path-Oriented D Algorithm (PODEM), which supports reconvergent fanout and was developed by Goel in 1981 (Smith 14.5.3).
a b c
L1 L4 L2 L5 L3 L7 L6 L8

Example Circuit with Fault Locations and Karnaugh Map

ab

bc
b

LEC-22:

7.3.4

Generate Test Vectors for 100% Coverage

16

7.3.4.1 Collapse the Faults

LEC-22:

7.3.4

Generate Test Vectors for 100% Coverage

17

Potential Fault Locations


Initial circuit with potential faults:
a b
L2@0,1 L5@0,1 L1@0,1 L4@0,1 L6@0,1 L8@0,1 L7@0,1

L3@0,1

Gate collapsing
a b
L2 L5 L1 @0 L4 @0 @0 L6 L8 L7

c a b

L3 L1 L4 L2 L5 @0

L1@0, L4@0, L6@0


L6 L8 @0 L7 L6 @1 @1 L7 @1 L8

c a b

L3 L1 L4 L2 L5

@0

L3@0, L5@0, L7@0

L3

L6@1, L7@1, L8@1

LEC-22:

7.3.4

Generate Test Vectors for 100% Coverage

18

Node Collapsing
Node collapsing: none applicable (no invertors or buffers).

a b

L1@1 L4@1 L2@0,1 L5@1

L6@0 L8@0,1 z L7@0

Remaining faults:

L3@1

LEC-22:

7.3.4

Generate Test Vectors for 100% Coverage

19

Intelligent Collapsing
Intelligent Collapsing
a b
L2@0 L8@0

L2@0, L8@0
c a b z
L1@1

Both L2@0 and L8@0 result in the equation 0.

L1@1, L3@1
c
L3@1

Both L1@1 and L3@1 result in the equation b

a b
L2@1 L5@1 L4@1 L6@0 L8@0,1 z L7@0

Remaining faults:

L3@1

LEC-22:

7.3.4

Generate Test Vectors for 100% Coverage

20

7.3.4.2 Check for Fault Domination


fault eqn K-map
a c b c

Diff w/ ckt
a b

1)

L2@1

a+c
a c b c a b

dominated by L4@1, L5@1

2)

L3@1

b
a c b c a b

3)

L4@1

a+bc
a c b c a b

4)

L5@1

ab+c
a c b c a b

5)

L6@0

bc
a c b c a b

6)

L7@0

ab
a c b c a b

7)

L8@0

0
a c b c a b

dominated by L6@0, L7@0

8)

L8@1

dominated by L2@1, L3@1, L4@1, L5@1

LEC-22:

7.3.4

Generate Test Vectors for 100% Coverage

21

Remove dominated faults


Dominated faults: (L2@1, L8@0, L8@1).

LEC-22:

7.3.4

Generate Test Vectors for 100% Coverage

22

Remaining Faults
fault eqn K-map
a c b c

Diff w/ ckt
a b

1)

L3@1

b
a c b c a b

2)

L4@1

a+bc
a c b c a b

3)

L5@1

ab+c
a c b c a b

4)

L6@0

bc
a c b c a b

5)

L7@0

ab

LEC-22:

7.3.4

Generate Test Vectors for 100% Coverage

23

Remaining Faults
a b c
L4@1 L6@0

z
L5@1 L3@1 L7@0

LEC-22:

7.3.4

Generate Test Vectors for 100% Coverage

24

7.3.4.3 Required Test Vectors


If we have any faults that are detected by just one test-vector, then we must include that test vector in our suite. Denition A test vector tv is required if there is a fault for which tv is the only test vector that will detect the fault. Required vectors L3@1 010 L6@0 110 L7@0 011

LEC-22:

7.3.4

Generate Test Vectors for 100% Coverage

25

7.3.4.4 Faults Not Covered by Required Test Vectors


fault eqn K-map
a c b c

Diff w/ ckt
a b

1)

L4@1

a+bc
a c b c a b

2)

L5@1

ab+c

The intersection of the two difference regions is 101. Choosing 101 detects both L4@1 and L5@1. Add 101 to suite of test vectors. Final set of test vectors is: 010, 110, 011, 101.

LEC-22:

7.3.4

Generate Test Vectors for 100% Coverage

26

7.3.4.5 Order to Run Test Vectors


The order in which the test vectors are run is important because it can affect how long a faulty chip stays in the tester before the chips fault is detected. The rst vector to run should be the one that detects the most faults. Build a table for which faults each test vector will detect.

LEC-22:

7.3.4

Generate Test Vectors for 100% Coverage


Test Vector
a c b c a b c a b c a b

27

fault 110
a c b

010

011

101

1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16)

L1@0
a c b

1 1
a c b

L1@1 L2@0
a c b

1 1

L2@1
a c b

L3@0
a c b

1 1
a c b

L3@1 L4@0
a c b

1 1
a c b

L4@1 L5@0
a c b

1 1
a c b

L5@1 L6@0
a c b

1 1
a c b

L6@1 L7@0
a c b

1 1

L7@1
a c b

1 1
a c b

1 1

L8@0 L8@1 Faults detected

1 5

1 6

LEC-22:

7.3.4

Generate Test Vectors for 100% Coverage

28

101 detects the most faults, so we should run it rst. This reduces the faults found by 010 from 5 to 2 (because L6@1, L7@1, and L8@1 will be found by 101). This leaves 110 and 011 with 5 faults each, we can run them in either order, then run 010. We settle on a nal order for our test suite of: 101, 011, 110, 010.

LEC-22:

7.3.4

Generate Test Vectors for 100% Coverage

29

7.3.4.6 Summary of Technique to Find and Order Test Vectors


1. identify all possible faults 2. gate collapsing 3. node collapsing 4. intelligent collapsing 5. fault domination 6. determine required test vectors 7. choose minimal set of test vectors to detect remaining faults 8. order test vectors based on number of faults detected (NOTE: when iterating through this step, need to take into account faults detected by earlier test vectors)

LEC-22:

7.3.4

Generate Test Vectors for 100% Coverage

30

7.3.4.7 Complete Analysis


In case you dont trust the fault collapsing analysis, heres the complete analysis. fault eqn K-map
a c b c

Diff w/ ckt
a b

1)

L1@0

bc
a c b c a b

2)

L1@1

b
a c b c a b

3)

L2@0

0
a c b c a b

dominated by 1, 5

4)

L2@1

a+c
a c b c a b

dominated by 8, 10

5) 6) 7)

L3@0 L3@1 L4@0

ab b bc

same as 2 same as 1
a c b c a b

8) 9)

L4@1 L5@0

a+bc ab

same as 5
a c b c a b

10) 11)

L5@1 L6@0

ab+c bc

same as 1
a c b c a b

12) 13) 14) 15) 16)

L6@1 L7@0 L7@1 L8@0 L8@1

1 ab 1 0 1

dominated by 8, 10 same as 5 same as 12 same as 3 same as 12

LEC-22:

7.3.5

One Fault Hiding Another

31

7.3.5
a b c
L1

One Fault Hiding Another


L4 L6 L8 L5 L7

L2 L3

Assume that we are not trying to detect all faults L1 is viewed as not being at risk for faults, but L3 is at risk for faults.
a b z c
L3 L1

a b

L1

z c
L3

LEC-22:

7.3.5

One Fault Hiding Another

32

Fault Hiding
a b z c
L3 L1

a b

L1

z c
L3

Problem: If L1 is stuck-at 1, the test vectors that normally detect L3@0 will not detect L3@0. In the presence of other faults, the set of test vectors to detect a fault will change. fault(s) L3@0 L1@1,L3@0 eqn ab
a c b c a b

K-map
a c b

Diff w/ ckt
a c b

LEC-23 Preliminaries

LEC-23: Built In Self Test


Lecture Notes Sections: 7.4 7.4.10

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-23 Preliminaries

Change Log

LEC-23 Preliminaries

Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 wk-09 10 wk-11 12 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Lec-21 Introduction Lec-22 Detection and Test Generation Lec-23 Built-In Self Test (BIST) Lec-24 JTAG Review

wk-13

LEC-23 Preliminaries

Purpose and List of Concepts


The purpose of this lecture is to connect the theory of testing and testability to the technique of built in self test (BIST). Well also relate gate-level circuit behaviour to Galois theory, which is a eld of mathematics used in information theory (encryption, compression, etc). A meta-level lesson here is that advanced mathematical concepts can sometimes be used to invent new types of circuits, or to better understand existing circuits. Finally, we see that theories created long before the advent of computers are often applied in computing theory. linear feedback shift register (LFSR) built-in self test (BIST)

characteristic polynomials Galois elds

LEC-23:

7.4

BUILT IN SELF TEST (SMITH 14.7)

7.4

Built In Self Test (Smith 14.7)

LEC-23:

7.4.1

Block Diagram

7.4.1

Block Diagram

LEC-23:

7.4.1

Block Diagram

Generic Testing Circuit


mode test generator d(0) i_data(0) o_data(0)

d(1) i_data(1) circuit under test

o_data(1)

d(2) i_data(2)

o_data(2)

d(3) i_data(3) result checker all_ok

LEC-23:

7.4.1

Block Diagram

Circuit in Normal Mode


mode test generator d(0) i_data(0) o_data(0)

d(1) i_data(1) circuit under test

o_data(1)

d(2) i_data(2)

o_data(2)

d(3) i_data(3) result checker all_ok

LEC-23:

7.4.1

Block Diagram

Circuit in Test Mode


mode test generator d(0) i_data(0) o_data(0)

d(1) i_data(1) circuit under test

o_data(1)

d(2) i_data(2)

o_data(2)

d(3) i_data(3) result checker all_ok

LEC-23:

7.4.1

Block Diagram

10

Circuit with BIST


mode test gen LFSR test generator i_data(0) signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) i_data(2) o_data(1) signature ok(2) analyzer2 o_data(2)

d(0)

d(1) i_data(1)

d(3) i_data(3)

result checker all_ok

LEC-23:

7.4.1

Block Diagram

11

BIST in Normal Mode


mode test gen LFSR test generator i_data(0) signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) i_data(2) o_data(1) signature ok(2) analyzer2 o_data(2)

d(0)

d(1) i_data(1)

d(3) i_data(3)

result checker all_ok

LEC-23:

7.4.1

Block Diagram

12

BIST in Test Mode


mode test gen LFSR test generator i_data(0) signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) i_data(2) o_data(1) signature ok(2) analyzer2 o_data(2)

d(0)

d(1) i_data(1)

d(3) i_data(3)

result checker all_ok

LEC-23:

7.4.1

Block Diagram

13

7.4.1.1 Components
There is one test generator per group of inputs (or internal ops) that drive the same circuit to be tested. There is one signature analyzer per output (or internal op).

NOTE: MISR An exception to the above rule is a multiple input signature register (MISR), which can be used to analyze several outputs of the circuit under test. (Smith 14.7.7) The test generator and signature analyzer are both built with linear-feedback shift registers.

LEC-23:

7.4.1

Block Diagram

14

Test generator

generates a psuedo-random set of test vectors for n output bits, generates all vectors from 1 to 2n 1 in a pseudo random order built with a linear-feedback shift register (shift-register portion is the input ops)

LEC-23:

7.4.1

Block Diagram

15

Signature analyzer

checks that the output it is examining has the correct results for the complete set of tests that are run only has a meaningful result at the end of the entire test sequence. built with a linear-feedback shift register similar to a hash function or a lossy compression function if there are no faults, the signature analyzer will denitely say ok (no false negatives) if there is a fault, the signature analyzer might say ok or might say bad (false positives are possible) design tradeoff: more accurate signature analyzers require more hardware

LEC-23:

7.4.1

Block Diagram

16

Result checker

signature analyzers output ok/bad on every clock cycle, but the result is only meaningful at the end of running the complete set of test vectors the result checker looks at test vector inputs to detect the end of the test suite and outputs all ok if all signature analyzers report ok at that moment implemented as an AND gate

LEC-23:

7.4.1

Block Diagram

17

7.4.1.2 Linear (LFSR)

Feedback Shift

Register

Basically, a shift register (sequence of ip-ops) with the output of the last ip-op fed back into some of the earlier ip-ops with XOR gates. Design parameters:

number of ip-ops external or internal XOR feedback taps (coefcients) external-input or self-contained reset or set

LEC-23:

7.4.1

Block Diagram

18

LFSR Example
reset

d0 i

q0 d1

q1 d2

q2

External-XOR, input, reset

LEC-23:

7.4.1

Block Diagram

19

LFSR Example

d0

q0 d1

q1 d2

q2

set

External-XOR, no input, set

LEC-23:

7.4.1

Block Diagram

20

LFSR Example
d0
R

q0

d1

q1 d2

q2

set

Internal-XOR, input, set

LEC-23:

7.4.1

Block Diagram

21

LFSR Example
reset

d0

q0

d1

q1

d2

q2

Internal-XOR, input, reset

LEC-23:

7.4.1

Block Diagram

22

LFSRs in E&CE 427


In E&CE 427, well use internal-XOR LFSRs, because the circuitry matches the mathematics of Galois elds. External-XOR LFSRs work just ne, but they are more difcult to analyze, because their behaviour cant be treated as Galois elds.

LEC-23:

7.4.1

Block Diagram

23

7.4.1.3 Maximal-Length LFSR


Denition maximal-length linear feedback shift register: An LFSR that outputs a pseudo-random sequence of all representable bit-vectors except 0...00.

Denition pseudo random: The same elements in the same order every time, but the relationship between consecutive elements is apparantly random. Maximal-length linear feedback shift registers are used to generate test vectors for built-in self test.

LEC-23:

7.4.1

Block Diagram

24

Maximal-Length LFSR Circuits


The gures below illustrate the two maximal-length internal-XOR linear feedback shift registers that can be constructed with 3 ops.
d0 q0 d1 q1 d2 q2

set

Maximal-length internal-XOR LFSR

d0

q0

d1

q1 d2

q2

set

Maximal-length internal-XOR LFSR

LEC-23:

7.4.1

Block Diagram

25

Maximal Length LFSR Characteristics


Maximal-length LFSRs:

reset clk d0 q0 d1 q1 q2 val

set to all 1s initially self contained (no external i input)

Timing diagram for a maximal-length LFSR

LEC-23:

7.4.1

Block Diagram

26

Maximal-Length LFSR Timing Diagram


1 reset clk d0 q0 d1 q1 q2 val 7 6 4 1 2 5 3 7 6 2 3 4 5 6 7 8

Timing diagram for a 3-op maximal-length LFSR

LEC-23:

7.4.2

Test Generator

27

7.4.2

Test Generator

The test generator component is a maximal-length LFSR with multiplexors on the inputs to each ip-op. In test mode, the data input on each ip op is connected to the output of the previous ip op. In normal mode, the input of each ip op is connected to the environment.
d0 q0 d1 q1 d2 q2

set

A maximal-length internal-XOR LFSR

LEC-23:

7.4.2

Test Generator

28

Test Generator
The test generator component is a maximal-length LFSR with multiplexors on the inputs to each ip-op. In test mode, the data input on each ip op is connected to the output of the previous ip op. In normal mode, the input of each ip op is connected to the environment.
mode d1 q1 d2 q2

d0

q0

i_d(0) i_d(1) i_d(2) set q0 q1 q2

A test generator: maximal-length internal-XOR LFSR with muxes on data inputs

LEC-23:

7.4.2

Test Generator

29

Test Generator
mode

d0 i_d(0)

q0

d1 i_d(1) d2 i_d(2)

q1

q2

A test generator, reset not shown

LEC-23:

7.4.3

Signature Analyzer

30

7.4.3

Signature Analyzer

There are four things that change between different signature analyzers:

number of ops ( ops area, accuracy) choice of feedback taps: a good choice can improve accuracy (more isnt necessarily better) bubbles on input to AND gate for ok: determined by expected result from simulating test sequence through circuit under test and LFSR of analyzer. Vector

LEC-23:

7.4.3

Signature Analyzer

31

Signature Analyzer
This circuit:

reset

i
S S

Two ops, most analyzers use more the HP boards in the 1970s used 37 ops! Feedback taps on both ops. Different signature analyzers have different congurations of feedback taps. Also contains ok tester (AND gate). Expected output of LFSR at end of test sequence is: q0=1 and q1=1, or 01. (We know this because of bubble on AND gate. To see why this is the expected output of the signature analyzer, we would need to know the correct sequence of outputs of the circuit under test.)

d0

q0

d1

q1

ok

LEC-23:

7.4.3

Signature Analyzer

32

Signature Analyzer
reset clk i d0 q0 d1 q1 0 0 i6 i5 i4 i3 i2 i1 i0 -

LEC-23:

7.4.3

Signature Analyzer

33

Signature Analyzer Timing


reset clk i d0 q0 d1 q1 i6 i6 0 0 0 i5 i5 i6 i6 0 i4 i3 i2 i1 i0 -

i4i6 356 i5

245 1346 02356

i4i6 356

245 1346 02356 -

i5i6 i4i5 346 2356 1245 i6

i5i6 i4i5 346 2356 1245

356 = i3i5i6 2356 = i2i3i5i6 etc...

LEC-23:

7.4.4

Result Checker

34

7.4.4

Result Checker

The purpose of the result checker is to check the ok circuit at the end of the test sequence. To do this, we need to recognize the end of the test sequence. The simplest way to do this is to notice that the rst test vector is all 1s and that the test vector sequence will repeat as long as the circuit is in test mode. We want to sample the ok signal one clock cycle after the sequence is over. This is the same as the rst clock cycle of the second test sequence. In this clock cycle, the output of the test generator will be all 1s and reset will be 0. We need to look at reset, because otherwise we could not distinguish the rst sequence (when reset is 1) from the subsequenct sequences.
reset q0 q1 q2 ok

all_ok

LEC-23:

7.4.5

Arithmetic over Binary Fields

35

7.4.5

Arithmetic over Binary Fields

Galois Fields! Two operations: and Two values: 0 and 1

LEC-23:

7.4.5

Arithmetic over Binary Fields

36

Addition
represents XOR expression result 0 0 0 0 1 1 1 0 1 1 1 0 x x 0

LEC-23:

7.4.5

Arithmetic over Binary Fields

37

Multiplication
represents concatenating shift registers expression result x4 1 x4 x2 x3 x5

LEC-23:

7.4.5

Arithmetic over Binary Fields

38

Example

x5

x3

x2

x3

x2 x

x3 x3

'C
x2 x2 1 1 x5 x4 x4 x2 x x

Calculate x3

x2

x2

LEC-23:

7.4.6

Shift Registers and Characteristic Polynomials (Smith 14.7.5) 39

7.4.6 Shift Registers and Characteristic Polynomials (Smith 14.7.5)


Given a linear feedback shift register with l ops. The feedback register can be represented as a polynomial p x with maximum exponent xl . The polynomial represents the behaviour of the output of the last ip op. The exponent on the variable x represents the number of clock cycles of delay. From polynomials to hardware:

The maximum exponent denotes the number of ops The other exponents denote the ops that tap off of feedback line from last op

LEC-23:

7.4.6
reset

Shift Registers and Characteristic Polynomials (Smith 14.7.5) 40

d0

q0

q1

q2

reset

d0

q0

d1

q1

q2

reset

d0 i

q0

q1

q2

reset

d0 i

q0

d1

q1

q2

reset

d0 i

q0

d1

q1

d2

q2

reset

d0 i

q0

d1

q1

q2

d3

q3

px

x4

x3

px

x3

px

x3

px

x3

px

px

x3

x3

x2

LEC-23:

7.4.6

Shift Registers and Characteristic Polynomials (Smith 14.7.5) 41

See Smiths Fig 14.27 (pp771), 14.28 (pp773), and Table 14.11 (pp774).

LEC-23:

7.4.6

Shift Registers and Characteristic Polynomials (Smith 14.7.5) 42

7.4.6.1 Circuit Multiplication


Redoing the multiplication example as circuits:

x5

The op for the most-signicant bit is represented by a coeffcient of 1 for the maximum exponent in the polynomial. Hence, MSB of the rst partial product cancels the x4 of the second partial product, resulting in a coefcient of 0 for x4 in the answer.

'D

x2

x x3 x2 1 x x3 x2 1 2 x x3 x2 1 x3 x2 x

x3

x2

x2

LEC-23:

7.4.7

Bit Streams and Characteristic Polynomials

43

7.4.7 Bit Streams and Characteristic Polynomials


A bit stream, or bit sequence, can be represented as a polynomial. The oldest (rst) bit in a sequence of n bits is represented by xn youngest (last) bit is x0 .

1 1x6 x6 x4

0 0x5 x 1

1 1x4

0 0x3

0 0x2

1 1x1

The bit sequence 1010011 can be represented as x6

x4

1: 1 1x0

and the

LEC-23:

7.4.8

Division

44

7.4.8

Division

With rules for multiplication and addition, we can dene division. A fundamental theorem of division denes q and r to be the quotient and remainder, respectively, of m p iff: mx

R D D

qx

px

r x

LEC-23:

7.4.8

Division

45

Long Division
In Galois elds, we do division just as with long division in elementary school. Given:

C C

Quotient Remainder

qx r x

x2 x

1x4 1x4

x x

x4

0x5

x2 x x6 x6

1 1x4

1x3 1x3

0x2

R
0x1

Calculate the quotient, q x and remainder r x for m x

mx px

x6 x4

x4 x

x3

px:

0x0

LEC-23:

7.4.8

Division

46

Long Division (Check)


Check result:

x4 x3

The mathematics for an LFSR without an input i:

same polynomial as if the circuit had an input input sequence is all 0s

mx

qx x2 1 x6 x3 x6 x4

px x4 x x

r x x x

LEC-23:

7.4.9

Signature Analysis: Math and Circuits

47

7.4.9 Signature Analysis: Math and Circuits


The input to the signature analyzer is a message, m x , which is a sequence of n bits represented as a polynomial. After n shifts through an LFSR with l ops:

The remainder is the signature.

R D D

mx

qx

px

r x

The sequence of output bits forms a quotient, q x , of length n The ops in the analyzer form a remainder, r x , of length l

LEC-23:

7.4.9

Signature Analysis: Math and Circuits

48

Input Streams and Error Polynomials

R C C R

mx

ex

q x

px

r x

e x is the error polynomial bits in the message that are ipped have a coefcient of 1 in e x

An input stream with an error can be represented as m x

ex

LEC-23:

7.4.9

Signature Analysis: Math and Circuits

49

Input Streams and Error Polynomials


The error e x will be detected if it results in a different signature (remainder).

That is e x must be a multiple of p x . The larger p x is, the smaller the chances that e x will be a multiple of p x .

m x and m x

e x will have the same remainder iff e x mod p x 0

LEC-23:

7.4.10

Summary

50

7.4.10 Summary

LEC-23:

7.4.10

Summary

51

Adding Test Circuitry


1. Pick number of ops for generator 2. Build generator (maximal-length linear feedback shift register) 3. Pick number of ops for signature analysis 4. Pick coeffecients (feedback taps) for analyzer 5. Based on generator, circuit under test, and signature analyzer; determine expected output of analyzer 6. Based on expected output of analyzer, build result checker

LEC-23:

7.4.10

Summary

52

Running Test Vectors


1. Put circuit in test mode 2. Set reset = 1 3. Run one clock cycle, set reset = 0 4. Run one clock cycle for each test vector 5. At end of test sequence, all ok signals should be 1

6. To run n test vectors requires n

1 clock cycles.

LEC-24 Preliminaries

LEC-24: Scan Testing (JTAG)


Lecture Notes Sections: 7.5 7.7.3

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

LEC-24 Preliminaries

Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 wk-09 10 wk-11 12 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Lec-21 Introduction Lec-22 Detection and Test Generation Lec-23 Built-In Self Test (BIST) Lec-24 Scan Testing (JTAG) Review

wk-13

LEC-24 Preliminaries

Purpose and List of Concepts


The purpose of this lecture is to connect the theory of testing and testability to the current techniques of scan testing and the IEEE Standard 1149.1 (aka JTAG). scan testing scan chain testing procedure time to run a test boundary scan testing

JTAG IEEE 1149 length of time to do a scan test hardware to do scan testing

LEC-24:

7.5

SCAN TESTING IN GENERAL (SMITH 14.6)

7.5

Scan Testing in General (Smith 14.6)

LEC-24:

7.5.1

Structure and Behaviour of Scan Testing

7.5.1 Structure and Behaviour of Scan Testing


data_in(3) another circuit #0 zeta_in(3) another circuit #1 yet another circuit scan_out1

data_in(2) circuit under test

zeta_in(2)

data_in(1)

zeta_in(1)

data_in(0)

zeta_in(0)

Normal Circuit
mode0 scan_in0 mode1 scan_in1

another circuit

scan chain 0

circuit under test

scan_out0

Circuit with Scan Chains Added

scan chain 1

LEC-24:

7.5.2

Scan Chains

7.5.2
data_in(3)

Scan Chains
mode1 scan_in1 zeta_in(3)

mode0 scan_in0

data_in(2)

circuit under test

zeta_in(2)

data_in(1)

zeta_in(1)

data_in(0) scan_out0 scan_out1

zeta_in(0)

LEC-24:

7.5.2

Scan Chains

7.5.2.1 Circuitry in Normal Mode


mode0 scan_in0 mode1 scan_in1

circuit under test

scan_out0

scan_out1

Normal Mode

LEC-24:

7.5.2

Scan Chains

Scan Mode
mode0 scan_in0 mode1 scan_in1

circuit under test

scan_out0

scan_out1

Scan Mode

LEC-24:

7.5.2

Scan Chains

7.5.2.2 Scan in Operation


mode0 scan chain 0 scan_in0 mode1 scan chain 0 scan_in1

another circuit

scan_out0

scan_out1

Circuit under test with scan chains

yet another circuit

circuit under test

LEC-24:

7.5.2

Scan Chains

10

From Test Vector to Results


clk mode0 scan_out0 scan_in0 scan_out1 scan_in1 current vector0 current results1

LEC-24:

7.5.2

Scan Chains

11

Load Test Vector


mode0 scan chain 0 current vector0 scan_in0 mode1 scan chain 0 scan_in1

another circuit

scan_out0

scan_out1

yet another circuit

circuit under test

LEC-24:

7.5.2

Scan Chains

12

Run Test Vector Through Circuit


mode0 scan chain 0 scan_in0 mode1 scan chain 0 scan_in1

another circuit

scan_out0

scan_out1

yet another circuit

circuit under test

LEC-24:

7.5.2

Scan Chains

13

Unload Test Vector


mode0 scan chain 0 scan_in0 mode1 scan chain 0 scan_in1

another circuit

scan_out0

scan_out1 current results1

yet another circuit

circuit under test

LEC-24:

7.5.2

Scan Chains

14

Unload Prev and Load Current


mode0 scan chain 0 current vector0 scan_in0 mode1 scan chain 0 current vector1 scan_in1

another circuit

scan_out0 previous results0

scan_out1 previous results1

Optimization: Unload and Load and Same Time

yet another circuit

circuit under test

LEC-24:

7.5.2

Scan Chains

15

Run Tests
mode0 scan chain 0 scan_in0 mode1 scan chain 0 scan_in1

another circuit

scan_out0

scan_out1

Optimization: Unload and Load and Same Time

yet another circuit

circuit under test

LEC-24:

7.5.2

Scan Chains

16

Unload Current and Load Next


mode0 scan chain 0 next test vector0 scan_in0 mode1 scan chain 0 next test vector1 scan_in1

another circuit

scan_out0 current results0

scan_out1 current results1

Optimization: Unload and Load and Same Time

yet another circuit

circuit under test

LEC-24:

7.5.2

Scan Chains

17

Behaviour of Scan Testing


clk mode0 scan_out0 scan_in0 scan_out1 scan_in1 previous results0 current vector0 previous results1 current vector1 current results0 next test vector0 current results1 next test vector1

Behaviour of scan testing

LEC-24:

7.5.2

Scan Chains

18

7.5.2.3 Scan in Operation with Example Circuit

LEC-24:

7.5.2

Scan Chains

19

a b y z c d

Circuit under test

LEC-24:

7.5.2

Scan Chains

20

mode0 scan_in0 a

mode1 scan_in1

y b z c

scan_out0

scan_out1

Circuit under test with scan test circuitry

LEC-24:

7.5.2

Scan Chains

21

mode0 scan_in0 a

mode1 scan_in1

y b z c

scan_out0 clk mode0

scan_out1

Start Loading Test Vector (Load )

LEC-24:

7.5.2

Scan Chains

22

mode0 scan_in0 a

mode1 scan_in1

y b z c

scan_out0 clk mode0

scan_out1

Load

LEC-24:

7.5.2

Scan Chains

23

mode0 scan_in0 a

mode1 scan_in1

y b z c

scan_out0 clk mode0

scan_out1

Load

LEC-24:

7.5.2

Scan Chains

24

mode0 scan_in0 a

mode1 scan_in1

y b z c

scan_out0 clk mode0

scan_out1

Load

LEC-24:

7.5.2

Scan Chains

25

mode0 scan_in0

mode1 scan_in1

scan_out1

scan_out0 clk mode0

Run Test Vector

LEC-24:

7.5.2

Scan Chains

26

mode0 scan_in0

mode1 scan_in1

__

+
__

__

__

scan_out1

scan_out0 clk mode0

Test Values Propagate

LEC-24:

7.5.2

Scan Chains

27

mode0 scan_in0

mode1 scan_in1

__

__

scan_out0 clk mode0

scan_out1 (+)
__

Flop-In Result, Start (Un)loading Test Vector

LEC-24:

7.5.2

Scan Chains

28

mode0 scan_in0

mode1 scan_in1

__

scan_out0
__

scan_out1 (+, +)
__

clk mode0

Continue (Un)loading Test Vector

LEC-24:

7.5.2

Scan Chains

29

mode0 scan_in0

mode1 scan_in1

scan_out0
__

scan_out1 (+, +)
__

clk mode0

Continue (Un)loading Test Vector

LEC-24:

7.5.2

Scan Chains

30

mode0 scan_in0

mode1 scan_in1

scan_out0
__

scan_out1 (+, +)
__

clk mode0

Finish (Un)loading Test Vector

LEC-24:

7.5.2

Scan Chains

31

mode0 scan_in0

mode1 scan_in1

scan_out0
__

scan_out1 (+, +)
__

clk mode0

Run Next Test Vector

LEC-24:

7.5.3

Summary of Scan Testing

32

7.5.3

Summary of Scan Testing

Adding scan circuitry 1. Registers around circuit to be tested are grouped into scan chains 2. Replace each op with mux + op 3. Flops and muxes wired together into scan chains 4. Each scan chain is connected to dedicated I/O pins for loading and unloading test vectors Running test vectors 1. Put scan chain in scan mode 2. Load in test vector (one element of vector per clock cycle) 3. Put scan chain in normal mode 4. Run circuit for one clock cycle load result of test into ops 5. Unload results of current test vector while simultaneously loading in next test vector (one element of vector per clock cycle)

LEC-24:

7.5.4

Example: Time to Test a Chip

33

7.5.4

Example: Time to Test a Chip

A 800MHz chip has scan chains of length 20,000 bits, 18,000 bits, 21,000 bits, 22,000 bits, and two of 15,000 bits. 500,000 test vectors are used for each scan chain. The tests are run at 80% of full speed.

Question:

Calculate the total test time.

Answer:

We can load and unload all of the scan chains at the same time, so time will be limited by the longest (22,000 bits). For the rst test vector, we have to load it in, run the circuit for one clock cycle, then unload the result. Loading the second test vector is done while unloading the rst.

Q Q

'CQ

TimeTot

ClockPeriod MaxLengthVec 1 0 80 800 106 17secs

NumVecs MaxLengthVec 1 22 000 500 000 22 000 1

LEC-24:

7.6

BOUNDARY SCAN

34

7.6

Boundary Scan

Boundary scan originated as technique to test wires on printed circuit boards (PCBs). Goal was to replace bed-of-nails style testing with technique that would work for high-density PCBs (lots of small wires close together) Now used to test both boards and chip internals. Used both on boundaries (I/O pins) and internal ops.

LEC-24:

7.6

BOUNDARY SCAN

35

Boundary Scan with JTAG


Standardized by IEEE (1149) and previously by JTAG:

JTAG circuitry now commonly built-into FPGAs and ASICS, or part of a celllibrary. Rarely is a JTAG circuit custom-built as part of a larger part. So, youll probably be choosing and using JTAG circuits, not constructing new ones. Using JTAG circuitry is usually done by giving a description of your printed circuit board (PCB) and the JTAG components on each chip (in BSDL) to test generation software. The software then generates a sequence of JTAG commands and data that can be used to test the wires on the circuit board for opens and shorts.

4 required signals (Scan Pins: TDI, TDO, TCK, TMS) 1 optional signal (Scan Pin: TRST) protocol to connect circuit under test to tester and other circuits state machine to drive test circuitry on chip Boundary Scan Description Language (BSDL): structural language used to describe which features of JTAG a circuit supports

LEC-24:

7.6.1

Boundary Scan History

36

7.6.1

Boundary Scan History

1985 JETAG: Joint European Test Action Group 1986 JTAG (North American companies joined) 1990 JTAG 2.0 formed basis for IEEE 1491 Test access port and boundary scan architecture

LEC-24:

7.6.2

Scan Pins

37

7.6.2
TDO TCK TMS TRST

Scan Pins
test data input: input testvector to chip test data output: output result of test test clock: clock signal that test runs on test mode select: controls scan state machine test reset (optional): resets the scan state machine

'

TDI

LEC-24:

7.6.2

Scan Pins

38

Overview
chip scan registers

normal input pins

circuit under test

normal output pins

TDI TCK TMS

TDO control

LEC-24:

7.6.2

Scan Pins

39

Expanded View
chip BSR BSC circuit under test BSC BSC control TDI BR Instruction Decoder IR TCK IDCODE TAP Controller IRC IRC TDO BSC BSC BSC

TMS

LEC-24:

7.6.3

Scan Registers and Cells

40

7.6.3

Scan Registers and Cells

LEC-24:

7.6.3

Scan Registers and Cells

41

Basic Building Blocks


TDR DR Fig 14.2 Test data register The boundary scan registers on a chip Data register cell Often used as a Boundary scan cell (BSC)

LEC-24:

7.6.3

Scan Registers and Cells

42

JTAG Components
BSR BSC Fig 14.8 Fig 14.5 Fig 14.2 Top level diagram Boundary scan register A chain of boundary scan cells (BSCs) Boundary scan cell Connects external input and scan signal to internal circuit. Acts as wire between external input and internal circuit in normal mode. Bypass-register cell Allows direct connection from TDI to TDO. Acts as a wire when executing BYPASS instruction. Device identication register data register to hold manufacturers name and chip identier. Used in IDCODE instruction. Instruction register cell Cells are combined together as a shift register to form an instruction register (IR) Instruction register Two or more IR cells in a row. Holds data that is shifted in on TDI, sends this data in parallel to instruction decoder. Instruction decoder Reads instruction stored in instruction register (IR) and sends control signals to bypass register (BR) and boundary scan register (BSR) TAP Controller State machine that, together with instruction decoder, controls the scan circuitry.

BR

Fig 14.3

IDCODE

IR cell

Fig 14.4

IR

Fig 14.6

IDecode

Table 14.4

Fig 14.7

LEC-24:

7.6.4

Scan Instructions

43

7.6.4
EXTEST

Scan Instructions
Test board-level interconnect. Drive output pins of chip with hard-coded test vector. Sample results on inputs. Sample result data Load test vector Directly connect TDI to TDO. This is used when several chips are daisy chained together to skip loading data into some chips. Output manufacturer and part number

This the set of required instructions, other instructions are optional.

SAMPLE PRELOAD BYPASS

IDCODE

LEC-24:

7.6.5

TAP Controller

44

7.6.5

TAP Controller

The TAP controller is required to have 16 states and obey the state machine shown in Fig 14.7.

LEC-24:

7.6.6

Other descriptions of JTAG/IEEE 1194.1

45

7.6.6 Other 1194.1

descriptions

of

JTAG/IEEE

Texas Instruments introductory seminar on IEEE 1149.1 http://www.ti.com/sc/docs/jtag/seminar1.pdf Texas Instruments intermediate seminar on IEEE 1149.1 http://www.ti.com/sc/docs/jtag/seminar2.pdf Sun midroSPARC-IIep scan-testing documentation http://www.sun.com/microelectronics/whitepapers/wpr-0018-01/ Intellitech JTAG overview: http://www.intellitech.com/resources/technology.html Actels JTAG description: http://www.actel.com/appnotes/97s05d15.pdf Description of JTAG support on Motorola Coldle microprocessor: http://e-www.motorola.com/collateral/MCF5307TR-JTAG.pdf

LEC-24:

7.7

SUMMARY AND CONCLUSIONS ON TESTING

46

7.7

Summary and Conclusions on Testing

LEC-24:

7.7.1

Faults

47

7.7.1

Faults

Faults are manufacturing defects. Common occurences are opens (wire is broken) and shorts (two wires are connected together). When working with faults, we work with wire segments, not signals. In the circuit below, there are 8 different wire segments (L1L8). Each wire segment corresponds to a logically distinct fault location. All physical faults on a segment affect the same set of signals, so they are grouped together into a logical fault. If a signal has a fanout of 1, then there is one wire segment. A signal with a fanout of n, where n 1, has n 1 wire segments one for the source signal and one for each gate of fanout.
a L1 L4 L2 L5 c L3 L7

For signal b in the circuit here, the fanout is 2, so there are three wire segments (L2, L4, and L5).

Although there are many different bad behaviours that faults can lead to, the simple model of single-stuck-at-faults has proven very capable of nding real faults in real circuits. single stuck-at-0 (s@0) stuck-at-1 (s@1) assume that at most wire segment in circuit has a fault. assume that the faulty behaviour is that the segment is hardwired to 0. assume that the faulty behaviour is that the segment is hardwired to 1.

L6 L8 z

LEC-24:

7.7.2

Testing

48

7.7.2

Testing

Faults are detected by stimulating circuits (real, manufactured circuit, not a simulation!) with test-vectors and checking that real circuit gives correct output. Standard practice in testing is to test circuits for single stuck-at faults. Mathematics and empirical evidence demonstrate that testing a circuit for single stuck-at faults will also detect many other types of faults and will often detect multiple faults. Some faults are undetectable. Undetectable stuck-at faults are located in redundant parts of a circuit. These redundant parts are added to prevent timing hazards. As such, a stuck-at fault in redundant circuitry will not affect the steady state behaviour of the circuit, but could allow timing glitches to occur. If a circuit has 100% single stuck-at fault coverage with a suite of test vectors, then each stuck-at fault in the circuit can be detected by one or more vectors in the suite. This also means that the circuit has no undetectable faults, and hence, no redundant circuitry. It is possible that achieving 100% coverage for single stuck at faults will allow defective chips to pass if they have faults that are not stuck-at-1 or stuck-at-0, or if they have multiple faults. I think, but havent seen a proof, that achieving 100% single stuck-at coverage will detect all combinations of multiple stuck-at faults. But, if you do not achieve 100% coverage, then a stuck-at fault that you arent testing for can mask (hide) a fault that you are testing for. There are two ways to generate vectors and check result: built-in tests and scan testing. Both require:

generate test vectors overide normal datapath to send test-vectors, rather than normal inputs, as inputs to ops compare outputs of ops to expected result

LEC-24:

7.7.2

Testing

49

7.7.2.1 Scan Testing


In scan testing, the generation and checking are done off-chip. This has the advantage of exibility and reduced on-chip hardware, but increases the length of time required to run a test. We want to individually drive and read every op in the circuit. Even without using any I/O pins for testing purposes, chips are already I/O bound, so scan-testing must be very frugal in its use of pins. Flops are connected together in a scan chain with one input pin and one output pin. If the length (number of ops) of a scan chain is n, then it takes 2n 1 clock cycles to run a single test: n clock cycles to scan in the test vector, 1 clock cycle to execute the test vector, and n cycles to scan out the results. Once the results are scanned out, they can be compared to the expected results for a correctly working circuit. If we run 2 or more tests (and chips generally are subjected to hundreds of thousands of tests), then we speed things up by scanning in the next test vector while we scan out the previous result. ScanLength NumVectors TimeScan = = = = number of ip ops in a scan chain number of test vectors in test suite number of clock cycles to run test suite NumVectors ScanLength 1 ScanLength

To nd a test vector that will detect a fault:

1. build Boolean equation (or Karnaugh map) of correct circuit 2. build Boolean equation (or Karnaugh map) of faulty circuit 3. compare equations (or Karnaugh maps), regions of difference represent test vectors that will detect fault Because it takes so much time to perform a scan test, reducing the number of test vectors that are needed is very important. fault1 dominates fault2 is dened as: any test vector that will detect fault1 will also detect fault2.

'

LEC-24:

7.7.2

Testing

50

Summary of Technique to Find and Order Test Vectors: 1. identify all possible faults 2. gate collapsing 3. node collapsing 4. intelligent collapsing 5. fault domination 6. determine required test vectors 7. choose minimal set of test vectors to detect remaining faults 8. order test vectors based on number of faults detected (NOTE: when iterating through this step, need to take into account faults detected by earlier test vectors)

LEC-24:

7.7.2

Testing

51

7.7.2.2 Built-In Self Test (BIST)


With built-in self test, the circuit tests itself. Both test vector generation and checking are done using linear feedback shift registers (LFSRs). The gure below shows an LFSR that generates all possible 3-bit vectors except 000. (An n bit LFSR that generates 2n 1 different vectors is called a maximal-length LFSR.) Assume that reset initializes the circuit to 111. The sequence that is generated is: 111, 011, 001, 100, 010, 101, 110. This sequence is repeated, so the number after 110 is 111.

Each linear feedback shift register has a characteristic polynomial, that corresponds to the behaviour of the signal that is the input to the rst ip-op in the shift register. The exponents in the polynomial correspond to the delay x0 is the input to the shift register, x1 is the output of the rst ip-op, x2 is the output of the second, etc. The coefcient is 1 if theres a feedback tap from the output of the op. Checking is done by building one signature analyzer circuit for each signal tested. The circuit returns true if the signal generates the correct sequence of outputs for the test vectors. Doing this with complete accuracy would require storing 2n bits of information for each output for a circuit with n inputs. This would be as expensive as the original circuit. So, BIST uses mathematics similar to error correction/detection to approximate whether the outputs are correct. This technique is called signature analysis and originated with Hewlett-Packard in the 1970s. The checking is done with an LFSR, similar to the BIST generation circuit. The checking circuit is designed to output a 1 at the end of the sequence of 2n 1 test results if the sequence of results matches the correct circuit. We could do this with an LFSR of 2n 1 ops, but as said before, this would be at least as expensive as duplicating the original circuit.

q2 q1 q0

LEC-24:

7.7.2

Testing

52

The checking LFSR is designed similarly to a hashing function or parity checking circuit. If it returns 0, then we know that there is a fault in the circuit. If it returns a 1, then there is probably not a fault in the circuit, but we cant say for sure. There is a tradeoff between the accuracy of the analyzer and its area. The more accurate it is, the more ip ops are required. The LFSR here recognizes the sequence 1, 0, 1, 1, 1, 0, 0:

output from circuit under test

q2

It could be used, in conjunction with the maximal-length LFSR above, to detect faults in a circuit that, when stimulated with the sequence with the sequence 111, 011, 001, 100, 010, 101, 110; outputs the sequence 1, 0, 1, 1, 1, 0, 0.

LEC-24:

7.7.3

Scan vs Self Test

53

7.7.3
Scan

Scan vs Self Test

less hardware

Self Test

slower well dened coverage test vectors are easy to modify

more hardware faster ill dened coverage test vectors are hard to modify

LEC-24:

7.7.3

Scan vs Self Test

54

Chapter 8

Review
This chapter is a collection of information cover the major topics of the term. The Topics List section for each major area is meant to be relatively complete. The notes sections are less focused and are not indicative of the relative importance of the different topics we covered.

55

LEC-25 Preliminaries

LEC-25: Review
Lecture Notes Sections: 8.1 8.9

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 wk-09 10 wk-11 12 wk-13

VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review

LEC-25:

8.1

OVERVIEW OF THE TERM

8.1

Overview of the Term


The purely digital world VHDL design process optimization techniques functional validation performance analysis

Analog effects in the digital world timing analysis power faults and testing

LEC-25:

8.1

OVERVIEW OF THE TERM

Topics and Lectures


Design techniques Lec-01 Lec-02 Lec-03 Lec-04 Introduction and overview VHDL syntax and synthesis VHDL simulation semantics Hardware building blocks;

Design and optimization techniques Lec-05 Lec-06 Lec-07 Lec-08 Lec-09 Lec-10 Dataow diagrams and high-level models State machines Memory arrays Design example (stack) Optimization and coding guidelines FPGA-Specic optimizations

Functional Validation Lec-11 Datapath Validation Lec-12 Control Validation Performance analysis and prediction Lec-13 Measuring performance, comparing optimizations Lec-14 Digital-circuit performance

LEC-25:

8.1

OVERVIEW OF THE TERM

Topics and Lectures (2)


Timing Analysis Lec-15 Denitions, equations, sources of delay Lec-16 Math, physics and applications Lec-17 Storage Power Lec-18 Power and energy analysis Lec-19 Data encoding for power reduction Lec-20 Clock gating Testing and testability Lec-21 Lec-22 Lec-23 Lec-24 Faults; fault models; testability Fault detection and test vector generation Built-in self test (I) Built-in self test (II)

LEC-25:

8.2

VHDL

8.2

VHDL
simple syntax and semantics things that you should know simply by having done the miniproject and project synthesizing VHDL

match up VHDL code with hardware choose VHDL fragment to generate more optimal hardware identify whether a particular signal will be the output of combinational circuitry or a op identify whether a particular process is combinational or clocked VHDL semantics match up VHDL code with waveforms identify whether two VHDL fragments have same behaviour perform delta-cycle simulation of VHDL perform clock-cycle simulation of VHDL

LEC-25:

8.3

DESIGN AND OPTIMIZATION TECHNIQUES

8.3

Design and Optimization Techniques


from algorithm to dataow diagram from dataow diagram to hardware optimizing dataow diagrams nite state machines and hardware calculate resource usage for a dataow diagram (input ports, output ports, registers, datapath components) calculate performance data for a dataow diagram (clock period and number of cycles to execute (CPI)) given a dataow diagram, calculate the clock period that will result in the optimum performance given an algorithm, design a dataow diagram given a dataow diagram, design the datapath and nite state machine optimize a dataow diagram to improve performance or reduce resource usage

LEC-25:

8.4

VALIDATION

8.4

Validation
test benches assertions coverage monitors relational specication functional specication boundary conditions / corner cases

LEC-25:

8.5

PERFORMANCE PREDICTION AND ANALYSIS

8.5

Performance Prediction and Analysis


time to execute a program denition of performance speedup n% faster calculating performance of different different tasks and average task choosing which task to optimize to best improve overall performance cpi calculations performance increase over time design tradeoffs (CPI vs NumInsts vs ClockSpeed vs time-to-market) CPI calculations MIPs calculations Clock speed vs. performance Optimality performance / area tradeoffs

LEC-25:

8.6

TIMING ANALYSIS

8.6

Timing Analysis
what affects delay setup, hold, clock-to-Q times, skew, jitter, etc clock period clock skew clock jitter propagation delay load delay setup time hold time clock-to-Q time critical path

nd the critical path through a circuit nd the minimum clock period for a circuit nd a pair of assignments to signals that exercises the critical path false path determine whether a critical path is real or false derating factors

LEC-25:

8.7

POWER

10

8.7

Power
power vs energy equations for power

dynamic power static power switching power short circuit power leakage power activity factor leakage current threshold voltage

power reduction techniques clock gating data encoding

LEC-25:

8.8

TESTING

11

8.8

Testing
causes of faults locations of faults physical faults

mathematical models of faults single stuck-at fault will a test for a mathematica fault detect a physical fault?

testable / untestable fault fault masking redundant circuitry timing hazards

economics of testing fault coverage

open short wired AND wired OR stronger wins

LEC-25:

8.8

TESTING

12

Testing II

built-in self-testing linear feedback shift register characteristic polynomials addition multiplication division (quotient and remainder) relationship to hardware maximal length linear feedback shift register signature analyzer fault aliasing process and time to run a BIST test

test vector generation generate test vector to nd a particular fault generate test vectors to nd a set of faults fault collapsing gate collapsing node collapsing fault domination order test vectors to reduce test time

LEC-25:

8.9

FORMULAS TO BE GIVEN ON FINAL EXAM

13

8.9

Formulas to be Given on Final Exam

p
106
i

i 0

LEC-25:

8.9

1 2

R A t
10

1 38066

q e k

1 60218

FORMULAS TO BE GIVEN ON FINAL EXAM

10

Formulas II

23 19

J/K C 14

LEC-25:

8.9

FORMULAS TO BE GIVEN ON FINAL EXAM

Part II

Solutions to Tutorial Notes

Chapter 1

VHDL Problems

SOL-01 Preliminaries

SOL-01: VHDL Syntax


Lecture Notes Sections: 1.1 1.6

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

SOL-01:

1.1

IEEE 1164

1.1

IEEE 1164

For each of the values in the list below, answer whether or not it is dened in the ieee.std_logic_1164 library. If it is part of the library, write a 23 word description of the value. Values: -, #, 0, 1, A, h, H, L, Q, X, Z.

Answer:

- # 0 1 A h H L Q X Z

In std logic 1164? Yes No X X X X X X X X X X X

Description dont care strong 0 strong 1

weak 1 weak 0 strong unknown high impedance

NOTE: h is not in the package, because characters are case sensitive. For example a /= A.

SOL-01:

1.2

FLOPS, LATCHES, AND COMBINATIONAL CIRCUITRY 3

1.2 Flops, Latches, and Combinational Circuitry


For each of the signals p...z in the architecture main of montevido, answer whether the signal is a latch, combinational gate, or ip-op. entity montevido is port ( a, b0, b1, c0, c1, d0, d1, e0, e1 : in std_logic; l : in std_logic_vector (1 downto 0); p, q, r, s, t, u, v, w, x, y, z : out std_logic ); end montevido;

SOL-01:

1.2

FLOPS, LATCHES, AND COMBINATIONAL CIRCUITRY 4

architecture main of montevido is signal i, j : std_logic; begin process begin i <= c0 XOR c1; wait until rising_edge(a); j <= c0 XOR c1; t <= b0 XOR b1; process (a, i, j) begin u <= NOT t; if (a = 1) then v <= NOT x; p <= i AND j; end process; else process begin p <= NOT i; case l is end if; when "00" => end process; wait until rising_edge(a); process (a, b0, b1) begin w <= b0 AND b1; if rising_edge(a) then x <= 0; q <= b0 AND b1; when "01" => end if; wait until rising_edge(a); end process; w <= -; process (a, c0, c1, d0, d1, e0, e1) <= 1; x begin when "1-" => if (a = 1) then wait until rising_edge(a); r <= c0 OR c1; w <= c0 XOR c1; s <= d0 AND d1; x <= -; else end case; r <= e0 XOR e1; end process; end if; y <= c0 XOR c1; end process; z <= x XOR w; end main;

SOL-01:
Answer:

1.2

FLOPS, LATCHES, AND COMBINATIONAL CIRCUITRY 5

Latch p q r s t u v w x y z

Combinational X X

Flip-op X

X X X X X X X X

SOL-01:

1.3

COUNTING CLOCK CYCLES

1.3
NOTES: 1. 2. 3. 4.

Counting Clock Cycles

This question refers to the VHDL code shown below.

... represents a legal fragment of VHDL code assume all signals are properly declared the VHDL code is intendend to be legal, synthesizable code all signals are initially U

SOL-01:

1.3

COUNTING CLOCK CYCLES

architecture main of tinyckt is component bigckt ( ... ); signal ... : std_logic; begin p0 : process begin entity bigckt is wait until rising_edge(clk); port ( p0_a <= i; a, b : in std_logic; wait until rising_edge(clk); c : out std_logic end process; ); p1 : process begin end bigckt; wait until rising_edge(clk); p1_b <= p1_d; architecture main of bigckt is p1_c <= p1_b; begin p1_d <= s2_k; process (a, b) end process; begin p2 : process (p1_c, p3_h, p4_i, clk) begin if (a = 0) then if rising_edge(clk) then c <= 0; p2_e <= p3_h; else p2_f <= p1_c = p4_i; if (b = 1) then end if; c <= 1 end process; else p3 : process (i, s4_m) begin c <= 0; p3_g <= i; end if; p3_h <= s4_m; end if; end process; end process; p4 : process (clk, i) begin end main; if (clk = 1) then p4_i <= i; entity tinyckt is else port ( p4_i <= 0; clk : in std_logic; end if; i : in std_logic; end process; o : out std_logic huge : bigckt ); (a => p2_e, b => p1_d, c => h_y); end tinyckt; s1_j <= s3_l; s2_k <= p1_b XOR i; s3_l <= p2_f; s4_m <= p2_f; end main;

For each of the pairs of signals below, what is the minimum length of time between when a change occurs on the source signal and when that change

SOL-01:

1.3

COUNTING CLOCK CYCLES

affects the destination signal?

Answer:

NOTE: i doesnt affect the value of p2 f just before a rising edge of clock, so i doesnt affect p2 e at all along the path that goes through p2 f source signal destination signal no connection same clock cycle 1 clock cycle 2 clock cycle 3 clock cycle 4 clock cycle 5 clock cycle 6 clock cycle 7 clock cycle 8 clock cycle 9 clock cycle 10 or more clock cycles i p0 a i p1 b i p1 c i p2 e i p3 g X X X X X i p4 i X X X s4 m hy p1 b p1 d p2 f s1 j X

SOL-01:

1.4

ARITHMETIC OVERFLOW

1.4

Arithmetic Overow

Implement a circuit to detect overow in 8-bit signed arithmetic.

Answer:

An overow in 8 bit arithmetic happens when the carry into the most signicant bit is different from the carry out of the most signicant bit. library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity overflow is port ( num1, num2 : in signed(7 downto 0); cin : in std_logic; overflow : out std_logic ); end overflow; architecture main of overflow is signal num1_ext, num2_ext, result : signed(8 downto 0); begin num1_ext <= 0 & num1; num2_ext <= 0 & num2; result <= num1_ext + num2_ext + ("00000000" & cin); ovrflw <= not (num1_ext(7) xor num2_ext(7)) and ( num1_ext(7) xor result(7) ); end overflow;

SOL-01:

1.5

8-BIT REGISTER

10

1.5

8-Bit Register

Implement an 8 bit register that has:

clock signal clk input data vector d output data vector q synchronous active-high input reset synchronous active-high input enable

SOL-01:

1.5

8-BIT REGISTER

11

Answer: library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity reg_8 port ( clk, reset, enable : d : q : ); end reg_8; is

in std_logic; in std_logic_vector (7 downto 0); out std_logic_vector (7 downto 0)

architecture main of reg_8 is begin reg: process begin wait until (rising_edge(clk)); if reset = 1 then q <= (others => 0); elsif enable = 1 then q <= d; end if; end process reg; end main;

SOL-01:

1.5.1

Asynchronous Reset

12

1.5.1

Asynchronous Reset

Modify your design so that the reset signal is asynchronous, rather than synchronous.

Answer: reg : process(clk, reset) begin if reset = 1 then q <= (other => 0); elsif rising_edge(clk) then if enable = 1 then q <= d; end if; end if; end process reg;

SOL-01:

1.5.2

Discussion

13

1.5.2

Discussion

Describe the tradeoffs in using synchonous versus asynchronous reset in a circuit implemented on an FPGA.

SOL-01:

1.5.3

Testbench for Register

14

1.5.3

Testbench for Register

Write a test bench to validate the functionality of the 8-bit register with synchronous reset.

Answer:

SOL-01:

1.5.3

Testbench for Register

15

library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity reg_8_tb is end reg_8_tb; architecture main of reg_8_tb is component reg_8 is port ( clk : in std_logic; reset : in std_logic; enable : in std_logic; d : in std_logic_vector (7 downto 0); q : out std_logic_vector (7 downto 0)); end component; signal clk, reset, enable : std_logic; signal d, q : std_logic_vector(7 downto 0); begin uut : reg_8 port map ( clk => clk, reset => reset, enable => enable, d => d, q => q ); process begin clk <= 1 ; reset <= 0 ; wait for 20 ns; -- time=20 ns clk <= 0 ; reset <= 1 ; enable <= 1 ; d <= "10101011"; wait for 20 ns; -- time=40 ns clk <= 1 ; wait for 20 ns; -- time=60 ns clk <= 0 ; en <= 0 ; d <= "00001011" wait for 20 ns; -- time=80 ns clk <= 1 ; wait for 20 ns; -- time=100 ns clk <= 0 ; en <= 1 ; wait for 20 ns; -- time=120 ns clk <= 1 ;

SOL-01:

1.6

VHDL SYNTAX

16

1.6

VHDL Syntax

Answer whether each of the VHDL code fragments q2a through q2f is legal VHDL code. NOTES: ... represents a fragment of legal VHDL code. For full marks, if the code is illegal, you must explain why. The code has been written so that, if it is illegal, then it is illegal for both simulation and synthesis.

1) 2) 3)

architecture main of anchiceratops is signal a, b, c : std_logic; begin process begin architecture main of tulerpeton i wait until rising_edge(c); begin a <= if (b = 1) then lab: for i in 15 downto 0 loop q2a q2b ... ... else end loop; ... end main; end if; ILLEGAL: loop statements are sequential, end process; while architecture bodies contain concurrent end main; statements. ILLEGAL: if-then-else is a statement, not an expression, so cant have if-then-else on right-hand-side of assignment.

SOL-01:

1.6

VHDL SYNTAX

17

architecture main of temnospondyl is component compa port ( architecture main of metaxygnathus ais in std_logic; : signal a : std_logic; b : out std_logic begin ); q2d q2c lab: if (a = 1) generate end component; ... signal p, q : std_logic; end generate; begin end main; coma_1 : compa port map (a => p, b => q); ILLEGAL: condition for ... if-generate statements must end main; be statically determined; testing the value of a signal is dynamic. LEGAL architecture main of pachyderm is architecture main of apatosaurus is function inv(a : std_logic) type state_ty is (S0, S1, S2); return std_logic is signal st : state_ty; begin signal p : std_logic; return(NOT a); begin q2e q2f end inv; case st is signal p, b : std_logic; when S0 | S1 => p <= 0; begin when others => p <= 1; p <= inv(b => a); end case; ... end main; end main; ILLEGAL: case statements are ILLEGAL: the argument to inv sequential; but the body of an should be (a => b) architecture contains concurrent statements.

SOL-02 Preliminaries

SOL-02: VHDL Semantics


Lecture Notes Sections: 1.7 1.15.3

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

SOL-02:

1.7

CLOCK-CYCLE SIMULATION

1.7

Clock-Cycle Simulation

Given the VHDL code for deinonychus and waveform diagram below, answer what the values of the signals y, z, and p will be at the given times.

SOL-02:

1.7

CLOCK-CYCLE SIMULATION

architecture main of deinonychus is signal y, z : unsigned(15 downto 0) signal state : state_ty; begin proc_herzog: process begin top_loop: loop wait until (rising_edge(clk)); library ieee; next top_loop when (reset = 1 use ieee.std_logic_1164.all; state <= durian; use ieee.numeric_std.all; wait until (rising_edge(clk)); state <= papaya; package deinonychus_pkg is while y < z loop type state_ty is wait until (rising_edge(clk)) (mango, guava, durian, papaya); if sel = 1 then end deinonychus_pkg; wait until (rising_edge(clk next top_loop when (reset = library ieee; state <= mango; use ieee.std_logic_1164.all; end if; use ieee.numeric_std.all; state <= papaya; use work.deinonychus_pkg.all; end loop; end loop; entity deinonychus is end process; port ( proc_hillary: process (clk) clk, reset, sel : in std_logic; begin a, b : in unsigned(15 downto 0); if rising_edge(clk) then p : out unsigned(15 downto 0) if (state = durian) then ); z <= a; end deinonychus; else z <= z + 2; end if; end if; end process; y <= b; p <= y + z; end main;

SOL-02:
0 reset clk

1.7
20

CLOCK-CYCLE SIMULATION
40 60 80 100 120 140 160 180 200

sel

01 0E 02 0C 04 0A 06 08 0E 02 0C 04 0A 06 08 0E 02 0C 04 0A

b state

0F 03 0D 05 0B 07 01 0F 03 0D 05 0B 07 01 0F 03 0D 05 0B 07

0F 03 0D 05 0B 07 01 0F 03 0D 05 0B 07 01 0F 03 0D 05 0B 07

z p

U
U

2
07

6
15

A
11

55ns

107ns

147ns

195ns

Answer: y z p 55ns 7 U U 107ns 147ns 195ns 5 F 7 2 6 A 7 15 11

SOL-02:

1.8

DELTA-CYCLE SIMULATION: PONG

1.8

Delta-Cycle Simulation: Pong

Simulate the following VHDL code by drawing a timing diagram. INSTRUCTIONS: 1. The simulation is to be done at the granularity of simulation-steps. 2. Show all changes to process modes and signal values. 3. Each column of the timing diagram corresponds to a simulation step that changes a signal or process. 4. Clearly show the beginning and end of each simulation cycle, delta cycle, and simulation round by writing in the appropriate row a B at the beginning and an E at the end of the cycle or round. NOTES: 1. The initial value of all signals is U. 2. The signal reset becomes 1 at 0 ns and then becomes 0 at 5 ns.

SOL-02:

1.8

DELTA-CYCLE SIMULATION: PONG

architecture main of pong_machine is signal ping_i, ping_n, pong_i, pong_n : std_logic; begin process (clk) begin if rising_edge(clk) then ping_n <= ping_i; pong_n <= pong_i; end if; end process; process (pong_n, ping_n, reset) begin if (reset = 1) then ping_i <= 1; pong_i <= 0; else ping_i <= pong_n; pong_i <= ping_n; end if; end process; out_pong_proc : process (pong_i) begin pong <= pong_i; end process; ping <= ping_i; end main;

SOL-02:

1.9

DELTA-CYCLE SIMULATION: FEMUR

1.9

Delta-Cycle Simulation: Femur

Simulate the following VHDL code by completing the timing diagram on the next page. INSTRUCTIONS: 1. The simulation is to be done at the granularity of simulation-steps. 2. Show all changes to process modes and signal values. 3. Each column of the timing diagram corresponds to a simulation step that changes a signal or process. 4. Clearly show the beginning and end of each simulation cycle, delta cycle, and simulation round by writing in the appropriate row a B at the beginning and an E at the end of the cycle or round. 5. Write t=5ns and t=10ns at the top of columns where time advances to 5 ns and 10 ns. NOTES: 1. The initial value of all of the signals are shown in the timing diagram. 2. The only changes on clk, a, and b are: (a) At 5 ns, a changes from 0 to 1. (b) At 5 ns, b changes from 0 to 1. (c) At 10 ns, clk changes from 0 to 1.

SOL-02:

1.9

DELTA-CYCLE SIMULATION: FEMUR

entity femur is port ( clk, a, b : in std_logic; f : out std_logic ); end femur; architecture main of femur is signal c, d, e : std_logic; begin proc_1 : process (a, b, c) begin c <= a and b; d <= a xor c; end process; proc_2 : process begin e <= d; wait until rising_edge(clk); end process; proc_3 : process (c, e) begin f <= c xor e; end process; end main;

SOL-02:

t=5 ns

t=10 ns

simulation round E E E S P A S P A S P A B E B E S B E B E

B B B

E E E

1.9

simulation cycle

delta cycle

proc_external

proc_1

proc_2 P A S

proc_3

clk

DELTA-CYCLE SIMULATION: FEMUR

SOL-02:

1.10

VHDL VHDL BEHAVIOURAL COMPARISON: TERADACTYL 10

1.10 VHDL VHDL Behavioural Comparison: Teradactyl


For each of the VHDL architectures q3a through q3c, does the signal v have the same behaviour as it does in the main architecture of teradactyl? NOTES: For full marks, if the code has different behaviour, you must explain why. Ignore any differences in behaviour in the rst few clock cycles that is caused by initialization of ip-ops, latches, and registers. All code fragments in this question are legal, synthesizable VHDL code.

1) 2) 3)

entity teradactyl is port ( architecture q3a of teradactyl is a : in std_logic; signal b, c, d : std_logic; v : out std_logic begin ); b <= a; end teradactyl; architecture main of teradactyl is c <= b; d <= c; signal m : std_logic; v <= d; begin end q3a; m <= a; v <= m; SAME end main;

SOL-02:

1.10

VHDL VHDL BEHAVIOURAL COMPARISON: TERADACTYL 11

architecture q3c of teradactyl is architecture q3b of teradactyl is signal m : std_logic; signal m : std_logic; begin begin process (a) begin process (a, m) begin m <= a; v <= m; end process; m <= a; process (m) begin end process; v <= m; end q3b; end process; end q3c; SAME SAME

SOL-02:

1.11

VHDL VHDL BEHAVIOURAL COMPARISON: ICHTYOSTEGA 12

1.11 VHDL VHDL Behavioural Comparison: Ichtyostega


For each of the VHDL architectures q4a through q4c, does the signal v have the same behaviour as it does in the main architecture of ichthyostega? NOTES: For full marks, if the code has different behaviour, you must explain why. Ignore any differences in behaviour in the rst few clock cycles that is caused by initialization of ip-ops, latches, and registers. All code fragments in this question are legal, synthesizable VHDL code.

1) 2) 3)

SOL-02:

1.11

VHDL VHDL BEHAVIOURAL COMPARISON: ICHTYOSTEGA 13

entity ichthyostega is port ( clk : in std_logic; b, c : in signed(3 downto 0); architecture q4a of ichthyostega is v : out signed(3 downto 0) signal bx, cx : signed(3 downto 0); ); begin end ichthyostega; process begin wait until (rising_edge(clk)); architecture main of ichthyostega is bx <= b; signal bx, cx : signed(3 downto 0); cx <= c; begin end process; process begin process begin wait until (rising_edge(clk)); if (cx > 0) then bx <= b; wait until (rising_edge(clk)); cx <= c; v <= bx; end process; else process begin wait until (rising_edge(clk)); wait until (rising_edge(clk)); v <= to_signed(-1, 4); if (cx > 0) then end if; v <= bx; end process; else end q4a; v <= to_signed(-1, 4); end if; DIFFERENT: evaluations of cx > 0 and end process; v <= bx are separated by a clock cycle. end main;

SOL-02:

1.11

VHDL VHDL BEHAVIOURAL COMPARISON: ICHTYOSTEGA 14

architecture q4b of ichthyostega is architecture q4c of ichthyostega is signal bx, cx : signed(3 downto 0); signal bx, cx, dx : signed(3 downto begin begin process begin process begin wait until (rising_edge(clk)); wait until (rising_edge(clk)); bx <= b; bx <= b; cx <= c; cx <= c; wait until (rising_edge(clk)); end process; if (cx > 0) then process begin v <= bx; wait until (rising_edge(clk)); else v <= dx; v <= to_signed(-1, 4); end process; end if; dx <= bx when (cx > 0) end process; else to_signed(-1, 4); end q4b; end q4c; DIFFERENT: each assignment statement SAME (e.g. bx <= b) will execute every other clock cycle, rather than every clock cycle.

SOL-02:

1.12

WAVEFORM VHDL BEHAVIOURAL COMPARISON 15

1.12 Waveform VHDL Behavioural Comparison


Answer whether each of the VHDL code fragments q3a through q3d has the same behaviour as the timing diagram. NOTES: Same behaviour means that the signals a, b, and c have the same values at the end of each clock cycle in steadystate simulation (ignore any irregularities in the rst few clock cycles). For full marks, if the code does not match, you must explain why. Assume that all signals, constants, variables, types, etc are properly dened and declared. All of the code fragments are legal, synthesizable VHDL code.

1)

2) 3) 4)

clk a b c

SOL-02:

1.12

WAVEFORM VHDL BEHAVIOURAL COMPARISON 16

q3a q3b architecture q3a of q3 is architecture q3b of q3 is begin begin process begin process begin a <= 1; b <= 0; loop a <= 1; wait until rising_edge(clk); wait until rising_edge(clk); a <= NOT a; a <= b; end loop; b <= a; end process; wait until rising_edge(clk); b <= NOT a; end process; c <= NOT b; c <= a; end q3a; end q3b; SAME SAME

q3c q3d architecture q3c of q3 is architecture q3d of q3 is begin begin process begin process (b, clk) begin a <= 0; a <= NOT b; b <= 1; end process; wait until rising_edge(clk); process (a, clk) begin b <= a; b <= NOT a; a <= b; end process; wait until rising_edge(clk); c <= NOT b; end process; end q3d; c <= NOT b; end q3c; DIFFERENT: this code has combinaSAME tional loops

SOL-02:

1.12

WAVEFORM VHDL BEHAVIOURAL COMPARISON 17

q3e q3f architecture q3e of q3 is architecture q3f of q3 is begin begin process process begin begin a <= 1; b <= 0; b <= 0; a <= 1; c <= 1; wait until rising_edge(clk); wait until rising_edge(clk); a <= c; a <= c; b <= a; b <= a; wait until rising_edge(clk); c <= NOT b; end process; wait until rising_edge(clk); c <= not b; end process; end q3e; end q3f; DIFFERENT: c is a constant 1 DIFFERENT: a is a constant 1

SOL-02:

1.13

HARDWARE VHDL COMPARISON

18

1.13

Hardware VHDL Comparison


entity q2 is port ( a, clk, reset : in std_logic; d : out std_logic ); end q2; architecture main of q2 is signal b, c : std_logic; begin b <= 0 when (reset = 1) else a; process (clk) begin if rising_edge(clk) then c <= b; d <= c; end if; end process; end main;

For each of the circuits q2aq2d, answer whether the signal d has the same behaviour as it does in the main architecture of q2.

reset 0 a 0 a q2b clk reset d d

q2a clk

SOL-02:

1.13

HARDWARE VHDL COMPARISON


reset clk

19

reset 0 0 d a q2c clk a clk d

q2d

SOL-02:

1.14

SYNTHESIZABLE VHDL AND HARDWARE

20

1.14

Synthesizable VHDL and Hardware

For each of the fragments of VHDL q4a...q4d, answer whether the the code is synthesizable. If the code is synthesizable, draw the circuit most likely to be generated by synthesizing the datapath of the code. If the the code is not synthesizable, explain why.

process begin wait until rising_edge(a); e <= d; q4a wait until rising_edge(b); e <= NOT d; end process;

Answer: Unsynthesizable: different conditions in wait statements in same process. This would lead to a single ip-op requiring multiple clock signals.

Answer: unsynthesizable: while process begin loop around code where while (c /= 1) loop some paths have wait if (b = 1) then statements and some do wait until rising_edge(a); not. Even having a while e <= d; loop with a dynamic q4b else condition around code e <= NOT d; without a wait statement end if; would be end loop; unsynthesizable, e <= b; because it would lead to end process; combinational loops in the hardware.

SOL-02:

1.14

SYNTHESIZABLE VHDL AND HARDWARE

21

process (a, d) begin e <= d; end process; process (a, e) begin q4c if rising_edge(a) then f <= NOT e; end if; end process;

Answer: Flop with inverter on input

process (a) begin if rising_edge(a) then if b = 1 then e <= 0; q4d else e <= d; end if; end if; end process;

Answer: Synchronous reset (AND with bubble). The Reset pin on a ip-op is generally asynchronous, so a op with a reset pin would be incorrect.

SOL-02:

1.15

DATAPATH DESIGN

22

1.15

Datapath Design

Each of the three VHDL fragments q4aq4c, is intended to be the datapath for the same circuit. The circuit is intended to perform the following sequence of operations (not all operations are required to use a clock cycle):

read in source and destination addresses from i src1, i src2, i dst read operands op1 and op2 from mem- clk i_src1 ory compute sum of operands sum i_src2 write sum to memory at destination ad- i_dst dress dst write sum to output o result

o_result

SOL-02:

1.15.1

Correct Implementation?

23

1.15.1 Correct Implementation?


For each of the three fragments of VHDL q4aq4c, answer whether it is a correct implementation of the datapath. If the datapath is not correct, explain why. If the datapath is correct, answer in which cycle you need load=1. NOTES: 1. You may choose the number of clock cycles required to execute the sequence of operations. 2. The cycle in which the addresses are on i src1, i src2, and i dst is cycle #1. 3. The control circuitry that controls the datapath will output a signal load, which will be 1 when the sum is to be written into memory. 4. The code fragment with the signal declaractions, connections for inputs and outputs, and the instantiation of memory is to be used for all three code fragments q4aq4c. 5. All of the VHDL is legal, synthesizable code.

SOL-02:

1.15.1

Correct Implementation?

24

-- This code is to be used for all three code fragments q4a--q4c. signal state : std_logic_vector(3 downto 0); signal src1, src2, dst, op1, op2, sum, mem_in_a, mem_out_a, mem_out_b mem_addr_a, mem_addr_b : unsigned(7 downto 0); ... process (clk) begin if rising_edge(clk) then src1 <= i_src1; src2 <= i_src2; dst <= i_dst; o_result <= sum; end if; end process; mem : ram256x16d port map (clk => clk, i_addr_a => mem_addr_a, i_addr_b => mem_addr_b, i_we_a => mem_we, i_data_a => mem_in_a, o_data_a => mem_out_a, o_data_b => mem_out_b); q4a

SOL-02:
op1

1.15.1

Correct Implementation?

25

<= mem_out_a when state = "0010" else (others => 0); op2 <= mem_out_b when state = "0010" else (others => 0); sum <= op1 + op2 when state = "0100" else (others => 0); mem_in_a <= sum when state = "1000" else (others => 0); mem_addr_a <= dst when state = "1000" else src1; mem_we <= 1 when state = "1000" else 0; mem_addr_b <= src2; process (clk) begin if rising_edge(clk) then if (load = 1) then state <= "1000"; else -- rotate state vector one bit to left state <= state(2 downto 0) & state(3); end if; end if; end process;

SOL-02:

1.15.1

Correct Implementation?

26

Answer: The circuit is not correct: all of the signals are combinational. Also, there could be initialization problems with state.

SOL-02:
q4b

1.15.1

Correct Implementation?

27

process (clk) begin if rising_edge(clk) then op1 <= mem_out_a; op2 <= mem_out_b; end if; end process; sum <= op1 + op2; mem_in_a <= sum; mem_we <= load; mem_addr_a <= dst when load = 1 else src1; mem_addr_b <= src2;

SOL-02:

1.15.1

Correct Implementation?

28

Answer:

The circuit is correct. load = 1 in clock cycle 2

SOL-02:
q4c

1.15.1

Correct Implementation?

29

process begin wait until rising_edge(clk); op1 <= mem_out_a; op2 <= mem_out_b; sum <= op1 + op2; mem_in_a <= sum; end process; process (load, dst, src1) begin if load = 1 then mem_addr_a <= dst; else mem_addr_a <= src1; end if; end process; mem_addr_b <= src2;

SOL-02:

1.15.1

Correct Implementation?

30

Answer: If take code exactly as is:

If assume that add mem we:

the circuit is incorrect, because mem we is missing.

The circuit correct. Need load = 1 in cycle 4.

SOL-02:

1.15.2

Smallest Area

31

1.15.2 Smallest Area


Of all of the circuits (q4aq4c), including both correct and incorrect circuits, predict which will have the smallest area. If you dont have sufcient information to predict the relative areas, explain what additional information you would need to predict the area prior to synthesizing the designs.

SOL-02:

1.15.2

Smallest Area

32

Answer: Assuming that q4c includes mem we: All of the circuits have an adder, memory, input ops, output ops, and a mux for mem addr a. The differences are in the ops and misc circuitry: q4a 1*4 5*4 q4b 2*8 0 q4c 4*8 0

ops ands

From this analysis, q4a has the smallest area.

SOL-02:

1.15.3

Shortest Clock Period

33

1.15.3 Shortest Clock Period


Of all of the circuits (q4aq4c), including both correct and incorrect circuits, predict which will have the shortest clock period. If you dont have sufcient information to predict the relative periods, explain what additional information you would need to predict the period prior to performing any synthesis or timing analysis of the designs.

SOL-02:

1.15.3

Shortest Clock Period

34

Answer:

q4c has the shortest clock period, because it does the least amount of computation between ip ops all of the signals are opped.

Chapter 2

Design Problems

35

SOL-03 Preliminaries

SOL-03: Datapath and Control Design


Lecture Notes Sections: 2.5 2.5

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

SOL-03:

2.1

SYNTHESIS

2.1

Synthesis

This question is about using VHDL to implement memory structures on FPGAs.

SOL-03:

2.1.1

Data Structures

2.1.1

Data Structures

If you have to write your own code (i.e. you do not have a library of memory components or a special component generation tool such as LogiBlox or CoreGen). What datastructures in VHDL would you use when creating a register le?

SOL-03:

2.1.2

Own Code vs Libraries

2.1.2

Own Code vs Libraries

When using VHDL for an FPGA, under what circumstances is it better to write your own VHDL code for memory, rather than instantiate memory components from a library?

SOL-03:

2.2

DESIGN GUIDELINES

2.2

Design Guidelines

While you are grocery shopping you encounter your co-op supervisor from last year. Shes now forming a startup company in Waterloo that will build digital circuits. Shes writing up the design guidelines that all of their projects will follow. She asks for your advice on some potential guidelines. What is your response to each question? What is your justication for your answer? What are the tradeoffs between the two options? 0. Sample Should all projects use silicon chips, or should all use biological chips, or should each project choose its own technique? Answer: All projects should use silicon based chips, because biological chips dont exist yet. The tradeoff is that if biological chips existed, they would probably consume less power than silicon chips. 1. Should all projects use an asynchronous reset signal, or should all use a synchronous reset signal, or should each project choose its own technique? Answer: Synchronous reset: Synchronous reset leads to more robust designs. With asynchronous reset, a op is reset whenever the reset signal arrives. Due to wire delays, signals will arrive at different ops at different times. If an asynchronous reset occurs at about the time as a clock edge, some ops might be reset in one clock cycle and some in the next. This can lead to glitches and/or illegal values on internal state signals. The tradeoff is that asynchronous reset is often easier to code in VHDL and requires less hardware to implement. 2. Should all projects use latches, or should all projects use ip-ops, or should each project choose its own technique?

SOL-03:

2.2

DESIGN GUIDELINES

Answer: Flops Flip ops lead to more robust designs than latches. Latches are level sensitive and act as wires when enabled. For a latch based design to work correctly, there cannot be any overlap in the time when a consecutive pair of latches are enabled. If this happens, the value on a signal will leak through the latch and arrive at the next set of latches one clock phase too early. Thus, latch based designs are more sensitive to the timing of clock signals. Another disadvantage of latches is that some FPGAs and cell libraries do not support them. In comparison, D-type ip ops are (almost?) always supported. The tradeoff is that latches are smaller and faster than ip ops. A common implementation of a ip-op is a pair of latches in a master/slave combination. 3. Should all chips have registers on the inputs and outputs or should chips have the inputs and outputs directly connected to combinational circuitry, or should each project choose its own technique? By register we mean either ip-ops or latches, based upon your answer to the previous question. If your answer is different for inputs and outputs, explain why. Answer: Flops on outputs and inputs Putting ops on inputs and outputs will make the clock speed of the chip less dependent of the propagation delay between chips. Flops can also be used to isolate the internals of the chip from glitches and other anomolous behaviour that can occur on the boards. The tradeoff is that ops consume area and will increase the latency through the chip. 4. Should all circuit modules on all chips have ip-ops on the inputs and outputs or should chips have the inputs and outputs directly connected to combinational circuitry, or should each project

SOL-03:

2.2

DESIGN GUIDELINES

choose its own technique? By register we mean either ip-ops or latches, based upon your answer to the previous question. If your answer is different for inputs and outputs, explain why. Answer: Each project should adopt a convention of either using ops on inputs of modules or outputs of modules. It is rarely necessary to put ops on both inputs and outputs of modules on the same chip. This is because the wire delay between modules is usually less than a clock period. Putting ops on either the inputs or outputs is advantageous because it provides a standard design convention that makes it easier to glue modules together without violating timing constraints. If modules were allowed to have combinational circuitry on both inputs and outputs, the maximum clock speed of the design could not be determined until all of the modules were glued together. The tradeoff is that ops add area and latency. Sometimes there will be two modules where the combinational circuitry on the outputs of one can be combined with the combinational circuitry on the inputs of the second without violating timing constraints. This discipline prevents that optimization. Aside: Sometimes, to meet performance targets, in situations such as this, a project will remove or move the ops between modules and do clock borrowing to t the maximum amount of circuitry into a clock period. This is a rather low-level optimization that happens late in the design cycle. It can cause big headaches for functional validation and equivalence verication, because the specications for modules are no longer clean and the boundaries between modules on the low-level design might be different from the boundaries in the high-level design. 5. Should all projects use tri-state buffers, or should all projects use multiplexors, or should each project choose its own technique?

SOL-03:

2.2

DESIGN GUIDELINES

Answer: Multiplexors Multiplexors lead to more robust designs. Tri-state buffers rely on analog characteristics of devices to work correctly. Latches can work incorrectly in the presence of voltage uctuations or fabrication process variations. Multiplexors work on a purely Boolean level and as such are less sensitive to changes in voltages or fabrication processes. The tradeoff is that latches are smaller and faster than multiplexors.

SOL-03:

2.3

DATAFLOW DIAGRAM OPTIMIZATION

2.3

Dataow Diagram Optimization


a b c

Use the dataow diagram below to answer questions 2.3.1 and 2.3.2.

f f d e

g f

SOL-03:

2.3.1

Resource Usage

10

2.3.1

Resource Usage

List the number of items for each resource used in the dataow diagram.

Answer: input ports output ports registers f components g components 3 1 4 2 1

SOL-03:

2.3.2

Optimization

11

2.3.2

Optimization

Draw an optimized dataow diagram that improves the performance and produces the same output values. Or, if the performance cannot be improved, describe the limiting factor on the preformance. NOTES:

Answer:
a b d

you may change the times when signals are read from the environment you may not increase the resource usage (input ports, registers, output ports, f components, g components) you may not increase the clock period

f c f g

g e f

SOL-03:

2.4

DATAFLOW DIAGRAM DESIGN

12

2.4

Dataow Diagram Design

Your manager has given you the task of implementing the following pseudocode in an FPGA: if is_odd(a + d) p = (a + d)*2 + ((b + c) - 1)/4; else p = (b + c)*2 + d;

1) 2) 3) 4) 5)

6)

NOTES: You must use registers on all input and output ports. p, a, b, c, and d are to be implemented as 8-bit signed signals. A 2-input 8-bit ALU that supports both addition and subtraction takes 1 clock cycle. A 2-input 8-bit multiplier or divider takes 4 clock cycles. A small amount of additional circuitry (e.g. a NOT gate, an AND gate, or a MUX) can be squeezed into the same clock cycle(s) as an ALU operation, multiply, or divide. You can require that the environment provides the inputs in any order and that it holds the input signals at the same value for multiple clock cycles.

SOL-03:

2.4.1

Maximum performance

13

2.4.1

Maximum performance

What is the minimum number of clock cycles needed to implement the pseudocode with a circuit that has two input ports?

Answer:

Optimizations:

Data ow for odd case

Multiplication by a constant power of 2 can be done without hardware, just connect the wires between the signals. For example, if we have a <= b*2;, we can do this with a(0) <= b(1); a(1) <= b(2); etc. Testing if a signal is odd or even can be done simply by extracting the least signicant bit of the signal.
b c

d 1

SOL-03:

2.4.1

Maximum performance
b c

14

Data ow for even case Even ow requires 4 clock cycles (3 cycles in the datapath plus one more because we have to have ops on both inputs and outputs). Therefore total design will require 4 clock cycles. What is the minimum number of ALUs, multipliers, and dividers needed to achieve the minimum number of clock cycles that you just calculated?

Answer:

SOL-03:

2.4.1

Maximum performance

15

c 4 2 0 0 clock cycles ALUs dividers multipliers

-1 xor and

Dataow for entire circuit

SOL-03:

2.4.2

Minimum area

16

2.4.2

Minimum area

What is the minimum number of datapath storage registers (8, 6, 4, and 1 bit) and clock cycles needed to implement the pseudocode if the circuit can have at most one ALU, one multiplier, and one divider?

Answer:
a d 3 0 0 0 5 8b regs 6b regs 4b regs 1b regs clock cycles

d -1

SOL-03:

2.5

DESIGN AND OPTIMIZATION

17

2.5

Design and Optimization

Design a circuit that performs the following operation: P = (a+d) + ((b - c) - 1) Optimize your design for area.

Answer:

VHDL code for implementing: P = (a+d) + ((b-c)-1)

SOL-03:

2.5

DESIGN AND OPTIMIZATION

18

library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity fsm1 is port( in1: in signed(3 downto 0); in2: in signed(3 downto 0); clk: in std_logic; p: out signed(4 downto 0) ); end fsm1; architecture fsm1_arch of fsm1 is signal add_sel, sub_sel : std_logic; signal add1, add2, sub1, sub2, r1, r2: signed(4 downto 0); begin fsm: process begin wait until rising_edge(clk); add_sel <= - ; sub_sel <= 1 ; wait until rising_edge(clk); add_sel <= 1 ; sub_sel <= 0 ; wait until rising_edge(clk); add_sel <= 0 ; sub_sel <= - ; end process; reg: process begin wait until rising_edge(clk); r1 <= sub1 sub2; r2 <= add1 + add2; end process; -- concurrent statements add1 <= ( 0 & in1) when (add_sel = 1 ) else r1; add2 <= ( 0 & in2) when (add_sel = 1 ) else r2; sub1 <= ( 0 & in1) when (sub_sel = 1 ) else r1; sub2 <= ( 0 & in2) when (sub_sel = 1 ) else to_signed(1,5); p <= r2; end fsm1_arch;

SOL-04 Preliminaries

SOL-04: Memory Design


Lecture Notes Sections: 2.6 2.6.2

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

SOL-04:

2.6

DATAFLOW DIAGRAMS WITH MEMORY ARRAYS

2.6 Dataow Diagrams with Memory Arrays


Component Register Adder Subtracter ALU with , , Memory read Memory write Multiplication 2:1 Multiplexor NOTES: 1. 2. 3. 4. The inputs of the algorithms are a and b. The outputs of the algorithms are p and q. You must register both your inputs and outputs. You may choose to read your input data values at any time and produce your outputs at any time. For your inputs, you may read each value only once (i.e. the environment will not send multiple copies of the same value). Execution time is measured from when you read your rst input until the latter of producing your last output or the completion of writing a result to memory M is an internal memory array, which must be implemented as dualported memory with one read/write port and one write port. M supports synchronous write and asynchronous read. Assume all memory address and other arithmetic calculations are within the range of representable numbers (i.e. no overows occur). If you need a circuit not on the list above, assume that its delay is 30 ns. You may sacrice area efciency to achieve high performance, but marks will be deducted for extra hardware that does not contribute to performance. Delay 5 ns 25 ns 30 ns 40 ns 60 ns 60 ns 65 ns 5 ns

5.

6. 7. 8. 9. 10.

, AND, XOR

SOL-04:

2.6.1

Algorithm 1

2.6.1
Algorithm

Algorithm 1

q = M[b]; M[a] = b; p = (M[b-1]) * b) + M[b]; Assuming a > b, draw a dataow diagram that is optimized for the fastest overall execution time.

Answer:

1. a > b means that a b 1, therefore can do M[b+1] read in parallel with M[a] write or with M[b] read.

3. Initial dataow diagram: M a

M(wr)

4. Find the critical path

2. But, could have a with M[b] read.

b, so cant do M[a] write in parallel

b 1

M(rd)

M(rd)

SOL-04:

2.6.1

Algorithm 1
M a b -1 25ns M(rd) 60ns 60ns M(wr) 65ns M q p 150ns M(rd) 60ns

Critical path is from b to p: 150ns. 5. Explore performance with different clock periods
M a b 1 25ns 5ns 5ns

M(rd) 60ns 60ns M(wr)

M(rd) 60ns 5ns 65ns 5ns

period latency time

70 ns 4 cycles 280 ns

SOL-04:
M

2.6.1
a

Algorithm 1
b 1 25ns 5ns

M(rd) 60ns 60ns M(wr)

M(rd) 60ns 5ns 65ns 5ns

period latency time

90 ns 3 cycles 270 ns

6. Minimum latency is 3 clock cycles, because cant do all memory operations in parallel and need registers on both inputs and outputs. 7. Best performance is with clock period of 90 ns. 8. Resource usage: Component Quantity Input 2 Output 2 Register 5 (including mem array) Adder 1 Memory read 2 Memory write 1 Multiplication 1 Clock Period 90 ns Latency 3 cycles Execution Time 270 ns

SOL-04:

2.6.2

Algorithm 2

2.6.2

Algorithm 2

q = M[b]; M[a] = b; p = (M[b-1]) * b) + M[b]; Assuming a b, draw a dataow diagram that is optimized for the fastest overall execution time.

Answer:

1. a b means that a b and a b-1, so no memory address conicts to create dependencies and complications. 2. Explore performance with different clock periods
M a b 1 30ns 5ns 5ns

M(rd)

M(rd)

60ns 5ns

60ns

M(wr)

65ns 5ns q 25ns

5ns M p

period latency time

70 ns 5 cycles 350 ns

SOL-04:
M

2.6.2
a

Algorithm 2
b 1 30ns 5ns

M(rd)

M(rd)

60ns 5ns

60ns

M(wr)

65ns

25ns

5ns M p

period latency time

90 ns 3 cycles 270 ns

3. Without going to a triple-ported memory, cant reduce latency below 3. 4. Area optimization: change b - 1 to b + (-1).

SOL-04:
M

2.6.2
a

Algorithm 2
b -1 25 ns 5ns

M(rd)

M(rd)

60ns 5ns

60ns

M(wr)

65ns

25ns

5ns M p

5. Resource usage: Component Quantity Input 2 Output 1 Register 5 (including mem array) Adder 1 Memory read 2 Memory write 1 Multiplication 1 Clock Period 90 ns Latency 3 cycles Execution Time 270 ns

SOL-05 Preliminaries

SOL-05: Optimization and FPGA Implementation


Lecture Notes Sections: 2.7 2.8

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

SOL-05:

2.7

2-BIT ADDER

2.7

2-bit adder

This question compares an FPGA and generic-gates implementation of 2bit full adder.

SOL-05:

2.7.1

Generic Gates

2.7.1

Generic Gates

Show the implementation of a 2 bit adder using NAND, NOR, and NOT gates.

SOL-05:

2.7.2

Xilinx FPGA

2.7.2

Xilinx FPGA

Show the CLB implementation of a 2 bit adder in a Xilinx Spartan XCS10 FPGA by drawing the schematic of a CLB and showing the equations for the lookup tables.

SOL-05:

2.8

SKETCHES OF PROBLEMS

2.8

Sketches of Problems

1. calculate resource usage for a dataow diagram (input ports, output ports, registers, datapath components) 2. calculate performance data for a dataow diagram (clock period and number of cycles to execute (CPI)) 3. given a dataow diagram, calculate the clock period that will result in the optimum performance 4. given an algorithm, design a dataow diagram 5. given a dataow diagram, design the datapath and nite state machine 6. optimize a dataow diagram to improve performance or reduce resource usage 7. given fsm diagram, pick VHDL code that best implements diagram correct behaviour, simple, fast hardware or critique hardware

SOL-05:

2.8

SKETCHES OF PROBLEMS

Chapter 3

Functional Validation Problems

SOL-06 Preliminaries

SOL-06: Functional Validation


Lecture Notes Sections: 3.1 3.1.5.2

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

SOL-06:

3.1

FUNCTIONAL VALIDATION PROBLEMS

3.1

Functional Validation Problems

SOL-06:

3.1.1

Carry Save Adder

3.1.1

Carry Save Adder

1. Functionality Briey describe the functionality of a carry-save adder. 2. Testbench Write a testbench for a 16-bit combinational carry save adder. 3. Testbench Maintenance Modify your testbench so that it is easy to change the width of the adder and the latency of the computation. NOTES: (a) You do not need to support pipelined adders. (b) VHDL generics might be useful.

SOL-06:

3.1.2

Trafc Light Controller

3.1.2

Trafc Light Controller

1. Functionality Briey describe the functionality of a trafc-light controller that has sensors to detect the presence of cars. Answer:

Given a normal trafc light, which spends a constant amount of time as green in direction, add the following two transitions to the system: (a) If the less-busy road does not have any cars present for t1 minutes, transition the trafc light to make the busier of the two roads as green. (b) If the busy road has a car waiting for t2 minutes, transition the trafc light to make the busier of the two roads as green. 2. Boundary Conditions Make a list of boundary conditions to check for your trafc light controller. Answer:

(a) A car arrives at the intersection and triggers the sensor, but makes a right turn before the light turns green in its direction. Should the light turn to green in the direction of the now vacant road, or stay green in the current direction? (b) Same as 1, but the makes a right turn after the other road already has a yellow light. Should the light turn to green in the direction of the now vacant road, or transition from yellow back to green, or very briey stay green in the vacant direction?

SOL-06:

3.1.2

Trafc Light Controller

(c) If the less-busy road is yellow, theres no car at the busy road, and a car arrives at the less busy road. Same questions as the rst two situations. 3. Assertions Make a list of assertions to check for your trafc light controller. Answer:

(a) (b) (c) (d)

if a light is green, the next colour will be yellow if a light is yellow, the next colour will be red if a light is red, the next colour will be green if no car has been at the less-busy road for at least t1 minutes then the less-busy road is red. (e) if the car sensor has been continuously on for the busy road for at least t2 minutes then the busy road is green.

SOL-06:

3.1.3

State Machines and Validation

3.1.3

State Machines and Validation

1. Three Different State Machines


s0
1/0

*/0

s1

*/0

s2 */0

*/1
s1

s0

s9
0/0 */1 */0

*/0

s8 */0

s3 */0 s4 */0

s6
s3 */0 s2

*/0 */0

s7

Figure 3.1: A very simple machine

s5

Figure 3.2: A very big machine

s0

*/0

s1

q0

*/0

q1

*/0 input/output q2 * = dont care */0

*/1 s2 */0

*/0

*/1

q4

*/0

q3

Figure 3.4: Legend

Figure 3.3: A concurrent machine Answer each of the following questions for the three state machines in Figures 3.13.3. (a) How many test scenarios (sequences of test vectors) would you need to fully validate the behaviour of the state machine?

SOL-06:

3.1.3

State Machines and Validation

(b) What is the maximum length (number of test vectors) in a test scenario for the state machine? (c) Assuming that neither the inputs nor the outputs are registered, what is the minimum number of ip-ops needed to implement the state machine? Answer: scenarios sequence expected behaviour 1) 000 s0, s2, s3, s0 2) 001 s0, s2, s3, s0 3) 010 s0, s2, s3, s0 4) 011 s0, s2, s3, s0 5) 1000 s0, s1, s2, s3, s0 6) 1001 s0, s1, s2, s3, s0 ... 12) 1111 s0, s1, s2, s3, s0 sequence expected behaviour 1) 0000000000 s0, s1, s2 ..., s9, s0 2) 0000000001 s0, s2, s2 ..., s9, s0 1024) 1111111111 s0, s1, s2 ..., s9, s0 sequence expected behaviour 1) 0...00 (s0,q0), (s1,q1), (s2,q2), (s0,q3), (s1,q4), (s2,q0), (s0,q1), (s1,q2), (s2,q3), (s0,q4), (s1,q0), (s2,q1), (s0,q2), (s1,q3), (s2,q4), (s0,q0) 2) 0...01 same behaviour 215 ) 1..11 same behaviour max len 4 min ops 2

Fig 3.1

Fig 3.2

10

Fig 3.3

15

5 or 4

For Fig 3.3, if we implement each machine separately we need 5 ops, 2 for the S machine and 3 for the Q machine. If we merge the state machines, we need log2 3 5 4 ops.

SOL-06:

3.1.3

State Machines and Validation

One of the purposes of this exercise is to illustrate how many test vectors it requires to exhaustively test the behaviour of even simple circuits. Also, this demonstrates how the structure of a circuit affects the number of test vectors needed. Size alone is not the determining factor. 2. State Machines in General If a circuit has n signals of 1-bit each that are the outputs of ip-ops and m 1-bit signals that are the outputs of combinational circuitry, what is the maximum number of states that the circuit can have? Answer:

The maximum number of states for a circuit with n ops is 2n . The values of combinational signals are determined by the ops and the inputs, and so they dont contribute to the total number of states.

SOL-06:

3.1.4

Additional Problem

3.1.4

Additional Problem

SOL-06:

3.1.5

Test Plan Creation

10

3.1.5

Test Plan Creation

Youre on the functional validation team for a chip that will control a simple portable CD-player. Your task is to create a plan for the functional validation for the signals in the entity cd digital. Youve been told that the player behaves just like all of the other CD players out there. If your test plan requires knowledge about any potential nonstandard features or behaviour, youll need to document your assumptions. track min sec

prev

stop

play

next

pwr

entity cd_digital is port ( ----------------------------------------------------- buttons prev, stop, play, next, pwr : in std_logic; ----------------------------------------------------- detect if player door is open open : in std_logic; ----------------------------------------------------- output display information track : out std_logic_vector(3 downto 0); min : out unsigned(6 downto 0); sec : out unsigned(5 downto 0) ); end cd_digital;

SOL-06:

3.1.5

Test Plan Creation

11

3.1.5.1 Early Tests


Describe ve tests that you would run as soon as the VHDL code is simulatable. For each test: describe what your specication, stimulus, and check. Summarize the why your collection of tests should be the rst tests that are run.

Answer: test1 specication when power is turned on, the display will show the number of tracks on the CD, and the minutes and seconds will show the total length of the CD. stimulus power=0; wait; power=1, all other signals are 0. check display outputs of circuit match specication test2 specication when power is on, play starts CD playing, display for track=1, min and sec show remaining time for song and start decrementing. stimulus power=1; play=0; wait; play=1, all other signals are 0. check display outputs of circuit match specication test3 specication when power is on and CD is playing, next starts next song. Display for track increments, min and sec show remaining time for next song and start decrementing. stimulus power=1; play=0; next=0; wait; play=1; wait; next=1, all other signals are 0. check display outputs of circuit match specication test4 specication when power is on and CD is playing, prev starts previous song. Display for track decrements, min and sec show remaining time for previous song and start decrementing. stimulus power=1; play=0; prev=0; wait; play=1; wait; prev=1, all other signals are 0.

SOL-06:

3.1.5

Test Plan Creation

12

check display outputs of circuit match specication test5 specication when power is on and CD is playing, stop causes CD to stop. stimulus power=1; play=0; stop=0; wait; play=1; wait; stop=1, all other signals are 0. check display outputs of circuit match specication justication for choices These cases test the basic operations of the CD player. Each test focusses on a different aspect of the players behaviour.

SOL-06:

3.1.5

Test Plan Creation

13

3.1.5.2 Corner Cases


Describe ve corner-cases or boundary conditions, and explain the role of corner cases and boundary conditions in functional validation. NOTES: 1. You may reference your answer for question 3.1.5.1 in this question. 2. If you do not know what a corner case or boundary condition is, you may earn partial credit by: checking this box ve things that you would do in functional validation. and explaining

Answer: case 1 : press both prev and next while a CD is playing case 2 : open the case while a CD is playing case 3 : press play and stop at the same time case 4 : press any button other than power when the player is off case 5 : press next repeatedly until track counter wraps around role of corner cases : The purpose of corner cases is to test unusual situations that designers might not have thought of, and so are more likely to contain bugs than normal behaviour.

SOL-06:

3.1.5

Test Plan Creation

14

Chapter 4

Performance Analysis and Optimization Problems

15

SOL-07 Preliminaries

SOL-07: Performance Analysis and Optimization


Lecture Notes Sections: 4 4.7.3

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

SOL-07:

4.1

FARMER

4.1

Farmer

A farmer is trying to decide which of his two trucks to use to transport his apples from his orchard to the market. Facts: capacity of truck big truck small truck 12 tonnes 6 tonnes speed when loaded with apples 15kph 30kph speed when unloaded (no apples) 38kph 70kph

distance to market amount of apples NOTES:

120 km 85 tonnes

1. All of the loads of apples must be carried using the same truck 2. Elapsed time is counted from beginning to deliver rst load to returning to the orchard after the last load 3. Ignore time spent loading and unloading apples, coffee breaks, refueling, etc. 4. For each trip, a truck travels either its fully loaded or empty speed.

Question: Which truck will result in the least elapsed time and what percentage faster will the elapsed time be?

Answer:

SOL-07:

4.1

FARMER

NumTrips Harvest Capacity All trips are for the same distance, so distance cancels out of the equations: Time 1 Speed TimeTotBig 85 12 1 15 1 38 8 0 0930 0 7439 TimeTotSmall 85 6 1 30 1 70 15 0 0477 0 7143 Small truck will take less time TimeSlow TimeFast PctFaster TimeFast TimeTotBig TimeTotSmall TimeTotSmall 0 7439 0 7143 0 7143 4 15%

Question: In planning ahead for next year, is there anything the farmer could do to decrease his delivery time with little or no additional expense? If so, what is it, if not, explain.

Answer: Use two drivers Use a combination of the small truck and large truck to improve his utilization.

' '

'

TimeTot

NumTrips

TimeLoaded

TimeUnloaded

SOL-07:

4.2

NETWORK AND ROUTER

4.2

Network and Router

The BigLan network protocol runs at a data rate of 160 Mbps (Mega bits per second). Each BigLan packet contains 100 Bytes of routing information and 1000 Bytes of data. You are working on the DataChopper router, which has the following performance numbers: 75MHz 500 4 clock speed number of clock cycles to process the routing information for a packet CPI for a byte of data

SOL-07:

4.2.1

Maximum Throughput

4.2.1

Maximum Throughput

Which has a higher maximum throughput (as measured in bits per second), the network or your router, and how much faster is it? Answer: The maximum data throughput of the two technologies in terms of bits can be calculated as follows: 1. BigLan Network Protocol Maximum data throughput 2. DataChopper Router Time required for a packet

= = = = = = =

160 Mbps * (8000 data bits / 8800 packet bits) 145.45 Mbps 500 clock cycles + 0.5 CPI per data bit * 8800 packet bits 500 clock cycles + 4400 clock cycles 4900 clock cycles 4900 clock cycles * 13.33 ns per cycle 65333 ns per packet 65333 ns per packet / 8000 data bits 8.167 ns per data bit 1 / 8.167 ns per data bit 122.46 Mbps

Time required for a data bit

= = = =

Maximum data throughput

The network has a higher maximum throughput. What percentage higher? n% higher performance = = = (perf high - perf low) / perf low (145 - 122)/122 19%

The network has 19% higher maximum performance.

SOL-07:

4.2.2

Packet Size and Performance

4.2.2

Packet Size and Performance

Explain the effect of an increase in packet length on the performance of the DataChopper (as measured in the maximum number of bits per second that it can process).

Answer:

As packet size increases, the overhead associated with the constant routing delay will become less signicant. The data rate of the router will slowly approach that of the network but it will never surpass the network throughput. If there was not any overhead for routing, the peak data rate for the router would be 150 Mbps compared to 160 Mbps of the network.

SOL-07:

4.3

PERFORMANCE SHORT ANSWER

4.3

Performance Short Answer

If performance doubles every two years, by what percentage does performance go up every month?

Answer:

Therefore, performance goes up by 2.9% each month.

2t 24 where t is measured in months 21 24 1 029

SOL-07:

4.4

MICROPROCESSORS

4.4

Microprocessors

The Yme microprocessor is very small and inexpensive. One performance sacrice the designers have made is to not include a multiply instruction. Multiplies must be written in software using loops of shifts and adds. The Yme currently ships at a clock frequency of 200MHz and has an average CPI of 4. A competitor sells the Y!v1 microprocessor, which supports exactly the same instructions as the Yme. The Y!v1 runs at 150MHz, and the average program is 10% faster on the Yme than it is on the Y!v1.

SOL-07:

4.4.1

Average CPI

4.4.1

Average CPI

Question: What is the average CPI for the Y!v1? If you dont have enough information to answer this question, explain what additional information you need and how you would use it?

Answer:

Use the following subscripts: Yme Y!v1 Y!u2 The Yme is 10% faster than the Y!v1.

1 2 3

Solve for CPI2 .

NumInst2 ClockSpeed1 ClockSpeed2 CPI1

NumInst1 200MHz 150MHz 4

SOL-07:

4.4.1

Average CPI

10

1 10

33

Common mistakes:

Swapping performance of Yme and Y!v1.

A new version of the Y!, the Y!u2 has just been announced. The Y!u2 includes a multiply instruction and runs at 180MHz. The Y!u2 publicity brochures claim that using their multiply instruction, rather than shift/add loops, can eliminate 10% of the instructions in the average program. The brochures also claim that the average performance of Y!u2 is 30% better than that of the Y!v1.

1 10

1 10

CPI2

1 10

1 10

NumInst2 CPI2 ClockSpeed2

Time2

1 10

Time2 Time1 Time1 Time2 Time1

Time

NumInst CPI ClockSpeed 0 10

Time1 NumInst1 CPI1 ClockSpeed1 ClockSpeed2 NumInst1 CPI1 NumInst2 ClockSpeed1 ClockSpeed2 CPI1 ClockSpeed1 150MHz 4 200MHz

SOL-07:

4.4.2

Why not you too?

11

4.4.2

Why not you too?

Question: Assuming the advertising claims are true, what is the average CPI for the Y!u2? If you dont have enough information to answer this question, explain what additional information you need and how you would use it?

Answer:

Solve forCPI3

3 38

Common mistakes:

Comparing performance of Y!u2 to Yme, rather than Y!v1.

Saying that time for Y!u2 is 70% of Y!v1.

Forgeting to take into account reduced number of instructions.

CPI3

13

: ClockSpeed3 NumInst2 CPI2 1 3 NumInst3 ClockSpeed2 180MHz 3 3 1 3 0 9 150MHz

NumInst3 CPI3 ClockSpeed3

13

Time3

Time2 NumInst2 CPI2 ClockSpeed2

SOL-07:

4.4.3

Analysis

12

4.4.3

Analysis
Which of the following do you think is most likely

Question: and why.

1. the Y!u2 is basically the same as the Y!v1 except for the multiply 2. the Y!u2 designers made performance sacrices in their design in order to include a multiply instruction 3. the Y!u2 designers performed other signicant optimizations in addition to creating a multiply instruction

Answer: The most likely analysis is that the Y!u2 is basically the same as the Y!v1 except for the multiply. This is because the Y!u2 has a slightly larger CPI than the Y!v1, this is in keeping with the addition of a multiply instruction. A multiply instruction probably has a larger-than-average CPI. The increase in clock speed likely comes from a new fabrication process, and would not have required signicant changes to the design of the chip.

SOL-07:

4.5

DATAFLOW DIAGRAM OPTIMIZATION

13

4.5

Dataow Diagram Optimization

Draw an optimized dataow diagram that improves the performance and produces the same output values. Or, if the performance cannot be improved, describe the limiting factor on the performance. NOTES:

you may change the times when signals are read from the environment you may not increase the resource usage (input ports, registers, output ports, f components, g components) you may not increase the clock period
a b c a b f f d e f c f f g g f g e

After Optimization Before Optimization

SOL-07:

4.6

OPTIMIZATION WITH MEMORY ARRAYS

14

4.6

Optimization with Memory Arrays

This question deals with the implementation and optimization for the algorithm and library of circuit components shown below. Algo- Component q = M[b]; Register if (a > b) then Adder M[a] = b; Subtracter p = (M[b-1]) * b) + M[b]; with , , ALU rithm else Memory read M[a] = b; Memory write p = M[b+1] * a; Multiplication end; 2:1 Multiplexor NOTES: 1. 2. 3. 4. 5. 25% of the time, a > b The inputs of the algorithm are a and b. The outputs of the algorithm are p and q. You must register both your inputs and outputs. You may choose to read your input data values at any time and produce your outputs at any time. For your inputs, you may read each value only once (i.e. the environment will not send multiple copies of the same value). Execution time is measured from when you read your rst input until the latter of producing your last output or the completion of writing a result to memory M is an internal memory array, which must be implemented as dualported memory with one read/write port and one write port. Assume all memory address and other arithmetic calculations are within the range of representable numbers (i.e. no overows occur). If you need a circuit not on the list above, assume that its delay is 30 ns. Your dataow diagram must include circuitry for computing a > b and using the result to choose the value for p Delay 5 ns 25 ns 30 ns 40 ns 60 ns 60 ns 65 ns 5 ns

6.

7. 8. 9. 10.

, AND, XOR

SOL-07:

4.6

OPTIMIZATION WITH MEMORY ARRAYS

15

Draw a dataow diagram for each operation that is optimized for the fastest overall execution time. NOTE: You may sacrice area efciency to achieve high performance, but marks will be deducted for extra hardware that does not contribute to performance.

Answer: q M[a] p q M[a] p

a > b (25%) = M[b]; = b; = (M[b-1] * b) + M[b]; b (75%) a = M[b]; = b; = M[b+1] * a;

b happens 75% of the time, so initially focus on 1. a common case. b means that a b 1, therefore can do (a) a M[b+1] read in parallel with M[a] write or with M[b] read. (b) But, could have a b, so cant do M[a] write in parallel with M[b] read. M a b -1

M(rd) 60ns 60ns M(wr)

25ns M(rd) 60ns

65ns p 150ns

SOL-07:

4.6

OPTIMIZATION WITH MEMORY ARRAYS

16

(c) Critical path is from b to p: 150ns + 5ns for mux on p = 155ns. (d) Longest operation in diagram is multiplication: 65ns. (e) Minimum clock period is 65ns + 5ns for register = 70ns.
M a b 1 25ns 5ns

M 5ns

b 1 25ns 5ns

M(rd) 60ns 60ns M(wr)

M(rd) 60ns 5ns

M(rd) 60ns

M(rd) 60ns 5ns 65ns

60ns M(wr)
65ns 5ns

5ns M q p

M
M a

q
b

1 30ns

5ns

5ns

M(rd)

M(rd)

60ns 5ns

60ns

M(wr)

65ns 5ns q 25ns

5ns M p

period 70 ns 75 ns 90 ns latency 5 cycles 4 cycles 3 cycles time 350 ns 300 ns 270 ns (f) Minimum latency is 3 clock cycles, because cant do all memory operations in parallel and need registers on both inputs and outputs. (g) Best overall performance for a b case is with clock period of 90 ns. 2. Now try a b with 90 ns clock period.

SOL-07:

4.6

OPTIMIZATION WITH MEMORY ARRAYS

17

(a) a b means that a b and a b-1, so no memory address conicts to create dependencies and complications.
M a b 1 30ns M 5ns a

M(rd)

M(rd)

60ns 5ns

60ns

M(wr)

65ns

60ns

M(wr)

25ns

5ns M p M p

period 90 ns 95 ns latency 4 cycles 3 cycles time 360 ns 285 ns (b) Without going to a triple-ported memory, cant reduce latency below 3. b case is with clock period (c) Best performance for a of 95 ns. 3. Choose 95 ns clock period, which gives a latency of 3 clock cycles for both options. 4. Optimize dataow diagrams to reduce area without sacricing performance.

b -1 25 ns 5ns

M(rd)

M(rd)

60ns 5ns

65ns

25ns

5ns

SOL-07:

4.6
M

OPTIMIZATION WITH MEMORY ARRAYS


b M 1 25ns a a 5ns 1 b

18

5ns 30ns

M(rd) M(rd) 60ns M(rd) 60ns 5ns M(wr) 60ns M(wr) q 65ns

M(rd)

60ns 5ns

65ns q

25ns 5ns

5ns M M p p

5ns

5. Merge dataow diagrams.


M a b 5ns 1 1 30ns M(rd) M(rd) 60ns 5ns

5ns

M(wr)

M(rd) 0

65ns

25ns 5ns M q p

Optimal performance (Period = 95 ns)

SOL-07:

4.6

OPTIMIZATION WITH MEMORY ARRAYS


Quantity 2 2 5 1 1 0 2 1 1 2 95 ns 3 cycles 285 ns
5ns 1 1 30ns a M(rd) M(rd) 60ns 5ns

19

Component Input Output Register Adder Subtracter ALU Memory read Memory write Multiplication 2:1 Multiplexor Clock Period Average Latency Average Execution Time
M b

M(wr)

M(rd)

65ns

25ns 5ns 5ns M q p

Suboptimal area (two multipliers)

SOL-07:
M

4.6
a

OPTIMIZATION WITH MEMORY ARRAYS


b 5ns 1 1 30ns 5ns M(rd) 60ns 5ns

20

M(wr)

M(rd)

65ns

25ns 5ns 5ns M q p

Suboptimal performance (Period = 100 ns)

SOL-07:

4.7

MULTIPLY INSTRUCTION

21

4.7

Multiply Instruction

You are part of the design team for a microprocessor implemented on an FPGA. You currently implement your multiply instruction completely on the FPGA. You are considering using a specialized multiply chip to do the multiplication. Your task is to evaluate the performance and optimality tradeoffs between keeping the multiply circuitry on the FPGA or using the external multiplier chip. If you use the multipliplier chip, it will reduce the CPI of the multiply instruction, but will not change the CPI of any other instruction. Using the multiplier chips will also force the FPGA to run at a slower clock speed and will raise the cost. FPGA option FPGA + MULT option
MULT FPGA FPGA

average CPI % of instrs that are multiplies CPI of multiply Clock speed Cost

5 10% 20 200 MHz $20

??? 10% 6 160 MHz $23

SOL-07:

4.7.1

Highest Performance

22

4.7.1

Highest Performance

Which option, FPGA or FPGA+MULT, gives the higher performance (as measured in MIPs), and what percentage faster is the higher-performance option?

Answer: MIPs for FPGA option:

40

Find MIPs for FPGA+MULT option:

Find CPI for MIPS+FPGA option: FM mult mult other

Find CPI for non-multiply (other) instructions: FPGA other mult mult other

FM

01 09

mult other 20

mult

3 333

MIPsFM

MHzFM FM

MIPsFPGA

MHzFPGA FPGA 200 5

other

other

SOL-07:

4.7.1

Highest Performance

23

FM

mult

mult

other

01

09

3 333

36

44 4

MIPsFM MIPsFPGA , therefore the FPGA+MULT is the higher performance option.

FM

44 4 40

FPGA FPGA 40

11 1%

The FPGA+MULT option is 11% faster than the FPGA option.

MIPsFM

MIPsFM

MHzFM FM 160 36

other

SOL-07:

4.7.2

Optimality

24

4.7.2

Optimality

Which option, FPGA or FPGA+MULT, is more optimal (as measured in MIPs/$), and what percentage more optimal is the more optimal option?

Answer:

The FPGA+MULT option is 3.4% more optimal than the FPGA option.

n-pct-optimal

optFM optFPGA optFPGA 0 034

optFM

optFPGA

MIPsFPGA PriceFPGA 40 20 2

MIPsFM PriceFM 44 4 23 1 93

SOL-07:

4.7.3

Performance Metrics

25

4.7.3

Performance Metrics

Explain whether MIPs is a good choice for the performance metric when making this decision.

Answer:

MIPs is a good metric for this example, because we are comparing two microprocessors that use the same instruction set and will be used in the same environment. In general, the disadvantage of MIPs is that it doesnt take into account that different instructions accomplish different amounts of work. This causes problems when comparing microprocessors that use different instruction sets (e.g. one with a cosine instruction and one without).

SOL-07:

4.7.3

Performance Metrics

26

Chapter 5

Timing Analysis Problems

27

SOL-08 Preliminaries

SOL-08: Timing Analysis


Lecture Notes Sections: 5.1 5.5.3

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

SOL-08:

5.1

TERMINOLOGY

5.1

Terminology

Assume that the timing diagram shows the limits of the allowed times (either minimum or maximum). For each of the terms in the table below, answer which time periods (one or more of t1 t9 or NONE) are examples of the term. t7 t4 signal is stable
t3 t1 t2 t6 signal may change t9

clk1 t8 clk2 a b b t10 t11 t5

clock skew clock period setup time hold time

SOL-08:

5.2

CRITICAL PATH AND FALSE PATH

5.2

Critical Path and False Path

Find the critical path through the following circuit: a


b c

SOL-08:

5.3

CRITICAL PATH

5.3
a

Critical Path
d f g k h l m i j

b c

gate NOT AND OR XOR

delay 2 4 4 6

Assume all delay and timing factors other than combinational logic delay are negligible.

SOL-08:

5.3.1

Ignoring potential false paths, list the signals in the critical path through this circuit. 5

5.3.1 Ignoring potential false paths, list the signals in the critical path through this circuit.
a
2 2

d6

6 6

f8 g 12

8 12

i 16

b c

e8

12 8

j 18 m 16 l 16

k 10

10 12 12

h4

Critical path is: b, e, g, j

SOL-08:

5.3.2

What is the combinational delay through the critical path?

5.3.2 What is the combinational delay through the critical path?


Delay: 18

SOL-08:

5.3.3

Missing Factors

5.3.3

Missing Factors

What factors that affect the maximum clock speed does your analysis for parts 1 and 2 not take into account?

Answer:

wire delay clock skew clock jitter

SOL-08:

5.3.4

False Path?

5.3.4

False Path?

Is the critical path you found a real critical path, or a false path? If it is a false path, nd the real critical path. If it is a critical path, nd a set of assignments to the primary inputs that exercises the critical path. a d f i
b c e 0 k g 0 j m l

Answer:

Therefore, the rst candidate critical path is a false path. NOTE: rules for XOR require one of the inputs to remain stable, otherwise output of XOR will not change. Find next candidate.

Contradictory assignment to e. Critical path requires 0 while input to j requires 1.

1,

SOL-08:

5.3.4
a

False Path?
d 6
6 6

9
f 8
8 12

2 2

i 16

b c

e6,8
8

6,8

g 10,12 10
8

j 16 m 16 l 16

k 10

10 12 12

h 2

1. There are four paths with a delay of 16. All go through g. a 6 6 8 f 8 2 d i 16 12


b c
2 6

e6,8
8

6,8

g 10,12 10
8

j 16 m 16 l 16

k 10

10 12 12

h 2

2. Quick check if g can change: Static equation for g a b bc Therefore, g can change. 1 on f, because of (a) For f, have choice of 1 or 0 reconvergent fanout. (b) Try 1 rst, because its simpler. (c) For g, have choice of 0 or 0 1 on d, because of reconvergent fanout. (d) Try 0 rst, because its simpler. (e) d is ok, 0 from both sides (f) Conict on output of inverter.

3. Try 0

1 on i

SOL-08:

5.3.4

False Path?
0 0 0

10

0 0

f g

1 i

b c

j k m l

b c

5. Try 0 1 on i, 0 Conict on d

1 on f, 0

4. Try 0 1 on i with 0 Conict on d a d

1 on f.
f 0 g j k m l

1 on d.

SOL-08:

5.3.4
a

False Path?
d f g j k m l

11

b c

7. Try 0

(a) For e, have choice on b of whether to invert or not, because e is an xor. 0 and e is (b) Because path from h is propagating 1 0 1, need to invert. (c) For inversion, need to put a 1 on c. (d) Conict on b a d f i
0 b c e g j k m l

8. Need to get g to toggle. (a) Static equation for g is a b bc Only assignment that makes g=0 is abc Only assignment that causes g to toggle because of change on b is a=0, b=1 0, c=0.

6. Try 0

1 on m. Conict on e. 1 on l.

SOL-08:

5.3.4

False Path?

12

(b) Try to push rising edge on b through g to i, j, m, or l; with a=0 and c=0. 0 a d f i 0
b 0 c e g j 0 k m 0 l 0

(c) Cant get rising edge on b to toggle both g and an output. Therefore, critical path does not go through both b and g. 9. Find next candidate path. a d
2 6 6

f g 10

8 10

i 14

b c

e 6,8

10 8

j 16 m 14 l 14

k 10

10 10 10

SOL-08:

5.3.4
a

False Path?
d
2 6 6

13
f g 10
8 10

i 14

b c

e 6,8

10 8

j 16 m 14 l 14

k 10

10 10 10

h 4 a 0 d0 0 e 0 g f 1

b c

1 0

j 0 k m 0 l

h 1

10. Cant get rising edge on c to toggle both g and j. However, the rising edge can toggle i and l. Both the path from c to j and from c to l have a delay of 14. a 8 6 d 6 f 2 i 14 10
b c
2 6

e 6,8

g 10

10 8

j 16 m 14 l 14

k 10

10 10 10

h 4

SOL-08:

5.3.4

False Path?

14

11. The pair of assignments abc and abc will exercise the critical paths from c to i and c to l, both of which have a delay of 14.

SOL-08:

5.4

TIMING MODELS

15

5.4

Timing Models

In your next job, you have been told to use a fanout timing model, which states that the delay through a gate increases linearly with the number of gates in the immediate fanout. You dimly recall that a long time ago you learned about a timing model named Elmo, Elmwood, Elmore, El-Morre, or something like that. For the circuit shown below as a schematic and as a layout, answer whether the fanout timing model closely matches the delay values predicted by the Elmore delay model.
G2 G3 G1 G4 G5 G1
Gate Cg 0 Symbol Description Interconnect level 2 Capacitance Cx Resistance 0

Interconnect level 1

Cy

Antifuse

G2

G3

G4

G5

Assumptions:

The capacitance of a node on a wire is independent of where the node is located on the wire.

SOL-08:

5.5

WORST CASE CONDITIONS AND DERATING FACTOR 16

5.5 Worst Case Conditions and Derating Factor


Assume that we have a Std speed grade Actel A1415 (an ACT 3 part) Logic Module that drives 4 other Logic Modules:

SOL-08:

5.5.1

Worst-Case Commercial

17

5.5.1

Worst-Case Commercial

Estimate the delay under worst-case commercial conditions (assume that the junction temperature is the same as the ambient temperature)

Answer: For worst-case commercial condition, assuming that TA = TJ, Logic Module delay, tPD, for ACT 3 Std with 4 fanout is 5.7 ns (see Smith Table 5.2). Assume this is the slowest path, then estimated critical path delay between registers, tCRIT (worst-case commercial) is:

tCRIT

tPD tSUD tCO 5 7ns 0 8ns 3 0ns 9 5ns

SOL-08:

5.5.2

Worst-Case Industrial

18

5.5.2

Worst-Case Industrial

Find the derating factor for worst-case industrial conditions and calculate the delay (assume that the junction temperature is the same as the ambient temperature).

Answer: For worst-case industrial conditions, assuming that TA = TJ, the derating factor is 1.07 (see Table 5.3). Hence the delay tCRIT (worst-case industrial) is: 7% greater than worst case commercial delay: 1 07 9 5 10 2ns

SOL-08:

5.5.3

Worst-Case Industrial, Non-Ambient Junction Temperature 19

5.5.3 Worst-Case Industrial, Non-Ambient Junction Temperature


Estimate the delay under the worst-case industrial conditions (assuming that the junction temperature is 105C).

Answer: For worst-case industrial conditions, the derating factor at 105C is found by linear interpolation between the values for 85C (1.07) and 125C (1.17). The interpolated derating factor is 1.12. Hence the delay is: tCRIT (worst-case industrial, TJ = 105 0C) 1 12 9 5 10 6ns.

SOL-09 Preliminaries

SOL-09: Timing Analysis (II)


Lecture Notes Sections: 5.6 5.9

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

SOL-09:

5.6

SHORT ANSWER

5.6

Short Answer

SOL-09:

5.6.1

Wires in FPGAs

5.6.1

Wires in FPGAs

In an FPGA today, what percentage of the clock period is typically consumed by wire delay?

Answer: 4060%

SOL-09:

5.6.2

Age and Time

5.6.2

Age and Time

If you were to compare a typical digital circuit from 5 years ago with a typical digital circuit today, would you nd that the percentage of the total clock period consumed by capacative load has increased, stayed the same, or decreased?

Answer: Decreased. Justication:

Transistors have gotten smaller, die size has remained roughly the same size or even increased, clock speeds are increasing. Signals are travelling roughly the same distance as before, but driving smaller capactive loads. Thus, wire delay is not decreasing much, but capacitive load is decreasing. The clock period is decreasing, so the wire delay is taking up a larger percentage of the clock period and capacitive load delay is taking up a smaller percentage.

SOL-09:

5.6.3

Temperature and Delay

5.6.3

Temperature and Delay

As temperature increases, does the delay through a typical combinational circuit increase, stay the same, or decrease?

Answer: Increase. Justication: As temperature increases, atoms vibrate more, and so have greater probability of colliding with electrons owing with current. This increases resistivity, which increases delay.

SOL-09:

5.7

HOLD TIME VIOLATIONS

5.7

Hold Time Violations

SOL-09:

5.7.1

Cause

5.7.1

Cause

What is the cause of a hold time violation?

SOL-09:

5.7.2

Behaviour

5.7.2

Behaviour

What is the bad behaviour that results if a hold time violation occurs?

SOL-09:

5.7.3

Rectication

5.7.3

Rectication

If a circuit has a hold time violation, how would you correct the problem with minimal effort?

SOL-09:

5.8

LATCH ANALYSIS

10

5.8

Latch Analysis

Does the circuit below behave like a latch? If not, explain why not. If so, calculate the hold time and answer whether it is active-high or active-low.
d

Gate Delays AND 4 OR 2 NOT 1

d en

Answer:
0 1 1 1

1 0 0

en

en

Load mode

Store mode

From the mode diagrams, if the circuit is a latch, it is active high, because latch is in load mode when en=1.

Now check if timing of circuit is correct. The critical transition is from load mode to store mode.

SOL-09:

5.8

LATCH ANALYSIS
d l1 q s1 en cn

11

cn

l1 q

en

s1

Node labels

Timing diagram for transition from load to store mode.

circuit is latch? hold time latch type


Hold time constraint must prevent new value arriving at d before en sets l1 to 1. Delay along data path is 0. Delay along clock path is 1. Hold time is 1. Y 1 active high

SOL-09:

5.9

COMBINATIONAL TIMING (SMITH 13.23)

12

5.9

Combinational Timing (Smith 13.23)

Chapter 6

Power Problems

13

SOL-10 Preliminaries

SOL-10: Power Analysis and Reduction


Lecture Notes Sections: 6.1 6.1.3

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

SOL-10:

6.1

POWER ANALYSIS AND REDUCTION PROBLEMS

6.1 Power Analysis and Reduction Problems

SOL-10:

6.1.1

Short Answers

6.1.1

Short Answers

SOL-10:

6.1.1

Short Answers

6.1.1.1 Power and Temperature


As temperature increases, does the power consumed by a typical combinational circuit increase, stay the same, or decrease?

Answer:

Power will increase. Justication:

where T is temperature. Short circuiting power will increase because: As temperature increases, atoms vibrate more, and so have greater probability of colliding with electrons owing with current. This increases resistivity, which increases delay. Signals will rise and fall more slowly, which will increase the short circuiting time, and hence increase short circuiting power

"

Leakage power will increase, because the equation for the leakage power is: q e k T

SOL-10:

6.1.1

Short Answers

6.1.1.2 Leakage Power


The new vice president of your company has set up a contest for ideas to reduce leakage power in the next generation of chips that the company fabricates. The prize for the person who submits the suggestion that makes the best tradeoff between leakage power and other design goals is to have a door installed on their cube. What is your door-winning idea, and what tradeoffs will your idea require in order to achieve the reduction in leakage power?

Answer: Increase transistor size so as to increase threshold voltage. This will require an increase in supply voltage, which will likely increase total power. Alternative: when increase transistor size, keep the supply voltage the same, but decrease performance. Alternative: change fabrication process and materials to reduce leakage current. This will likely be expensive. Alternative: Use dual-Vt fabrication process.

SOL-10:

6.1.1

Short Answers

6.1.1.3 Clock Gating


In what situations could adding clock-gating to a circuit increase power consumption?

Answer:

Alternative: Even if the utilization rate is low, the utilization pattern could prevent the clock gating circuitry from turning off the clock to main circuit. For example, if the circuit receives new data every other clock cycle, it would have a utilization rate of 50%, but might need to be powered up 100% of the time.

If the circuitry has a high utilization rate, then the power consumed by the clock gating circuit could be more than that saved in the main circuit.

SOL-10:

6.1.1

Short Answers

6.1.1.4 Gray Coding


What are the tradeoffs in implementing a program counter for a microprocessor using Gray coding?

Answer:


Gray coding is designed to reduce power, because only one bit changes when incrementing or decrementing. Program counters usually increment, rather than jump to completely different values. So, using gray coding should reduce power consumption. The downside is that the memory system probably doesnt use gray-coded addresses, so additional circuitry would be needed to convert between gray and binary codes. This will increase area and likely decrease performance. Additionally, the extra circuitry to do the translation might require more power than is saved by using gray coding.

SOL-10:

6.1.2

VLSI Gurus

6.1.2

VLSI Gurus

The VLSI gurus at your company have come up with a way to decrease the average rise and fall time (0-to-1 and 1-to-0 transitions) for signals. The current value is 1ns. With their fabrication tweaks, they can decrease this to 0.85ns .

SOL-10:

6.1.2

VLSI Gurus

6.1.2.1 Affect on Power If you implement their suggestions, and make no other changes, what affect will this have on power? (NOTE: Based on the information given, be as specic as possible.)
Answer: Reducing short circuit time from 1 ns to 0.85 ns means reducing raising/falling time. Hence, the new short circuit power is 85% of original.

SOL-10:

6.1.2

VLSI Gurus

10

6.1.2.2 Critique
A group of wannabe performance gurus claim that the above optimization can be used to improve performance by at least 15%. Briey outline what their plan probably is, critique the merits of their plan, and describe any affect their performance optimization will have on power.

Answer: The plan was probably to increase clock speed by 15%. However reducing Tshort by 0.15 ns can at most decrease clock period by 2 0 15 0 30 ns, while clock period 1 ns. Therefore, it does not work.

SOL-10:

6.1.3

Advertising Ratios

11

6.1.3

Advertising Ratios

One day you are strolling the hallways in search of inspiration, when you bump into a person from the marketing department. The marketing department has been out surng the web and has noticed that companies are advertising the MIPs/mm2 , MIPs/Watt, and Watts/cm3 of their products. This wide variety of different metrics has confused them. Explain whether each metric is a reasonable metric for customers to use when choosing a system. If the metric is reasonable, say whether bigger is better (e.g. 500 MIPs/mm2 is better than 20 MIPs/mm2 ) or smaller is better (e.g. 20 MIPs/mm2 is better than 500 MIPs/mm2 ), and which one type of product (cell phone, desktop computer, or compute server) is the metric most relevant to.

MIPs/mm2 MIPs/Watt Watts/cm3

SOL-11 Preliminaries

SOL-11: Power Analysis and Reduction


Lecture Notes Sections: 6.1.4 6.1.8.3

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

SOL-11:

6.1.4

Vary Supply Voltage

6.1.4

Vary Supply Voltage

As the supply voltage is scaled down (reduced in value), the maximum clock speed that the circuit can run at decreases. The scaling down of supply voltage is a popular technique for minimizing power. The maximum clock speed is related to the supply voltage by the following equation: MaxClockSpeed
2

With a supply voltage of 3V and a threshold voltage of 0.8V, the maximum clock speed is measured to be 200MHz. What will the maximum clock speed be with a supply voltage of 1.5V?

Answer: MaxClockSpeed
2

MaxClockSpeed1

MaxClockSpeed1

MaxClockSpeed1

MaxClockSpeed2 200MHz 40MHz

m m

MaxClockSpeed1 MaxClockSpeed2

Where

is supply voltage and

is threshold voltage.

1 5V 0 8V 1 5V

3V

3V 0 8V

SOL-11:

6.1.5

Power Reality and Math (Smith prob 15.16)

6.1.5 Power Reality and Math (Smith prob 15.16)

SOL-11:

6.1.6

Clock Speed Increase Without Power Increase

6.1.6 Clock Speed Increase Without Power Increase


The following are given:

You need to increase the clock speed of a chip by 10% You must not increase its dynamic power consumption The only design parameter you can change is supply voltage Assume that short-circuiting current is negligible

SOL-11:

6.1.6

Clock Speed Increase Without Power Increase

6.1.6.1 Supply Voltage


How much do you need to decrease the supply voltage by to achieve this goal? Answer: Total power:

Only need to reduce dynamic power, therefore neglect static (leakage) power.

11 0 95

   (0

'

'

11

   %#&

2 2 2

 $

   %#&

1 2

1 2

 %

   #"!

Power

Power

Power

Power
2

11

Power

1 2

Neglect short circuiting current.

"

' (   &       

 

Power

m
2

1 2

SOL-11:

6.1.6

Clock Speed Increase Without Power Increase

We need to decrease the supply voltage to be 95.3% of its original value.

SOL-11:

6.1.6

Clock Speed Increase Without Power Increase

6.1.6.2 Supply Voltage


What problems will you encounter if you continue to decrease the supply voltage?

Answer: Decreasing the supply voltage will bring it closer to the threshold voltage. As the difference between the supply and threshold voltage decreases, it will limit the maximum frequency that the circuit can run at. This then leads to decreasing the threshold voltage, which will then increase the leakage current, and raise the static power dissipation:

SOL-11:

6.1.7

Power Reduction Strategies

6.1.7

Power Reduction Strategies

In each low power approach described below identify which component(s) of the power equation is (are) being minimized and/or maximized:

SOL-11:

6.1.7

Power Reduction Strategies

6.1.7.1 Supply Voltage


Designers scaled down the supply voltage of their ASIC

Answer: Scaling the supply voltage (V) reduces the dynamic power

SOL-11:

6.1.7

Power Reduction Strategies

10

6.1.7.2 Transistor Sizing


The transistors were made larger.

Answer: Resizing transistor to increase the width to length ratio decreases the resistance of the transistor, which makes it faster. This means that the supply voltage can be reduced to save power while maintaining performance. However, increasing the width to length ratio increases the capacitance. After a certain point, the capacitance increase becomes more signicant than the reduction in supply voltage, causing power to increase. Therefore, resizing is adjusting supply voltage and load capacitance to minimize their product in the switching power component.

SOL-11:

6.1.7

Power Reduction Strategies

11

6.1.7.3 Adding Registers to Inputs


All inputs to functional units are registered

Answer: When inputs are registered, the activity factor is decreased, which decreases the dynamic power.

SOL-11:

6.1.7

Power Reduction Strategies

12

6.1.7.4 Gray Coding


Gray coding of signals is used for address signals.

Answer: Gray coding reduces the activity factor on signals that typically change by 1 or a small amount. Address signals have this behaviour, in contrast to data signals, where consecutive values are often completely different. Reducing the activity factor will reduce the dynamic power.

SOL-11:

6.1.8

Power Consumption on New Chip

13

6.1.8

Power Consumption on New Chip

While you are eating lunch at your regular table in the company cafeteria, a vice president sits down and starts to talk about the difculties with a new chip. The chip is a slight modication of existing design that has been ported to a new fabrication process. Earlier that day, the rst sample chips came back from fabrication. The good news is that the chips appear to function correctly. The bad news is that they consume about 10% more power than had been predicted. The vice president explains that the extra power consumption is a very serious problem, because power is the most important design metric for this chip. The vice president asks you if you have any idea of what might cause the chips to consume more power than predicted.

SOL-11:

6.1.8

Power Consumption on New Chip

14

6.1.8.1 Hypothesis
Hypothesize a likely cause for the surprisingly large power consumption, and justify why your hypothesis is likely to be correct.

SOL-11:

6.1.8

Power Consumption on New Chip

15

6.1.8.2 Experiment
Briey describe how to determine if your hypothesized cause is the real cause of the surprisingly large power consumption.

SOL-11:

6.1.8

Power Consumption on New Chip

16

6.1.8.3 Reality
The vice president wants to get the chips out to market quickly and asks you if you have any ideas for reducing their power without changing the design or fabrication process. Describe your ideas, or explain why her suggestion is infeasible.

Chapter 7

Problems on Faults, Testing, and Testability

17

SOL-12 Preliminaries

SOL-12: Faults, Testing, and Testability


Lecture Notes Sections: 7.1 7.13.4

University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter

SOL-12:

7.1

BASED ON SMITH Q14.9: TESTING COST

7.1

Based on Smith q14.9: Testing Cost

A modern (circa 1995) production tester costs US$510 million. This cost is depreciated over the life of the tester (usually ve years in the States due to tax guidelines). 1. Neglecting all operating expenses other than depreciation, if the tester is in use 24 hours a day, 365 days per year how much does one second of test time cost? Answer:

$0 031 for a US$ 5 million tester $0 062 for a US$ 10 million tester

2. A new tester sits idle for 6 months, because the design of the chips that it is to test is behind schedule. After the chips begin shipping, the tester is used 100% of the time. What is the cost of testing the chips relative to the cost if the chips had been completed on time? Answer: 6 months is 10% of a 5 year lifespan Therefore the tester will test 90% of the total number of chips that it would normally test. The cost per chip for testing will be: 1 0 90

111%OrigTestCost

NewTestCost

OrigTestCost

365

CostPerSecond

PurchaseCost Lifespan 5 106 24 60 60

SOL-12:

7.1

BASED ON SMITH Q14.9: TESTING COST

3. The dimensions of the die to be tested are 20mm 10mm. The wafers are 200mm in diameter. Fabricating a wafer with die costs $3000. The yield is 70%. Assume that the number of die per wafer is equal to wafer area divided by chip area. What percentage of the fabrication + test cost is for test if the chip is on schedule and requires 1 minute to test? Answer:

157

16 3%

TestCostPct

DieTestCost DieTestCost DieFabCost $3 72 $3 72 $19 10

DieTestCost

TestCostPerSec $0 062 60 $3 72

DieFabCost

WaferFabCost DiePerWafer $3000 157

$19 10

200 2 10 20

DiePerWafer

WaferArea DieArea

TestTime

SOL-12:

7.2

TESTING COST AND TOTAL COST

7.2

Testing Cost and Total Cost

Given information:

What fault escapee rate will result in the lowest total cost for ACHIPs?

Answer: From section 7.2.2: TotCost NoTestCost TestCost EscapeeProb ReplaceCost

However, here we have two ACHIPs per board, so we need to use the escapee probability to compute the probability of board needing to be replaced. The revised equation for total cost is:

TotCost

NoTestCost TestCost ReplaceProb ReplaceCost

The ACHIP costs $10 without any testing Each board uses two ACHIPs (plus lots of other chips that we dont care about) 68% of the manufactured ACHIPS do not have any faults For the ACHIP, it costs $1 per chip to catch half of the faults Each 50% reduction in fault escapees doubles cost of testing (intuition: doubles number of tests that are run) If board-level testing detects faults in either one or both ACHIPs, it costs $200 to replace the ACHIP(s) (This is an approximation, based on the fact that the cost of the chip is much less than the total cost of $200). Board-level testing will detect 100% of the faults in an ACHIP

SOL-12:

7.2

TESTING COST AND TOTAL COST

The testing cost doubles, because we have two ACHIPs per board to test. The probablity of a board having at least one bad ACHIP (and therefore needing to be replaced) is 1 - the probability that both ACHIPs are good.
2

The chips will have a lowest cost if either $8 or $16 is spent on testing and they have a fault escapee rate of 4% or 2%. We choose to spend $16 on testing, because that has a lower escapee rate for the same total cost. The lower escapee rate will improve our reputation for quality.

NoTestCost $10 $10 $10 $10 $10 $10 $10

Testcost $0 2 $1 = $2 2 $2 = $4 2 $4 = $8 2 $8 = $16 2 $16 = $32 2 $32 = $64

EscapeeProb 32% 16% 8% 4% 2% 1% 0.5%

ReplaceProb 54% 29% 15% 8% 4% 2% 1%

ReplaceProb

EscapeeProb

AvgReplaceCost $108 $58 $30 $16 $8 $4 $2

TotCos $118 $70 $44 $34 $34 $46 $76

SOL-12:

7.3

MINIMUM NUMBER OF FAULTS

7.3
4

Minimum Number of Faults

In a circuit with i inputs, o outputs, and g gates with an average fanout of fo (fo 1), and average fanin of , what is the minimum number of faults that must be considered when using a single-stuck-at fault model?

Answer:

The minimum number of wire segments to connect a gate or input to fo other gates or outputs is fo + 1. (Assuming fo 1. If fo = 1, then the minimum number of wire segments is 1. With i inputs and g gates, this results in (i g) (fo 1) wire segments. Each wire segment has two possible faults (stuck-at-1 and stuck-at-0), therefore there are 2 (i g) (f 1) potential single-stuck-at faults that must be considered. NOTE: the fanin degree does not direcly factor into this equation. However, there is a relationship between the number of gates g, the number of inputs i, the depth of the circuitry, the fanout degree fo, and the fanin degree . For example, the maximum number of gates whose inputs are all primary inputs is i fo .

SOL-12:

7.4

SMITH Q14.10: FAULT COLLAPSING

7.4

Smith q14.10: Fault Collapsing

Draw the set of faults that collapse for AND, OR, NAND, and NOR gates, and a two-input mux.

Answer:
@0 @0

@0

@1 @1

@1

@0 @0

@1

@1 @1

@0

A two-input mux does not have any controlling inputs, so it does not have any collapsible faults.

SOL-12:

7.5

MATHEMATICAL MODELS AND REALITY

7.5

Mathematical Models and Reality

Given a correct circuit, and a non-stuck-at fault (e.g. bridging AND), will a single-stuck-at fault model detect the fault? If so, identify a single-stuck at fault that will detect, or explain why cant be detected.

SOL-12:

7.6

UNDETECTABLE FAULTS

7.6

Undetectable Faults

Identify one of the undetectable single stuck-at fault in the circuit below, or say NONE if all single stuck-at faults are detectable. a L1 L6 L4 b L2 L8 z L5 L7 c L3

SOL-12:

7.7

TEST VECTOR GENERATION

10

7.7

Test Vector Generation

Your task is to generate test vectors to detect faults in the circuit shown below. Your manager has said that manufacturing only has time to run three test vectors on the circuit. L1 a L6
L4

b c

L2 L5 L3

L7

L8

SOL-12:

7.7.1

Choice of Test Vectors

11

7.7.1

Choice of Test Vectors

Which test vectors should you run and in what order should you run them?

SOL-12:

7.7.2

Number of Test Vectors

12

7.7.2

Number of Test Vectors

Write a brief statement (backed up with data) to support either staying with three test vectors or increasing the test suite to four vectors.

SOL-12:

7.8

TIME TO DO A SCAN TEST

13

7.8

Time to do a Scan Test

A 1.2GHz chip has scan chains of length 30,000 bits, 20,000 bits, 24,000 bits, 25,000 bits, and two of 12,000 bits. 500,000 test vectors are used for each scan chain. The tests are run at 50% of full speed. Calculate the total test time.

Answer:

We can load and unload all of the scan chains at the same time, so time will be limited by the longest (30,000 bits). For the rst test vector, we have to load it in, run the circuit for one clock cycle, then unload the result. Loading the second test vector is done while unloading the rst. Clock Cycles 30,000 1 30,000 1 30,000 ... Vector 1 Load Run Dump Vector 2 Vector 3 ...

...

Load Run Dump ...

Load ...

20 8secs

TimeTot

ClockPeriod MaxLengthVec NumVecs MaxLengthVec 1 30 000 500 000 30 000 1 0 50 1 2 109

SOL-12:

7.9

BIST

14

7.9

BIST

In this problem, we will revisit the circuit from section 7.3.1, which is shown below. But, this time well use BIST to test the circuit, rather than analyzing the faults and then choosing test vectors to catch the potential faults.
a b c
L1 L4 L2 L5 L3 L7 L6 L8

SOL-12:

7.9.1

Characteristic Polynomials

15

7.9.1

Characteristic Polynomials

Derive the characteristic polynomials for the linear feedback shift registers shown below:
d0
R

q0

d1

q1

d2

q2

d0

q0

d1

q1 d2

q2

set

set

Answer: Both circuits have three ops, so their maximum exponent is x3 . A feedback tap on each signal di has corresponds to a coefcient of 1 on xi in the characteristic polynomial. The rst circuit has feedback taps for d0, d1, and d2. This gives a characteristic polynomial of: x3 x2

The second circuit has taps on d0 and d1, but not one on d2: x3

SOL-12:

7.9.2

Test Generation

16

7.9.2

Test Generation

Do either of the circuits generate a maximal-length non-repeating sequence?

Answer:

For an LFSR with n ops, the length of a maximal-length non-repeating sequence is 2n 1. Both of the LFSRs under consideration have 3 ops, so we are looking for a sequence of 7 non-repeating values. We will rst simulate the circuits to see their values, and then demonstrate how characteristic polynomials and division over Galois elds can be used to accomplish the same thing. d0 1 0 0 1 1 q0 1 1 0 0 1 x3 d0 1 1 0 0 1 0 1 q0 1 1 1 0 0 1 0 x3 For x3 x2 d1 0 1 0 1 0 x2 q1 1 0 1 0 1 x d2 0 0 1 1 0 q2 1 0 0 1 1

1) 2) 3) 4)

1 q2 1 1 0 0 1 0 1

1) 2) 3) 4) 5) 6) 7)

d1 0 0 1 0 1 1 1 x

q1 1 0 0 1 0 1 1 1

1, we see that it repeats after 4 values.

same as 1)

SOL-12:

7.9.2

Test Generation

17

For x3 x 1, we see that it generates a sequence of 7 different values before repeating. The circuit has three ops, so the maximum length sequence of non-repeating values it can generate is 23 1, which is 7. Thus, x x3 is a maximal length linear feedback shift register. Format for division: lfsr quotient message ... remainder

For an LFSR with no external input and n ops, the rst n coefcients of the message are the reset values of the LFSR, and all of the other remaining coefcients are 0. For a test vector generator LFSR, the reset values are all 1s. We hope to have a sequence of 7 unique remainders. With the three initial values in the LFSR ops, we require a message polynomial of 3 + 7=10 values. 0x2 0x1

Carry out the division:

The message polynomial is then: 1x9 1x8 1x7 0x6 0x5 0x4 0x3

0x0

SOL-12:

7.9.2

Test Generation

18

1x

0x

1x

1x

The values on the ip ops inside an LFSR with n ops show up as the n-most-signicant coefcients on the polynomials immediately below the subtraction lines in the long-divison. For example, after the second subtraction, the polynomial is: 0x7 0x6 1x5 0x4 . The three most signicant coefcients are: 001 and the value on (q2,q1,q0) after two steps of execution is also 001.

Quotient Remainder

1x6 1x2

1x5 1x1

1x2 1x0

1x0

0x5 1x5 1x5 0x5 1x5 0x5 1x5 1x5

0x4 0x4 0x4 0x4 0x4 0x4 0x4 0x4

0x3 0x3 0x3 1x3 1x3 0x3 1x3 1x3

0x2 1x2 1x2 0x2 1x2 0x2 1x2

0x1 0x1 0x1 1x1 1x1

7 7 7 7 7 7 7

7 7 7 7 7 7

7 7 7 7

7 7

1x6 1x9 1x9

1x5 1x8 0x8 1x8 1x8

0x4 1x7 1x7 0x7 0x7 0x7 0x7

0x3 0x6 1x6 1x6 1x6 0x6 0x6 0x6 0x6

1x2 0x5

0x1 0x4

1x0 0x3

0x2

0x1

0x0

0x0 1x0 1x0

SOL-12:

7.9.3

Signature Analyzer

19

7.9.3

Signature Analyzer

Given a signature analyzer equation of x2 x 1, nd the expected value of the ops in the signature analyzer at the end of the test sequence. Also, design the hardware for the signature analyzer and result checker.

Answer:
set mode q0

i_d(0)
S

q1

i_d(1)
S

q2

i_d(2)
S

Expected sequence of values from circuit: z q0 q1 q2 1) 1 1 1 1 x6 2) 1 0 1 0 x5 z 3) 1 0 0 0 x4 4) 0 1 0 0 x3 5) 0 0 1 0 x2 6) 1 1 0 1 x1 7) 0 1 1 1 x0 Polynomial for output sequence of circuit under test: x6 x 1

Connect test generator to circuit Remainder of result sequence divided by signature analyzer is values in ops of signature analyzer at end of test sequence.

Format for division:

mx px qx r x

message (output of circuit under test) polynomial of signature analyzer quotient remainder

x6 x2

x x

1 1

SOL-12:

7.9.3

Signature Analyzer
quotient circuit under test ... remainder

20

signature analyzer

Carry out the division: 1x4 1x6 1x6 1x3 0x5 1x5 1x5 1x5 0x2 0x4 1x4 1x4 1x4 0x4 0x4 1x1 0x3 0x3 1x3 1x3 0x3 1x3 1x3

1x2

1x1

1x0

1x1

Quotient Remainder

1x4 1x3 1x1 0

1x1

1x0

Check division:

x6

1x6

Division was done correctly. The nal value on the three ops in the signature analyzer will be the remainder: 1x1 0x0 10.

1x6

1x4

1x3

1x1

1x0

1x2

1x0

mx x

qx

px 1x1

0x2 0x2 0x2 1x2 1x2 1x2

1x1 1x1 0x1 1x1 1x1

1x0 0x2

1x0

1x0 1x0 0x0

r x x1 x

SOL-12:

7.9.3

Signature Analyzer

21

NOTE: When looking at the remainder (signature), we look at the outputs of the ops, representing the op nearest the input as x0 . Using hardware:
clk i d0 q0 d1
reset d0 i
S S R

1 0 0 0 0 1 1 1 0 1 1 0 0 0 0

remaind

0 1 0 1 1 0 0 0
0 1 1 0 1 1 1 0

q0

d1

q1

q1

0 0 1 1 0 1 1 1

quotient

Signature analyzer and timing diagram The quotients and the remainder calculated using long division match the ones that were calculated using the circuit. The values on the ops in the signature analyzer match, cycle by cycle, the two most signicant coefcients on the intermediate remainders calculated during long division. The intermediate remainders are the polynomials below the subtraction lines. (When looking at the circuit, remember that for an LFSR with n ops, it takes n clock cycles for the circuit to become primed with the input sequence and match the long-division arithmetic.) The ok circuit for this signature analyzer is just a 2-input AND gate, because the remainder is 11.

SOL-12:

7.9.3

Signature Analyzer

22

reset d0 i
S S R

q0

d1

q2

q0 q1

ok

Signature analyzer with ok circuit The result checker should check the ok signal one cycle after the last test vector. The last test vector in the sequence is 110. We can either look for 110 and delay by one clock cycle, or we can look for the rst test vector (111) in second iteration the sequence. To make sure that we are looking at the second iteration of the sequence, and not the rst, we look at reset.
max-length LFSR q0 q1 q2 circuit under test z signature analyzer ok

all_ok

Result checker circuit option 1


q0 q1 q2 reset circuit under test z signature analyzer ok all_ok

max-length LFSR

Result checker circuit option 2

SOL-12:

7.9.4

Probabilty of Catching a Fault

23

7.9.4

Probabilty of Catching a Fault

Find the approximate probability of a fault not being detected

Answer:

We have a sequence of 7 bits coming from the circuit under test. This gives us 27 128 possible sequences. Of these, 1 is the good sequence and 127 are faulty sequences. The signature analyzer stores 2 bits of data, which gives us 4 possible values. Thus, on average 128 4 32 different result sequences will map to the same 2-bit signature. Of these 32 vectors, 1 is the good sequence and 31 are faulty sequences. Assume that each result sequence is equally likely to occur. (NOTE: this is a poor assumption, a full analysis would make each stuck-at fault equally likely, then compute the result vector for each fault.) With this assumption, there is a 31 127 24% chance that a faulty sequence will result in the same signature as the good sequence. There is approximately a 24% chance that a faulty circuit will not be detected.

SOL-12:

7.9.5

Probabilty of Catching a Fault

24

7.9.5

Probabilty of Catching a Fault

If we increase the size of the signature analyzer by one ip op, by how much do we change the the approximate probability of a fault not being detected?

Answer:

A signature analyzer with 3 bits of data gives us 8 possible values. Thus, on average 128 8 16 different result sequences will map to the same 3-bit signature. Assuming that each result sequence is equally likely to occur, there is a 15 127 11 8% chance that a faulty sequence will result in the same signature as the good sequence. There is approximately a 12% chance that a faulty circuit will not be detected. Thus, we have decreased the probability of a faulty circuit not being detected from 24% to 12%.

SOL-12:

7.9.6

Detecting a Specic Fault

25

7.9.6

Detecting a Specic Fault

Determine if a L7@0 is detectable

Answer:

There is an error somewhere in this solution


Equation for faulty circuit: a
AND

b.

Faulty sequence of values from circuit: a 1 1 1 0 0 1 0 b 1 0 0 1 0 1 1 c 1 1 0 0 1 0 1 z 1 x6 0 x5 0 x4 0 x3 0 x2 1 x1 0 x0

Polynomial for result sequence: x6 Compute remainder

SOL-12:

7.9.6

Detecting a Specic Fault

26

1x2

1x1

1x0

1x1

This remainder is the same as the remainder for the correct circuit, thus the fault will be not detected! In hardware:
clk i d0 q0 d1 q1 1 0 0 0 0 1 1 1 0 1 1 0 0 0 0 remainder

0 1 0 1 1 0 0 0
0 1 1 0 1 1 1 0

0 0 1 1 0 1 1 1

quotient

Quotient Remainder

1x4 1x1

1x3 1x0

1x1

0x3 1x3 1x3 0x3 1x3 1x3

0x2 0x2 0x2 1x2 1x2 1x2

1x1 1x1 0x1 1x1 1x1

1x4 1x6 1x6

1x3 0x5 1x5 1x5 1x5

0x2 0x4 1x4 1x4 1x4 0x4 0x4

1x1 0x3

1x0 0x2

0x0 1x0 1x0

SOL-12:

7.9.7

Time to Run Test

27

7.9.7

Time to Run Test

Find the number of clock cycles to run the test

Answer: For a maximal-length LFSR of n bits, it takes 2n 1 clock cycles to generate the 2n 1 test vectors, plus one cycle at the end to op the results. This gives a total of 2n clock cycles, which in our case is 8.

SOL-12:

7.10

POWER AND BIST

28

7.10

Power and BIST

You add a BIST circuit to a chip. This causes the chip to exceed the power envelop that marketing has dictaed is needed. What can you do to reduce the power consumption of the chip without negatively affecting performance or incuring signicant design effort?

Answer: When in test mode, run the clock at a lower frequency so that the chip will consume less power. Add clock gating to signature analyzer so that it is turned off when the chip is in normal mode.

SOL-12:

7.11

TIMING HAZARDS AND TESTABILITY

29

7.11
a
L1

Timing Hazards and Testability


L7 L8 L4 L9 L12

This question deals with with following circuit:

L2 L5 L10

L13

L15

L14

L3

L6

L11

1. Does the circuit have any untestable single-stuck-at faults? If so, identify them. Answer:
a c

None of the minterms are completely covered by other minterms, so the circuit is irredundant and does not have undetectable faults. The two minterms ac and ab overlap, but neither is completely covered by other minterms. So, if one of them was stuck at 0, there would be at least one set of input values that would cause the faulty circuit to differ from the correct circuit. 2. Does the circuit have any static timing hazards?

SOL-12:

7.11

TIMING HAZARDS AND TESTABILITY

30

Answer: Moving from abc to abc moves between minterms. Thus, there is a potential timing hazard.
a

c
Potential glitch (static hazard)

3. Add any circuitry needed to prevent static timing hazards in the circuit below, then identify any untestable single-stuck-at faults in the resulting circuit. Answer:
a c
L1 L7 L8 L4 L9@0 L16@0

L12

L2

L13@0 L15

L17@0 L18@0 L5 L3 L6 L10 L11 L4

L19@0

L14

SOL-12:

7.11

TIMING HAZARDS AND TESTABILITY

31

The minterms ab and bc are both completely covered by other minterms. Thus, these minterms are redundant and are sources of undetectable faults. This gives us L13@0 and L19@0 as undetectable single stuck-at faults. Using gate collapsing, we see that the following faults are equivalent to L13@0: L9@0, L160. And the following are equivalent to L19@0: L17@0, L18@0. NOTE: although both L16@0 and L17@0 are undetectable, this does not mean that L2@0 is undetectable. L2@0 is equivalent to having both L16@0 and L17@0 at the same time. Check the Boolean equations if you are in doubt about this.

SOL-12:

7.12

TESTING SHORT ANSWER

32

7.12

Testing Short Answer

SOL-12:

7.12.1

Are there any physical faults that are detectable by scan testing but not by built-in self

7.12.1 Are there any physical faults that are detectable by scan testing but not by built-in self testing?
If not, explain why. If so, describe such a fault.

Answer: Yes.

A fault that is only detectable with 000 will be detectable by scan testing but not by built-in self test. A fault that results in the same signature as the correct circuit will be detectable by scan testing but not by built-in self test.

SOL-12:

7.12.2

Are there any physical faults that are detectable by built-in self testing but not by scan t

7.12.2 Are there any physical faults that are detectable by built-in self testing but not by scan testing?
If not, explain why. If so, describe such a fault.

Answer: No. Any fault that is detectable by built-in self testing can be detected by scan testing where the test vector that we scan in in the BIST test vector that triggers the fault. If scan testing is interpreted as boundary scan testing and built-in self test is allowed inside a chip, then there are faults that are detectable by built-in self test but not by boundary scan testing. These faults would be inside redundant sequential circuitry. But, this scenario was not intended to be part of this question.

SOL-12:

7.13

FAULT TESTING

35

7.13

Fault Testing

In this question, you will design and analyze built-in self test circuitry for the circuit-under-test shown below.

SOL-12:

7.13.1

Design test generator

36

7.13.1 Design test generator


Draw the schematic for a 2-bit maximal-length linear feedback shift register and demonstrate that it is maximal length.

Answer:

clk d0 1 0 1 q0 1 1 0 1 d1 0 1 1 q1 1 0 1 1 value 3 1 2 3

SOL-12:

7.13.2

Design signature analyzer

37

7.13.2 Design signature analyzer


Design a signature analyzer circuit for a characteristic polynomial of x

Answer:

1.

SOL-12:

7.13.3

Determine if a fault is detectable

38

7.13.3 Determine if a fault is detectable


Is a stuck-at-1 fault on the output of the inverter detectable with the circuitry that youve designed?

Answer:

This solution has an error


8
1. Equation for correct circuit-under-test is a a b output 1 1 0 1 1 0 0 1 1 b.

2. Simulating correct output sequence 011 through signature analyzer: i 0 1 1 d0 0 1 0 q0 0 0 1 0 3. Equation for faulty circuit-under-test is ab a b output 1 1 1 0 1 0 0 1 0

4. Simulating faulty output sequence 100 through signature analyzer: i 1 0 0 d0 1 1 1 q0 0 1 1 1 5. Output of signature analyzer is different from correct circuit, so the fault will be detected.

ab.

SOL-12:

7.13.4

Testing time

39

7.13.4 Testing time


How many clock cycles does your BIST circuitry require to test the circuit under test? Explain how each clock cycle is used.

Answer:

1. reset circuit 2. run rst of three test vectors 3. run second of three test vectors 4. run three of three test vectors 5. op result from circuit under test into signature analyzer 5 clock cycles.

Vous aimerez peut-être aussi