Académique Documents
Professionnel Documents
Culture Documents
Mark Aagaard University of Waterloo Dept of Electrical and Computer Engineering 2003t1Winter March 24, 2003
Contents
I Lecture Notes
1 VHDL LEC-02: Introduction to VHDL . . . . . . . . . . . . . . . . . . . 1.1 Prelude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Topics in this Chapter . . . . . . . . . . . . . . . . . . 1.1.2 Background Material . . . . . . . . . . . . . . . . . . . 1.1.3 Recommended Reading . . . . . . . . . . . . . . . . . 1.2 Introduction to VHDL . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 VHDL Origins and History . . . . . . . . . . . . . . . . 1.2.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Synthesis of a Simulation-Based Language . . . . . . 1.2.4 Solution to Synthesis Sanity . . . . . . . . . . . . . . . 1.2.5 VHDL Disadvantages . . . . . . . . . . . . . . . . . . 1.2.6 VHDL Advantages . . . . . . . . . . . . . . . . . . . . 1.2.7 VHDL and Other Languages . . . . . . . . . . . . . . 1.2.7.1 VHDL vs Verilog . . . . . . . . . . . . . . . . 1.2.7.2 VHDL vs SystemC . . . . . . . . . . . . . . . 1.2.7.3 VHDL vs Other Hardware Description Languages . . . . . . . . . . . . . . . . . . . . . 1.2.7.4 Summary of VHDL Evaluation . . . . . . . . 1.3 Overview of Syntax . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Syntactic Categories . . . . . . . . . . . . . . . . . . . 1.3.2 Library Units . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Entities and Architecture . . . . . . . . . . . . . . . . . 1.3.4 Concurrent Statements . . . . . . . . . . . . . . . . . 1.3.5 Component Declaration and Instantiations . . . . . . . 1.3.6 Processes . . . . . . . . . . . . . . . . . . . . . . . . 1.3.7 Sequential Statements . . . . . . . . . . . . . . . . . . 1.4 Concurrent vs Sequential Statements . . . . . . . . . . . . . 1.4.1 Concurrent Assignment vs Process . . . . . . . . . . 1.4.2 Conditional Assignment vs If Statements . . . . . . . 1.4.3 Selected Assignment vs Case Statement . . . . . . . i
1
3 1 4 5 6 7 9 10 14 18 19 20 21 22 23 24 25 26 27 28 29 31 36 39 40 45 47 48 49 50
CONTENTS
1.4.4 Coding Style . . . . . . . . . . . . . . . . . . . . . . . 1.5 Overview of Processes . . . . . . . . . . . . . . . . . . . . . 1.5.1 Combinational Process vs Clocked Process . . . . . . 1.5.2 Latch Inference . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Combinational vs Flopped Signals . . . . . . . . . . . LEC-03: Details of Process Execution . . . . . . . . . . . . . . . 1.6 Details of Process Execution . . . . . . . . . . . . . . . . . . 1.6.1 Denitions and Algorithm . . . . . . . . . . . . . . . . 1.6.1.1 Temporal Granularities of Simulation . . . . . 1.6.1.2 Process Modes . . . . . . . . . . . . . . . . 1.6.1.3 Simulation Algorithm . . . . . . . . . . . . . 1.6.1.4 Delta-Cycle Denitions . . . . . . . . . . . . 1.6.2 Example: Process Execution . . . . . . . . . . . . . . 1.6.3 Example: Need for Provisional Assignments . . . . . LEC-04: Hardware Building Blocks . . . . . . . . . . . . . . . . 1.7 VHDL and Hardware Building Blocks . . . . . . . . . . . . . . 1.7.1 Basic Building Blocks . . . . . . . . . . . . . . . . . . 1.7.2 Deprecated Building Blocks for RTL . . . . . . . . . . 1.7.3 Hardware and Code for Flops . . . . . . . . . . . . . . 1.7.3.1 Flip-Flops vs Latches . . . . . . . . . . . . . 1.7.3.2 Flops with Waits and Ifs . . . . . . . . . . . . 1.7.3.3 Flops with Synchronous Reset . . . . . . . . 1.7.3.4 Flops with Chip-Enable . . . . . . . . . . . . 1.7.3.5 Flops with Chip-Enable and Mux on Input . . 1.7.3.6 Flops with Chip-Enable, Muxes, and Reset . 1.7.4 An Example Sequential Circuit . . . . . . . . . . . . . 1.8 Synthesizable vs Non-Synthesizable Code . . . . . . . . . . 1.8.1 Unsynthesizable Code . . . . . . . . . . . . . . . . . . 1.8.1.1 Initial Values . . . . . . . . . . . . . . . . . . 1.8.1.2 Wait For . . . . . . . . . . . . . . . . . . . . . 1.8.1.3 Different Wait Conditions . . . . . . . . . . . 1.8.1.4 Multiple if rising edges in Same Process 1.8.1.5 if rising edge and wait in Same Process 1.8.1.6 if rising edge with else Clause . . . . 1.8.1.7 if rising edge Inside a for Loop . . . . 1.8.1.8 wait Inside of a for loop . . . . . . . . . 1.8.2 Synthesizable, but Undesirable Hardware . . . . . . . 1.8.2.1 Asynchronous Reset . . . . . . . . . . . . . 1.8.2.2 Bad Form of Nested Ifs . . . . . . . . . . . . 1.8.2.3 Deeply Nested Ifs . . . . . . . . . . . . . . . 1.9 Numbers, Arithmetic, Arrays, and Signals . . . . . . . . . . . 1.9.1 Arithmetic Packages . . . . . . . . . . . . . . . . . . . 51 55 62 67 71 1 5 6 7 12 17 22 24 79 1 5 6 8 12 13 14 15 16 17 19 22 28 29 30 31 32 34 35 36 37 39 41 42 43 44 45 46
CONTENTS
1.9.2 1.9.3 1.9.4 1.9.5 1.9.6 1.9.7 Shift and Rotate Operations . . . . Overloading of Arithmetic . . . . . Different Widths and Arithmetic . . Overloading of Comparisons . . . Different Widths and Comparisons Type Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 48 49 50 51 52 57 58 59 1 5 6 8 9 12 13 14 15 16 17 18 19 20 25 26 38 39 51 54 56 59 61 66 88 92 95 1 7 8 11 14
2 RTL Design with VHDL 2.1 Prelude to Chapter . . . . . . . . . . . . . . . . . . . 2.1.1 Topics in this Chapter . . . . . . . . . . . . . LEC-05: Dataow Diagrams . . . . . . . . . . . . . . . . 2.2 Design Flow . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Generic Design Flow . . . . . . . . . . . . . . Additional material in notes . . . . . . . . 2.2.2 Implementation Flows . . . . . . . . . . . . . 2.2.3 Classes of Hardware . . . . . . . . . . . . . . 2.2.4 Design Flow: Datapath vs Control vs Storage 2.2.4.1 Datapath-Centric Design Flow . . . 2.2.4.2 Control-Centric Design Flow . . . . 2.2.4.3 Storage-Centric Design Flow . . . . 2.3 Dataow Diagrams and High-Level Models . . . . . 2.3.1 Overview of Example . . . . . . . . . . . . . 2.3.1.1 Software vs Hardware Algorithms . 2.3.1.2 Serial vs Parallel . . . . . . . . . . . 2.3.2 Dataow Diagrams . . . . . . . . . . . . . . . 2.3.2.1 Dataow Diagrams Overview . . . . 2.3.2.2 Area Estimation . . . . . . . . . . . 2.3.3 Dataow Diagram Execution . . . . . . . . . 2.3.3.1 Performance Estimation . . . . . . . 2.3.3.2 Design Analysis . . . . . . . . . . . 2.3.4 Area / Performance Tradeoffs . . . . . . . . . 2.3.5 Optimize Inputs and Outputs . . . . . . . . . 2.3.6 From Dataow Diagram to High-Level Model 2.3.7 From Dataow Diagram to DP+Ctrl Model . . 2.3.7.1 Datapath for DP+Ctrl Model . . . . 2.3.8 Dataow Diagram Scheduling . . . . . . . . . 2.3.9 Summary: From Dataow to Hardware . . . . LEC-06: State Machine Design . . . . . . . . . . . . . . 2.4 Finite State Machines in VHDL . . . . . . . . . . . . 2.4.1 Mealy vs Moore State Machines . . . . . . . 2.4.2 State Machines and VHDL . . . . . . . . . . 2.4.2.1 Implicit and Explicit State Machines
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
2.4.3 Some Simple State Machines . . . . . . . . . . . . . . 2.4.3.1 Implementing a Simple Moore Machine . . . 2.4.3.2 Implementing a Simple Mealy Machine . . . 2.4.4 State Encoding . . . . . . . . . . . . . . . . . . . . . . 2.4.4.1 Constants vs Enumerated Type . . . . . . . 2.4.4.2 Encoding Schemes . . . . . . . . . . . . . . 2.4.5 From Dataow to State Machine . . . . . . . . . . . . 2.4.6 Implicit vs Explicit State Machines . . . . . . . . . . . 2.4.7 Implicit State Machines . . . . . . . . . . . . . . . . . 2.4.7.1 Multi-Wait Process . . . . . . . . . . . . . . . 2.4.7.2 Counter . . . . . . . . . . . . . . . . . . . . . 2.4.8 Explicit State Machines . . . . . . . . . . . . . . . . . 2.4.8.1 State Machine . . . . . . . . . . . . . . . . . 2.4.8.2 Conditional Assignment . . . . . . . . . . . . 2.4.8.3 Conditional Assignment with Dont Care . . . 2.4.8.4 Selected Assignment with Dont Care . . . . 2.4.8.5 Case Statement . . . . . . . . . . . . . . . . 2.4.9 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.10 Input / Output Protocols . . . . . . . . . . . . . . . . LEC-07: Memory Design . . . . . . . . . . . . . . . . . . . . . . . 2.5 Memory Arrays and RTL Design . . . . . . . . . . . . . . . . 2.5.1 Memory Arrays and Dataow Diagrams . . . . . . . . 2.5.1.1 Legend for Dataow Diagrams . . . . . . . . 2.5.1.2 Basic Memory Operations . . . . . . . . . . 2.5.1.3 Data Dependencies . . . . . . . . . . . . . . 2.5.1.4 Denition of Three Types of Dependencies . 2.5.1.5 Dataow Diagrams and Data Dependencies 2.5.1.6 Example: Memory Array and Dataow Diagram . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Memory Arrays in VHDL . . . . . . . . . . . . . . . . . 2.5.2.1 Two-Dimensional Array . . . . . . . . . . . . 2.5.2.2 Memory Array in Hardware . . . . . . . . . . 2.5.2.3 Example VHDL Code for Memory Array in Hardware . . . . . . . . . . . . . . . . . . . . 2.5.2.4 Library Component . . . . . . . . . . . . . . 2.5.2.5 Build Memory from Slices . . . . . . . . . . . 2.5.2.6 Dual-Ported Memory . . . . . . . . . . . . . LEC-08: Design Example: Stack . . . . . . . . . . . . . . . . . . 2.6 Design Example: Stack . . . . . . . . . . . . . . . . . . . . . 2.6.1 Stack Requirements . . . . . . . . . . . . . . . . . . . 2.6.1.1 Stack Entity . . . . . . . . . . . . . . . . . . . 2.6.1.2 Stack Instructions . . . . . . . . . . . . . . . 17 18 30 38 39 44 46 48 49 50 51 52 53 54 55 56 57 59 62 1 7 8 9 10 12 17 18 25 39 40 42 43 44 48 53 1 7 8 9 10
CONTENTS
2.6.1.3 Stack Instruction Encoding . . . . . . . . . . 2.6.1.4 Miscellaneous Requirements . . . . . . . . . 2.6.2 Stack Algorithm . . . . . . . . . . . . . . . . . . . . . 2.6.3 Stack Dataow Diagrams . . . . . . . . . . . . . . . . 2.6.3.1 Initial Diagrams . . . . . . . . . . . . . . . . 2.6.3.2 Partition into Clock Cycles . . . . . . . . . . 2.6.3.3 High-Level Model . . . . . . . . . . . . . . . 2.6.3.4 Individual Block Diagrams . . . . . . . . . . . 2.6.3.5 Complete Block Diagram . . . . . . . . . . . 2.6.4 Stack: Register Transfer Level . . . . . . . . . . . . . 2.6.4.1 Stack: Separate Control, Datapath and Storage . . . . . . . . . . . . . . . . . . . . . . . 2.6.4.2 Stack: Datapath Operations . . . . . . . . . 2.6.4.3 Stack: Explicit State Machine . . . . . . . . . LEC-09: Guidelines and Optimization Techniques . . . . . . . . 2.7 RTL Coding Guidelines . . . . . . . . . . . . . . . . . . . . . 2.7.1 Design Process . . . . . . . . . . . . . . . . . . . . . 2.7.2 Signal Declarations . . . . . . . . . . . . . . . . . . . 2.7.3 Processes . . . . . . . . . . . . . . . . . . . . . . . . 2.7.4 Flip-Flops and Latches . . . . . . . . . . . . . . . . . 2.7.4.1 Multiplexors and Tri-State Signals . . . . . . 2.7.5 State Machines . . . . . . . . . . . . . . . . . . . . . . 2.7.5.1 Reset . . . . . . . . . . . . . . . . . . . . . . 2.7.6 Inputs and Outputs . . . . . . . . . . . . . . . . . . . . 2.8 Additional VHDL Features . . . . . . . . . . . . . . . . . . . . 2.8.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.2 Still More VHDL Features . . . . . . . . . . . . . . . . 2.9 General Optimization Techniques . . . . . . . . . . . . . . . . 2.9.1 Strength Reduction . . . . . . . . . . . . . . . . . . . 2.9.1.1 Arithmetic Strength Reduction . . . . . . . . 2.9.1.2 Boolean Strength Reduction . . . . . . . . . 2.9.2 Replication and Sharing . . . . . . . . . . . . . . . . . 2.9.2.1 Mux-Pushing . . . . . . . . . . . . . . . . . . 2.9.2.2 Common Subexpression Elimination . . . . . 2.9.2.3 Computation Replication . . . . . . . . . . . 2.9.3 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.4 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . LEC-10: FPGA-Specic Guidelines and Optimization . . . . . . 2.10 FPGA-Specic Guidelines . . . . . . . . . . . . . . . . . . . 2.10.1 Generic FPGAs . . . . . . . . . . . . . . . . . . . . . 2.10.1.1 Overview of Generic FPGA Hardware . . . 2.10.1.2 Generic Clocks . . . . . . . . . . . . . . . . 11 12 13 17 18 23 28 37 43 45 52 70 80 1 4 5 6 11 15 17 18 20 24 25 26 30 31 32 33 34 35 36 37 39 40 41 1 5 6 7 24
CONTENTS
2.10.1.3 Special Circuitry in FPGAs 2.10.2 Altera APEX20K . . . . . . . . . . . 2.11 Example Circuits . . . . . . . . . . . . . . . 2.11.1 Ripple-Carry Adder . . . . . . . . . . 2.11.2 Barrel Shifter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 32 36 37 38 43 1 8 9 14 21 22 27 31 34 39 46 50 51 53 55 57 58 61 63 64 65 66 68 69 1 10 11 14 15 16 17 19 21 28 33
3 Functional Validation LEC-11: Functional Validation of Datapath Circuits 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Validation / Verication / Testing . . . . . 3.1.2 Why Your First Circuit Will Not Work . . . 3.2 Test Cases . . . . . . . . . . . . . . . . . . . . . 3.2.1 Coverage . . . . . . . . . . . . . . . . . . 3.2.2 Heating System Example . . . . . . . . . 3.2.2.1 Number of Cases to Consider . 3.2.2.2 Representation Simplication . . 3.2.3 Floating Point Divider Example . . . . . . 3.2.4 Functional Validation Challenges . . . . . 3.3 Testbenches . . . . . . . . . . . . . . . . . . . . 3.3.1 Overview of Test Benches . . . . . . . . . 3.3.2 Reference Model Style Testbench . . . . 3.3.3 Relational Style Testbench . . . . . . . . 3.3.4 Coding Structure of a Testbench . . . . . 3.3.5 Datapath vs Control . . . . . . . . . . . . 3.4 Functional Validation for Datapath Circuits . . . . 3.4.1 A Spec-Less Testbench . . . . . . . . . . 3.4.2 Use an Array for Test Vectors . . . . . . . 3.4.3 Build Spec into Stimulus . . . . . . . . . . 3.4.4 Have Separate Specication Entity . . . . 3.4.5 Generate Test Vectors . . . . . . . . . . . 3.4.6 Relational Specication . . . . . . . . . . LEC-12: Functional Validation of State Machines . 3.5 Functional Validation of Control Circuits . . . . . 3.5.1 Overview of Queues in Hardware . . . . . 3.5.2 VHDL Coding . . . . . . . . . . . . . . . . 3.5.2.1 Package . . . . . . . . . . . . . 3.5.2.2 Other VHDL Coding . . . . . . . 3.5.3 Code Structure for Validation . . . . . . . 3.5.4 Instrumentation Code . . . . . . . . . . . 3.5.5 Coverage Monitors . . . . . . . . . . . . . 3.5.6 Assertions . . . . . . . . . . . . . . . . . 3.5.7 VHDL Coding Tips . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
3.5.8 Queue Specication . . . . . . . . . . . . . . . . . . . 3.5.9 Queue Testbench . . . . . . . . . . . . . . . . . . . . 4 Performance Analysis and Optimization 4.1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Background Material . . . . . . . . . . . . . . . . . . . 4.1.3 Reading Material . . . . . . . . . . . . . . . . . . . . . LEC-13: Introduction to Performance Analysis . . . . . . . . . 4.2 Dening Performance . . . . . . . . . . . . . . . . . . . . . . 4.3 Comparing Performance . . . . . . . . . . . . . . . . . . . . . 4.3.1 Performance for Different Tasks . . . . . . . . . . . . . 4.3.2 Optimizing Performance . . . . . . . . . . . . . . . . . 4.4 Clock Speed, CPI, Program Length, and Performance . . . . 4.4.1 Mathematics . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Example: CISC vs RISC and CPI . . . . . . . . . . . . 4.4.3 Summary of Equations . . . . . . . . . . . . . . . . . LEC-14: Performance and Dataow Diagrams . . . . . . . . . . 4.5 Performance Analysis and Dataow Diagrams . . . . . . . . 4.5.1 Dataow Diagrams, CPI, and Clock Speed . . . . . . 4.5.1.1 Tradeoffs . . . . . . . . . . . . . . . . . . . . 4.5.2 Dataow Diagram with Two Instructions . . . . . . . . 4.5.2.1 Scheduling of Operations for Different Clock Periods . . . . . . . . . . . . . . . . . . . . . 4.5.2.2 Performance Computation for Different Clock Periods . . . . . . . . . . . . . . . . . 4.5.2.3 Example: Two Instructions Taking Similar Time 4.5.2.4 Example: Same Total Time, Different Order for A . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Example: From Algorithm to Optimized Dataow . . . 4.5.4 Optimality: Performance vs Area Tradeoffs . . . . . . 4.5.5 Affect of Instruction Set on Performance . . . . . . . . 4.5.6 Affect of Time to Market on Relative Performance . . 39 43 45 46 47 48 49 1 7 10 13 14 16 17 18 22 1 5 6 7 9 10 14 15 18 20 24 27 30
CONTENTS
5 Timing Analysis 5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Background Material . . . . . . . . . . . . . . . . . . . 5.1.3 Reading Material . . . . . . . . . . . . . . . . . . . . . LEC-15: Introduction to Timing Analysis . . . . . . . . . . . . . 5.2 Delays and Denitions . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Related Background Denitions . . . . . . . . . . . . . 5.2.2 Timing Constraints . . . . . . . . . . . . . . . . . . . . 5.2.2.1 Minimum Clock Period . . . . . . . . . . . . . 5.2.2.2 Hold Constraint . . . . . . . . . . . . . . . . 5.2.3 Clock-Related Timing Denitions . . . . . . . . . . . . 5.2.3.1 Clock Skew (Smith 6.5.1) . . . . . . . . . . . 5.2.3.2 Clock Latency (Smith 6.5.1) . . . . . . . . . . 5.2.3.3 Clock Jitter (Smith pp873) . . . . . . . . . . . 5.2.4 Storage Related Timing Denitions (Smith 2.5.2) . . . 5.2.4.1 Setup Time . . . . . . . . . . . . . . . . . . . 5.2.4.2 Hold Time . . . . . . . . . . . . . . . . . . . 5.2.4.3 Clock-to-Q Time . . . . . . . . . . . . . . . . 5.2.4.4 Example Timing Violations . . . . . . . . . . 5.2.5 Propagation Delays . . . . . . . . . . . . . . . . . . . 5.2.5.1 Load Delays (Smith 3.1) . . . . . . . . . . . . 5.2.5.2 Interconnect Delays (Smith 7.1) . . . . . . . 5.3 Critical Paths: False and True . . . . . . . . . . . . . . . . . . 5.3.1 Critical Path Example . . . . . . . . . . . . . . . . . . 5.3.2 Algorithm to Find Critical Path . . . . . . . . . . . . . 5.3.2.1 Critical Path Between Two Signals . . . . . . 5.3.2.2 Critical Path Between Sets of Signals . . . . 5.3.3 False Paths . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3.1 Static False Path Example . . . . . . . . . . 5.3.3.2 Dynamic False Path Example . . . . . . . . . CHANGE ver2 (2002/12/02): corrected edge polarity on a . . . . . . . . . . . . . . . . . . . . . 5.3.3.3 Another Dynamic False Path Example . . . . 5.3.3.4 And Another Dynamic False Path Example . 5.3.3.5 Algorithm for False Path Detection . . . . . . 5.3.4 Increasing the Accuracy of Critical Path Analysis . . . LEC-16: Math, Physics, and Applications of Timing Analysis . 5.4 Analog Effects in Timing Analysis . . . . . . . . . . . . . . . . 5.4.1 Timing Model (Smith 3.1, 13.6) . . . . . . . . . . . . . 5.4.1.1 Equation for Output Voltage . . . . . . . . . . 5.4.1.2 Extrinsic / Intrinsic Delays (Smith 13.6) . . . 33 34 35 36 37 1 10 11 17 21 22 23 24 26 27 29 31 32 33 34 38 39 41 42 46 47 48 51 52 53 59 68 71 73 76 84 1 5 6 7 13
vi
CONTENTS
5.4.2 Data-Dependent Delay . . . . . . . . . . . . . . . . 5.4.3 Interconnect Delay (Smith 7.1) . . . . . . . . . . . . 5.4.3.1 Elmore Time Constant (Smith 7.1.2) . . . . 5.4.3.2 Interconnect with Single Fanout . . . . . . 5.4.3.3 Interconnect with Multiple Gates in Fanout 5.4.3.4 FPGAs, Interconnect, and Synthesis . . . . 5.5 Practical Usage of Timing Analysis . . . . . . . . . . . . . . 5.5.1 Speed Binning (Smith 5.1.6) . . . . . . . . . . . . . 5.5.2 Worst Case Timing (Smith 5.1.7) . . . . . . . . . . . 5.5.2.1 Fanout delay . . . . . . . . . . . . . . . . . 5.5.2.2 Derating Factors . . . . . . . . . . . . . . . LEC-17: Timing Analysis (Latches and Flip Flops) . . . . . . 5.6 Timing Analysis of Latches and Flip Flops . . . . . . . . . . 5.6.1 Simple Latch . . . . . . . . . . . . . . . . . . . . . . 5.6.2 Clock-to-Q Time of a Simple Latch . . . . . . . . . . 5.6.3 Setup Timing of a Simple Latch . . . . . . . . . . . . 5.6.3.1 Hold Time of a Simple Latch . . . . . . . . 5.6.3.2 Example of a Bad Latch . . . . . . . . . . . 5.6.4 Timing Analysis of a Transmission Gate Latch . . . 5.6.4.1 Transmission Gate (Smith 2.4.3) . . . . . . 5.6.4.2 Transmission Gate Latch (Smith 2.5.1) . . 5.6.4.3 Clock-to-Q Delay for Latch . . . . . . . . . 5.6.4.4 Setup and Hold Times for Latch . . . . . . 5.6.5 Falling Edge Flip Flop (Smith 2.5.2) . . . . . . . . . 5.6.5.1 Behaviour of Flip-Flop . . . . . . . . . . . . 5.6.5.2 Clock-to-Q of Flip-Flop . . . . . . . . . . . 5.6.5.3 Setup of Flip-Flop . . . . . . . . . . . . . . 5.6.5.4 Hold of Flip-Flop . . . . . . . . . . . . . . . 5.6.6 Timing Analysis of FPGA Cells (Smith 5.1.5) . . . . 5.6.6.1 Standard Timing Equations . . . . . . . . . 5.6.6.2 Hierarchical Timing Equations . . . . . . . 5.6.6.3 Actel Act 2 Logic Cell . . . . . . . . . . . . 5.6.6.4 Timing Analysis of Actel Sequential Module 5.6.7 Exotic Flop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 16 17 19 25 37 38 39 40 41 42 1 4 5 20 21 25 29 30 31 32 35 36 39 40 41 42 43 44 45 46 47 52 54
CONTENTS
6 Power Analysis and Design LEC-18: Introduction to Power . . . . . . . . . . . . . . . . 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Importance of Power and Energy . . . . . . . . . 6.1.2 Industrial Names and Products . . . . . . . . . . 6.1.3 Power vs Energy . . . . . . . . . . . . . . . . . . 6.1.4 Batteries, Power and Energy . . . . . . . . . . . 6.1.4.1 Do Batteries Store Energy or Power? . 6.1.4.2 Battery Life and Efciency . . . . . . . 6.1.5 Example Problem: Battery Life and Power . . . . 6.2 Power Equations . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Dynamic Power and Activity Factor . . . . . . . . 6.2.2 Switching Power . . . . . . . . . . . . . . . . . . 6.2.3 Short-Circuited Power . . . . . . . . . . . . . . . 6.2.4 Leakage Power . . . . . . . . . . . . . . . . . . . 6.2.5 Glossary . . . . . . . . . . . . . . . . . . . . . . 6.2.6 Note on Power Equations . . . . . . . . . . . . . LEC-19: Data Encoding for Power Reduction . . . . . . . 6.3 Overview of Power Reduction Techniques . . . . . . . . 6.4 Voltage Reduction for Power Reduction . . . . . . . . . 6.5 Data Encoding for Power Reduction . . . . . . . . . . . 6.5.1 How Data Encoding Can Reduce Power . . . . . 6.5.2 Example Problem . . . . . . . . . . . . . . . . . 6.5.2.1 Problem Statement . . . . . . . . . . . 6.5.2.2 Additional Information . . . . . . . . . . 6.5.2.3 Answer . . . . . . . . . . . . . . . . . . LEC-20: Clock Gating for Power Reduction . . . . . . . . . 6.6 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Introduction to and Overview of Clock Gating . . 6.6.1.1 Examples of Clock Gating . . . . . . . . 6.6.1.2 Design Tradeoffs . . . . . . . . . . . . . 6.6.1.3 Functional Validation and Clock Gating 6.6.2 Implementing Clock Gating . . . . . . . . . . . . 6.6.2.1 Simple Power Analysis . . . . . . . . . 6.6.2.2 Valid-Bit Protocol . . . . . . . . . . . . . 6.6.2.3 Clock Gating and Big Circuit . . . . . 6.6.2.4 Designing Clock Gating Circuitry . . . . 6.6.3 Design Problem . . . . . . . . . . . . . . . . . . 6.6.3.1 Solution Sketch . . . . . . . . . . . . . 55 1 7 8 9 10 11 12 13 14 15 16 17 18 19 20 23 1 4 9 13 14 15 16 17 18 1 4 5 6 7 8 9 10 14 21 29 32 34
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
7 Fault Testing and Testability 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Purpose and List of Concepts . . . . . . . . . . . . . . 7.1.2 Background Material . . . . . . . . . . . . . . . . . . . 7.1.3 Reading Material . . . . . . . . . . . . . . . . . . . . . LEC-21: Introduction to Faults, Testing, and Testability . . . . 7.2 Faults and Testing . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Overview of Faults and Testing . . . . . . . . . . . . . 7.2.1.1 Faults (Smith 14.3) . . . . . . . . . . . . . . 7.2.1.2 Causes of Faults (Smith 14.3) . . . . . . . . 7.2.1.3 Testing (Smith 14) . . . . . . . . . . . . . . . 7.2.1.4 Burn In (Smith 14.3.1) . . . . . . . . . . . . . 7.2.1.5 Bin Sorting (Smith 5.1.6) . . . . . . . . . . . 7.2.1.6 Testing Techniques (Smith 14) . . . . . . . . 7.2.1.7 Design for Testability (DFT) (Smith 14.6) . . 7.2.2 Example Problem: Economics of Testing (Smith 14.1) 7.2.3 Physical Faults (Smith 14.3.3) . . . . . . . . . . . . . . 7.2.3.1 Types of Physical Faults . . . . . . . . . . . . 7.2.3.2 Locations of Faults . . . . . . . . . . . . . . . 7.2.3.3 Layout Affects Locations . . . . . . . . . . . 7.2.3.4 Naming Fault Locations . . . . . . . . . . . . 7.2.4 Detecting a Fault . . . . . . . . . . . . . . . . . . . . . 7.2.4.1 Which Test Vectors will Detect a Fault? . . . 7.2.4.2 A Single Test-Vector Can Detect Several Faults . . . . . . . . . . . . . . . . . . . . . . 7.2.5 Mathematical Models of Faults (Smith 14.3.4) . . . . . 7.2.5.1 Single Stuck-At Fault Model . . . . . . . . . . 7.2.6 Generate Test Vector to Find a Mathematical Fault (Smith 14.4) . . . . . . . . . . . . . . . . . . . . . . . . 7.2.6.1 Algorithm . . . . . . . . . . . . . . . . . . . . 7.2.6.2 Example of Finding a Test Vector . . . . . . . 7.2.7 Undetectable Faults . . . . . . . . . . . . . . . . . . . 7.2.7.1 Redundant Circuitry . . . . . . . . . . . . . . 7.2.7.2 Curious Redundant Circuitry and Fault Detection . . . . . . . . . . . . . . . . . . . . . LEC-22: Fault Detection and Test-Vector Generation . . . . . . 7.3 Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Locations of Faults . . . . . . . . . . . . . . . . . . . . 7.3.2 Choosing Test Vectors (Smith 14.3.7) . . . . . . . . . 7.3.2.1 Fault Domination . . . . . . . . . . . . . . . . 7.3.2.2 Fault Equivalence . . . . . . . . . . . . . . . 7.3.2.3 Gate Collapsing . . . . . . . . . . . . . . . . 49 50 51 52 53 1 6 7 8 9 10 11 12 13 15 16 18 19 20 21 22 23 24 25 26 27 31 32 33 34 35 41 1 4 5 8 9 10 11
CONTENTS
7.3.2.4 Node Collapsing . . . . . . . . . . . . . . . . 7.3.2.5 Fault Collapsing Summary . . . . . . . . . . 7.3.3 Fault Coverage . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Generate Test Vectors for 100% Coverage . . . . . . 7.3.4.1 Collapse the Faults . . . . . . . . . . . . . . 7.3.4.2 Check for Fault Domination . . . . . . . . . . 7.3.4.3 Required Test Vectors . . . . . . . . . . . . . 7.3.4.4 Faults Not Covered by Required Test Vectors 7.3.4.5 Order to Run Test Vectors . . . . . . . . . . . 7.3.4.6 Summary of Technique to Find and Order Test Vectors . . . . . . . . . . . . . . . . . . 7.3.4.7 Complete Analysis . . . . . . . . . . . . . . . 7.3.5 One Fault Hiding Another . . . . . . . . . . . . . . . . LEC-23: Built In Self Test . . . . . . . . . . . . . . . . . . . . . . 7.4 Built In Self Test (Smith 14.7) . . . . . . . . . . . . . . . . . . 7.4.1 Block Diagram . . . . . . . . . . . . . . . . . . . . . . 7.4.1.1 Components . . . . . . . . . . . . . . . . . . 7.4.1.2 Linear Feedback Shift Register (LFSR) . . . 7.4.1.3 Maximal-Length LFSR . . . . . . . . . . . . . 7.4.2 Test Generator . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Signature Analyzer . . . . . . . . . . . . . . . . . . . . 7.4.4 Result Checker . . . . . . . . . . . . . . . . . . . . . . 7.4.5 Arithmetic over Binary Fields . . . . . . . . . . . . . . 7.4.6 Shift Registers and Characteristic Polynomials (Smith 14.7.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.6.1 Circuit Multiplication . . . . . . . . . . . . . . 7.4.7 Bit Streams and Characteristic Polynomials . . . . . . 7.4.8 Division . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.9 Signature Analysis: Math and Circuits . . . . . . . . . 7.4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . LEC-24: Scan Testing (JTAG) . . . . . . . . . . . . . . . . . . . . 7.5 Scan Testing in General (Smith 14.6) . . . . . . . . . . . . . . 7.5.1 Structure and Behaviour of Scan Testing . . . . . . . 7.5.2 Scan Chains . . . . . . . . . . . . . . . . . . . . . . . 7.5.2.1 Circuitry in Normal Mode . . . . . . . . . . . 7.5.2.2 Scan in Operation . . . . . . . . . . . . . . . 7.5.2.3 Scan in Operation with Example Circuit . . . 7.5.3 Summary of Scan Testing . . . . . . . . . . . . . . . . 7.5.4 Example: Time to Test a Chip . . . . . . . . . . . . . . 7.6 Boundary Scan . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Boundary Scan History . . . . . . . . . . . . . . . . . 7.6.2 Scan Pins . . . . . . . . . . . . . . . . . . . . . . . . . 12 13 14 15 16 20 24 25 26 29 30 31 1 5 6 13 17 23 27 30 34 35 39 42 43 44 47 50 1 4 5 6 7 9 18 32 33 34 36 37
CONTENTS
7.6.3 Scan Registers and Cells . . . . . . . . . 7.6.4 Scan Instructions . . . . . . . . . . . . . . 7.6.5 TAP Controller . . . . . . . . . . . . . . . 7.6.6 Other descriptions of JTAG/IEEE 1194.1 . 7.7 Summary and Conclusions on Testing . . . . . . 7.7.1 Faults . . . . . . . . . . . . . . . . . . . . 7.7.2 Testing . . . . . . . . . . . . . . . . . . . 7.7.2.1 Scan Testing . . . . . . . . . . . 7.7.2.2 Built-In Self Test (BIST) . . . . . 7.7.3 Scan vs Self Test . . . . . . . . . . . . . . 8 Review LEC-25: Review . . . . . . . . . . . . . . . 8.1 Overview of the Term . . . . . . . . . 8.2 VHDL . . . . . . . . . . . . . . . . . . 8.3 Design and Optimization Techniques . 8.4 Validation . . . . . . . . . . . . . . . . 8.5 Performance Prediction and Analysis 8.6 Timing Analysis . . . . . . . . . . . . . 8.7 Power . . . . . . . . . . . . . . . . . . 8.8 Testing . . . . . . . . . . . . . . . . . 8.9 Formulas to be Given on Final Exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 43 44 45 46 47 48 49 51 53 55 1 2 5 6 7 8 9 10 11 13
xi
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
CONTENTS
xi
1
3 1 2 3 6 9 10 12 13 14 16 1 2 5 7 10 12 15 18 20 22 23 31 33
CONTENTS
2 Design Problems SOL-03: Datapath and Control Design . . . . . . . 2.1 Synthesis . . . . . . . . . . . . . . . . . . . . . 2.1.1 Data Structures . . . . . . . . . . . . . . 2.1.2 Own Code vs Libraries . . . . . . . . . 2.2 Design Guidelines . . . . . . . . . . . . . . . . 2.3 Dataow Diagram Optimization . . . . . . . . . 2.3.1 Resource Usage . . . . . . . . . . . . . 2.3.2 Optimization . . . . . . . . . . . . . . . 2.4 Dataow Diagram Design . . . . . . . . . . . . 2.4.1 Maximum performance . . . . . . . . . 2.4.2 Minimum area . . . . . . . . . . . . . . 2.5 Design and Optimization . . . . . . . . . . . . SOL-04: Memory Design . . . . . . . . . . . . . . . 2.6 Dataow Diagrams with Memory Arrays . . . . 2.6.1 Algorithm 1 . . . . . . . . . . . . . . . . 2.6.2 Algorithm 2 . . . . . . . . . . . . . . . . SOL-05: Optimization and FPGA Implementation 2.7 2-bit adder . . . . . . . . . . . . . . . . . . . . 2.7.1 Generic Gates . . . . . . . . . . . . . . 2.7.2 Xilinx FPGA . . . . . . . . . . . . . . . . 2.8 Sketches of Problems . . . . . . . . . . . . . . 3 Functional Validation Problems SOL-06: Functional Validation . . . . . . 3.1 Functional Validation Problems . . . . 3.1.1 Carry Save Adder . . . . . . . 3.1.2 Trafc Light Controller . . . . . 3.1.3 State Machines and Validation 3.1.4 Additional Problem . . . . . . . 3.1.5 Test Plan Creation . . . . . . . 3.1.5.1 Early Tests . . . . . . 3.1.5.2 Corner Cases . . . . 35 1 2 3 4 5 9 10 11 12 13 16 17 1 2 3 6 1 2 3 4 5 7 1 2 3 4 6 9 10 11 13
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
CONTENTS
4 Performance Analysis and Optimization Problems SOL-07: Performance Analysis and Optimization . 4.1 Farmer . . . . . . . . . . . . . . . . . . . . . . . 4.2 Network and Router . . . . . . . . . . . . . . . . 4.2.1 Maximum Throughput . . . . . . . . . . . 4.2.2 Packet Size and Performance . . . . . . . 4.3 Performance Short Answer . . . . . . . . . . . . 4.4 Microprocessors . . . . . . . . . . . . . . . . . . 4.4.1 Average CPI . . . . . . . . . . . . . . . . 4.4.2 Why not you too? . . . . . . . . . . . . . . 4.4.3 Analysis . . . . . . . . . . . . . . . . . . . 4.5 Dataow Diagram Optimization . . . . . . . . . . 4.6 Optimization with Memory Arrays . . . . . . . . . 4.7 Multiply Instruction . . . . . . . . . . . . . . . . . 4.7.1 Highest Performance . . . . . . . . . . . 4.7.2 Optimality . . . . . . . . . . . . . . . . . . 4.7.3 Performance Metrics . . . . . . . . . . . . 15 1 2 4 5 6 7 8 9 11 12 13 14 21 22 24 25 27 1 2 3 4 5 6 7 8 15 16 17 18 19 1 2 3 4 5 6 7
xv
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
5 Timing Analysis Problems SOL-08: Timing Analysis . . . . . . . . . . . . . . . . . . . . . . 5.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Critical Path and False Path . . . . . . . . . . . . . . . . . . . 5.3 Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Ignoring potential false paths, list the signals in the critical path through this circuit. . . . . . . . . . . . . . 5.3.2 What is the combinational delay through the critical path? . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Missing Factors . . . . . . . . . . . . . . . . . . . . . . 5.3.4 False Path? . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Timing Models . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Worst Case Conditions and Derating Factor . . . . . . . . . . 5.5.1 Worst-Case Commercial . . . . . . . . . . . . . . . . . 5.5.2 Worst-Case Industrial . . . . . . . . . . . . . . . . . . 5.5.3 Worst-Case Industrial, Non-Ambient Junction Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . SOL-09: Timing Analysis (II) . . . . . . . . . . . . . . . . . . . . 5.6 Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Wires in FPGAs . . . . . . . . . . . . . . . . . . . . . 5.6.2 Age and Time . . . . . . . . . . . . . . . . . . . . . . 5.6.3 Temperature and Delay . . . . . . . . . . . . . . . . . 5.7 Hold Time Violations . . . . . . . . . . . . . . . . . . . . . . . 5.7.1 Cause . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
5.7.2 Behaviour . . . . . . . . . . . 5.7.3 Rectication . . . . . . . . . 5.8 Latch Analysis . . . . . . . . . . . . 5.9 Combinational Timing (Smith 13.23) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 9 10 12 13 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 7 8 9 10 11 12 13 14 15 16
xv
6 Power Problems SOL-10: Power Analysis and Reduction . . . . . . . . . . 6.1 Power Analysis and Reduction Problems . . . . . . . 6.1.1 Short Answers . . . . . . . . . . . . . . . . . . 6.1.1.1 Power and Temperature . . . . . . . . 6.1.1.2 Leakage Power . . . . . . . . . . . . 6.1.1.3 Clock Gating . . . . . . . . . . . . . . 6.1.1.4 Gray Coding . . . . . . . . . . . . . . 6.1.2 VLSI Gurus . . . . . . . . . . . . . . . . . . . . 6.1.2.1 Affect on Power . . . . . . . . . . . . 6.1.2.2 Critique . . . . . . . . . . . . . . . . . 6.1.3 Advertising Ratios . . . . . . . . . . . . . . . . SOL-11: Power Analysis and Reduction . . . . . . . . . . 6.1.4 Vary Supply Voltage . . . . . . . . . . . . . . . 6.1.5 Power Reality and Math (Smith prob 15.16) . . 6.1.6 Clock Speed Increase Without Power Increase 6.1.6.1 Supply Voltage . . . . . . . . . . . . . 6.1.6.2 Supply Voltage . . . . . . . . . . . . . 6.1.7 Power Reduction Strategies . . . . . . . . . . . 6.1.7.1 Supply Voltage . . . . . . . . . . . . . 6.1.7.2 Transistor Sizing . . . . . . . . . . . . 6.1.7.3 Adding Registers to Inputs . . . . . . 6.1.7.4 Gray Coding . . . . . . . . . . . . . . 6.1.8 Power Consumption on New Chip . . . . . . . 6.1.8.1 Hypothesis . . . . . . . . . . . . . . . 6.1.8.2 Experiment . . . . . . . . . . . . . . . 6.1.8.3 Reality . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
7 Problems on Faults, Testing, and Testability SOL-12: Faults, Testing, and Testability . . . . . . . . . . . . . . 7.1 Based on Smith q14.9: Testing Cost . . . . . . . . . . . . . . 7.2 Testing Cost and Total Cost . . . . . . . . . . . . . . . . . . . 7.3 Minimum Number of Faults . . . . . . . . . . . . . . . . . . . 7.4 Smith q14.10: Fault Collapsing . . . . . . . . . . . . . . . . . 7.5 Mathematical Models and Reality . . . . . . . . . . . . . . . . 7.6 Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . 7.7 Test Vector Generation . . . . . . . . . . . . . . . . . . . . . 7.7.1 Choice of Test Vectors . . . . . . . . . . . . . . . . . . 7.7.2 Number of Test Vectors . . . . . . . . . . . . . . . . . 7.8 Time to do a Scan Test . . . . . . . . . . . . . . . . . . . . . 7.9 BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9.1 Characteristic Polynomials . . . . . . . . . . . . . . . 7.9.2 Test Generation . . . . . . . . . . . . . . . . . . . . . 7.9.3 Signature Analyzer . . . . . . . . . . . . . . . . . . . . 7.9.4 Probabilty of Catching a Fault . . . . . . . . . . . . . . 7.9.5 Probabilty of Catching a Fault . . . . . . . . . . . . . . 7.9.6 Detecting a Specic Fault . . . . . . . . . . . . . . . . 7.9.7 Time to Run Test . . . . . . . . . . . . . . . . . . . . . 7.10 Power and BIST . . . . . . . . . . . . . . . . . . . . . . . . . 7.11 Timing Hazards and Testability . . . . . . . . . . . . . . . . 7.12 Testing Short Answer . . . . . . . . . . . . . . . . . . . . . . 7.12.1 Are there any physical faults that are detectable by scan testing but not by built-in self testing? . . . . . . 7.12.2 Are there any physical faults that are detectable by built-in self testing but not by scan testing? . . . . . . 7.13 Fault Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.13.1 Design test generator . . . . . . . . . . . . . . . . . 7.13.2 Design signature analyzer . . . . . . . . . . . . . . . 7.13.3 Determine if a fault is detectable . . . . . . . . . . . 7.13.4 Testing time . . . . . . . . . . . . . . . . . . . . . . . 17 1 2 4 6 7 8 9 10 11 12 13 14 15 16 19 23 24 25 27 28 29 32 33 34 35 36 37 38 39
xvi
Part I
Lecture Notes
Chapter 1
LEC-02 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-02 Preliminaries
Schedule
wk-01 05 VHDL Design and Optimization wk-01 Overview and VHDL wk-02 VHDL Semantics wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review
LEC-02 Preliminaries
Concepts
Lecture Notes: Sections 1.11.5.3
port type direction signal combinational process clocked process latch inference
LEC-02:
1.1
PRELUDE
1.1
Prelude
LEC-02:
1.1.1
1.1.1
LEC-02:
1.1.2
Background Material
1.1.2
Background Material
LEC-02:
1.1.3
Recommended Reading
1.1.3
Recommended Reading
Links to many VHDL resources are on the E&CE 427 web pages under Documentation. In addition to Smith, two other books on VHDL are on reserve in the Davis Centre Library:
Relevant chapters in Smith: 8 (Software), 10 (VHDL), 12 (Synthesis); Appendix A. Suggested reading order in Smith:
Designers Guide to VHDL, Peter J. Ashenden VHDL for Logic Synthesis, Andrew Rushton
LEC-02:
1.1.3
Recommended Reading
8 10.9 other declarations 10.15 congurations and specications 10.16 example: engine controller remainder of Ch 12
Third pass: 10.110.4 intro to VHDL 10.6 packages and libraries 10.8 type declarations
Second pass: 10.11 operators 10.12 arithmetic 12.7 FSM synthesis 12.8 Memory synthesis
Reference material: Table 10.27: VHDL summary Table 10.28: VHDL denitions Appendix A: VHDL syntax
LEC-02:
1.2
INTRODUCTION TO VHDL
1.2
Introduction to VHDL
LEC-02:
1.2.1
10
1.2.1
VHDL = VHSIC Hardware Description Language VHSIC = Very High Speed Integrated Circuit The VHSIC Hardware Description Language (VHDL) is a formal notation intended for use in all phases of the creation of electronic systems. Because it is both machine readable and human readable, it supports the development, verication, synthesis and testing of hardware designs, the communication of hardware design data, and the maintenance, modication, and procurement of hardware. Language Reference Manual (IEEE Design Automation Standards Committee, 1993a)
LEC-02:
1.2.1
11
VHDL History
Developed by the United States Department of Defense as part of the very high speed integrated circuit (VHSIC) program in the early 1980s. The Department of Defense intended VHDL to be used for the documentation, simulation and verication for electronic systems. Goals: improve design process over schematic entry standardize design descriptions amongst multiple vendors portable and extensible
LEC-02:
1.2.1
12
Inspired by the ADA programming language large: 97 keywords, 94 syntactic rules verbose (designed by committee) static type checking, overloading complicated syntax: parentheses are used for both expression grouping and array indexing Example: a <= b * (3 + c); a <= (3 + c); -- integer -- 1-element array of integers
LEC-02:
1.2.1
13
Standardized by IEEE in 1987 (IEEE 1076-1987), revised in 1993, 2000. In 1993 the IEEE standard VHDL package for model interoperability, STD_LOGIC_1164 (IEEE Standard 1164-1993), was developed. std_logic_1164 denes 9 different values for signals (See Smith Section 10.6.2) In 1997 the IEEE standard packages for arithmetic over std logic and bit signals were dened (IEEE Standard 1076.31997). numeric_std denes arithmetic over std logic vectors and integers. NB: This is the package that you should use for arithmetic. Dont use std logic arith it has less uniform support for mixed integer/signal arithmetic and has a greater tendency for differences between tools. numeric_bit denes arithmetic over bit vectors and integers. We wont use bit signals in this course, so you dont need to worry about this package.
LEC-02:
1.2.2
Semantics
14
1.2.2
Semantics
The original goal of VHDL was to simulate circuits. The semantics of the language dene circuit behaviour.
a c <= a AND b;
simulation
b c
But now, VHDL is used in simulation and synthesis. Synthesis is concerned with the structure of the circuit. Synthesis: converts one type of description (behavioural) into another, lower level, description (usually a netlist).
c <= a AND b;
synthesis
a c b
LEC-02:
1.2.2
Semantics
15
LEC-02:
1.2.2
Semantics
16
CAD Tools
CAD Tools allow designers to automate lower-level design processes in implementing the desired functionality of a system. NOTE: EDA = Electronic Design Automation. In digital hardware design EDA = CAD.
LEC-02:
1.2.2
Semantics
17
Synthesis vs Simulation
For synthesis, we want the code we write to dene the structure of the hardware that is generated.
c <= a AND b;
synthesis
a c b
But, the VHDL semantics dene the behaviour of the hardware that is generated, not the structure of the hardware.
a c b a c <= a AND b; b c a c b
LEC-02:
1.2.3
18
Not all of VHDL is synthesizable c <= a AND b; (synthesizable) c <= a AND b AFTER 2ns; (NOT synthesizable) how do you build a circuit with exactly 2ns of delay through an AND gate? more examples of non-synthesizable code are in section 1.8 See section 1.8 for more details Different synthesis tools support different subsets of VHDL Some tools generate erroneous hardware for some code behaviour of hardware differs from VHDL semantics Some tools generate unpredictable hardware There is an IEEE standard (1076.6) for a synthesizable subset of VHDL, but tool vendors dont yet conform to it. (Most vendors still dont have full support for the 1993 extensions to VHDL!). For more info, see http://www.vhdl.org/siwg/.
LEC-02:
1.2.4
19
1.2.4
Pick a high-quality synthesis tool and study its documentation thoroughly Learn the idioms of the tool Different VHDL code with same behaviour can result in very different circuits Be careful if you have to port VHDL code from one tool to another KISS: Keep It Simple Stupid VHDL examples in lectures will illustrate reliable coding techniques for the Synopsys tools (and most other tools as well). Follow the coding guidelines and examples from lecture As you write VHDL, think about the hardware you expect to get. NB: If you cant predict the hardware, then the hardware probably wont be very good (small, fast, correct, etc)
LEC-02:
1.2.5
VHDL Disadvantages
20
1.2.5
VHDL Disadvantages
Some VHDL programs cannot be synthesized Different tools support different subsets of VHDL. Different tools generate different circuits for same code VHDL is verbose Many characters to say something simple VHDL is complicated and confusing Many different ways of saying the same thing Constructs that have similar purpose have very different syntax (case vs. select) Constructs that have similar syntax have very different semantics (variables vs signals) Hardware that is synthesized is not always obvious (when is a signal a ip-op vs latch vs combinational) The infamous latch inference problem (See section 1.5.2 for more information)
LEC-02:
1.2.6
VHDL Advantages
21
1.2.6
VHDL Advantages
VHDL supports unsynthesizable constructs that are useful in writing testbenches and other non-hardware artifacts that we need in hardware design. VHDL can be used throughout a large portion of the design process in different capacities, from specication to implementation to verication. VHDL has static typechecking many errors can be caught before synthesis and/or simulation. (In this respect, it is more similar to Java than to C.) VHDL has a rich collection of datatypes VHDL is a full-featured language with a good module system (libraries and packages). VHDL has a well-dened standard.
LEC-02:
1.2.7
22
1.2.7
LEC-02:
1.2.7
23
Verilog is a simpler language: smaller language, simple circuits are easier to write VHDL has more features than Verilog richer set of data types and strong type checking VHDL offers more exibility and expressivity for constructing large systems. The VHDL Standard is more standard than the Verilog Standard VHDL and Verilog have simulation-based semantics Simulation vendors generally conform to VHDL standard Some Verilog constructs dont simulate the same in different tools VHDL is used more than Verilog in Europe and Japan Verilog is used more than VHDL in North America South-East Asia, India, South America: ?????
LEC-02:
1.2.7
24
System C looks like C familiar syntax C is often used in algorithmic descriptions of circuits, so why not try to use it for synthesizable code as well? If you think VHDL is hard to synthesize, try C.... SystemC simulation is slower than advertised
LEC-02:
1.2.7
25
Superlog: A new language (still under active development) that is based on Verilog and C. Basic core comes from Verilog. C-like extensions included to make language more expressive and powerful. Developed by the Co-Design company. Esterelle: A language evolving from academia to commercial viability. Very clean semantics. Aimed at state machines, limited support for datapath operations.
LEC-02:
1.2.7
26
VHDL is far from perfect and has lots of annoying characteristics VHDL is a better language for education than Verilog because the static typechecking enforces good software engineering practices The richness of VHDL will be useful in creating concise high-level models and powerful testbenches
LEC-02:
1.3
OVERVIEW OF SYNTAX
27
1.3
Overview of Syntax
This section is just a brief overview of the syntax of VHDL, focussing on the constructs that are most commonly used. Read a book on VHDL and use online resources. (Look for VHDL under the Documentation tab in the E&C 427 web pages for more information.)
LEC-02:
1.3.1
Syntactic Categories
28
1.3.1
Syntactic Categories
There are ve major categories of syntactic constructs. (There are many, many minor categories and subcategories of constructs.)
Library units (section 1.3.2) Top-level constructs (packages, entities, architectures) Concurrent statements (section 1.3.4) Statements executed at the same time (in parallel) Sequential statements (section 1.3.7) Statements executed in series (one after the other) Expressions Arithmetic (section 1.9), Boolean, Vectors , etc Declarations Components , signals, variables, types, functions, ....
LEC-02:
1.3.2
Library Units
29
1.3.2
Library Units
Library units are the top-level syntactic constructs in VHDL. They are used to dene and include libraries, declare and implement interfaces, dene packages of declarations and otherwise bind together VHDL code.
Package body dene the contents of a library Packages determine which parts of the library are externally visible Use clause use a library in an entity/architecture or another package technically, use clauses are part of entities and packages, but they proceed the entity/package keyword, so we list them as toplevel constructs Entity (section 1.3.3)
LEC-02:
1.3.2
Library Units
30
See Smith Section 10.6 for information on packages and use clauses.
LEC-02:
1.3.3
31
1.3.3
architecture
Figure 1.1: Entity and Architecture The syntax of VHDL is dened using a variation on Backus-Naur forms (BNF). See Smith Appendix A.1 for a description of the rules for understanding VHDL grammar.
Entity: interface
LEC-02:
1.3.3
32
library ieee; use ieee.std_logic_1164.all; entity and_or is port ( a, b, c : in std_logic ; z : out std_logic ); end and_or; Figure 1.2: Example of an entity
LEC-02:
1.3.3
33
[ use_clause ] entity ENTITYID is [ port ( SIGNALID : (in | out) TYPEID [ := expr ] ; ); ] [ declaration ] [ begin concurrent_statement ] end [ entity ] ENTITYID ;
LEC-02:
1.3.3
34
Architecture
architecture signal x : begin x <= a AND z <= x OR end main; main of and_or is std_logic; b; (a AND c); Figure 1.4: Example of architecture
LEC-02:
1.3.3
35
[ use_clause ] architecture ARCHID of ENTITYID is [ declaration ] begin concurrent_statement ] [ end [ architecture ] ARCHID ; Figure 1.5: Simplied grammar of architecture
LEC-02:
1.3.4
Concurrent Statements
36
1.3.4
Concurrent Statements
Concurrent statements are used inside architectures Concurrent statements execute in parallel (Figure 1.6) Concurrent statements make VHDL fundamentally different from most software languages. Hardware (gates) naturally execute in parallel VHDL mimics the behaviour of real hardware. At each innitesimally small moment of time, each gate: 1. samples its inputs 2. computes the value of its output 3. drives the output
LEC-02:
1.3.4
Concurrent Statements
37
architecture main of bowser is begin x1 <= a AND b; x2 <= NOT x1; z <= NOT x2; end main;
architecture main of bowser is begin z <= NOT x2; x2 <= NOT x1; x1 <= a AND b; end main;
a b
x1
x2
LEC-02:
1.3.4
Concurrent Statements
38
conditional assignment
selected assignment
with ... select ... <= ... when ... | ..., else ...;
component instantiation
for-generate
if-generate
process
normal assignment (... <= ...) if-then-else style (uses when) Smith Section 10.13.4
the body of a process is executed sequentially Sections 1.3.6, 1.6; Smith Section 10.10
LEC-02:
1.3.5
39
LEC-02:
1.3.6
Processes
40
1.3.6
Processes
Processes are used to describe the behaviour of hardware A process is a concurrent statement (Section 1.3.4). The body of a process contains sequential statements (Section 1.3.7) Processes are the most complex and difcult to understand part of VHDL (Sections 1.5 and 1.6)
LEC-02:
1.3.6
Processes
41
LEC-02:
1.3.6
Processes
42
LEC-02:
1.3.6
Processes
43
Sensivity List
The sensitivity list contains the signals that are read in the process. An important coding guideline to ensure consistent synthesis and simulation results is to include all signals that are read in the sensitivity list. If you forget some signals, you will either end up with unpredictable hardware and simulation results (different results from different programs) or undesirable hardware (latches where you expected purely combinational hardware). For more on this topic, see sections 1.5.2 and 1.6. There is one exception to this rule: for a process that implements a ip-op with an if rising edge statement, it is acceptable to include only the clock signal in the sensitivity list other signals may be included, but are not needed.
LEC-02:
1.3.6
Processes
44
Process Grammar
[ PROCLAB : ] process ( sensitivity_list ) declaration ] [ begin sequential_statement end process [ PROCLAB ] ; Figure 1.8: Simplied grammar of process
LEC-02:
1.3.7
Sequential Statements
45
1.3.7
Sequential Statements
LEC-02:
1.3.7
Sequential Statements
46
wait until ...; ... <= ...; if ... then ... elsif ... end if; case ... is when ... | ... => ...; when ... => ...; end case; loop ... end loop; while ... loop ... end loop; for ... in ... loop ... end loop; next ...;
LEC-02:
1.4
47
1.4
Concurrent assignments can be translated into sequential statements. But, not all sequential can be translated into concurrent statements.
LEC-02:
1.4.1
48
1.4.1
The two code fragments below have identical behaviour: architecture main of tiny is begin b <= a; end main;
LEC-02:
1.4.2
49
LEC-02:
1.4.3
50
LEC-02:
1.4.4
Coding Style
51
1.4.4
Coding Style
Code thats easy to write with sequential statements, but difcult with concurrent:
LEC-02:
1.4.4
Coding Style
52
Sequential Statements
case <expr> is when <choice1> => if <cond> then o <= <expr1>; else o <= <expr2>; end if; when <choice2> => ... end case;
LEC-02:
1.4.4
Coding Style
53
Concurrent Statements
Overall structure: with <expr> select t <= ... when <choice1>, ... when <choice2>; Failed attempt: with <expr> select t <= -- want to write: -<val1> when <cond> -else <val2> -- but conditional assignment -- is illegal here when c1, ... when c2;
LEC-02:
1.4.4
Coding Style
54
Lesson: complicated, nested control constructs are easier with sequential statements than with concurrent statements.
LEC-02:
1.5
OVERVIEW OF PROCESSES
55
1.5
Overview of Processes
Processes are the most difcult VHDL construct to understand. This section gives an overview of processes. Section 1.5 gives the details of the semantics of processes.
Within a process, statements are executed almost sequentially Among processes, execution is done in parallel Remember: a process is a concurrent statement!
LEC-02:
1.5
OVERVIEW OF PROCESSES
56
entity ENTITYID is interface declarations end ENTITYID; architecture ARCHID of ENTITYID is begin concurrent statements process begin sequential statements end process; concurrent statements end ARCHID; Figure 1.10: Sequential statements in a process
LEC-02:
1.5
OVERVIEW OF PROCESSES
57
VHDL mimics hardware Hardware (gates) execute in parallel Processes execute in parallel with each other All possible orders of executing processes must produce the same simulation results (waveforms) If a signal is not assigned a value, then it holds its previous value
All orders of executing concurrent statements must produce the same waveforms
LEC-02:
1.5
OVERVIEW OF PROCESSES
58
It doesnt matter whether you are running on a single-threaded operating system, on a multi-threaded operating system, on a massively parallel supercomputer, or on a special hardware emulator with one FPGA chip per VHDL process all simulations must be the same. These concepts are the motivation for the semantics of executing processes in VHDL (Section 1.6) and lead to the phenomenon of latch-inference (Section 1.5.2).
LEC-02:
1.5
OVERVIEW OF PROCESSES
execution sequence execution sequence execution sequence
59
architecture procA: process stmtA1; stmtA2; stmtA3; end process; procB: process stmtB1; stmtB2; end process; B1 B2 B1 B2 B1 B2 A1 A2 A3 A1 A2 A3 A1 A2 A3
single threaded: single threaded: multithreaded: procA before procB before procA and procB procA procB in parallel Figure 1.11: Different process execution sequences
LEC-02:
1.5
OVERVIEW OF PROCESSES
60
LEC-02:
1.5
OVERVIEW OF PROCESSES
61
Sections 1.5.11.5.3 discuss the hardware generated by processes. Sections 1.61.6.3 discuss the behaviour and execution of processes.
LEC-02:
1.5.1
62
LEC-02:
1.5.1
63
Combinational process:
Executing the process takes part of one clock cycle Target signals are outputs of combinational circuitry A combinational processes must have a sensitivity list A combinational process does not have any wait statements and does not have any events, rising_edges, or falling_edges in conditions for if or in case statements Hardware is just combinational circuitry
LEC-02:
1.5.1
64
Clocked process:
Executing the process takes one (or more) clock cycles Target signals are outputs of ops Process contains one or more wait or if rising edge statements hardware contains combinational circuitry and ip ops
NOTE: C locked processes are sometimes called sequential processes, but this can be easily confused with sequential statements, so in E&CE 427 well refer to synthesizable processes as either combinational or clocked.
LEC-02:
1.5.1
65
LEC-02:
1.5.1
66
LEC-02:
1.5.2
Latch Inference
67
1.5.2
Latch Inference
The semantics of VHDL require that if a signal is assigned a value on some passes through a process and not on other passes, then on a pass through the process when the signal is not assigned a value, it must maintain its value from the previous pass. process (a, b, c) begin if (a = 1) then z1 <= b; z2 <= b; else z1 <= c; end if; end process;
a b c z1 z2
LEC-02:
1.5.2
Latch Inference
68
When a signals value must be stored, VHDL infers a latch or a ip-op in the hardware to store the value.
LEC-02:
1.5.2
Latch Inference
69
If you want a latch or a ip-op for the signal, then latch inference is good. If you want combinational circuitry, then latch inference is bad.
LEC-02:
1.5.2
Latch Inference
70
LEC-02:
1.5.3
71
1.5.3
Signals assigned to in combinational processes are combinational. Signals assigned to in clocked processes are outputs of ip-ops. The one exception to this can occur in a clocked process that contains a signal that is assigned to in every branch of every if-then-else and case statement. Such a signal might be generated as combinational logic. Mixing combinational and clocked signals in the same process is bad design discipline, because it can lead to different results from different synthesis tools. So, if you follow good coding practices, you wont need to worry about this exception.
LEC-03 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-03 Preliminaries
Schedule
wk-01 05 VHDL Design and Optimization wk-01 Overview and VHDL wk-02 VHDL Semantics wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review
LEC-03 Preliminaries
Overview
This lecture relates fragments of VHDL code to the basic building blocks of hardware: ip-ops, Boolean gates, arithmetic circuits, etc. The semantics of VHDL are behavioural, not structural, but by understanding the behavioural semantics of VHDL we can derive the relationship between VHDL code and netlists.
LEC-03 Preliminaries
Concepts
Lecture Notes: Sections 1.61.6.3
LEC-03:
1.6
1.6
LEC-03:
1.6.1
1.6.1
LEC-03:
1.6.1
LEC-03:
1.6.1
Clock Cycle
smallest unit of time is a clock cycle combinational logic has zero delay ip-ops have a delay of one clock cycle used for simulation early in the design cycle fastest simulation run times
LEC-03:
1.6.1
Timing Simulation
smallest unit of time is a nano, pico, or fempto second combinational logic and wires have delay as computed by timing analysis tools ip-ops have setup, hold, and clock-to-Q timing parameters used for simulation when ne-tuning design and conrming that timing contraints are satised slow simulation times for large circuits
LEC-03:
1.6.1
10
Delta Cycles
In assignments and exams, you will need to be able to simulate VHDL code at each of the three different levels of temporal granularity. In the laboratories and project, you will use simulation programs for both clock-cycle simulation and timing simulation. We dont have access to a program that will produce delta-cycle waveforms, but if anyone is looking for a challenging co-op job or fourth-year design project....
units of time are artifacts of VHDL semantics and simulation software simulation cycles, delta cycles, and simulation steps are inntesimaly small amounts of time VHDL semantics are dened in terms of these concepts
LEC-03:
1.6.1
11
Denitely Delta
For the remainder of section 1.6, well look at only the delta cycle view of the world.
LEC-03:
1.6.1
12
NOTE: postponed This use of the word postponed differs from that in the VHDL Standard. We wont be using postponed processes as dened in the Standard.
LEC-03:
1.6.1
13
Process Modes
active
e sp su te tiv a
nd
postponed resume
ac
suspended
LEC-03:
1.6.1
14
Suspended
active
d en sp su e
postponed resume
ac
tiv at
suspended
Nothing to currently execute A process stays suspended until the event that it is waiting for occurs: either a change in a signal on its sensitivity list or the condition in a wait statement
LEC-03:
1.6.1
15
Postponed
active
d en sp su e
postponed resume
ac
tiv at
suspended
Wants to execute, but not currently active A process becomes active when the simulator chooses it from the pool of postponed processes
LEC-03:
1.6.1
16
Active
active
d en sp su e tiv at
postponed resume
ac
suspended
Currently executing A process stays active until it hits a wait statement or completes the execution of the last statement in the process, at which point it suspends
LEC-03:
1.6.1
17
LEC-03:
1.6.1
18
Initialization
Simulations start at step 6 with all processes postponed and all signals with a default value (U for std logic).
LEC-03:
1.6.1
19
The Algorithm
LEC-03:
1.6.1
20
1. All processes are suspended. 2. Each process looks at the signals that changed value and checks its sensitivity list or wait condition to see if it should resume 3. Update signals with their provisional values 4. Resume all suspended processes whose sensitivity list changed or wait condition became true. 5. If there are no postponed processes, then simulation time increments to the next scheduled event and the simulation continues at Step 1. 6. While there are postponed processes: (a) Pick one or more postponed processes to become active. (b) As a process executes, assignments to signals are provisional new values do not become visible until step 3 in the next simulation cycle (c) A process runs until it hits its sensitivity list or a wait statement, at which point it suspends. (d) Processes that become suspended stay suspended until there are no more postponed or active processes. 7. Calculate the new simulation time: If zero-delay assignments were made in the current simulation cycle then simulation time does not advance else simulation time is set to time of next scheduled event
LEC-03:
1.6.1
21
NOTE: Parallel execution In n-threaded execution, at most n processes are active at a time
LEC-03:
1.6.1
22
LEC-03:
1.6.1
23
NOTE: Ofcial and unofcial terminology Simulation cycle and delta cycle are ofcial denitions in the VHDL Standard. Simulation step and simulation round are not standard denitions. They are used in E&CE 427 because we need words to associate with the concepts that they describe.
LEC-03:
1.6.2
24
1.6.2
LEC-03:
1.6.2
25
entity bamboozle is begin port ( a, b : in std_logic; e : out std_logic ); end bamboozle; architecture main of bamboozle is signal c, d : std_logic; begin procA : process (a, b) begin c <= a AND b; end process; procB : process (b, c, d) begin d <= NOT c; e <= b AND d; end process; end main; Figure 1.14: Example circuit for process execution
LEC-03:
1.6.2
26
In simulation run, a and b are external inputs with the following scheduled events:
In this example, we will treat the external inputs as if they were driven by an external process.
procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process;
d e
LEC-03:
0ns
a b c d e
1.6.2
27
LEC-03:
1.6.2
28
process mode (S=suspended, P=postponend A=active) simulation-step pointer (one per process) P procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; b <= 1; a U wait for 10 ns; b U a <= 1; wait for 2 ns; c U b <= 0; d U wait for 3 ns; a <= 0; e U end process; visible-assignment value
U a U b Uc Ud U e
LEC-03:
P
1.6.2
29
procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
U a U b Uc Ud U e
LEC-03:
A
1.6.2
30
procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
U a U b Uc Ud U e
LEC-03:
1.6.2
31
A P
procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
U a U b UUc Ud U e
LEC-03:
S
1.6.2
32
procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
U a U b UUc Ud U e
LEC-03:
S
1.6.2
33
procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
U a U b UUc Ud U e
LEC-03:
S
1.6.2
34
procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
0U a U b UUc Ud U e
LEC-03:
S
1.6.2
35
procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
0U a 1U b UUc Ud U e
LEC-03:
S
1.6.2
36
procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
0U a 1U b UUc Ud U e
LEC-03:
S
1.6.2
37
procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
0U a U b UUc Ud U e
LEC-03:
P
1.6.2
38
procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
0U a U b UUc UUd U e
LEC-03:
S
1.6.2
39
procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
0U a 1U b UUc UUd UU e
U U
LEC-03:
S
1.6.2
40
procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
0U a 1U b UUc UUd UU e
U U
LEC-03:
S
1.6.2
41
procA: process (a, b) begin c <= a AND b; 0U UUc UUd end process; a UU procB: process (b, c, d) begin e 1U d <= NOT c; b e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; U d U wait for 3 ns; U e U a <= 0; end process; All processes suspended: End of simulation cycle
LEC-03:
S
1.6.2
42
procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
0U a 1U b UUc UUd UU e
0ns
U U
LEC-03:
S
1.6.2
43
procA: process (a, b) begin c <= a AND b; 0U UUc UUd end process; a UU procB: process (b, c, d) begin 1U d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process; Step 1: Beginning of next simulation cycle Note: First simulation cycle compacted into two columns. This is done only in this example to save space and is not standard practice.
LEC-03:
S
1.6.2
44
procA: process (a, b) begin c <= a AND b; 0U UUc UUd end process; a UU procB: process (b, c, d) begin 1U d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process; Step 2: Check sensitivity lists for changes
LEC-03:
S
1.6.2
45
procA: process (a, b) begin c <= a AND b; 0 end process; a Uc Ud procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process; Step 3: Update signal values
U e
LEC-03:
P
1.6.2
46
procA: process (a, b) begin c <= a AND b; 0 end process; a Uc Ud procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process; Step 4: Resume procA and procB
U e
LEC-03:
A
1.6.2
47
procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
0 Uc 1 Ud U e
LEC-03:
1.6.2
48
A P
procA: process (a, b) begin c <= a AND b; 0 end process; a 0Uc Ud U procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process; Step 6(b): Provisional assignment to c
LEC-03:
S
1.6.2
49
procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
0 0Uc 1 Ud U e
LEC-03:
S
1.6.2
50
procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
0 0Uc 1 Ud U e
LEC-03:
S
1.6.2
51
procA: process (a, b) begin c <= a AND b; 0 end process; a 0Uc UUd U procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; e U a <= 0; end process; Step 6(b): Provisional assignment to d
LEC-03:
S
1.6.2
52
procA: process (a, b) begin c <= a AND b; 0 end process; a UUd 0Uc UU procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U U wait for 3 ns; e U U a <= 0; end process; Step 6(b): Provisional assignment to e
LEC-03:
S
1.6.2
53
procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
0 0Uc 1 UUd UU e
U U
LEC-03:
S
1.6.2
54
procA: process (a, b) begin c <= a AND b; 0 end process; a 0Uc UUd UU procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U e U a <= 0; end process; Step 7: All processes suspended; end of simulation cycle
LEC-03:
S
1.6.2
55
procA: process (a, b) begin c <= a AND b; 0 end process; a 0Uc UUd UU procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U e U a <= 0; end process; Step 1: Begin next simulation cycle
LEC-03:
S
1.6.2
56
procA: process (a, b) begin c <= a AND b; 0 end process; a 0Uc UUd UU procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U e U a <= 0; end process; Step 2: Check sensitivity lists for changes
LEC-03:
S
1.6.2
57
procA: process (a, b) begin c <= a AND b; 0 end process; a 0c Ud procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U e U a <= 0; end process; Step 3: Update signal values
U e
LEC-03:
S
1.6.2
58
procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
0 0c 1 Ud U e
0ns
U U
LEC-03:
S
1.6.2
59
procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 1Ud U procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U e U a <= 0; end process; Steps 6(a,b): Activate procB; Provisional assignment to d
LEC-03:
S
1.6.2
60
procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 1Ud UU procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 6(b): Provisional assignment to e
LEC-03:
S
1.6.2
61
procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
0 0c 1 1Ud UU e
0ns
U U U
LEC-03:
S
1.6.2
62
procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 1Ud UU procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 7: All processes suspended; end of simulation cycle
LEC-03:
S
1.6.2
63
procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 1Ud UU procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 1: Begin next simulation cycle
LEC-03:
S
1.6.2
64
procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 1Ud UU procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 2: Check sensitivity lists for changes
LEC-03:
S
1.6.2
65
procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
0 0c 1 1d U e
0ns
0ns
U U U
LEC-03:
S
1.6.2
66
procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
0 0c 1 1d U e
0ns
0ns
U U U
LEC-03:
S
1.6.2
67
procA: process (a, b) begin c <= a AND b; 0 end process; 11d a 0c U procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Steps 6(a, b): Activate procB; provisional assignment to d
LEC-03:
P
1.6.2
68
procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 11d 1U procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 6(b): Provisional assignment to e
LEC-03:
S
1.6.2
69
procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
0 0c 1 11d 1U e
0ns
0ns
0ns
U U U
LEC-03:
S
1.6.2
70
procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 1d 1 procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns 0ns 10ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U U wait for 3 ns; U U e U a <= 0; end process; simulation round Step 7: No changes to "sensitized" signals --- time advances
LEC-03:
1.6.2
71
LEC-03:
1.6.2
72
Step 2: Check sensitivity lists for changes (Not shown) Step 3: Update signal values (Not shown)
S procA: process (a, b) begin c <= a AND b; 10 end process; a 0c 1d 1 procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns 0ns 10ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U U wait for 3 ns; U U e U a <= 0; end process; Step 6(a,b): Activate procC, provisional assignment to a
LEC-03:
S
1.6.2
73
procA: process (a, b) begin c <= a AND b; 10 0c 1d end process; a 1 procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns 0ns 10ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 6(c): Suspend procC; end of simulation cycle
LEC-03:
1.6.2
74
LEC-03:
S
1.6.2
75
procA: process (a, b) begin c <= a AND b; 1 0c 1d end process; a 1 procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns 0ns 10ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 3: Update signal values
LEC-03:
P
1.6.2
76
procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
1 0c 1 1d 1 e
0ns
0ns
0ns
10ns
U U U
LEC-03:
1.6.2
77
Question: What are the different granularities of time that occur when doing delta-cycle simulation?
Question: What is the order of granularity, from nest to coarsest, amongst the different granularities related to delta-cycle simulation?
LEC-03:
1.6.2
78
LEC-03:
1.6.3
79
LEC-03:
1.6.3
80
LEC-03:
.
1.6.3
81 .
If assignments are not visible within same simulation cycle (correct: i.e. provisional assignments are used)
p_c p_d a b c d
0 0 0 0
P A P
S A
p_c S P A p_dS a b c d
0 0 0 0
P P A S
S P A S
LEC-03:
.
1.6.3
82 .
p_c p_d a b c d
0 0 0 0
P A P
S A
p_c S P A p_dS a b c d
0 0 0 0
P P A S
S P A S
With provisional assignments, both orders of scheduling processes result in the same behaviour on all signals. Without provisional assignments, differ-
LEC-03:
1.6.3
83
LEC-04 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-04 Preliminaries
Schedule
wk-01 05 VHDL Design and Optimization wk-01 Overview and VHDL wk-02 VHDL Semantics wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review
LEC-04 Preliminaries
Overview
This lecture uses the VHDL semantics from Lecture 03 to describe how we determine what hardware will be synthesized from VHDL.
LEC-04 Preliminaries
Concepts
Lecture Notes: Sections 1.71.9.7 basic building blocks ip-ops and latches coding ip-ops coding sequential circuits
LEC-04:
1.7
1.7
This section outlines the building blocks for register transfer level design and how to write VHDL code for the building blocks.
LEC-04:
1.7.1
1.7.1
D CE
WE A DO
DI
LEC-04:
1.7.1
adder, subtracter, negater shifter, rotater ip-op memory array, register le, queue
LEC-04:
1.7.2
1.7.2
LEC-04:
1.7.2
Latches
Use ops, not latches Latch-based designs are susceptible to timing problems The transparent phase of a latch can let a signal leak through a latch causing the signal to affect the output one clock cycle too early Its possible for a latch-based circuit to simulate correctly, but not work in real hardware, because the timing delays on the real hardware dont match those predicted in synthesis
LEC-04:
1.7.2
10
Limit yourself to D-type ip-ops Most FPGA and ASIC cell libraries include only D-type ip ops (However, the ip-ops in Alteras APEX FPGAs can be congured as D, T, JK, or SR ip-ops.)
LEC-04:
1.7.2
11
Tri-state buffers
Use multiplexers, not tri-state buffers Tri-state designs are susceptible to stability and signal integrity problems Getting tri-state designs to simulate correctly is difcult, some library components dont support tri-state signals Tri-state designs rely on the code never letting two signals drive the bus at the same time It can be difcult to check that bus arbitration will always work correctly Manufacturing and environmental variablity can make real hardware not work correctly even if it simulates correctly Typical industrial practice is to avoid use of tri-state signals on a chip, but allow tri-state signals at the board level
LEC-04:
1.7.3
12
1.7.3
LEC-04:
1.7.3
13
LEC-04:
1.7.3
14
LEC-04:
1.7.3
15
LEC-04:
1.7.3
16
LEC-04:
1.7.3
17
LEC-04:
1.7.3
If
18
process (clk) begin if rising_edge(clk) then if (ce = 1) then if (sel = 1) then q <= d1; else q <= d0; end if; end if; end if; end process;
process begin wait until rising_edge(clk); if (ce = 1) then if (sel = 1) then q <= d1; else q <= d0; end if; end if; end process;
LEC-04:
1.7.3
19
LEC-04:
1.7.3
20
LEC-04:
1.7.3
21
LEC-04:
1.7.4
22
1.7.4
There are many ways to write VHDL code that synthesizes to the schematic in gure 1.17. The two major choices in the styles are:
Put all of the code in a single process, or have collection of clocked processes, combinational processes, and concurrent statements. Use wait or if rising edge for ip ops.
LEC-04:
sel reset
1.7.4
23
a
R
c clk
S
entity and_not_reg is port ( reset, clk, sel : in std_logic; c : out std_logic ); end; Schematic and entity for examples of different code organizations in Figures 1.181.21 Figure 1.17: Schematic and entity for and not reg
LEC-04:
1.7.4
24
One Process
architecture one_proc of and_not_reg is signal a : std_logic; begin process begin wait until rising_edge(clk); if (reset = 1) then a <= 0; elsif (sel = 1) then a <= NOT a; else a <= a; end if; c <= NOT a; end process; end one_proc; Figure 1.18: One process implementation of Figure 1.17
LEC-04:
1.7.4
25
LEC-04:
1.7.4
26
LEC-04:
1.7.4
27
Concurrent Statements
architecture comb of and_not_reg is signal a, b, d : std_logic; begin process (clk) begin if rising_edge(clk) then if (reset = 1) then a <= 0; else a <= d; end if; end if; end process; process (clk) begin if rising_edge(clk) then c <= NOT a; end if; end process; d <= b when (sel = 1) else a; b <= NOT a; end comb; Figure 1.21: Concurrent statement implementation of Figure 1.17
LEC-04:
1.8
28
LEC-04:
1.8.1
Unsynthesizable Code
29
1.8.1
Unsynthesizable Code
LEC-04:
1.8.1
Unsynthesizable Code
30
LEC-04:
1.8.1
Unsynthesizable Code
31
LEC-04:
1.8.1
Unsynthesizable Code
32
LEC-04:
1.8.1
Unsynthesizable Code
33
-- different clock edges process begin wait until rising_edge(clk); x <= a; wait until falling_edge(clk); x <= a; end process; Reason: processes with multiple wait statements are turned into nite state machines. The wait statements denote transitions between states. The target signals in the process are outputs of ip ops. Using different wait conditions would require the ip ops to use different clock signals at different times. Multiple clock signals for a single ip op would be difcult to synthesize, inefcient to build, and fragile to operate.
LEC-04:
1.8.1
Unsynthesizable Code
34
LEC-04:
1.8.1
Unsynthesizable Code
35
LEC-04:
1.8.1
Unsynthesizable Code
36
LEC-04:
1.8.1
Unsynthesizable Code
37
LEC-04:
1.8.1
Unsynthesizable Code
38
Synthesizable Alternative
A synthesizable alternative to an if rising edge statement in a for-loop is to put the if-rising-edge outside of the for loop. process (clk) begin if rising_edge(clk) then for i in 0 to 7 loop q <= d; end loop; end if; end process; Reason: just an idiom of the synthesis tool. Synthesizable for loops are described in Rushton Section 8.7. For loops in general are described in Ashenden. Examples of for loops in E&CE will appear when describing testbenches for functional validation.
LEC-04:
1.8.1
Unsynthesizable Code
39
NOTE: For loops For loops are very useful in simulation, particular for test benches.
LEC-04:
1.8.1
Unsynthesizable Code
40
LEC-04:
1.8.2
41
LEC-04:
1.8.2
42
LEC-04:
1.8.2
43
LEC-04:
1.8.2
44
LEC-04:
1.9
45
LEC-04:
1.9.1
Arithmetic Packages
46
1.9.1
Arithmetic Packages
Rushton Ch-7 covers arithmetic packages. Rushton Appendex A.5 has the code listing for the numeric std package. To do arithmetic with signals, use the numeric_std package. This package denes types signed and unsigned, which are std_logic vectors on which you can do signed or unsigned arithmetic. numeric std supersedes std logic arith. earlier arithmetic packages, such as
Use only one arithmetic package, otherwise the different denitions will clash and you can get strange error messages.
LEC-04:
1.9.2
47
1.9.2
Shift and rotate operations are described with three character acronyms:
The shift right arithmetic (sra) operation preserves the sign of the operand, by coping the most signicant bit into lower bit positions. The shift left arithmetic does the analogous operation, except that the least signicant bit is copied.
shift/rotate
left/right
arithmetic/logical
LEC-04:
1.9.3
Overloading of Arithmetic
48
1.9.3
Overloading of Arithmetic
The arithmetic operators +, -, and * are overloaded on signed vectors, unsigned vectors, and integers. Tables 1.11.4 show the different combinations of target and source types and widths that can be used. Table 1.1: Overloading of Arithmetic Operations (+, -) target unsigned unsigned src1 unsigned integer unsigned src2 integer unsigned signed
OK OK fails in analysis
LEC-04:
1.9.4
49
1.9.4
wide narrow
LEC-04:
1.9.5
Overloading of Comparisons
50
1.9.5
Overloading of Comparisons
src1 unsigned integer signed integer unsigned signed src2 integer unsigned integer signed signed unsigned
Table 1.3: Overloading of Comparison Operations (=, /=, >=, >, <)
LEC-04:
1.9.6
51
1.9.6
Table 1.4: Different Vector Widths and Comparison Operations (=, /=, >=, >, <)
OK OK
LEC-04:
1.9.7
Type Conversion
52
1.9.7
Type Conversion
The functions unsigned, signed, to integer, to unsigned and to signed are used to convert between integers, std-logic vectors, signed vectors and unsigned vectors. The listing below summarizes the types of these functions. unsigned( val : std_logic_vector ) signed( val : std_logic_vector ) to_integer( val : signed ) to_integer( val : unsigned ) return unsigned; return signed; return integer; return integer;
to_unsigned( val : signed; width : natural) to_signed( val : integer; width : natural)
The most common example of converting between two types arises when using a signal as an index into an array. To use a signal as an index into
LEC-04:
1.9.7
Type Conversion
53
an array, you must convert the signal into an integer using the function to_integer (Figure 1.22). library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; ... signal bit_sig : std_logic; signal uns_sig : unsigned(7 downto 0); signal vec_sig : std_logic_vector(255 downto 0); ... bit_sig <= vec_sig( to_integer(uns_sig) ); ... Figure 1.22: Using a signal as an index to array To convert a std_logic_vector into an integer, you must rst say whether the signal should be interpreted as signed or unsigned. As illus-
LEC-04:
1.9.7
Type Conversion
54
trated in gure 1.23, this is done by: 1. Convert the std_logic_vector signal to signed or unsigned, using the function signed or unsigned 2. Convert the signed or unsigned signal into an integer, using to_integer
LEC-04:
1.9.7
Type Conversion
55
library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; ... signal bit_sig : std_logic; signal std_sig : std_logic_vector(7 downto 0); signal vec_sig : std_logic_vector(255 downto 0); ... bit_sig <= vec_sig( to_integer( unsigned( std_sig ) ) ); ... Figure 1.23: Using a std logic vector as an index to array
LEC-04:
1.9.7
Type Conversion
56
Chapter 2
57
LEC-04:
2.1
PRELUDE TO CHAPTER
58
2.1
Prelude to Chapter
LEC-04:
2.1.1
59
2.1.1
design ows dataow diagrams state machines memory arrays design example optimization
LEC-05 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-05 Preliminaries
Schedule
wk-01 02 VHDL wk-01 Overview and VHDL wk-02 VHDL Semantics Design and Optimization wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review
wk-03 05
LEC-05 Preliminaries
Concepts
Lecture Notes: Sections 2.32.3.9
serial vs parallel algorithms and hardware dataow diagrams area estimation performance estimation
register allocation datapath, register, input, output allocation area / performance tradeoffs scheduling
LEC-05 Preliminaries
Reading
Rushton VHDL for Logic Synthesis (On reserve in DC-Library).
LEC-05:
2.2
DESIGN FLOW
2.2
Design Flow
LEC-05:
2.2.1
2.2.1
Most people agree on the general terminology and process for a digital hardware design ow. However, each book and course has its own particular way of presenting the ideas. Here we will lay out the consistent set of denitions that we will use in E&CE 427. This might be different from what you have seen in other courses or on a work term. Focus on the ideas and you will be ne both now and in the future. The design ow presented here focuses on the artifacts that we work with, rather than the operations that are performed on the artifacts. This is because the same operations can be performed at different points in the design ow, while the artifacts each have a unique purpose.
LEC-05:
2.2.1
Modify Algorithm Analyze Modify High-Level Model Analyze dp/ctrl specific Modify DP+Ctrl Code Analyze Modify Opt. RTL Code Analyze Modify Implementation Analyze
Hardware
LEC-05:
2.2.1
LEC-05:
2.2.2
Implementation Flows
2.2.2
Implementation Flows
Synopsys Design Compiler and FPGA Compiler are general-purpose synthesis programs. They have very few, if any, technology-specic algorithms. Instead, they rely on libraries to describe technology-specic parameters of the primitive building blocks (e.g. the delay and area of individual gates, PLAs, CLBs, ops, memory arrays). Mentor Graphics product Leonardo Spectrum, Cadences product BuildGates, and Synplicitys product Synplify are similar. In comparison, Avant! (Now owned by Synopsys) and Cadence sell separate tools that do place-and-route and other low-level (physical design) tasks. These general-purpose synthesis tools do not (generally) do the nal stages of the design, such as place-and-route and timing analysis, which are very specic to a given implementation technology. The implementationtechnology-specic tools generally also produce a VHDL le that accurately models the chip. We will refer to this le as the implementation VHDL code.
LEC-05:
2.2.2
Implementation Flows
10
LEC-05:
2.2.2
Implementation Flows
11
LEC-05:
2.2.3
Classes of Hardware
12
2.2.3
Classes of Hardware
Each circuit tends to be dominated by either its datapath, control (state machine) or storage (memory).
Datapath Purpose: compute output data based on input data Each parcel of input produces one parcel of output Examples: arithmetic, decoders Storage Purpose: hold data for future use Data is not modied while stored Examples: register les, FIFO queues Control Purpose: modify internal state based on inputs, compute outputs from state and inputs Mostly individual signals, few data (vectors) Examples: bus arbiters, memory-controllers
LEC-05:
2.2.4
13
Lec-05:
2.2.4.1
14
Lec-05:
2.2.4.1
15
Modify State Machine Analyze Modify Dataflow Diagram Analyze Modify Block Diagram Analyze
Lec-05:
2.2.4.1
16
LEC-05:
2.3
LEC-05:
2.3.1
Overview of Example
18
2.3.1
Overview of Example
Requirement: compute the sum of 6 numbers: output = a + b + c + d + e + f Well go through the following artifacts: 1. 2. 3. 4. 5. 6. requirements algorithm dataow diagram hardware block diagram state machine high-level model
LEC-05:
2.3.1
Overview of Example
19
In software, the expression: (((((a + b) + c) + d) + e) + f) takes the same amount of time to execute as: (a + b) + (c + d) + (e + f). But: hardware runs in parallel in algorithmic description, parentheses can guide parallel vs serial execution
LEC-05:
2.3.1
Overview of Example
20
Parallel (a+b)+(c+d)+(e+f)
+ + + + +
a b c d e f
+ +
LEC-05:
2.3.1
Overview of Example
21
Performance Estimation
Serial (((((a+b)+c)+d)+e)+f)
a b c d e f
Parallel (a+b)+(c+d)+(e+f)
1 + 2 + 3 + 4 + 5 +
a b c d e f
1 + 2 +
3 +
LEC-05:
2.3.1
Overview of Example
22
LEC-05:
2.3.1
Overview of Example
23
Area Estimation
Serial (((((a+b)+c)+d)+e)+f)
a b c d e f
Parallel (a+b)+(c+d)+(e+f)
1 + 2 + 3 + 4 + 5 +
a b c d e f
1 + 4 +
2 +
3 +
5 +
5 adders used
5 adders used
LEC-05:
2.3.1
Overview of Example
24
Design Comparison
Serial (((((a+b)+c)+d)+e)+f)
a b c d e f
Parallel (a+b)+(c+d)+(e+f)
+ + + + +
5 adders on longest path (slower) 5 adders used
a b c d e f
+ +
+
3 adders on longest path (faster) 5 adders used
LEC-05:
2.3.2
Dataow Diagrams
25
2.3.2
Dataow Diagrams
A disciplined approach for going beyond combinational logic for datapathcentric circuits
LEC-05:
2.3.2
Dataow Diagrams
26
Purpose: Provide a disciplined approach for designing datapath-centric circuits Guide the design from algorithm to high-level model Guide the design from high-level model to model with separated datapath and control Estimate area and performance Make tradeoffs between different design options Background Based on techniques from high-level synthesis tools
LEC-05:
2.3.2
Dataow Diagrams
27
+
x1
+
x2
+
x3
+
x4
+
z
LEC-05:
2.3.2
Dataow Diagrams
28
+
x1
+
x2
+
x3
+
x4
+
z
LEC-05:
2.3.2
Dataow Diagrams
29
Latency
a b c d e f
+
2 3 4 5 6
z x1
+
x2
+
x3
+
x4
+
Latency = 6 clock cycles
LEC-05:
2.3.2
Dataow Diagrams
30
Latency
a b c d e f
+
x1
+
2
x2
+
x3
+
3 4
z x4
+
Latency = 4 clock cycles
Question:
LEC-05:
2.3.2
Dataow Diagrams
31
Flip Flops
a b c d e f
+
x1
+
x2
+
x3
+
x4
+
z
LEC-05:
2.3.2
Dataow Diagrams
32
+
x1
+
x2
+
x3
+
x4
+
z
LEC-05:
2.3.2
Dataow Diagrams
33
Registered Inputs
a b c d e f
+
x1
+
x2
+
x3
+
x4
+
z
LEC-05:
2.3.2
Dataow Diagrams
34
Datapath Components
a b c d e f
+
x1
+
x2
+
x3
+
x4
+
z
LEC-05:
2.3.2
Dataow Diagrams
35
Inputs
Unconnected signal tails are inputs Horizontal lines mark clock cycle boundaries
+
x1
+
x2
+
x3
+
x4
+
z
LEC-05:
2.3.2
Dataow Diagrams
36
Outputs
a b c d e f
Unconnected signal tails are inputs Horizontal lines mark clock cycle boundaries
+
x1
+
x2
+
x3
+
x4
+
z
LEC-05:
2.3.2
Dataow Diagrams
37
Summary
a b c d e f
Unconnected signal tails are inputs Horizontal lines mark clock cycle boundaries
+
x1
+
x2
+
x3
+
x4
+
z
LEC-05:
2.3.2
Dataow Diagrams
38
Maximum number of blocks in a clock cycle is total number of that component that are needed Maximum number of signals that cross a cycle boundary is total number of registers that are needed Maximum number of unconnected signal tails in a clock cycle is total number of inputs that are needed Maximum number of unconnected signal heads in a clock cycle is total number of outputs that are needed
LEC-05:
2.3.3
39
2.3.3
LEC-05:
2.3.3
40
LEC-05:
a
2.3.3
b c
41
0
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
x1 x2 x3 x4 x5 z
+
z
LEC-05:
a
2.3.3
b
42
0 1
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
x1 x2 x3 x4 x5 z
+
z
LEC-05:
a
2.3.3
b c
43
0 1
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
x1 x2 x3 x4 x5 z
+
z
LEC-05:
a
2.3.3
b
44
0 1
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
2 3
x1 x2 x3 x4 x5 z
+
z
LEC-05:
a
2.3.3
b c
45
0 1
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
2 3 4
x1 x2 x3 x4 x5 z
+
z
LEC-05:
a
2.3.3
b
46
0 1
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
2 3 4
x1 x2 x3 x4 x5 z
+
z
LEC-05:
a
2.3.3
b c
47
0 1
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
2 3 4
x1 x2 x3 x4 x5 z
+
z
5 6
LEC-05:
a
2.3.3
b
48
0 1
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
2 3 4
x1 x2 x3 x4 x5 z
+
z
5 6
LEC-05:
2.3.3
49
0 1
clk a
0 1 2 3 4 5 6
x1
+ + x2 + x3 + x4
x5
2 3 4
x1 x2 x3 x4 x5 z
+
z
LEC-05:
a
2.3.3
b
50
0 1
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
2 3 4
x1 x2 x3 x4 x5 z
+
z
LEC-05:
2.3.3
51
LEC-05:
2.3.3
52
Performance Equations
Performance 1 TimeExec
Latency = Number of clock cycles from inputs to outputs There is much more information on performance in chapter 4, which is devoted to performance.
TimeExec
Latency
ClockPeriod
LEC-05:
2.3.3
53
Latency: count horizontal lines in diagram Min clock period (Max clock speed) limited by longest path in a clock cycle
a b c d e f
0 1
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
LEC-05:
2.3.3
54
0 1
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
2 3 4
x1 x2 x3 x4 x5 z
+
z
LEC-05:
2.3.3
55
LEC-05:
2.3.4
56
2.3.4
0 1
+
x1
+
x1
+
x2
+
x2
+
x3
+
x3
+
x4
+
x4
+
z
5 6
+
z
NB: In the Two-add design, half of the last clock cycle is wasted.
LEC-05:
2.3.4
57
0
clk
0 1 2 3 4 5 6
a x1
+
x1
+
x2
x2
+
x3
x3
x4 x5
+
x4
+
z
3 4
LEC-05:
2.3.4
58
Design Comparison
One add per clock cycle
a b c d e f
0 1
0 1
+
x1
+
x1
+
x2
+
x2
+
x3
+
x3
+
x4
+
x4
+
z
5 6
+
z
6 1 6 1 op + 1 add 6
6 1 6 2 op + 2 add 4
Question: Under what circumstances would each of the design options (one add and two add) be the fastest?
Answer: time = latency * clock period compare execution times for both options
LEC-05:
2.3.5
59
2.3.5
inputs regs
If currently storing all inputs and can change environments behaviour to delay sending some inputs, then can reduce the number of inputs and registers. One-add before I/O opt
a b c d e f
+
x1
+
x1
+
x2
+
x2
+
x3
+
x3
+
x4
+
x4
+
z
+
z
6 6
2 2
LEC-05:
2.3.5
60
Design Comparison
One-add after I/O opt
a b
+
x1
+
x1 d
+
x2
+
x2 e
+
x3
+
x3 f
+
x4
+
x4
+
z
+
z
2 1 2 1 op + 1 add 6
3 1 3 2 op + 2 add 4
LEC-05:
2.3.6
61
LEC-05:
2.3.6
62
LEC-05:
2.3.6
63
LEC-05:
2.3.6
64
LEC-05:
2.3.6
65
LEC-05:
2.3.7
66
LEC-05:
2.3.7
67
+
x1
+
x2
+
x3
+ +
x4 f
+
z
Figure 2.4: Dataow diagram and building blocks for block diagram
LEC-05:
2.3.7
68
I/O Allocation
i1 i2 a b i3 c i1 i2 i3
+
x1
+
x2
i2 d
i3 e
+
x3
+ +
x4 i2 f
+
z o1
+
o1
LEC-05:
2.3.7
69
Datapath Allocation
i1 i2 a b a1 i3 c i1 i2 i3
+
x1 a2
+
x2 a1
i2 d
i3 e
+
x3 a2
a1
+
a2
+
x4 a1
i2 f
+
z o1
+
o1
LEC-05:
2.3.7
70
Register Allocation
i1 i2 a b r1 r2 a1 i3 c r3 i1 i2 i3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
71
Allocation Completed
i1 i2 a b r1 r2 a1 i3 c r3 i1 i2 i3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
I/O Allocation
Figure 2.5: Block diagram after I/O, datapath, and register allocation
LEC-05:
2.3.7
72
a1
Simulate the dataow diagram, drawing connections between blocks when they communicate
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
73
a1
Simulate the dataow diagram, drawing connections between blocks when the communicate
i1 i2 a b r1 r2 i3 c r3 i1 i2 i3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
74
a1
Simulate the dataow diagram, drawing connections between blocks when the communicate
i1 i2 a b r1 r2 i3 c r3 i1 i2 i3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
75
a1
Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
76
a1
Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
77
a1
Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
78
a1
Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
79
a1
Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
80
a1
Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
81
a1
Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
82
a1
Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
83
a1
Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
84
a1
Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
85
The state machine keeps track of which clock cycle of the dataow diagram is currently being executed.
Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers Clean up drawing, add state machine (control)
LEC-05:
2.3.7
86
Select signals on multiplexers Instruction signals on arithmetic modules Chip-enable lines on registers and ip-ops
i1 i2 i3
i1 i2 a b r1 r2 a1
i3 c r3
+
x1 a2
ctrl
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
87
Classes of Hardware
i1 i2 i3
datapath ctrl
r1 a1 r2 r3
storage control
+
a2
+
o1
LEC-05:
2.3.7
88
LEC-05:
2.3.7
89
architecture main of big_add is fsm : process ... end; process (clk) begin if rising_edge(clk) then if r1_gets_in = 1 then r1 <= i1; else r1 <= a2; end if; end if; end process; process (clk) begin if rising_edge(clk) then r2 <= i2; end if; end process; process (clk) begin if rising_edge(clk) then if r3_gets_in = 1 then r3 <= i3; else r3 <= a1; end if; end if; end process; a1 <= r1 + r2; a2 <= a1 + r3; o1 <= r3; end main;
LEC-05:
2.3.7
90
In section 2.4, well discuss how to build the control circuitry (nite state machine, represented by the fsm process).
LEC-05:
2.3.7
91
LEC-05:
2.3.8
92
2.3.8
Schedule: move functional blocks between clock cycles Allows tradeoffs between performance and area NOTE: Parallel algorithms have higher performance and greater scheduling exibility than serial algorithms NOTE: Serial algorithms tend to have less area than parallel algorithms Serial (((((a+b)+c)+d)+e)+f)
a b c d e f
Parallel (a+b)+(c+d)+(e+f)
+ + + + +
a b c d e f
+ +
LEC-05:
2.3.8
93
Design Analysis
a b c d e f
+ +
+
clock period num adders 1 add 3
LEC-05:
2.3.8
94
after scheduling
b c d
+ +
+ +
+ +
6 1 6 3 op + 1 add 3
4 1 4 2 op + 1 add 3
LEC-05:
2.3.9
95
LEC-06 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-06 Preliminaries
Schedule
wk-01 02 VHDL wk-01 Overview and VHDL wk-02 VHDL Semantics Design and Optimization wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review
wk-03 05
LEC-06 Preliminaries
Overview
This lecture builds on material from Lec-05, where dataow diagrams were introduced. The bulk of the lecture discusses nite state machine design. First how to build a state machine from a dataow diagram, and then various ways of coding up state machines in VHDL.
LEC-06 Preliminaries
Concepts
Lecture Notes: Sections 2.42.4.10
input/output protocols deriving nite state machines from dataow diagrams coding state machines in
LEC-06 Preliminaries
Background
Mano Digital Design
Section 6-4: Analysis of Clocked Sequential Circuits Section 6-5: State Reduction and Assignment Section 6-7: Design Procedure
LEC-06 Preliminaries
Reading
Smith ASIC
By now, you should be done with Chapter 8 (Programable ASIC Design Software) and Chapter 10 (VHDL) Section 12.2: Synthesis (From Lec-02) Section 12.6: VHDL Logic Synthesis (From Lec-02) Section 12.7: Finite State Machine Synthesis
Chapter 8: Sequential VHDL Chapter 9: Registers Section 12.2: Finite State Machines
LEC-06:
2.4
2.4
LEC-06:
2.4.1
2.4.1
LEC-06:
2.4.1
Moore Machines
Outputs are dependent upon only the state No combinational paths from inputs to outputs Outputs can be either ops or combinational
s3/0
LEC-06:
2.4.1
10
Mealy Machines
Outputs are dependent upon both the state and the inputs Combinational paths from inputs to outputs Outputs must be combinational
s0 a/1 s1 /0 s3 /0 !a/0 s2
LEC-06:
2.4.2
11
2.4.2
A state machine is generally written as a single clocked process, or as a pair of processes, where one is clocked and one is combinational.
LEC-06:
2.4.2
12
Design Decisions
Moore vs Mealy (Sections 2.4.3.1 and 2.4.3.2) Implicit vs Explicit (Section 2.4.6) State values in explicit state machines: Enumerated type vs constants (Section 2.4.4.1) State values for constants: encoding scheme (binary, gray, one-hot, ...) (Section 2.4.4.2)
LEC-06:
2.4.2
13
if ... then ... case for ... loop while ... loop
else
LEC-06:
2.4.2
14
LEC-06:
2.4.2
15
LEC-06:
2.4.2
16
LEC-06:
2.4.3
17
2.4.3
LEC-06:
2.4.3
18
LEC-06:
2.4.3
19
s3/0
LEC-06:
2.4.3
20
LEC-06:
2.4.3
21
LEC-06:
2.4.3
22
LEC-06:
2.4.3
23
LEC-06:
2.4.3
24
LEC-06:
2.4.3
25
LEC-06:
2.4.3
26
LEC-06:
2.4.3
27
LEC-06:
2.4.3
28
LEC-06:
2.4.3
29
LEC-06:
2.4.3
30
LEC-06:
2.4.3
31
s0 a/1 s1 /0 s3 /0 !a/0 s2
LEC-06:
2.4.3
32
LEC-06:
2.4.3
33
LEC-06:
2.4.3
34
LEC-06:
2.4.3
35
LEC-06:
2.4.3
36
LEC-06:
2.4.3
37
LEC-06:
2.4.4
State Encoding
38
2.4.4
State Encoding
LEC-06:
2.4.4
State Encoding
39
LEC-06:
2.4.4
State Encoding
40
LEC-06:
2.4.4
State Encoding
41
Simulation
When doing functional simulation with enumerated types, simulators often display waveforms with pretty-printed values rather than bits (e.g. s0 and s1 rather than 11 and 10). However, when simulating a design that has been mapped to gates, the enumerated type dissappears and you are left with just bits. If you dont know the encoding that the synthesis tool chose, it can be very difcult to debug the design.
LEC-06:
2.4.4
State Encoding
42
signal t : std_logic; ... case t is when 1 => ... when 0 => ... end case; will result in an error message about missing cases. You must provide for t being H, U, etc. The simplest thing to do is to make the last test when other. However, this opens you up to potential bugs if the enumerated type you are testing grows to include more values, which then end up unintentionally executing your when other branch, rather than having a special branch of their own in the case statement.
LEC-06:
2.4.4
State Encoding
43
Unused Values
If the number of values you have in your datatype is not a power of two, then you will have some unused values that are representable. For example: type state_ty is std_logic_vector(2 downto 0); constant s0 : state_ty := "011"; constant s1 : state_ty := "000"; constant s2 : state_ty := "001"; constant s3 : state_ty := "011"; constant s4 : state_ty := "101"; signal state : state_ty; This type only needs ve unique values, but can represent eight different values. What should we do with the three representable values that we dont need? The safest thing to do is to code your design so that if an illegal value is encountered, the machine resets or enters an error state.
LEC-06:
2.4.4
State Encoding
44
Binary: Conventional binary counter. One-hot: Exactly one bit is asserted at any time. Modied one-hot: Alteras Quartus synthesizer generates an almostone-hot encoding where the initial state is all Os. Gray: Transition between adjacent values requires exactly one bit ip. Custom: Choose encoding to simplify combinational logic for specic task.
LEC-06:
2.4.4
State Encoding
45
Gray is good for low-power applications where consecutive data values typically differ by 1 (e.g. no random jumps). One-hot usually has less combinational logic and runs faster than binary for machines with up to a dozen or so states. With more than a dozen states, the extra ip-ops required by one-hot encoding become too expense. Custom is great if you have lots of time and are incredibly intelligent, or have deep insight into the guts of your design.
LEC-06:
2.4.5
46
2.4.5
This section designs the state machine for the big_add example used in dataow diagrams (Section 2.3.7). We pick up from the VHDL code for the datapath in section 2.3.7.1.
i1 i2 a b r1 r2 a1 i3 c r3 i1 i2 i3
+
x1 a2
ctrl
+
x2 r1 a1
i2 d r2
i3 e r3 r1 a1 r2 r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
Two control signals from state machine: r1 gets in r3 gets in r1 reads from input or a2 r3 reads from input or a1
Simulate dataow diagram and record required values of signals. cycle 1 2 3 4 r1 gets in true false false r3 gets in true true false
LEC-06:
2.4.5
47
LEC-06:
2.4.6
48
2.4.6
There are two broad categories of state machines in VHDL: explicit and implicit. Explicit state machines are a direct translation of the hardware: a concurrent assignments to for the next-state equations and a clocked process for the ops to hold the state. Implicit state machines are built with processes that have multiple wait statements in a process. Explicit state machines are more cumbersome to write, but they are simpler to synthesize and more commonly used. Implicit state machines are concise and readable. Very few books or synthesis manuals describe multiple-wait statement processes, but they are relatively well supported among synthesis tools.
LEC-06:
2.4.7
49
2.4.7
Several examples of implicit state machines that could be used to drive r1 gets in and r3 gets in.
LEC-06:
2.4.7
50
LEC-06:
2.4.7
51
2.4.7.2 Counter
This example uses a counter in a process to keep track of the state, and then uses concurrent assignments for the control signals. The assignments to r1 gets in and r3 gets in could be done with conditional assignments, or a combinational process. Some of these alternatives are illustrated in section 2.4.8. ---------------------------------------------------process (clk) begin cycle_count <= to_unsigned(0, 2); -------------------------------wait until rising_edge(clk); -------------------------------while 3 > cycle_count loop cycle_count <= cycle_count + 1; wait until rising_edge(clk); end loop; end process; ---------------------------------------------------with cycle_count select r1_gets_in <= 1 when to_unsigned(0,2), 0 when others ; ---------------------------------------------------with cycle_count select r3_gets_in <= 1 when to_unsigned(3,2), 0 when others ; ----------------------------------------------------
LEC-06:
2.4.8
52
2.4.8
This is an explicit state machine. A clocked process is used to store the state and a concurrent assignment is used to calculate the next state. The datapath is the same as in section 2.3.6 The control signals for the datapath (r1_gets_in and r3_gets_in) drive the two multiplexors, one for each register (r1 and r3). The values of r1_gets_in and r3_gets_in are determined by the current state of the machine. In this section we rst write the explicit state machine, and then look at several different coding styles for communicating between the state machine and datapath.
LEC-06:
2.4.8
53
LEC-06:
2.4.8
54
LEC-06:
2.4.8
55
LEC-06:
2.4.8
56
S3 S0 | S1, others
LEC-06:
2.4.8
57
1; 1; 0; 1; 0; -; -; 0;
LEC-06:
2.4.8
58
LEC-06:
2.4.9
Reset
59
2.4.9
Reset
All circuits should have a reset signal that puts the circuit back into a good initial state. There are standard ways to add a reset signal to both explicit and implicit state machines. It is important that reset is tested on every clock cycle, otherwise a reset might not be noticed, or your circuit will be slow to react to reset and could generate illegal outputs after reset is asserted.
LEC-06:
2.4.9
Reset
60
LEC-06:
2.4.9
Reset
61
LEC-06:
2.4.10
62
LEC-06:
2.4.10
63
Figure 2.7: Four phase handshaking protocol Used when timing of communication between producer and consumer is unpredictable. The disadvantage is that it is cumbersome to implement and slow to execute.
LEC-06:
2.4.10
64
Valid-bit protocol
clk valid data
Figure 2.8: Valid-bit protocol A low overhead (both in area and performance) protocol. Consumer must always be able to accept incoming data. Often used in pipelined circuits. More complicated versions of the protocol can handle pipeline stalls.
LEC-06:
2.4.10
65
Start/Done Protocol
clk start data_in done data_out
Figure 2.9: Start/Done protocol A low overhead (both in area and performance) protocol. Useful when a circuit works on one piece of data at a time and the time to compute the result is unpredictable.
LEC-07 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-07 Preliminaries
Schedule
wk-01 02 VHDL wk-01 Overview and VHDL wk-02 VHDL Semantics Design and Optimization wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review
wk-03 05
LEC-07 Preliminaries
Overview
This lecture builds on material from Lec-05, where dataow diagrams were introduced. In this lecture, we show how to deal with memory reads and writes in dataow diagrams. This ties in with data hazards in computer architecture.
LEC-07 Preliminaries
Concepts
Lecture Notes: Sections 2.52.5.2.6
LEC-07 Preliminaries
Background
LEC-07 Preliminaries
Reading
Smith ASIC
LEC-07:
2.5
2.5
LEC-07:
2.5.1
LEC-07:
2.5.1
Input port
Output port
State signal
Array read
Array write
LEC-07:
2.5.1
10
mem
Dataow diagrams show the dependencies between operations. The basic memory operations are similar, in that each arrow represents a data dependency.
LEC-07:
2.5.1
11
The antidependency for memory reads is related to Write-after-Read dependencies, as discussed in Section 2.5.1.4. The apparent dependency on and production of an entire memory array is because we dont know which address in the array will be read from or written to. There are optimizations that can be performed when we know the address (Section 2.5.1.5).
The anti-dependency arrow producing mem on a read. Reads and writes are dependent upon the entire previous value of the memory array. The write operation appears to produce an entire memory array, rather than just updating an individual element of an existing array.
LEC-07:
2.5.1
12
LEC-07:
2.5.1
13
Initial Program
LEC-07:
2.5.1
14
LEC-07:
2.5.1
15
Valid Modication
LEC-07:
2.5.1
16
M[3] := 31 C := M[3]
M[3] := 32 M[0] := 01
LEC-07:
2.5.1
17
:= M[i] :=
:= M[i]
M[i]
:=
M[i]
:=
LEC-07:
2.5.1
18
LEC-07:
2.5.1
19
mem(wr)
rd_addr
mem(rd)
mem
data_out
LEC-07:
2.5.1
20
mem(wr)
mem(rd)
mem
data_out
wr addr
LEC-07:
2.5.1
21
data1; data2;
mem(wr)
data2
wr2_addr
mem(wr)
mem
LEC-07:
2.5.1
22
mem(wr)
mem
wr2 addr
LEC-07:
2.5.1
23
mem(rd)
wr_data wr_addr
mem(wr)
rd_data
mem
LEC-07:
2.5.1
24
mem(rd)
mem(wr)
rd_data
mem
wr addr
LEC-07:
2.5.1
25
Memory
Array
and
LEC-07:
2.5.1
26
M(wr)
31
M(wr)
M(rd)
M(rd)
32
1 2 3 4 5 6 7
M(wr)
01
M(wr)
Figure 2.10: Memory array example code and initial dataow diagram
LEC-07:
2.5.1
27
LEC-07:
2.5.1
28
M(wr)
31
M(wr)
M(rd)
M(rd)
32
M(wr)
01
M(wr)
M(rd)
LEC-07:
2.5.1
29
M(wr)
31
M(wr)
M(rd)
M(rd)
32
M(wr)
01
M(wr)
M(rd)
LEC-07:
2.5.1
30
Minimal Dependencies
M 0 21 2 31 3
M(rd) B 01 0 M(wr)
M(wr)
M(wr)
2 M(rd)
32 3 M(wr) 3 M(rd)
Question:
LEC-07:
2.5.1
31
Critical Path
M 0 21 2 31 3
M(rd) B 01 0 M(wr)
M(wr)
M(wr)
2 M(rd)
32 3 M(wr) 3 M(rd)
LEC-07:
2.5.1
32
M(rd) B 01 0 M(wr)
M(wr)
M(wr)
read write
2 M(rd)
32 3 M(wr) 3 M(rd)
Question:
Question:
LEC-07:
2.5.1
33
M(rd) B 01 0
M(wr)
M(wr)
2 M(rd)
32 3 M(wr) 3 3 M(rd)
M(wr)
First and last read are obvious from critical path. Last write is obvious.
Question: point?
LEC-07:
2.5.1
34
Middle Read
M 0 21 2 31 3
M(rd) B 01 0
M(wr)
M(wr)
2 2 M(rd)
32 3 M(wr) 3 3 M(rd)
M(wr)
Only three reads, so once rst and last have been picked, the middle one is determined
Question:
LEC-07:
2.5.1
35
First Write
M 0 21 2 31 3
M(rd) B 01 0
M(wr)
M(wr)
2 2 M(rd)
32 3 M(wr) 3 3 M(rd)
M(wr)
First write is one closest to start of critical path, although because we know addresses, could reschedule rst two writes.
Question:
LEC-07:
2.5.1
36
Complete Ordering
M 0 21 2 31 3
M(rd) B 01 0
M(wr)
M(wr)
2 2 M(rd) 3
32 3 M(wr) 3 3 M(rd)
M(wr)
Figure 2.12: Memory array with orderings Ordering of writes 2 and 3 are determined because both have 3 as their address.
LEC-07:
2.5.1
37
M(rd) B
M(wr)
2 2 M(rd) A 2
31 3 M(wr)
32 3 3 M(wr)
01 0 4 M(wr) 3
3 M(rd)
LEC-07:
2.5.1
38
3 3 M(rd) C 4
01 0 M(wr) M
Figure 2.13: Final version of Figure 2.10 Put as many parallel operations into same clock cycle as allowed by resources (one write + one read, two reads, or one write for dual port RAM). Preserve depencies by putting dependent operations in separate clock cycles.
LEC-07:
2.5.2
39
2.5.2
LEC-07:
2.5.2
40
LEC-07:
2.5.2
41
Two-Dimensional Array
The example below illustrates: lack of interface protocol, combinational write, multiple write ports, multiple read ports. architecture main of mem_not_hw is subtype data is std_logic_vector(7 downto 0); type data_vector is array( natural range <> ) of data; signal mem : data_vector(31 downto 0); begin y <= mem( a ); mem( a ) <= b; -- comb read process (clk) begin if rising_edge(clk) then mem( c ) <= w; -- write port #1 end if; end process; process (clk) begin if rising_edge(clk) then mem( d ) <= v; -- write port #2 end if; end process; u <= mem( e ); -- read port #2 end main;
LEC-07:
2.5.2
42
LEC-07:
2.5.2
43
LEC-07:
2.5.2
44
LEC-07:
2.5.2
45
LEC-07:
2.5.2
46
Altera
Altera uses MegaFunctions to implement RAM in VHDL. A MegaFunction is a black-box description of hardware on the FPGA. There are tools in Quartus to generate VHDL code for RAM components of different sizes. In E&CE 427 we will provide you with the VHDL code for the RAM components that you will need in Lab-3 and the Project. The APEX20KE chips that we are using have dedicated SRAM blocks called Embedded System Blocks (ESB). Each ESB can store 2048 bits and can be congured in any of the following sizes: Number of Elements 2048 1024 512 256 128 Word Size (bits) 1 2 4 8 16
LEC-07:
2.5.2
47
Xilinx
Use component instantiation to get these components
Other sizes are also available, consult the datasheet for your chip.
ram16x1s ram16x1d
16 16
LEC-07:
2.5.2
48
LEC-07:
2.5.2
49
NxW
NxW
DataOut[W-1..0] DataOut[2W-1..W]
LEC-07:
2.5.2
50
NxW
WE A DI DO
NxW
DataOut
LEC-07:
2.5.2
51
LEC-07:
2.5.2
52
LEC-07:
2.5.2
53
Question: Why do dual-ported memories usually not support writes on both ports?
LEC-08 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-08 Preliminaries
Schedule
wk-01 05 VHDL wk-01 Overview and VHDL wk-02 VHDL Semantics Design and Optimization wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review
wk-03 05
LEC-08 Preliminaries
Overview
This lecture builds on material from the previous three lectures where dataow diagrams, nite state machines, and memory array design were described. This lecture takes a stack (push, pop, swap, top) from an algorithmic description to an RTL implementation in VHDL. The major new idea is working with dataow diagrams for circuits that perform multiple operations.
LEC-08 Preliminaries
Concepts
Lecture Notes: Sections 2.62.6.4.3 combining FSMs, datapath, and storage
LEC-08 Preliminaries
Background
LEC-08 Preliminaries
Reading
LEC-08:
2.6
2.6
LEC-08:
2.6.1
Stack Requirements
2.6.1
Stack Requirements
LEC-08:
2.6.1
Stack Requirements
LEC-08:
2.6.1
Stack Requirements
10
LEC-08:
2.6.1
Stack Requirements
11
LEC-08:
2.6.1
Stack Requirements
12
The stack shall have 16 elements The inputs shall be registered. When a push operation is done, in the clock cycle following the push instruction, inp shall have the data that is to be pushed onto the stack. Popping from an empty stack or pushing onto a full stack results in undened behaviour. When doing a tos or pop operation, the output outp shall have the tos data in the clock cycle after the tos instruction is input. At all other times the output is unconstrained. In the clock cycle following reset being asserted (set to 1), the stack shall be empty.
LEC-08:
2.6.2
Stack Algorithm
13
2.6.2
Stack Algorithm
A simple Perl program to implement an algorithmic description of the stack. NB: You dont need to know Perl in E&CE 427. Perl is just one example of the many different software programming languages that can be used to create algorithmic descriptions of circuits.
LEC-08:
2.6.2
Stack Algorithm
14
LEC-08:
2.6.2
Stack Algorithm
15
if ( $line eq "tos") print( $stack $tos ); elsif ( $line eq "pop") print( $stack $tos ); $tos = $tos - 1; elsif ( $line eq "push" ) $tos = $tos + 1; $line = <STDIN>; chop( $line ); $stack $tos = $line; elsif ( $line eq "swap" ) $tmp = $stack $tos ; $stack $tos = $stack $tos-1 ; $stack $tos-1 = $tmp;
LEC-08:
2.6.2
Stack Algorithm
16
LEC-08:
2.6.3
17
2.6.3
LEC-08:
2.6.3
18
LEC-08:
2.6.3
19
Pop
stack tos
stack(rd)
-1
stack
data_out
tos
Pop
LEC-08:
2.6.3
20
Push
stack data_in tos
+1
stack(wr)
stack
tos
Push
LEC-08:
2.6.3
21
Tos
stack tos
stack(rd)
stack
data_out
tos
Tos
LEC-08:
2.6.3
22
Swap
stack tos
-1
stack(rd)
stack(rd)
stack(wr)
stack(wr)
stack
tos
LEC-08:
2.6.3
23
LEC-08:
2.6.3
24
Pop, Push
stack data_in stack tos tos +1
stack(rd)
-1
stack
data_out
tos
2 1
3 1
LEC-08:
2.6.3
25
Tos
stack tos
stack(rd)
stack
data_out
tos
LEC-08:
2.6.3
26
Swap
stack tos
-1
stack(rd)
stack(rd)
stack(wr)
stack(wr)
stack
tos
5 1
LEC-08:
2.6.3
27
Swap (Optimized)
stack tos
-1
stack(rd)
stack(rd)
-1 stack(wr)
stack(wr)
stack
tos
4 1
registers (stack, tos, stack[tos], stack[tos-1]) ALU Swap version 2 (Optimized) eliminated one register
LEC-08:
2.6.3
28
LEC-08:
2.6.3
29
LEC-08:
2.6.3
30
LEC-08:
2.6.3
31
LEC-08:
2.6.3
32
LEC-08:
2.6.3
33
LEC-08:
2.6.3
34
LEC-08:
2.6.3
35
LEC-08:
2.6.3
36
LEC-08:
2.6.3
37
LEC-08:
2.6.3
38
Pop
stack
stack
tos
tos
we a di do
outp
stack(rd)
-1
-1
stack
data_out
tos
Pop
LEC-08:
2.6.3
39
Push
stack data_in tos control +1
stack tos
d
ce
q
1
we
a di do
inp
Push
LEC-08:
2.6.3
40
Tos
stack tos
stack(rd) 0
tos
stack
we a di do
outp
stack
data_out
tos
Tos
LEC-08:
2.6.3
41
Swap Dataow
stack tos
-1
stack(rd)
stack(rd)
-1 stack(wr)
stack(wr)
stack
tos
LEC-08:
2.6.3
42
d ce
we a
-1
di
do
tmp2
d ce
Swap
LEC-08:
2.6.3
43
LEC-08:
2.6.3
44
reset
r
tos
d ce
q
stack
tmp1
d ce
we a
-1 1
di
do
tmp2
outp
d
inp
ce
All Operations
LEC-08:
2.6.4
45
2.6.4
It uses a 2-d array for the stack, rather than specialized memory components from the library. We are relying on the synthesis tool to build a state machine to drive the datapath. Sometimes, by writing code that is closer to gate-level hardware, we can improve peformance and/or area.
LEC-08:
2.6.4
46
Single process Separate datapath Separate control, storage, and datapath Fully disassembled
LEC-08:
2.6.4
47
Single process Separate datapath Separate control, storage, and datapath Fully disassembled
LEC-08:
2.6.4
48
Separate Datapath
There are four different ways to structure your RTL code:
Single process Separate datapath Separate control, storage, and datapath Fully disassembled
Control Storage
Datapath
LEC-08:
2.6.4
49
Single process Separate datapath Separate control, storage, and datapath Fully disassembled
Storage
Datapath
LEC-08:
2.6.4
50
Fully Disassembeled
There are four different ways to structure your RTL code:
Next-State Funs
Single process Separate datapath Separate control, storage, and datapath Fully disassembled
Control Storage
Storage
Datapath
LEC-08:
2.6.4
51
Stack RTL
To write the RTL code for the stack, consider the following options:
(e.g. dene a state type and a signal of type state and do assignments to current and next-state signals Question to ponder: does an explicit state machine result in better hardware?
Replacing the stack as an array with a component instantiation of a memory array from the FPGA libraries Dening a state machine and signals to control the datapath
LEC-08:
2.6.4
52
LEC-08:
2.6.4
53
Block Diagram
control
tos_inc_dec_sel tos_ce tmp2_ce stack_addr_sel stack_data_sel stack_we tmp1_ce
reset
r
tos
d ce
q
stack
tmp1
d ce
we
stack_addr
a di do
-1 1
tos_adj+
outp tmp2
d
inp stack_data_in
ce
Inventory tos, tmp1, tmp2 stack tos adj, stack addr, stack data in tos ce, tos inc dec sel, stack addr sel, stack data sel, stack we, tmp2 ce, tmp1 ce,
LEC-08:
2.6.4
54
LEC-08:
2.6.4
55
LEC-08:
2.6.4
56
Question: Why are some signals unsigned and others std logic vector?
Answer: Signals that are used as numbers (e.g. addresses for memory array) are unsigned. Non-numeric signals are std logic vector
LEC-08:
2.6.4
57
LEC-08:
2.6.4
58
LEC-08:
2.6.4
59
LEC-08:
2.6.4
60
SepFsm Initialization
process begin init : loop -------------------------------empty <= 1; tos_inc_dec_sel <= -; stack_addr_sel <= -; tos_ce <= 1; stack_we <= 0; stack_data_sel <= "--"; tmp1_ce <= -; tmp2_ce <= -;
LEC-08:
2.6.4
61
SepFsm Pop
-------------------------------loop -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------case inp is when pop => tos_inc_dec_sel <= 0; stack_addr_sel <= 1; tos_ce <= 1; stack_we <= 0; stack_data_sel <= "--"; tmp1_ce <= -; tmp2_ce <= -;
LEC-08:
2.6.4
62
SepFsm Push
when push => if (empty = 1) then tos_inc_dec_sel <= -; stack_addr_sel <= 0; tos_ce <= 0; else tos_inc_dec_sel <= 1; stack_addr_sel <= 1; tos_ce <= 1; end if; stack_data_sel <= "--"; stack_we <= 0; tmp1_ce <= -; tmp2_ce <= -; -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------empty <= 0; tos_inc_dec_sel <= -; stack_addr_sel <= 0; tos_ce <= 0; stack_data_sel <= "00"; stack_we <= 1; tmp1_ce <= -; tmp2_ce <= -;
LEC-08:
2.6.4
63
SepFsm Swap
when swap => ... end case; end loop; end loop; end process;
LEC-08:
2.6.4
64
SepFsm tmp1
-----------------------------------------------------process (clk) begin if rising_edge(clk) then if (tmp1_ce = 1) then tmp1 <= stack_data_out; end if; end if; end process;
LEC-08:
2.6.4
65
SepFsm tmp2
-----------------------------------------------------process (clk) begin if rising_edge(clk) then if (tmp2_ce = 1) then tmp2 <= stack_data_out; end if; end if; end process;
LEC-08:
2.6.4
66
SepFsm Tos
-----------------------------------------------------process (clk) begin if rising_edge(clk) then if (reset = 1) then tos <= to_unsigned(0, 4); elsif (tos_ce = 1) then tos <= tos_adj; end if; end if; end process;
LEC-08:
2.6.4
67
LEC-08:
2.6.4
68
LEC-08:
2.6.4
69
LEC-08:
2.6.4
70
LEC-08:
2.6.4
71
Dp-Op Declarations
architecture dp_op of stack is ----------------------------------------------------- define the states type dp_op_ty is (init_op, pop_op, push1_op, push2_op, swap_wr_tmp1_op, swap_wr_tmp2_op, swap_rd_tmp1_op, swap_rd_tmp2_op, nop_op ); signal dp_op : dp_op_ty; signal tos, tos_adj, stack_addr : unsigned(3 downto 0); signal inp_intern, stack_data_in, stack_data_out, tmp1, tmp2 : std_logic_vector(3 downto 0); signal empty, stack_we : std_logic; begin
LEC-08:
2.6.4
72
LEC-08:
2.6.4
73
LEC-08:
2.6.4
74
Dp-Op Tos
-----------------------------------------------------process (clk) begin if rising_edge(clk) then if (dp_op = init_op) then tos <= to_unsigned(0,4); elsif ( (dp_op = pop_op) OR (dp_op = push1_op and (empty = 0)) ) then tos <= tos_adj; end if; end if; end process; -----------------------------------------------------tos_adj <= tos + to_unsigned(1,3) when (dp_op = push1_op) else tos - to_unsigned(1,3) ; ------------------------------------------------------
LEC-08:
2.6.4
75
(dp_op = pop_op) ((dp_op = push1_op) AND (empty = 0)) (dp_op = swap_wr_tmp1_op) (dp_op = swap_rd_tmp2_op)
LEC-08:
2.6.4
76
LEC-08:
2.6.4
77
LEC-08:
2.6.4
78
Dp-Op Output
----------------------------------------------------outp <= stack_data_out; -----------------------------------------------------
LEC-08:
2.6.4
79
LEC-08:
2.6.4
80
LEC-08:
2.6.4
81
Explicit Declarations
architecture state of stack is type state_ty is (init_st, pop_st, push1_st, push2_st, swap_wr_tmp1_st, swap_wr_tmp2_st, swap_rd_tmp1_st, swap_rd_tmp2_st, nop_st ); signal state, state_n : state_ty; ... ...
LEC-08:
2.6.4
82
Explicit Function
-------------------------------------------------------function restart (inp : std_logic_vector(3 downto 0)) return state_ty is begin case inp is when pop => return(pop_st); when push => return(push1_st); when swap => return(swap_wr_tmp1_st); when others => return(nop_st); end case; end restart; begin
LEC-08:
2.6.4
83
LEC-08:
2.6.4
84
LEC-08:
2.6.4
85
Explicit Tos
process (clk) begin if rising_edge(clk) then if (state = init_st) then tos <= to_unsigned(0,4); elsif ( (state = pop_st) OR (state = push1_st and (empty = 0)) ) then tos <= tos_adj; end if; end if; end process;
LEC-08:
2.6.4
86
LEC-08:
2.6.4
87
LEC-09 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-09 Preliminaries
Schedule
wk-01 05 VHDL Design and Optimization wk-01 Overview and VHDL wk-02 VHDL Semantics Design and Optimization wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Guidelines and Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review
wk-03 05
LEC-09 Preliminaries
Concepts
Lecture Notes: Sections 2.72.9.4 coding guidelines more vhdl features strength reduction mux-pushing
LEC-09:
2.7
2.7
LEC-09:
2.7.1
Design Process
2.7.1
Design Process
Recommendation: Spend the time up front to plan a good design on paper. Use dataow diagrams and state machines to predict performance and area. This section gives guidelines for building robust, portable, and synthesizable VHDL code. Portability is both for different simulation and synthesis tools and for different implementation technologies. Remember, there is a world of difference between getting a design to work in simulation and getting it to work on a real FPGA. And there is also a huge difference between getting a design to work in an FPGA for a few minutes of testing and getting thousands of products to work for months at a time in thousands of different environments around the world. The coding guidelines here are designed both for helping you to get your E&CE 427 project to work as well as all of the subsequent industrial designs. Finally, note that there are exceptions to every rule. You might nd yourself in a circumstance where your particular circumstance (e.g. choice of tool, target technology, etc) would benet from bending or breaking a guideline here. Within E&CE 427, of course, there wont be any such circumstances.
LEC-09:
2.7.2
Signal Declarations
2.7.2
Signal Declarations
LEC-09:
2.7.2
Signal Declarations
Signals vs Variables
Use signals, do not use variables reason The intention of the creators of VHDL was for signals to be wires and variables to be just for simulation. Some synthesis tools allow some uses of variables, but when using variables, it is easy to create a design that works in simulation but not in real hardware.
LEC-09:
2.7.2
Signal Declarations
Std Logic
Use std_logic signals, do not use bit or Boolean reason std_logic is the most commonly used signal type across synthesis tools, simulations tools, and cell libraries
LEC-09:
2.7.2
Signal Declarations
Port Modes
Use in or out, do not use inout reason inout signals are tri-state. note If you have an output signal that you also want to read from, you might be tempted to declare the direction of the signal to be inout. A better solution is to create a new, internal, signal that you both read from and write to. Then, your output signal can just read from the internal signal.
LEC-09:
2.7.2
Signal Declarations
10
Declare the primary inputs and outputs of chips as either std logic and std logic vector. Do not use signed or unsigned for primary inputs or outputs. reason Both the Altera tool Quartus and the Xilinx tool ngd2vhdl convert signed and unsigned vectors in entities into std-logicvectors. If you want your same testbench to work for both functional simulation and timing simulation, you must not use signed or unsigned signals in the top-level entity of your chip. note Signed and unsigned signals are ne inside testbenches, for non-top-level entities, and inside architectures. It is only the toplevel entity that should not use signed or unsigned signals.
LEC-09:
2.7.3
Processes
11
2.7.3
Processes
For a combinational process, the sensitivity list should contain all of the signals that are read in the process. reason Gives consistent results across different tools. Many synthesis tools will implicitly include all signals that a process reads in its sensitivity list. This differs from the VHDL Standard. A tool that adheres to the standard will introduce latches if not all signals that are read from are included in the sensitivity list. exception In a clocked process using an if rising edge, it is acceptable to have only the clock in the sensitivity list
LEC-09:
2.7.3
Processes
12
Combinational Processes
For a combinational process, every signal that is assigned to, must be assigned to in every branch of if-then and case statements. reason If a signal is not assigned a value in a path through a combinational process, then that signal will be a latch. note For a clocked process, if a signal is not assigned a value in a clock cycle, then the ip-op for that signal will have a chipenable pin.
LEC-09:
2.7.3
Processes
13
Each signal should be assigned to in only one process. reason Multiple processes driving the same signal is the same as having multiple gates driving the same wire. This can cause contention, short circuits, and other bad things. exception Multiple drivers are acceptable if your implementation technology has wired-ANDs or wired-ORs. FPGAs dont have wiredANDs or wired-ORs.
LEC-09:
2.7.3
Processes
14
Separate unrelated signals into different processes reason Grouping assignments to unrelated signals into a single process can complicate the control circuitry for that process. Each branch in a case statement or if-then-else adds multiplexor or chip-enable circuitry.
LEC-09:
2.7.4
15
2.7.4
Use ops, not latches (see section 1.7.2). Use D-ops, not T, JK, etc (see section 1.7.2).
LEC-09:
2.7.4
16
For every signal in your design, know whether it should be a ip-op or combinational. Before simulating your design, examine the log le LOG/dc shell.log to see if the ip ops in your circuit match your expectations, and to check that you dont have any latches in your design.
LEC-09:
2.7.4
17
LEC-09:
2.7.5
State Machines
18
2.7.5
State Machines
In a state machine, illegal and unreachable states should transition to the reset state reason Creates more robust implementations. In the eld, your circuit will be subjected to illegal inputs, voltage spikes, temperature uctuations, clock speed variations, etc. At some point in time, something wierd will happen that will cause it to jump into an illegal state. Having a system crash and reboot is much better than having it generate incorrect outputs that arent detected.
LEC-09:
2.7.5
State Machines
19
State Encoding
If your state machine has less than 16 states, use a one-hot encoding. reason For n states, a one-hot encoding uses n ip-ops, while a binary encoding uses log2 n ip-ops. One-hot signlas are simpler to decode, because only one bit must be checked to determine if the circuit is in a particular state. For small values of n, a one-hot signal results in a smaller and faster circuit. For large values of n, the number of signals required for a one-hot design is too great of a penalty to compensate for the simplicity of the decoding circuitry. note Using an enumerated type for states allows the synthesis tool to choose state encodings that it thinks will work well to balance area and clock speed. Quartus uses a modied one-hot encoding, where the bit that denotes the reset state is inverted. That is, when the reset bit is 0, the system is in the reset state and when the reset bit is a 1 the system is not in the reset state. The other bits have the normal polarity. The result is that when the system is in the reset state, all bits are 0 and when the system is in a non-reset state, two bits are 1. note Using your own encoding allows you to leverage knowledge about your design that the synthesis tool might not be able to deduce.
LEC-09:
2.7.5
State Machines
20
2.7.5.1 Reset
Include a reset signal in all clocked circuits. reason For most implementation technologies, when you power-up the circuit, you do not know what state it will start in. Also, if something goes wrong while the circuit is running, you need a way to get it into a guaranteed state.
LEC-09:
2.7.5
State Machines
21
For implicit state machines, check for reset after every wait statement. reason Missing a wait statement means that your circuit might not notice a reset signal, or different signals could reset in different clock cycles, causing your circuit to get out of synch.
LEC-09:
2.7.5
State Machines
22
Connect reset to the important control signals in the design, such as the state signal. Do not reset every ip op. reason Using reset adds area and delay to a circuit. The fewer signals that need reset, the faster and smaller your design will be. note Connect the reset signal to critical ip-ops, such as the state signal. Datapath signals rarely need to be reset. You do not need to reset every signal
LEC-09:
2.7.5
State Machines
23
Synchronous Reset
Use synchronous, not asynchronous, reset reason Creates more robust implementations. Signal propagation delays mean that asynchronous resets cause different parts of the circuit to be reset at different times. This can lead to glitches, which then might cause the circuit to move to an illegal state.
LEC-09:
2.7.6
24
2.7.6
Put ip ops on primary inputs and outputs of a chip reason Creates more robust implementations. Signal delays between chips are unpredictable. Signal integrity can be a problem (remember transmission lines from E&CE 324?). Putting ip ops on inputs and outputs of chip provides clean boundaries between circuits. note This only applies to primary inputs and outputs of a chip (the signals in the top-level entity). Within a chip, you should adopt a standard of putting ip-ops on either inputs or outputs. Within a chip, you do not need to put ip-ops on both inputs and outputs.
LEC-09:
2.8
25
2.8
LEC-09:
2.8.1
Vectors
26
2.8.1
Vectors
VHDL supports reading from and assigning to slices (aka discrete subranges) of vectors.
The ranges on both sides of the assignment must be the same. The direction (downto or to) of each slice must match the direction of the signal declaration. The direction of the target and expression may be different.
LEC-09:
2.8.1
Vectors
27
Declarations
---------------------------------------------------a, b : in std_logic_vector(15 downto 0); c, d, e : out std_logic_vector(15 downto 0); ---------------------------------------------------ax, bx : in std_logic_vector(0 to 15); cx, dx, ex : out std_logic_vector(0 to 15); ---------------------------------------------------m, n : in unsigned(15 downto 0); p, q, r : out unsigned(15 downto 0); ---------------------------------------------------w, x : in signed(15 downto 0); y, z : out signed(15 downto 0) ----------------------------------------------------
LEC-09:
2.8.1
Vectors
28
Legal code
c(3 downto 0) cx(0 to 3) (e(3), e(4)) (e(5), e(6)) <= <= <= <= a(15 downto 12); a(15 downto 12); bx(12 to 13); b(13 downto 12);
LEC-09:
2.8.1
Vectors
29
Illegal code
d(0 to 3) <= a(15 to 12); -- slice dirs must be same as decl e(3) & e(2) <= b(12 to 13); -- syntax error on & p(3 downto 0) <= (m + n)( 3 downto 0); -- syntax error on )( z(3 downto 0) <= m(15 downto 12); -- types on lhs and rhs must match
LEC-09:
2.8.2
30
2.8.2
Some constructs that are useful and will be described in later chapters and sections: for-generate : replicates hardware if-generate : conditionally generates hardware report : print a message on stderr while simulating assert : assertions about behaviour of signals, very useful with report statements. generics : parameters to an entity that are dened at elaboration time. attributes : predened functions for different datatypes. For example: high and low indices of a vector.
LEC-09:
2.9
31
2.9
LEC-09:
2.9.1
Strength Reduction
32
2.9.1
Strength Reduction
LEC-09:
2.9.1
Strength Reduction
33
LEC-09:
2.9.1
Strength Reduction
34
is odd, is even : least signicant bit is neg, is pos : most signicant bit NOTE: use is odd(a) rather than a(0)
LEC-09:
2.9.2
35
2.9.2
LEC-09:
2.9.2
36
2.9.2.1 Mux-Pushing
Pushing multiplexors into the fanin of a signal can reduce area. Before z <= a + b when (w = 1) else a + c; After tmp <= b when (w = 1) else c; z <= a + tmp; The rst circuit will have two adders, while the second will have one adder. Some synthesis tools will perform this optimization automatically, particularly if all of the signals are combinational.
LEC-09:
2.9.2
37
LEC-09:
2.9.2
38
Subexpression Elimination
NOTE: Clocked subexpressions Care must be taken when doing common subexpression elimination in a clocked process. Putting the temporary signal in the clocked process will add a clock cycle to the latency of the computation, because the tmp signal will be ip-op. The tmp signal must be combinational to preserve the behaviour of the circuit.
LEC-09:
2.9.2
39
To improve performance If same result is needed at two very distant locations and wire delays are signicant, it might improve performance (increase clock speed) to replicate the hardware To reduce area If same result is needed at two different times that are widely separated, it might be cheaper to reuse the hardware component to repeat the computation than to store the result in a register
NOTE: Muxes are not free Each time a component is reused, multiplexors are added to inputs and/or outputs. Too much sharing of a component can cost more area in additional multiplexors than would be spent in replicating the component
LEC-09:
2.9.3
Arithmetic
40
2.9.3
Arithmetic
VHDL is left-associative. The expression a + b + c + d is interpreted as (((a + b) + c) + d). You can use parentheses to suggest parallelism. Perform arithmetic on the minimum number of bits needed. If you only need the lower 12 bits of a result, but your input signals are 16 bits wide, trim your inputs to 12 bits. This results in a smaller and faster design than computing all 16 bits of the result and trimming the result to 12 bits.
LEC-09:
2.9.4
Pipelining
41
2.9.4
Pipelining
Pipelines will not be covered in E&CE 427. This subsection is provided for those who already understand the basics of pipelining. You can turn a dataow diagram into a pipeline by making each clock cycle of the dataow diagram a separate pipe stage. However, this can be complicated and error-prone. You need to worry about data hazards if you have state-holding registers in your algorithm. You need to worry about structural hazards if different instructions have different latencies.
LEC-10 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-10 Preliminaries
Schedule
wk-01 05 VHDL wk-01 Overview and VHDL wk-02 VHDL Semantics Design and Optimization wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Guidelines and Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review
wk-03 05
LEC-10 Preliminaries
Overview
In this lecture we will go over some design guidelines and optimization techniques that are specic to FPGAs.
LEC-10 Preliminaries
Concepts
Lecture Notes: Sections 2.102.11.2 Coding guidelines for FPGAs Hardware for generic FPGAs
LEC-10:
2.10
FPGA-SPECIFIC GUIDELINES
2.10
FPGA-Specic Guidelines
LEC-10:
2.10.1
Generic FPGAs
LEC-10:
2.10.1
Generic FPGAs
2.10.1.1 ware
LEC-10:
2.10.1
Generic FPGAs
data_in
comb
D CE
data_out
ctrl_in
carry_out
LEC-10:
2.10.1
Generic FPGAs
flop_data_out
flop_data_in ctrl_in
carry_out
LEC-10:
2.10.1
Generic FPGAs
10
flop_data_out
flop_data_in ctrl_in
carry_out
LEC-10:
2.10.1
Generic FPGAs
11
flop_data_out
flop_data_in ctrl_in
carry_out
LEC-10:
2.10.1
Generic FPGAs
12
flop_data_out
flop_data_in ctrl_in
carry_out
LEC-10:
2.10.1
Generic FPGAs
13
flop_data_out
flop_data_in ctrl_in
carry_out
LEC-10:
2.10.1
Generic FPGAs
14
Flip-ops are almost free in FPGAs reason In FPGAs, the area consumed by a design is usually determined by the amount of combinational circuitry, not by the number of ip-ops. Usually each 4:1 combinational circuit has a ip-op.
LEC-10:
2.10.1
Generic FPGAs
15
Use It or Lose
Aim for using 8090% of the cells on a chip. reason If you use more than 90% of the cells on a chip, then the placeand-route program might not be able to route the wires to connect the cells. reason If you use less than 80% of the cells, then probably: there are optimizations that will increase performance and still allow the design to t on the chip; or you spent too much human effort on optimizing for low area; or you could use a smaller (cheaper!) chip. exception In E&CE 427 (unlike in real life), the mark is based on the actual number of cells used.
LEC-10:
2.10.1
Generic FPGAs
16
Area Estimation
You can estimate the area of a design by counting the number of ipops in the fanin of each ip-op. reason Each set of four source signals requires one cell. Source ops Cells 1 1 2 1 3 1 4 1 5 2 6 2 7 2 8 3 9 3 10 3 11 4 note This technique is generally an overestimate, because a single cell can drive several other cells (common subexpression elimination).
LEC-10:
2.10.1
Generic FPGAs
17
LEC-10:
2.10.1
Generic FPGAs
18
General purpose interconnect (congurable, slow) Carry chains and cascade chains (verticaly adjacent cells, fast)
LEC-10:
2.10.1
Generic FPGAs
19
General purpose interconnect (congurable, slow) Carry chains and cascade chains (vertically adjacent cells, fast)
LEC-10:
2.10.1
Generic FPGAs
20
LEC-10:
2.10.1
Generic FPGAs
21
LEC-10:
2.10.1
Generic FPGAs
22
LEC-10:
2.10.1
Generic FPGAs
23
Cells not used for computation can be used as wires to shorten length of path between cells.
LEC-10:
2.10.1
Generic FPGAs
24
2.10.1.2
Generic Clocks
Characteristics of FPGAs:
High fanout (drive many gates) Long wires (destination gates scattered all over chip)
Very few gates that are large (strong) enough to support a high fanout. Very few wires that traverse entire chip and can be connected to every ip-op.
LEC-10:
2.10.1
Generic FPGAs
25
Clocks
Guideline for clock signals on FPGAs:
Use just one clock signal reason If all ip-ops use the same clock, then the clock does not impose any constraints on where the place-and-route tool puts ipops and gates. If different ip-ops used different clocks, then ip-ops that are near each other would probably be required to use the same clock.
LEC-10:
2.10.1
Generic FPGAs
26
Clocks
Guideline for clock signals on FPGAs:
Use only one edge of clock signal reason There are two ways to use both rising and falling edges of a clock signal: have rising-edge and falling-edge ip ops, or have two different clock signals that are inverses of each other. Most FPGAs have only rising-edge ip ops. Thus, using both edges of a clock signal is equivalent to having two different clock signals, which is deprecated by the preceding guideline.
LEC-10:
2.10.1
Generic FPGAs
27
2.10.1.3
LEC-10:
2.10.1
Generic FPGAs
28
Memory
For ve or more years, FPGAs have had special circuits for RAM and ROM. In Altera FPGAs, these circuits are called ESBs (Embedded System Blocks). These special circuits are possible because many FPGAs are fabricated on the using the same processes as SRAM chips. So, the FPGAs simply contain small chunks of SRAM.
LEC-10:
2.10.1
Generic FPGAs
29
Microprocessors
A new feature to appear in FPGAs in 2001 and 2002 is hardwired microprocessors on the same chip as programmable hardware.
Hard Arm 922T with 200 MIPs Power PC 405 with 420 D-MIPs
The Xilinx-II Pro has 4 Power PCs and enough programmable hardware to implement a complete 32-bit microprocessor.
LEC-10:
2.10.1
Generic FPGAs
30
Arithmetic Circuitry
A new feature to appear in FPGAs in 2001 and 2002 is hardwired circuits for multipliers and adders.
Using these resources can improve signicantly both the area and performance of a design.
16 18
16 at 130MHz 18 at ???MHz
LEC-10:
2.10.1
Generic FPGAs
31
Input / Output
Recently, high-end FPGAs have started to include special circuits to increase the bandwidth of communication with the outside world. Product True-LVDS (1 Gbps) Rocket I/O (3 Gbps)
Altera Xilinx
LEC-10:
2.10.2
Altera APEX20K
32
LEC-10:
2.10.2
Altera APEX20K
33
LEC-10:
2.10.2
Altera APEX20K
34
4-input lookup table (LUT) Carry-chain computation circuitry Cascade-chain computation circuitry Flip-op with load, clear, clock-enable
LEC-10:
2.10.2
Altera APEX20K
35
LE Interconnect
4 data inputs 2 data outputs Carry in, carry out Cascade in, cascade out Clock, clock-enable Async clear, synch set (load), synch clear (reset) Global reset
LEC-10:
2.11
EXAMPLE CIRCUITS
36
2.11
Example Circuits
LEC-10:
2.11.1
Ripple-Carry Adder
37
LEC-10:
2.11.2
Barrel Shifter
38
LEC-10:
2.11.2
Barrel Shifter
39
LEC-10:
2.11.2
Barrel Shifter
40
LEC-10:
2.11.2
Barrel Shifter
41
LEC-10:
2.11.2
Barrel Shifter
42
Chapter 3
Functional Validation
LEC-11 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-11 Preliminaries
Schedule
wk-01 02 wk-03 05 wk-06 VHDL Design and Optimization Functional Validation Lec-11 Datapath Validation and Testbenches Lec-12 Control Validation and Assertions Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review
LEC-11 Preliminaries
Overview
The purpose of this lecture is to illustrate techniques to quickly and reliably detect bugs in datapath circuits. We will discusses validation of datapath circuits and introduce the notions of testbench, specication, and implementation. Well illustrate a progression of techniques that can be used to go from very simple tests to more complete and complicated tests.
LEC-11 Preliminaries
Concepts
Specication Implementation Design Under Test (DUT) Unit Under Test (UUT) Test Bench Stimulus
Manual tests Array of test vectors Generated tests Functional specication Relational specication
LEC-11 Preliminaries
Background
LEC-11 Preliminaries
Reading (Smith)
Smiths ASIC: 10.2.7 13.1 : 13.2 : 13.5 : : sample testbench levels of temporal abstraction for simulation simulation example different simulation models for hardware
LEC-11 Preliminaries
Rushtons VHDL for Logic Synthesis: Ch 13 : Testbenches Ashendens Designers Guide to VHDL: Sect 1.4 : Testbenches Sect 6.2.1 : Testing the Behavioural Model of a Pipelined Multiplier Accumulator Sect 6.3.3 : Testing the Register-Transfer-Level Model of a Pipelined Multiplier Accumulator Sect 15.3 : Testing the Behavioural Model of a DLX Computer System Sect 15.5 : Testing the Register-Transfer-Level Model of a DLX Computer System Janick Bergerons verication guild website: http://www.janick.bergeron.com/guild/default.htm
LEC-11:
3.1
OVERVIEW
3.1
Overview
LEC-11:
3.1.1
3.1.1
functional validation checking that a design (e.g. RTL code) has the correct behaviour
usually treats combinational circuitry as having zero-delay usually done by simulating circuit with test vectors big challenges are simulation speed and test generation
LEC-11:
3.1.1
10
Terminology
formal verication checking that a design has the correct behaviour for every possible input and internal state
uses mathematics to reason about circuit, rather than checking individual vectors of 1s and 0s capacity problems: only usable on detailed models of small circuits or abstract models of large circuits mostly a research topic, but some practical applications have been demonstrated tools include model checking and theorem proving formal verication is not a guarantee that the circuit will work correctly
LEC-11:
3.1.1
11
Terminology
performance validation checking that implementation has (at least) desired performance power validation checking that implementation has (at most) desired power equivalence verication (checking) checking that the design generated by a synthesis tool has same behaviour as RTL code. timing verication checking that all of the paths in a circuit t meet the timing constraints
LEC-11:
3.1.1
12
LEC-11:
3.1.1
13
LEC-11:
3.1.2
14
3.1.2
Everyone should get a lecture on why their rst industrial design wont work in the eld. Here are few reasons:
LEC-11:
3.1.2
15
Unreachable States
1. You forgot to make your unreachable states transition to the initial (reset) state. Clock glitches, power surges, etc will occasionally cause your system to jump to a state that isnt dened or produce an illegal data value. When this happens, your design should reset itself, rather than crash or generatel illegal outputs.
LEC-11:
3.1.2
16
Untestable Registers
2. You have internal registers that you cant access or test. If you can set a register you must have some way of reading the register from outside the chip.
LEC-11:
3.1.2
17
LEC-11:
3.1.2
18
LEC-11:
3.1.2
19
LEC-11:
3.1.2
20
LEC-11:
3.2
TEST CASES
21
3.2
Test Cases
Test case / test vector : A combination of inputs and internal state values. Represents one possible test of the system. Boundary conditions / corner cases : A test case that represents an unusual situation on input and/or internal state signals. Corner cases are likely to contain bugs. Test scenario : A sequence of test vectors that, together, exercise a particular situation (scenario) on a circuit. For example, a scenario for an elevator controller might include a sequence of button pushes and movements between oors. Test suite : A collection of test vectors that a run on a circuit.
LEC-11:
3.2.1
Coverage
22
3.2.1
Coverage
To be sure that an implementation is correct, we must check every combination of values. This includes both input values and internal state (ip ops). If we have ni bits of inputs and ns bits in ip-ops, we have to test 2ni ns different cases when doing functional validation.
Question: If we have nc combinational signals, why dont we have to test 2ni ns nc different cases?
LEC-11:
3.2.1
Coverage
23
Coverage
Question: If we have nc combinational signals, why dont we have to test 2ni ns nc different cases?
Answer: The value of each combinational signal is determined by the ip ops and inputs in its fanin. Once the values of the inputs and ip ops are known, the value of each combinational signal can be calculated. Thus, the combinational signals do not add additional cases that we need to consider.
LEC-11:
3.2.1
Coverage
24
Coverage
Denition Coverage: The coverage that a suite of tests achieves on a circuit is the percentage of cases that are simulated by the tests. 100% coverage means that the circuit has been simulated for all combinations of values for input signals and internal signals.
LEC-11:
3.2.1
Coverage
25
Coverage
NOTE: Coverage Terminology There are many different types of coverage, which measure everything from percentage of cases that are exercised to number of output values that are exercises.
LEC-11:
3.2.1
Coverage
26
Coverage
NOTE: Coverage Tools There are many different commercial software programs that measure code and other types of coverage. Company Cadence Cadence Fintronic interHDL Summit Design Synopsys TransEDA Verisity Veritools Aldec Tool Afrma Coverage Analyzer DAI Coverscan FinCov Coverit HDLScore CoverMeter Verication Navigator SureCov Express VCT, VeriCover Riviera Coverage code, expressions, fsm code bought by Avant! ? code, events, variables code coverage (dead?) code and fsm code, block, values, fsm code, branch code, block
LEC-11:
3.2.2
27
3.2.2
Three states: off, low, and high. The user can set the desired temperature to any value between 15C and 25C. There is a thermometer to measure the current temperature for values between 0C and 40C. The state machine in gure 3.1 describes the transitions between states.
LEC-11:
3.2.2
28
5 =< di ff
HIGH
LEC-11:
3.2.2
29
Sample Scenario
off low des_tmp low high high current state low high
off low 23 22 20 15 13
off
high
low
off
low
high
LEC-11:
3.2.2
30
LEC-11:
3.2.2
31
41
11
LEC-11:
3.2.2
32
LEC-11:
3.2.2
33
64
16
1024 4 4096
LEC-11:
3.2.2
34
LEC-11:
3.2.2
35
cur_temp
$ "!! # $ "!! #
LEC-11:
3.2.2
36
current state
off
high
low
off
low
high
Notice that with adjusted ranges, there is very little change in behaviour.
LEC-11:
3.2.2
37
4 =< di ff
HIGH
LEC-11:
3.2.2
38
Choosing data ranges to be powers of two reduced number of illegal inputs and internal state values.
LEC-11:
3.2.3
39
3.2.3
Consider doing the functional simulation for a double precision (64-bit) oating-point divider. Given Information Data width Number of gates in circuit Number of assembly-language instructions to simulate one gate for one test case Number of clock cycles required to execute one assembly language instruction on the computer that is running the simulation Clock speed of computer that is running the simulation 64 bits 10 000 100 0.5
1 Gigahertz
LEC-11:
3.2.3
40
Number of Cases
Question: How many cases must be considered?
Answer:
3 4E 38cases
NumTestsTot
$ ! $ !
bits 64 64
$ !
LEC-11:
3.2.3
41
Answer:
1 7E 35secs 5 6E 26years
$ !
!
TestTimeTot
10000gates
100
instrs gate
05
cycles instr
1E 9
secs cycle
3 4E 38cases
$ ! $ !
LEC-11:
3.2.3
42
Coverage
Question: If you can run simulations non-stop for one year on ten computers, what coverage will you achieve?
Answer:
1. Calculate number of seconds to simulate one test case on one computer instrs cycles secs TestTime1:1 10000gates 100 05 1E 9 gate instr cycle 5E 4secs
!
LEC-11:
3.2.3
43
LEC-11:
3.2.3
44
Number of Tests
3. Number of tests per year using ten computers secs mins hours days 60 60 24 365 25 min hour day year NumTests:10 TestTime1:10 SpeedOfLight in m/s TestTime1:10 3E 8secs 5E 5secs 6E 12cases
LEC-11:
3.2.3
45
Coverage
4. Calculate coverage achieved by running tests on ten computers for one year NumTestsRun Covg NumTestsTot NumTests:10 NumTestsTot 6E 12 3E 38 2E 26 0 0000000000000000000000002%
$ $
LEC-11:
3.2.4
46
3.2.4
From Validating the Intel(R) Pentium(R) Microprocessor by Bob Bentley, Design Automation Conference 2001. (Link on E&CE 427 web page.)
Simulating the Pentium 4 Processor on a Pentium 3 Processor ran at about 15 MHz. By tapeout, over 200 billion simulation cycles had been run on a network of computers. All of these simulations represent less than two minutes of running a real processor.
LEC-11:
3.2.4
47
Research
Research challenges: 1. How to make simulations run faster? 2. How to choose test cases so that cases that are run are likely to detect bugs?
LEC-11:
3.2.4
48
Research
Research activities in functional validation: 1. 2. 3. 4. Simulation accelleration Coverage analysis Test generation Formal verication
LEC-11:
3.2.4
49
Practice
Challenges in practice: 1. 2. 3. 4. Writing specication Identifying corner cases Choosing test cases Finding root cause of unexpected behaviour
LEC-11:
3.3
TESTBENCHES
50
3.3
Testbenches
A test bench (also known as a test rig, test harness, or test jig) is a collection of code used to simulate a circuit and check if it works correctly. Testbenches are not synthesized. You do not need to restrict yourself to the synthesizable subset of VHDL. Use the full power of VHDL to make your testbenches concise and powerful.
LEC-11:
3.3.1
51
3.3.1
implementation
Implementation Circuit that youre checking for bugs also known as: design under test or unit under test Stimulus Generates test vectors Specication Describes desired behaviour of implementation Check Checks whether implementation obeys specication
LEC-11:
3.3.1
52
) ) ) ) ) )
Testbenches usually do not have any inputs or outputs. Inputs are generated by stimulus Outputs are analyzed by check and relevant information is printed using report statements Different circuits will use different stimuli, specications, and checks. The roles of the specication and check are somewhat exible. Most circuits will have complex specications and simple checks. However, some circuits will have simple specications and complex checks. If two circuits are supposed to have the same behaviour, then they can use the same stimuli, specication, and check. If two circuits are supposed to have the same behaviour, then one can be used as the specication for the other. Testbenches are restricted to stimulating only primary inputs and observing only primary outputs. To check the behaviour of internal signals, use assertions (Lec-12).
LEC-11:
3.3.2
53
3.3.2
implementation
) ) )
Specication has same inputs and outputs as implementation. Specication is a clock-cycle accurate description of desired behaviour of implementation. Check is an equality test between outputs of specication and implementation.
LEC-11:
3.3.2
54
Examples
) ) )
Execution modules: output is sum, difference, product, quotient, etc.of inputs DSP lters Instruction decoders
NOTE: Functional specication vs Reference model Functional specication and reference model are often used interchangeably.
LEC-11:
3.3.3
55
3.3.3
stimulus
check
implementation
) ) ) )
Relational testbenches, or relational specications are used when we do not want to specify the specic output values that the implementation must produce. Instead, we want to check that some relationship holds between the output and the input, or that some relationship holds amongst the output values (independent of the values of the input signals.) Specication is usually just wires to feed the input signals to the check. Check is the brains and encodes the desired behaviour of the circuit.
LEC-11:
3.3.3
56
Examples
) ) )
Carry-save adders: the two outputs are the sum of the three inputs, but do not specify exact values of each individiual output. Arbiters: every request is eventually granted, but do not specify in which order requests are granted. One-hot encoding: exactly one bit of vector is a 1, but do not specify which bit is a 1.
NOTE: Relational specication vs relational testbench Relational specication and relational testbench are often used interchangeably.
LEC-11:
3.3.4
57
3.3.4
testbench
stimulus
check
implementation
architecture main of athabasca_tb is component declaration for implementation; other declarations begin implementation instantiation; stimulus process; specification process (or component instantiation); check process; end main;
LEC-11:
3.3.5
Datapath vs Control
58
3.3.5
Datapath vs Control
LEC-11:
3.3.5
Datapath vs Control
59
Datapath Validation
Datapath circuits tend to be well-suited to reference-model style testbenches:
) )
Each set of inputs generates one set of outputs Each set of outputs is a function of just one set of inputs
LEC-11:
3.3.5
Datapath vs Control
60
Control Validation
Control circuits often pose problems for testbenches,
Assertions (Lec-12) can be used to check the behaviour of internal signals. Control circuits tend to use assertions to check correctness and rely on testbenches only to stimulate inputs.
) ) ) )
Many more internal signals than outputs. The behaviour of the outputs provides a view into only a fragment of the current state of the circuit. It may take many clock cycles from when a bug is exercised inside the circuit until it generates a deviation from the correct behaviour on the outputs. When the deviation on the outputs is observed, it is very difcult to pinpoint the precise cause of the deviation (the root cause of the bug).
LEC-11:
3.4
LEC-11:
3.4
Implementation
entity and2 is port ( a, b : in std_logic; c : out std_logic ); end and2; architecture main of and2 is begin c <= 1 when (a = 1 AND b = 1) else 0; end and2;
LEC-11:
3.4.1
A Spec-Less Testbench
63
3.4.1
A Spec-Less Testbench
(NOTE: this code has not been checked for correctness) First, use waveform viewer to check that implementation generates reasonable outputs for a small set of inputs. entity and2_tb is end and2_tb; architecture main_tb of and2_tb is component and2 port ( a, b : in std_logic; c : out std_logic ); end component; signal ta, tb, tc_impl : std_logic; signal ok : boolean; begin --------------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); --------------------------------------------stimulus : process begin ta <= 0; tb <= 0; wait for 10ns; ta <= 1; tb <= 1; wait for 10ns; end process; --------------------------------------------end main_tb; Use this testbench until implementation generates solid Boolean values (No X or U data) and have checked that a few simple test cases generate correct outputs.
LEC-11:
3.4.2
64
3.4.2
Writing code to drive inputs and repetitively typing wait for 10 ns; can get tedious, so code up test vectors in an array. (NOTE: this code has not been checked for correctness) architecture main_tb of and2_tb is ... begin ... stimulus : process type test_datum_ty is record ra, rb : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty ; constant test_vectors : test_vectors_ty := -a b ( ( 0, 0), ( 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; wait for 10 ns; end loop; end process; end main_tb; Use this testbench until checking the correctness of the outputs by hand using waveform viewer becomes difcult.
LEC-11:
3.4.3
65
3.4.3
(NOTE: this code has not been checked for correctness) After a few test vectors appear to be working correctly (via a manual check of waveforms on simulation), begin automatically checking that outputs are correct.
architecture main_tb of and2_tb is ... begin -----------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); -----------------------------------------stimulus : process type test_datum_ty is record ra, rb, rc : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := -a, b: inputs -c : expected output -a b c ( ( 0, 0, 0), ( 0, 1, 0), ( 1, 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; tc_spec <= test_vectors(i).rc; wait for 10 ns; end loop; end process; -----------------------------------------check : process (tc_impl, tc_spec) begin ok <= (tc_impl = tc_spec); end process; -----------------------------------------end main_tb; Use this testbench until it becomes tedious to calculate manually the correct result for each test case.
) )
LEC-11:
3.4.4
66
3.4.4
Rather than write the specication as part of stimulus, create separate specication entity/architecture. The specication component then calculates the expected output values. (NOTE: if your simulation tool supports congurations, the spec and impl can share the same entity, well see this in section 3.5)
LEC-11:
3.4.4
67
entity and2_spec is ...(same as and2 entity)... end and2_spec; architecture spec of and2_spec is begin c <= a AND b; end spec; architecture main_tb of and2_tb is component and2 ...; component and2_spec ...; signal ta, tb, tc_impl, tc_spec : std_logic; signal ok : boolean; begin -----------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); spec : and2_spec port map (a => ta, b => tb, c => tc_spec); -----------------------------------------stimulus : process begin type test_datum_ty is record ra, rb : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := -a b ( ( 0, 0), ( 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; wait for 10 ns; end loop; end process; -----------------------------------------check : process (tc_impl, tc_spec) begin ok <= (tc_impl = tc_spec); end process; -----------------------------------------end main_tb;
LEC-11:
3.4.5
68
3.4.5
When it becomes tedious to write out each test vector by hand, we can automaticaly compute them. This example uses a pair of nested for loops to generate all four permutations of input values for two signals. architecture main_tb of and2_tb is ... begin ... stimulus : process subtype std_test_ty of std_logic is (0, 1); begin for va in std_test_tylow to std_test_tyhigh loop for vb in std_test_tylow to std_test_tyhigh loop ta <= va; tb <= vb; wait for 10 ns; end loop; end loop; end process; ... end main_tb;
LEC-11:
3.4.6
Relational Specication
69
3.4.6
Relational Specication
Sometimes we want to check a relationship between the output and the input, rather than check that the output has a specic value. To do this, we drop the spec process, and put the brains into the check process. architecture main_tb of and2_tb is ... begin -----------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); -----------------------------------------stimulus : process ... end process; -----------------------------------------check : process (tc_impl, tc_spec) begin ok <= NOT (tc_impl = 1 AND (ta =0 OR tb = 0)); end process; -----------------------------------------end main_tb;
LEC-12 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-12 Preliminaries
Schedule
wk-01 02 wk-03 05 wk-06 VHDL Design and Optimization Functional Validation Lec-11 Datapath Validation and Testbenches Lec-12 Control Validation and Assertions Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review
LEC-12 Preliminaries
Overview
This lecture illustrates techniques for validating state machines by using a FIFO queue. The lecture goes over an implementation, specication, and testbench. The verication uses assertions and coverage monitors inside the implementation to improve the chances of catching bugs.
LEC-12 Preliminaries
Concepts
Dont care conditions : Conditions or situations where we dont care what the implementation does. Use of uninitialized data : Implementation should start with U on all signals. assert and report statements : Printing error messages to the screen.
LEC-12 Preliminaries
Concepts (Contd)
Instrumentation code : Code that is added to design but will not appear in hardware. Used to measure (instrument) behaviour of internal signals in circuit. Often used to aid in validation, performance analysis, etc. Coverage monitors : Processes that help check if test vectors are fully exercising behaviour of implementation. Assertions : Properties that behaviour of internal signals should obey.
LEC-12 Preliminaries
Concepts (Contd)
Running multiple scenarios from one test bench General VHDL coding guidelines
LEC-12 Preliminaries
) )
) ) )
separate package and package body assert, report textio package: read, write,
LEC-12 Preliminaries
Background
State machine design
LEC-12 Preliminaries
Reading
None
LEC-12:
3.5
In this section, we will explore the functional validation of state machines via a First-In First-Out queue. The VHDL code for the queue is on the web at: http://www.ece.uwaterloo.ca/ece427/exs/queue
) )
Control circuits have many internal signals. Testbenches are unable access key information about the behaviour of a control circuit. Many clock cycles can elapse between when a bug causes an internal signal to have an incorrect value and when an output signal shows the effect of the bug.
LEC-12:
3.5.1
11
3.5.1
queue
Empty
LEC-12:
3.5.1
12
Read 1
B C D E F G H I J
B C D E F G H I J
LEC-12:
do_rd
3.5.1
13
empty
data_rd
empty
LEC-12:
3.5.2
VHDL Coding
14
3.5.2
VHDL Coding
LEC-12:
3.5.2
VHDL Coding
15
3.5.2.1 Package
Things to notice in queue package: 1. separation of package and body package queue_pkg is subtype data is std_logic_vector(3 downto 0); function to_data(i : integer) return data; end queue_pkg; package body queue_pkg is function to_data(i : integer) return data is begin return std_logic_vector(to_unsigned(i, 4)); end to_data; end queue_pkg;
LEC-12:
3.5.2
VHDL Coding
16
LEC-12:
3.5.3
17
3.5.3
Validation things to notice in queue implementation: 1. instrumentation code 2. coverage monitors 3. assertions
LEC-12:
3.5.3
18
LEC-12:
3.5.4
Instrumentation Code
19
3.5.4
Instrumentation Code
process (clk) begin if rising_edge(clk) then prev_rd_idx <= rd_idx; prev_wr_idx <= wr_idx; prev_do_rd <= do_rd; prev_do_wr <= do_wr; end if; end process;
) ) ) ) )
Added to implementation to support validation Usually keeps track of previous values of signals Does not create hardware (Optimized away during synthesis) Does not feed any output signals Must use synthesizable subset of VHDL
LEC-12:
3.5.4
Instrumentation Code
20
Naming Convention
NOTE: Naming convention for instrumentation For assertions, signals are named prev signame and signame, rather than next signame and signame as is done for state machines. This is because for assertions we use the prev signals as history signals, to keep track of past events. In contrast, for state machines, we name the signals next, because the state machine computes the next values of signals.
LEC-12:
3.5.5
Coverage Monitors
21
3.5.5
Coverage Monitors
The goal of a coverage monitors is to check if a certain event is exercised in a simulation run. If a test suite does not trigger a coverage monitor, then we probably want to add a test vector that will trigger the monitor. For example, for a circuit used in a microwave oven controller, we might want to make sure that we simulate the situation when the door is opened while the power is on.
LEC-12:
3.5.5
Coverage Monitors
22
LEC-12:
3.5.5
Coverage Monitors
23
Prev rd wr wr
Now
rd
Prev wr rd wr
Now
rd
LEC-12:
3.5.5
Coverage Monitors
24
) ) ) ) ) )
wr wr wr rd rd wr
idx and rd idx are far apart idx and rd idx are equal idx catches rd idx idx catches wr idx idx wraps idx wraps
LEC-12:
3.5.5
Coverage Monitors
25
LEC-12:
3.5.5
Coverage Monitors
26
LEC-12:
3.5.5
Coverage Monitors
27
LEC-12:
3.5.6
Assertions
28
3.5.6
Assertions
LEC-12:
3.5.6
Assertions
29
LEC-12:
3.5.6
Assertions
30
Assertion Template
process (signals read) begin assert (required condition) report "error: message" severity warning; end process;
LEC-12:
3.5.6
Assertions
31
LEC-12:
3.5.6
Assertions
32
LEC-12:
3.5.7
33
3.5.7
LEC-12:
3.5.7
34
LEC-12:
3.5.7
35
Functions
function to_idx (i : natural range data_arraylow to data_arrayhigh) return idx_ty is begin return to_unsigned(i, idx_tylength); end to_idx; Conversion to Index Without Function With Function rd_idx <= to_unsigned(5, 3); rd_idx <= to_idx(5); The function code is verbose, but is very maintainable, because neither the function itself nor uses of the function need to know the width of the index vector.
LEC-12:
3.5.7
36
Attributes
function inc_idx (idx : idx_ty) return idx_ty is begin if idx < data_arrayhigh then return (idx + 1); else return (to_idx(data_arraylow)); end if; end inc_idx;
LEC-12:
3.5.7
37
Functions clearly distinguish between reading from a signal and writing to a signal. By examining the use of a procedure, you cannot tell which signals are read from and which are written to. You must examine the declaration or implementation of the procedure to determine modes of signals. Modifying a signal within a procedure results in a tri-state signal. This is bad.
LEC-12:
3.5.7
38
These functions can be used to read test vectors from a le and write results to a le.
) )
LEC-12:
3.5.8
Queue Specication
39
3.5.8
Queue Specication
Most bugs in queues are related to the queue becoming full, becoming empty, and/or wrap of indices. Specication should be obviously correct. Avoid bugs in specication by making specication queue larger than the max number of writes that we will do in test suite. Thus, the specication queue will never become full or wrap. However, the implementation queue will become full and wrap.
LEC-12:
3.5.8
Queue Specication
40
LEC-12:
3.5.8
Queue Specication
41
Things to Notice
Things to notice in queue specication: 1. dont care conditions (-) 2. uninitialized data (hint: what is the value of rd_data when do more reads than writes?
LEC-12:
3.5.8
Queue Specication
42
Dont Care
rd_data <= data_array(rd_idx) when (do_rd =1) else (others => -);
LEC-12:
3.5.9
Queue Testbench
43
3.5.9
Queue Testbench
Things to notice in queue testbench: 1. running multipe test sequences 2. uninitialized data U 3. std_match to compare spec and impl data
With equality, - 1, but we want to use - to mean dont care in specication. The solution is to use std match, rather than = to check implementation signals against the specication.
10
0 0 0 0 0
0 0 1 1 everything else
0 L 1 H everything everything
12
LEC-12:
3.5.9
Queue Testbench
44
Chapter 4
LEC-12:
4.1
INTRO
46
4.1
Intro
LEC-12:
4.1.1
Concepts
47
4.1.1
Concepts
) ) ) ) )
denition of performance different ways of measuring performance comparing performance (speedup, n% faster) improving performance Amdahls law (limits on performance improvements)
LEC-12:
4.1.2
Background Material
48
4.1.2
Background Material
LEC-12:
4.1.3
Reading Material
49
4.1.3
Reading Material
Performance is not described in Smiths book. Hennessey and Pattersons Quantitative Computer Achitecture (textbook for E&CE 429) has good information on performance.
LEC-13 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-13 Preliminaries
Schedule
wk-01 02 wk-03 05 wk-06 wk-07 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Lec-13 Computer Performance Lec-14 Digital Circuit Performance Timing Analysis Power Analysis and Reduction Faults and Testing Review
LEC-13 Preliminaries
LEC-13 Preliminaries
Concepts
) )
denition of performance different ways of measuring performance comparing performance (speedup, n% faster)
) ) )
improving performance Amdahls law (limits on performance improvements) clock speed, program length, cpi, and performance
LEC-13 Preliminaries
Background
Algebra, basic familiarity with assembly language
LEC-13 Preliminaries
Reading
Performance is not described in Smiths book. Hennessey and Pattersons Quantitative Computer Achitecture (textbook for E&CE 429) has good information on performance.
LEC-13:
4.2
DEFINING PERFORMANCE
4.2
Dening Performance
Performance Work Time
You can double your performance by: doing twice the work in the same amount of time OR doing the same amount of work in half the time
LEC-13:
4.2
DEFINING PERFORMANCE
Benchmarking
Performance Work Time
Measuring time is easy, but how do we accurately measure work? The game of benchmarking is nding a denition of work that makes your system appear to get the most work done in the least amount of time. Measure of Work clock cycle instruction synthetic program real program travel 1/4 mile Measure of Performance MHz MIPs Dhrystone, Whetstone, D-MIPs (Dhrystone MIPs) SPEC drag race
LEC-13:
4.2
DEFINING PERFORMANCE
SPEC Benchmarks
Denition SPEC: Standard Performance Evaluation Corporation MISSION: To establish, maintain, and endorse a standardized set of relevant benchmarks and metrics for performance evaluation of modern computer systems http://www.spec.org.
LEC-13:
4.3
COMPARING PERFORMANCE
10
4.3
Comparing Performance
We can measure the performance of practically anything (cars, computers, vacuum cleaners, printers....)
LEC-13:
4.3
COMPARING PERFORMANCE
11
Comparing Performance
printer1 printer2 Black and White 9ppm 12ppm Colour 6ppm 4ppm
n% faster
LEC-13:
4.3
COMPARING PERFORMANCE
12
BW Performance
Answer: BW1 1 9ppm
BW2
1 12ppm
0 0833min page TSlow TFast TFast BW1 BW2 BW2 0 1111 0 08333 0 08333 33%faster
BWFaster
4 3
5 5
4 4
2 2 2 2 2 2 2 2
0 1111min page
LEC-13:
4.3.1
13
4.3.1
Question: If average workload is 90% BW and 10% Colour, which printer is faster and how much faster is it? A potentially helpful formula is the average time to do one of k different tasks:
TAvg
i 1
Answer:
0 1167min page
0 1000min page TSlow TFast TFast Avg1 Avg2 Avg2 0 1167 0 1000 0 1000 16 7%faster
AvgFaster
0 90
0 0833
4 9 4 7 A8 @
0 10
4 3
5 4 4 9 4 7
TAvg2
%BW
BW2
%C
C2 0 2500
0 90
0 1111
4 9 4 7 A8 @
0 10
5 4 4 9 4 7
TAvg1
%BW
BW1
8 78 7
%i Ti %C C1 0 1667
2 2 2 2 2 2 2 2 2 2
LEC-13:
4.3.2
Optimizing Performance
14
4.3.2
Optimizing Performance
Question: If we want to optimize printer1 to match performance of printer2, should we optimize BW or Colour printing?
Answer:
Colour printing is slower, so appears that can save more time by optimizing colour printing. However, look at extreme case of optimizing colour printing to be instantaneous for P1:
0.150m/p 0.100m/p 0.050m/p 0.000m/p P1 P2
Even if make colour printing instantaneous for printer 1 and kept same for printer 2, printer 1 would not be measurably faster. Amdahls law Make the common case fast.
Optimizations need to take into account both run time and frequency of occurrence.
LEC-13:
4.3.2
Optimizing Performance
15
NOTE: Hmmmm This question was actually humorous during the high-tech bubble...
Answer:
Hire more marketing people! Notice that colour printing on printer 1 is faster than on printer 2. So, marketing suggests that people are increasing the percentage of printing that is done in colour.
Question: Revised question: what percentage of printing must be done in colour for printer1 to beat printer2?
Answer:
%C
0 25
4 3
4 3
%C
0 1111
4 @ 4 3
0 1111 0 0833
0 0833 0 2500
%C
C1 0 1667
B E8
BW1
%C
C1
BW1
BW2
%C
C2
BW2
7 9
9 D8
7 9
9 C8
%C
BW1
%C
C1
3 7
%BW
%C
%C
BW2
%C
B 2
%BW
BW1
%C
C1
TAvg1
3 7
C2
LEC-13:
4.4
LEC-13:
4.4.1
Mathematics
17
4.4.1
Mathematics
CPI NumInsts ClockSpeed Cycles per instruction Number of instructions Clock speed
Time
LEC-13:
4.4.2
18
4.4.2
The AMD Athlon is a CISC microprocessor (it uses the IA-32 instruction set). The Fujitsu SPARC64 is a RISC microprocessor (it uses Suns Sparc instruction set). Assume that it requires 20% more instructions to write a program in the Sparc instruction set than the same program requires in IA32.
LEC-13:
4.4.2
19
Answer: SPECint, SPECfp, and SPEC are measures of performance. Therefore, the higher the SPEC number, the higher the performance. The Fujitsu SPARC64 has higher performance
LEC-13:
4.4.2
20
Relative CPI
Question: What is the ratio between the CPIs of the two microprocessors?
LEC-13:
4.4.2
21
Absolute CPI
Question: Can you determine the absolute (actual) CPI of either microprocessor?
LEC-13:
4.4.3
Summary of Equations
22
4.4.3
Summary of Equations
Time
TAvg
i 1
8 78 7
%i Ti
LEC-13:
4.4.3
Summary of Equations
23
LEC-14 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-14 Preliminaries
Schedule
wk-01 02 wk-03 05 wk-06 wk-07 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Lec-13 Computer Performance Lec-14 Digital Circuit Performance Timing Analysis Power Analysis and Reduction Faults and Testing Review
LEC-14 Preliminaries
Overview
In this lecture we relate the general performance equations from Lec-13 to dataow diagrams.
LEC-14 Preliminaries
Concepts
predicting performance for dataow diagrams choosing clock speed in dataow diagrams instruction scheduling
) )
) ) )
dataow diagrams with multiple instructions and performance design effort vs performance
LEC-14:
4.5
LEC-14:
4.5.1
LEC-14:
4.5.1
4.5.1.1 Tradeoffs
When partitioning dataow diagrams into clock cycles, need to take both area and performance into account. Goal Minimize area Action decrease clock period Affect fewer operations per clock cycle, so fewer datapath components and more opportunities to reuse hardware more exibility in grouping operations in clock cycles decreases number of ops that data traverses through
Increase scheduling exibility Decrease percentage of clock cycle spent in ops (overhead time in ops is not doing useful work) Decrease time to execute an instruction
????
LEC-14:
4.5.1
General Plan
Our general plan to nd the clock period for maximum performance is: 1. Pick clock period to be delay through slowest component + delay through op. 2. For each instruction, for each operation, schedule the operation in the earliest clock cycle possible without violating clockperiod timing constraints. 3. Calculate average time to execute an instruction as: NumInsts CPI Combine: Time = ClockSpeed
to derive:
Time
i 1
ClockSpeed
4. If the maximum latency through dataow diagram is greater than 1, then increase clock period by minimum amount needed to decrease latency by one clock period and return to Step 2. 5. If the maximum latency through dataow diagram is 1, then clock period for highest performance is clock period resulting in fastest Time. 6. If possible, adjust the schedule of operations to reduce the maximum number of occurrences of a component per instruction per clock cycle without increasing latency for any instruction.
NumInsts
i 1
and:
CPIavg
%i
CPIi
%i
CPIi
LEC-14:
4.5.2
Instruction B
i (40ns)
g (50 ns)
g (50 ns)
h (20 ns)
g (50 ns)
LEC-14:
4.5.2
10
LEC-14:
4.5.2
11
Scheduling (1)
55ns Clock Period
55ns 55ns f (30ns) i (40ns)
75ns
g (50 ns)
75ns g (50 ns) h (20 ns) g (50 ns) g (50 ns)
55ns
55ns
g (50 ns)
75ns
LEC-14:
4.5.2
12
Scheduling (2)
85ns Clock Period
f (30ns) 85ns g (50 ns) h (20 ns) 85ns g (50 ns) g (50 ns) 95ns g (50 ns) i (40ns) 95ns g (50 ns) h (20 ns)
LEC-14:
4.5.2
13
Scheduling (3)
155ns Clock Period
f (30ns) g (50 ns) 155ns h (20 ns) g (50 ns) i (40ns) g (50 ns)
LEC-14:
4.5.2
14
Answer:
3 PI
4 @ 9 4 @ 9 4 @ 9 4 @ 9 4 @ 9
55 75 85 95 155
05 05 05 05 05
4 3 2 2 1
2 8 9 2 8 9 2 8 9 2 8 9 2 8 9
4 9 4 7 9 4 7 9 4 7 9 4 77 9
CPIA 4 3 2 2 1
CPIB 2 2 2 1 1
Tavg 05 2 05 2 05 2 05 1 05 1
LEC-14:
4.5.2
15
LEC-14:
4.5.2
16
Answer:
55ns 55ns
f (30ns)
i (40ns) 75ns
f (30ns)
i (40ns)
g (50 ns) 75ns g (50 ns) h (20 ns) 75ns g (50 ns) i (40ns) g (50 ns) i (40ns)
55ns
55ns
g (50 ns)
i (40ns)
f (30ns) h (20 ns) 85ns g (50 ns) g (50 ns) 95ns g (50 ns) h (20 ns) 85ns i (40ns) 95ns g (50 ns)
i (40ns)
LEC-14:
4.5.2
17
135ns
g (50 ns)
g (50 ns)
CPIA 4 3 2 2 2 2 1
CPIB 3 3 3 2 2 1 1
A clock period of 155 ns results in the highest performance. For a clock period of 105 ns, we did not calculate the performance, because we could see that it would be worse than the performance with a clock period of 95 ns. The dataow diagram with a 105 ns clock period has the same latency as the diagram with a clock period of 95 ns. If the data ow diagram with the longer clock period has the same latency as the diagram with the shorter clock period, then the diagram with the longer clock period will have lower performance.
LEC-14:
4.5.2
18
LEC-14:
4.5.2
19
CPIA 3 3 2 2 1
CPIB 3 2 2 1 1
A clock period of 155 ns results in lowest average execution time, and hence the highest performance. This is the same answer as the previous problem, but the total times for higher clock frequencies differ signicantly between the two problems.
LEC-14:
4.5.3
20
Component Delays 2-input Mult 40ns 2-input Add 25ns Register 5ns
9 C8 @ @ @ 8 A8 9 7 R8 9 Q7 7 9 9 @ @ 7
e
Algorithm b a b b d i j k l m
LEC-14:
4.5.3
21
NOTES
) ) ) )
There is a resource limitation of a maximum of 3 input ports. (There are no other resource limitations.) You must put registers on your inputs, you do not need to register your outputs. The environment will directly connect your outputs (its inputs) to registers. Each input value (a, b, c, d, e, i, j, k, l, m) can be input only once if you need to use a value in multiple clock cycles, you must store it in a register.
LEC-14:
4.5.3
22
Questions
Question: What clock period will result in the best overall performance?
Question: Find a minimal set of resources that will achieve the performance you calculated.
Answer:
a b d
*
70ns a*b
*
b*d e (a*b) + (b*d) (a*b) + (b*d) + e (a*b)*((a*b) + (b*d) + e)
+ + *
CPI = 2 InstP
i j k
+ + +
70ns l m
*
CPI = 2 InstQ
LEC-14:
4.5.3
23
Resource Usage
Fastest execution time Clock period Inputs Outputs Registers Adders Multipliers 140ns 70ns 3 1 3 2 2
LEC-14:
4.5.4
24
Performance vs Area
You are designing a 16-bit barrel shifter. You have the option of supporting an entire 15-bit shift in a single clock cycle (which gives a latency of 1 clock cycles), shifting 1-bit per clock cycle (which gives a latency of 15 clock cycles), or anything in between. You do the design and measure the following information: Max Shift 1 3 7 15 Min Period 21ns 27ns 40ns 34ns Area (CLBs) 13 36 57 53
Question: Which circuit gives you the best optimality, in terms of MIPs/CLB? Answer: Assume that all shift amounts have same probability of occurrence. Shift amounts can be anywhere from 0 (no shift) to 16 (shift all data out, leaving only zeroes). The data for the shift amounts and latencies were generated using Synopsys Design Compiler for a Xilinx FPGA. Max shift of 1
Max shift of 3 Max Shift 1 3 7 15 Min Period 21ns 27ns 34ns 40ns Latency 15 5 3 1 Time 315ns 135ns 102ns 40ns MIPs 3.2 7.4 9.8 25 Area 13 36 57 53 MIPs/CLB 0.25 0.21 0.17 0.47
3 PI
8 78 7 78 7 6 S 78 7
TAvg
i 0 16 %i Ti Task i is to shift by i bits A shift amount of i requires i clock cycles 16 1 17 i ClkPeriod i 0 16 1 ClkPeriod i 17 i 0 1 21 136 17 168ns
8 78 7
2 2 2 2
LEC-14:
4.5.4
25
New assumptions: 1. All shift amounts have same probability of occurrence. 2. The latency of a shift operation is dependent upon the shift amount. 3. Shift amounts can be anywhere from 0 (no shift) to 15 (shift leastsignicant bit to most signicant position). 4. Shifting by 0 requires 1 clock cycle.
Question: With the revised assumptions, which circuit gives you the best optimality, in terms of MIPs/CLB?
Answer:
Max shift of 1
ClkPeriod and %i
0 20.
TAvg
i 1 5 i 1
81 ns
Max shift of 7
9 78 4 7 6 8 78 7
%i Ti
0 20 i
27
4 2
TAvg
i 0 15 %i Ti Task i is to shift by i bits A shift amount of i requires i clock cycles The exception is i 0, which requires 1 clock cycle 15 1 1 ClkPeriod i ClkPeriod 16 i 1 16 15 1 1 ClkPeriod ClkPeriod i 16 16 i 0 1 1 21 21 120 16 16 158 ns
8 78 7
2 2 2 2 2
LEC-14:
4.5.4
26
ClkPeriod and %i
i 1 3 i 1
67 ns
Max shift of 15 Shift amount 0 15 1 task. Ti ClkPeriod and %i TAvg Latency 1 1 00.
i 1
%i Ti
40 ns
3 PI
Max Shift 1 3 7 15
Latency 15 5 3 1
MIPs 6.3 12 15 25
6 8 78 7 4 2
9 78 4 7 6 8 78 7
%i Ti 0 33 i
ClkPeriod
4 2
9 2 2 2 2
0 33.
Area 13 36 57 53
LEC-14:
4.5.5
27
LEC-14:
4.5.5
28
LEC-14:
4.5.5
29
Options
You have three options: option 1 : no change option 2 : add the MAC instruction, increase the clock period by 20%, and MAC has the same CPI as MUL. option 3 : add the MAC instruction, keep the clock period the same, and the CPI of a MAC is 50% greater than that of a multiply.
LEC-14:
4.5.6
30
LEC-14:
4.5.6
31
Question: If you add the optimization, how much can you allow your schedule to slip before the delay hurts your relative performance compared to not doing the optimization and launching the product according to your current schedule?
LEC-14:
4.5.6
32
Chapter 5
Timing Analysis
LEC-14:
5.1
PRELIMINARIES
34
5.1
Preliminaries
LEC-14:
5.1.1
Overview
35
5.1.1
Overview
) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )
Clock Skew Clock Jitter Clock-to-Q delay (latch and op) Setup Time (latch and op) Hold time (latch and op) Capacitive Load delay Critical path False path Setup, hold, and clock-to-Q times in hierarchical circuits Propagation delay Interconnect (Wire) delay Load delay Elmore time constant Worst case timing Derating factors Speed binning
LEC-14:
5.1.2
Background Material
36
5.1.2
Background Material
) ) ) )
resistance, capacitance, voltage equations over time ip-op timing, setup and hold times (Mano, Digital Design 6-3) digital view of CMOS transistor behaviors a tiny bit of calculus integration in Lec-12
LEC-14:
5.1.3
Reading Material
37
5.1.3
Reading Material
There is a tremendous amount of material on delay and timing scattered throughout Smiths book. Chapter 2 : transistor and logic review 2.1 : transistor review 2.4 : combinational logic cells 2.5 : sequential logic cells Chapter 3 : fundamentals of timing and delay 3.13.2 : transistors and delay Chapter 5 : timing and delay within cells 5.1.5 5.1.7 : Actel cells 5.2.4 : Xilinx LCA timing 5.4.2 : Altera MAX timing Chapter 7 : timing and delay between cells 7.1 : Actel interconnect 7.2 7.4 : Xilinx LCA timing 7.4 : Altera MAX timing (constant delay for all interconnect) Chapter 13 : simulation 13.1 13.2 13.5 13.6 13.7 : : : : : levels of temporal abstraction for simulation simulation example different simulation models for hardware delay models static timing analysis
Chapter 16 16.1.2 : clock trees and timing in oorplanning Chapter 17 17.1.2 : timing in routing Suggestion:
) ) ) )
skim/read Chs 2 and 3 to refresh/learn fundamentals of delay skim relevant sections of Chs 4, 7, 13 16, 17 read in depth: 5.155.1.7 7.1 13.2, 13.6, 13.7 16.1.2 read remaining sections as time and interest dictates
LEC-15 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-15 Preliminaries
Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Lec-15 Introduction Lec-16 Math, Physics and Applications Lec-17 Timing Analysis of Storage Elements Power Analysis and Reduction Faults and Testing Review
LEC-15 Preliminaries
Overview
This lecture introduces the fundamentals of timing analysis. In particular, how do we determine the fastest clock speed that a circuit will support?
LEC-15 Preliminaries
Concepts
) ) ) ) )
Minimum clock period Hold constraint Clock skew Clock latency Clock jitter Setup time Hold time Clock-to-Q time
) ) ) ) ) ) ) )
Cause and effect of timing violations Propagation delay Load delay Interconnect delay Critical path False path
LEC-15 Preliminaries
Background
For those who took E&CE-324, there is some overlap between the material in this chapter and the material in E&CE-324. In E&CE-427, we will focus on calculating the critical path of a circuit and on techniques to calculate the timing parameters of a storage device (e.g. latch or op). One terminology difference: what was called margin in E&CE-324 will be called slack in E&CE-427.
LEC-15 Preliminaries
Reading Material
There is a tremendous amount of material on delay and timing scattered throughout Smiths book.
NOTE: Reading and exam All of the exam material will come from the courses notes, but it could be helpful to read the relevant sections in Smiths book to better understand the material. Chapter 2 : Transistor and logic review 2.1 : transistor review 2.4 : combinational logic cells 2.5 : sequential logic cells Chapter 3 : Fundamentals of timing and delay 3.13.2 : transistors and delay
LEC-15 Preliminaries
LEC-15 Preliminaries
Chapter 16 : Floorplanning and placement 16.1.2 : clock trees and timing in oorplanning Chapter 17 : Routing 17.1.2 : timing in routing
LEC-15 Preliminaries
) ) ) )
skim/read Chs 2 and 3 to refresh/learn fundamentals of delay skim relevant sections of Chs 4, 7, 13, 16, 17 read in depth: 2.5.2 (setup and hold) 6.5.1 (clocks) 3.1 (timing model) 13.2, 13.6, 13.7 (timing models and timing analysis) 7.1 (interconnect delay) 16.1.2 (interconnect delay) 5.1.55.1.7 (timing analysis of storage devices)
LEC-15:
5.2
10
5.2
LEC-15:
5.2.1
11
5.2.1
LEC-15:
5.2.1
12
Fanin
y0 y1 y2 y3 y4 x
Denition fanin: The fanin of a gate or signal x are all of the gates or signals y where an input of x is connected to an output of y.
LEC-15:
5.2.1
13
Fanout
y0 x y1 y2 y3 y4
Denition fanout: The fanout of a gate or signal x are all of the gates or signals y where an output of x is connected to an input of y.
LEC-15:
5.2.1
14
y2 y3 y4
Denition immediate fanin/fanout: The phrases immediate fanout and immediate fanin mean that there is a direct connection between the gates.
LEC-15:
5.2.1
15
Denition transitive fanin/fanout: The phrases transitive fanout and transitive fanin mean that there is either a direct or indirect connection between the gates.
LEC-15:
5.2.1
16
LEC-15:
5.2.2
Timing Constraints
17
5.2.2
Timing Constraints
For a circuit to operate correctly, the clock period must be longer than the sum of the delays shown in table 5.1. Each of these timing parameters is described in more detail in section 5.2.3.
LEC-15:
5.2.2
Timing Constraints
18
CO
LEC-15:
5.2.2
Timing Constraints
19
Propagation Delay
Denition Propagation Delay: Sum of Interconnect and Load delay.
LEC-15:
5.2.2
Timing Constraints
20
Propagation Delay
Denition Slack: Difference between required value of timing parameter and actual value. A negative slack means that there is a timing violation. A positive slack means that the constraint for the timing parameter is satised. NB: Slack was called margin in E&CE 324. Both terms are used commonly.
LEC-15:
5.2.2
Timing Constraints
21
CO
U VT
ClockPeriod
Skew
Jitter
Interconnect
Load
SUD
LEC-15:
5.2.2
Timing Constraints
22
skew
-Q
jitter
hold
io n
to
k-
oc
cl
HO
CO
Y U PX
Skew
Jitter
pr
op
ag
at
Load
Interconnect
LEC-15:
5.2.3
23
5.2.3
LEC-15:
5.2.3
24
Denition Clock Skew: The difference in arrival times for the same clock edge at different ip-ops. Clock skew is caused by the difference in interconnect delays to different points on the chip.
LEC-15:
5.2.3
25
LEC-15:
5.2.3
26
Denition Clock Latency: The difference in arrival times for the same clock edge at different levels of interconnect along the clock tree. (Intuitively different points in the clock generation circuitry.)
NOTE: Clock latency Clock latency does not affect the limit on the minimim clock period.
LEC-15:
5.2.3
27
Denition Clock Jitter: Difference between actual clock period and ideal clock period.
LEC-15:
5.2.3
28
` ` ` `
temperature and voltage variations over time temperature and voltage variations across different locations on a chip manufacturing variations between different parts etc.
LEC-15:
5.2.4
29
Hold
clk q Clock-to-Q
LEC-15:
5.2.4
30
Forward Reference
In this section, we will use the denitions of setup, hold and clock-to-Q. Section 5.6 will show how to calculate setup, hold, and clock-to-Q times for ip ops, latches, and other storage devices.
LEC-15:
5.2.4
31
LEC-15:
5.2.4
32
LEC-15:
5.2.4
33
NOTE: Require / Guarantee Setup and hold times are requirements that the storage device imposes upon its environment. Clock-to-Q is a guarantee that the storage device provides its environment.
LEC-15:
5.2.4
34
LEC-15:
5.2.4
35
Good Timing
a clk b c d
a clk b
Clock-to-Q
c d
LEC-15:
5.2.4
36
Setup Violation
a clk b Clock-to-Q Prop Setup c d ??? ???
LEC-15:
5.2.4
37
Hold Violation
a clk b c d
a clk b
c d
???
LEC-15:
5.2.5
Propagation Delays
38
5.2.5
Propagation Delays
LEC-15:
5.2.5
Propagation Delays
39
Vi
Vo
Schematic
LEC-15:
5.2.5
Propagation Delays
40
Load Delays
1->0 0->1
0->1 1->0
Load capacitance is a dependent on the fanout (how many other gates a gate drives) and how big the other gates are. Section 5.4.1 goes into more detail on timing models and equations for load delay.
LEC-15:
5.2.5
Propagation Delays
41
` ` ` ` ` `
Wire resistance is dependent upon the material and geometry of the wire. Wire capacitance is dependent on wire geometry, geometry of neighboring wires, and materials. Shorter wires are faster. Fatter wires are faster. FPGAs have special routing resources for long wires. CMOS processes use higher metal layers for long wires, these layers have wires with much larger cross sections than lower levels of metal.
LEC-15:
5.3
42
5.3
Three classes of paths: entry path from an input to a op stage path from one op to another op exit path from a op to an output
LEC-15:
5.3
43
Entry Path
entry path: from an input to a op Quartus does not report this by default. When Quartus reports this path, it is reported as the period associated with System fmax. In Xilinx timing reports this is reported as Maximum Delay
LEC-15:
5.3
44
Stage Path
stage path: from one op to another op In Quartus timing reports, this is reported as the period associated with Internal fmax. In Xilinx timing reports, this is reported as Clock to Setup and Maximum Frequency.
LEC-15:
5.3
45
Exit Path
exit path: from a op to an output Quartus does not report this by default. When Quartus reports this path, it is reported as the period associated with System fmax. In Xilinx timing reports this is reported as Maximum Delay
LEC-15:
5.3.1
46
5.3.1
a
b c
delay 2 4 4 6
Question: Assuming all delay and timing factors other than combinational logic delay are negligible:
The answer to this question appears as Problem 5.3. In this circuit, it is extremely difcult to determine which path is the real critical path and which paths are false paths. There are many paths with reconvergent fanout, which greatly complicates the analysis. Most circuits are not nearly this difcult to analyze.
` `
what is the critical path through this circuit? what is the delay along the path?
LEC-15:
5.3.2
47
5.3.2
LEC-15:
5.3.2
48
LEC-15:
5.3.2
49
` ` `
Start at source node and traverse through fanout to destination node, annotating intermediate nodes with maximum delay to the intermediate nodes. The delay to the destination node is the delay of the critical path. The critical path is found by starting at the destination path and working backwards, choosing node with maximum delay at each step.
LEC-15:
5.3.2
50
LEC-15:
5.3.2
51
LEC-15:
5.3.3
False Paths
52
5.3.3
False Paths
Denition false path: A path from a source signal to a destination signal such that changes on the source signal will not propagate along the path to cause a change on the destination signal. There are two classes of false paths, static and dynamic. Static are easier to detect, while dynamic false paths can be tedious and difcult to detect.
LEC-15:
5.3.3
False Paths
53
a b c
gate NOT AND OR XOR delay 2 4 4 6
f g
z y h
LEC-15:
5.3.3
False Paths
54
Answer
Answer:
LEC-15:
5.3.3
False Paths
55
The path from a to y has a delay of 16. Check if it is a false critical path.
LEC-15:
5.3.3
False Paths
56
Equation for y is: !b!c, which does not contain a, so y is independent of a. In other words: changes on a do not lead to changes on y. In other words: the path from a to y is a false path We were able to use static analysis to determine that the path from a to y is a false path.
LEC-15:
5.3.3
False Paths
57
To nd the next candidate critical path, recompute delay values along the false path. Leave all other delays the same as before. For each node along the false path, maintain two delay values. One delay is the value already calculated. The other delay value is the maximum delay to that node, ignoring the prex of false path. The prex of a false path is the set of nodes whose fanin comes only from false paths.
LEC-15:
5.3.3
False Paths
58
Candidate Path
a b c 2 (0,2)!a (0,4) a (0,4) b 0 (4,8) 8 4 b 0 ab 4 !c 2 2 8 ab 8 !b 2 ab + !c 8 10 12 !a + !b !b!c z y
The next candidate is from b to y. Static analysis shows that b is in the equation for y, so static analysis cannot detect whether this is a false path. We must use dynamic analysis.
LEC-15:
5.3.3
False Paths
59
Answer:
LEC-15:
5.3.3
False Paths
60
LEC-15:
5.3.3
False Paths
61
Both rising and falling edges failed to generate a change on output, therefore found another false path. NB: Pushing edges forward is not a smart way to explore candidate critical paths, because this technique does not help isolate the cause the of false path. Pushing edges backwards will identify the cause of the false path.
LEC-15:
5.3.3
False Paths
62
Try to push a rising edge backwards along path between b and y. Contradictory assignment for b, therefore false path.
LEC-15:
5.3.3
False Paths
63
Reconvergent Fanout
a b y c 0 z
Two paths from point of contradictory assignment to y. This is reconvergent fanout. Reconvergent fanout is most common cause of false paths. It also causes problems with fault-detection (Chapter 7).
LEC-15:
5.3.3
False Paths
64
Try to push a rising edge backwards along path, but put edge (not constant) on node in reconvergent fanout. Contradictory assignments to b.
LEC-15:
5.3.3
False Paths
65
To nd the next candidate critical path, recompute the delay values for nodes along the false path. Leave all other delays the same as before. To recompute delay along a false path, ignore the prex of the false path. The prex is the set of nodes whose fanin comes only from false paths.
LEC-15:
5.3.3
False Paths
66
LEC-15:
5.3.3
False Paths
67
Next Candidates
a b c 2 2 !a 4 a 4 b 0 8 0 b 0 ab 0 !c 2 2 6 ab 8 !b 2 ab + !c 6 10 10 !a + !b !b!c z y
LEC-15:
5.3.3
False Paths
68
(*CHANGE ver2 (2002/12/02): corrected edge polarity on a *) Propagate a rising edge backwards. It works!
LEC-15:
5.3.3
False Paths
69
LEC-15:
5.3.3
False Paths
70
Summary
There are two paths with a delay of 10: one from a to z and one from c to y. We can push edges along both of these paths, so they are real critical paths. Note that different values on b result in different critical paths.
LEC-15:
5.3.3
False Paths
71
f h g
i k j
LEC-15:
5.3.3
False Paths
72
Answer
Answer:
a b c d e 0 /= 1 f g 1 0 h j i 1 k
LEC-15:
5.3.3
False Paths
73
LEC-15:
5.3.3
False Paths
74
First Candidate
Answer:
4 a0 b0 8 c 0 2 2 2 4 0 8 0 2 12 12 2 12 16 delay=8 12 x
delay=2
14
LEC-15:
5.3.3
False Paths
75
Second Candidate
4 a0,0 b0 c 0 2 0,2 0,4 0 0,8 0 0,8 2 2 6,12 12 2 6 10 delay=8 12 x
delay=2
14
The real critical path is the path from a to z, which has a delay of 14.
LEC-15:
5.3.3
False Paths
76
LEC-15:
5.3.3
False Paths
77
1 0
1 0
General rules
Figure 5.9: Rules for pushing rising and falling edges through gates
LEC-15:
5.3.3
False Paths
78
Answer:
a b a c b c
Falling edge on non-critical path will cause output to change before edge on critical path affects output.
LEC-15:
5.3.3
False Paths
79
LEC-15:
5.3.3
False Paths
80
1 glitch on output
constant 0 output
0 is controlling
LEC-15:
5.3.3
False Paths
81
Reconvergent for OR
0 0
1 is controlling
1 is controlling
1 is controlling
constant 1 output
0 glitch on output
LEC-15:
5.3.3
False Paths
82
0 glitch on output
0 is controlling
constant 0 output
LEC-15:
5.3.3
False Paths
83
1 is controlling
1 is controlling
1 is controlling
constant 1 output
0 glitch on output
LEC-15:
5.3.4
84
LEC-16 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-16 Preliminaries
Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Lec-15 Introduction Lec-16 Math, Physics and Applications Lec-17 Timing Analysis of Storage Elements Power Analysis and Reduction Faults and Testing Review
LEC-16 Preliminaries
Overview
This lecture looks at the analog equations that affect delay and relates them up to the digital world.
LEC-16 Preliminaries
Concepts
` ` ` `
Timing model Data dependend delay Propagation delay Load delay Interconnect delay Elmore time constant
` ` ` ` ` `
Extrinsic delay Intrinsic delay Worst case timing Derating factors Speed binning
LEC-16:
5.4
5.4
LEC-16:
5.4.1
5.4.1
Timing model pull up resistor in p-tran pull down resistor in n-tran parasitic capacitance load capacitance
LEC-16:
5.4.1
Vo
VDD
i ph
g f
Rpd Cp
t Cout
LEC-16:
5.4.1
LEC-16:
5.4.1
LEC-16:
5.4.1
10
` `
low-voltage (0) trip point of 0.35 Vdd high-voltage (1) trip point of 0.65 Vdd
LEC-16:
5.4.1
11
g f
s th q r f
Rpd Cp
0 35VDD
VDD
ih
g f
d q
LEC-16:
5.4.1
12
A larger transistor has a lower resistance, but a higher capacitance. Resistance affects timing of source (driving) signals. Capacitance affects (mostly) timing of destination (load) signals. Decreasing resistance increases the current through drivers. Increasing capacitance slows down (dis)charging of load capacitors.
g f
TPD
Rpd Cp
Cout
` ` ` ` `
LEC-16:
5.4.1
13
LEC-16:
5.4.2
Data-Dependent Delay
14
5.4.2
Data-Dependent Delay
Sometimes the delay through a component is dependent upon the values on signals.
In a ripple-carry adder, if a carry out of the MSB is generated from the least signicant bit, then it will take longer for the output to stabilize than if no carries generated at all.
In a state machine using a one-hot state encoding, false paths might exist when more than one state bit is a 1.
` ` `
Some implementation technologies (e.g. NMOS and exotic latches) have faster transitions from 1 0 than 0 1.
LEC-16:
5.4.2
Data-Dependent Delay
15
NOTE: Asynchronous circuits Data dependent delays are one motivation for asynchronous circuits. Asynchronous circuits are still an active area of research, but are beginning to be used in commercial circuits.
LEC-16:
5.4.3
16
5.4.3
LEC-16:
5.4.3
17
Di
Elmore time constant for node i n ER Ck (n is the number of nodes in the k,i k 1 circuit)
ER k,i
= resistance along path from node i to the source that is also on the path from node k to source
w w
v u
Vi t
LEC-16:
5.4.3
18
k 1
If we:
Di
ERk,iCk
hf
` `
LEC-16:
5.4.3
19
G1
G2
Ra4 Ra1
G1
C3 Rw3
Ra3
G2 C1 Rw1 G1
Rpu
C2 Rw2 Ra2
Vi Cp Rpd
C1
C2
C3
CG2
G* C* Ra* Rw*
Question:
Answer:
LEC-16:
5.4.3
20
k 1
Ra1
Rw1 C1
Ra1
Rw1
Ra2
Rw2 C2
Ra1
Rw1
Ra2
Rw2
Ra3
Rw3 C3
Ra1
Rw1
Ra2
Rw2
Ra3
Rw3
ER C1 1,4
f g
f g
f g
b b b
D4
ERk,iCk
ER C2 2,4 ER C3 3,4 ER C4 4,4 Ra4 CG2
LEC-16:
5.4.3
21
approximate Rai
Ra j
D4
4 Ra CG2
3 Ra C3
2 Ra C2
Ra C1
h f g h f g h f g h f b g g f g f g h f
h g f g h
D4
approximate Ra
Rw Ra2 Ra3 C3
LEC-16:
5.4.3
22
Doubling Antifuses
Question: If you double the number of antifuses and wires needed to connect two gates, what will be the approximate effect on the wire delay between the gates?
Answer:
LEC-16:
5.4.3
23
k 1
ERk,iCk
Assume all resistances and capacitances are the same values (R and C), and assume that all intermediate nodes are along path between the two gates of interest. k R ER k,i
h xf
Di
k 1
k RC
b b
LEC-16:
5.4.3
24
i 1
n2
k 1 n2 RC
We see that the delay is propotional to the square of the number of antifuses along the path.
h xf
Di
k RC
h g f
2
n
b s b
1n
LEC-16:
5.4.3
25
Question: Assuming that wire resistance is much less than antifuse resistance and that all antifuses have equal resistance, calculate the delay from the source inverter (G1) to G2
LEC-16:
5.4.3
26
Answer:
R4 C5 G2 C1 R1 G1
C4
R3 C3 R5 R6 C7
G3 C6 R2 C2
7).
LEC-16:
5.4.3
27
3. Draw RC tree
G1 Rpu R1 n1 R2 n2 Cp Rpd G3 R5 n6 R6 C6 C1 C2 n3 R3 n4 R4 C3 C4 G2
Vi
n5 C5
n7 C7
5).
LEC-16:
5.4.3
28
k 1
ER C5 5,5
ER C6 6,5
ER C7 7,5
ER C1 1,5
b b
D5
k 1 7
ERk,5Ck
ER C2 2,5 ER C3 3,5 ER C4 4,5
Di
ERk,iCk
LEC-16:
5.4.3
29
= = = = = = =
R 2R 2R 3R 4R 2R 2R
LEC-16:
5.4.3
30
D5
R C1 2R C2 2R C3 2R C6 2R C7
3R C4
4R C5
LEC-16:
5.4.3
31
Delay from G1 to G3
Question: Assuming that wire resistance is much less than antifuse resistance and that all antifuses have equal resistance, calculate the delay from the source inverter (G1) to G3
LEC-16:
5.4.3
32
Answer:
7).
LEC-16:
5.4.3
33
k 1
ER C5 5,7
ER C6 6,7
ER C7 7,7
ER C1 1,7
D7
k 1 7
ERk,7Ck
ER C2 2,7 ER C3 3,7 ER C4 4,7
Di
ERk,iCk
LEC-16:
5.4.3
34
= = = = = = =
R 2R 2R 2R 2R 3R 4R
LEC-16:
5.4.3
35
D7
R C1 2R C2 2R C3 3R C6 4R C7
2R C4
2R C5
LEC-16:
5.4.3
36
Delay to G2 vs G3
Question: Assuming all wire segments at same level have roughly the same capacitance, which is greater, the delay to G2 or the delay to G3?
Answer:
2. Difference in delays
3. Compare capacitances
C5
C4
C6 C7
D7
D5
R C1 2R C2 2R C3 2R C4 2R C5 3R C6 D7 RC4 2RC5 RC6 2RC7
D5
R C1
2R C2
2R C3
3R C4
4R C5
2R C6
2R C7 4R C7
LEC-16:
5.4.3
37
LEC-16:
5.5
38
5.5
LEC-16:
5.5.1
39
5.5.1
Speed binning is the process of testing each manufactured part to determine the maximum clock speed at which it will run reliably. Manufacturers sell chips off of the same manufacturing line at different prices based on how fast they will run. A speed bin is the clock speed that chips will be labeled with when sold. Overclocking: running a chip at a clock speed faster than what it is rated for (and hoping that your software crashes more frequently than your overstressed hardware will).
LEC-16:
5.5.2
40
5.5.2
LEC-16:
5.5.2
41
into a single parameter (fanout). This is common, and ne. But, when reading a table such as this, you need to know whether fanout delay is combining both capacitive load delay and interconnect delay, or is just capacitive load.
LEC-16:
5.5.2
42
D D
Delay Delay
LEC-16:
5.5.2
43
Temperature
As temp goes up, atoms vibrate more, and so have greater probability of colliding with electrons owing with current.
Temp
Temp
LEC-16:
5.5.2
44
Supply Voltage
current age
Supply voltage
Supply voltage
LEC-16:
5.5.2
45
LEC-17 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-17 Preliminaries
Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Lec-15 Introduction Lec-16 Math, Physics and Applications Lec-17 Timing Analysis of Storage Elements Power Analysis and Reduction Faults and Testing Review
LEC-17 Preliminaries
Concepts
Setup, hold, and clock-to-Q time calculations for the following circuits:
We wont have time to cover all of these in lecture. Hierarchical FPGA is in Smith. Exotic op is for your interest and buzz-word completedness in interviews, it will not be on the nal exam.
Latch Master/Slave ip op
LEC-17:
5.6
LEC-17:
5.6.1
Simple Latch
5.6.1
Simple Latch
loading data: loads input data into storage circuitry input data passes through to output using stored data input signal is disconnected from output storage circuitry drives output
clk o
Schematic
LEC-17:
5.6.1
Simple Latch
Storage mode
LEC-17:
5.6.1
Simple Latch
Implementing a Latch
s a b o a sel b
Latch implementation
LEC-17:
5.6.1
Simple Latch
Latch Glitching
d clk
NOTE: inverters on sel Both of the inverters on the sel signal are needed. Together, they prevent a glitch on the OR gate when sel is deasserted. If there was only one inverter, a glitch would occur. For more on this, see section 5.6.3.2
LEC-17:
5.6.1
Simple Latch
Loading 0
d=0 clk=1 1 1 0 1 1 0 0 o
LEC-17:
5.6.1
Simple Latch
10
Loading 1
d=1 clk=1 0 1 0 0 0 0 1 o
LEC-17:
5.6.1
Simple Latch
11
Storing 0
d clk=0 0 1 1 0 1 1 0 o=0
LEC-17:
5.6.1
Simple Latch
12
Storing 1
d clk=0 0 1 1 0 1 1 1 o=1
LEC-17:
5.6.1
Simple Latch
13
LEC-17:
5.6.1
Simple Latch
14
LEC-17:
5.6.1
Simple Latch
15
LEC-17:
5.6.1
Simple Latch
16
Clock-to-Q
NOTE: Clock-to-Q for latches For latches, clock-to-Q times are measured with respect to the clock edge that connects the data input to the output. For active-high latches, this is a rising edge.
LEC-17:
5.6.1
Simple Latch
17
LEC-17:
5.6.1
Simple Latch
18
LEC-17:
5.6.1
Simple Latch
19
LEC-17:
5.6.2
20
5.6.2
Calculate clock-to-Q time by nding delay of critical path from where clock signal enters storage circuit to where q exits storage circuit.
LEC-17:
5.6.3
21
5.6.3
LEC-17:
5.6.3
22
d clk
l2 qn s2 s1 q
LEC-17:
5.6.3
23
d clk
l2 qn s2 s1 q
setup d l1 l2 qn q s1 s2 clk cn c2
LEC-17:
5.6.3
24
d clk
l2 qn q
cn
s2 s1
setup d l1 l2 qn q s1 s2 clk cn c2
Minimum Setup Time must arrive at s1 before cn is asserted. Otherwise, will affect storage circuitry when data input is disconnected. Setup time is difference between path from d to s1 and path from clk to cn.
LEC-17:
5.6.3
25
LEC-17:
5.6.3
26
d clk
l2 qn q
d l1 l2 qn q s1 s2 clk cn c2
LEC-17:
5.6.3
27
d clk
l2 qn q
d l1 l2 qn q s1 s2 clk cn c2
LEC-17:
5.6.3
28
d l1 l2 qn q s1 s2 clk cn c2
Cant let affect l1 before c2 deasserts. Hold time is difference between path from clk to c2 and path from d to l1.
LEC-17:
5.6.3
29
LEC-17:
5.6.4
30
LEC-17:
5.6.4
31
Symbol
s 1 0
Implementation 0
Open
0
Closed
Transmit 1
s i o
Transmit 0
LEC-17:
5.6.4
32
LEC-17:
5.6.4
33
LEC-17:
5.6.4
34
LEC-17:
5.6.4
35
LEC-17:
5.6.4
36
LEC-17:
5.6.4
37
LEC-17:
5.6.4
38
LEC-17:
5.6.5
39
5.6.5
m
EN
d clk m clk_b q ??
??
LEC-17:
5.6.5
40
m
EN
TInv Tmd
LEC-17:
5.6.5
41
m
EN
d clk m clk_b q
Flop Clock-to-Q
Flop CO
TInv
Latch CO
LEC-17:
5.6.5
42
m
EN
d clk m clk_b q
Tinv
Tmd
Flop Setup
SUD
Flop
Tmd
Latch SUD
TInv
LEC-17:
5.6.5
43
m
EN
d clk m clk_b q
The hold of the ip op is the same as the hold time of the master latch. This is because, once the data is stored in the master latch, it will be held for the slave latch.
Flop HO
Latch HO
LEC-17:
5.6.6
44
LEC-17:
5.6.6
45
HO
CO
delay from D-inputs to storage element delay from clk-input to storage element delay from storage element to output setup time slowest D path fastest clk path T T PD Max CLKD Min hold time slowest clk path fastest D path T T CLKD Max PD Min delay clk to Q clk path output path T T CLKD OUT
LEC-17:
5.6.6
46
HO CO
SUD T HO T CO
SUD
PD Max
LEC-17:
5.6.6
47
Basic logic cells are called Logic Module ACT 1 family: one type of Logic Module (see Figure 5.1, Smiths pp. 192) ACT 2 and ACT 3 families: use two different types of Logic Module (see Figure 5.4, Smiths pp. 198) C-Module (Combinatorial Module) combinational logic similar to ACT 1 Logic Module but capable of implementing ve-input logic function S-Module (Sequential Module) C-Module + Sequential Element (SE) that can be congured as a ip-op
LEC-17:
5.6.6
48
Actel Timing
Actel Timing
ACT family: (see Figure 5.5, Smiths pp. 200) Simple. Why? Only logic inside the chip Not exact delay (as no place and route, physical layout, hence not accounting for interconnection delay) Non-Deterministic Actel Architecture All primed parameters inside S-Module are assumed Calculate tSUD, tH, and tCO The combinational logic delay of 3 ns: 0.4 went into increasing the setup time, tSUD, and 2.6 ns went into increasing the clockoutput delay, tCO. From outside we can say that the combinational logic delay is buried in the ip-op set up time
LEC-17:
5.6.6
49
Actel Latch
d clk q d clk clr q
LEC-17:
5.6.6
50
d clk clr
LEC-17:
5.6.6
51
SE-Module
m se_clk se_clk_n
clk clr
LEC-17:
5.6.6
52
Other given timing parameters C-Module delay (tPD ) tCLKD (from clk to se clk and se clk n) 3ns 2.6ns
LEC-17:
5.6.6
53
Answer:
See Smith pp 199. Use Smiths eqn 5.15, 5.16, and assume 2 6ns. t CLKD T SUD T HO T CO
LEC-17:
5.6.7
Exotic Flop
54
5.6.7
Exotic Flop
q d clk
Inverter chain creates evaluation window in time when clock has just risen and p transistors are turned on. When clock is 0, internal nodes precharge to 1. Inverter loops are keepers, which store data value.
Chapter 6
55
LEC-18 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-18 Preliminaries
Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 wk-09 10 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Lec-18 Introduction Lec-19 Data Encoding Lec-20 Clock Gating Faults and Testing Review
wk-11 12 wk-13
LEC-18 Preliminaries
Activity Factor Switching Power Consumption Short-Circuiting Power Consumption Leakage Power Consumption
LEC-18 Preliminaries
Background Material
Basic electricity and magnetism equations for voltage, power, current, etc
LEC-18 Preliminaries
Reading Material
All of the articles and papers below are linked to from the Documentation page on the E&CE 427 web site. Smith 15.5 Mudge Power: A rst class design constraint. Trevor Mudge. Computer, vol. 34, no. 4, April 2001, pp. 52-57
http://www.eecs.umich.edu/tnm/papers/computer01.pdf
Infrared Expose: Thermal imaging of 29 200-MHz and 233-MHz notebooks. PC Online. 1997
http://www.zdnet.com/pcmag/features/notebook3/heat.htm
Power-aware microarchitecture: Design and modeling challenges for next-generation microprocessors. Brooks, D.M.; Bose, P.; Schuster, S.E.; Jacobson, H.; Kudva, P.N.; Buyuktosunoglu, A.; Wellman, J.; Zyuban, V.; Gupta, M.; Cook, P.W. IEEE Micro Dec 2000.
http://ieeexplore.ieee.org/iel5/40/19226/00888701.pdf?isNumber=19226
Managing the Impact of Increasing Microprocessor Power Consumption. Stephen H. Gunther, Frank Binns, Douglas M. Carmean, and Jonathan C. Hall. Intel Technology Journal. 2001 Quarter 1.
http://developer.intel.com/technology/itj/q12001/articles/art 4.htm
the following are three papers from the 1998 Design Automation Conference (DAC) in a session on Power Dissipation and Distribution in High Performance Processors Power Considerations in the Design of the Alpha 21264 Microprocessor. Michael K. Gowan, Larry L. Biro, Daniel B. Jackson.
http://www.sigda.acm.org/Archives/ProceedingArchives/Dac/Dac98/papers/1998/dac98/p
LEC-18 Preliminaries
Reducing Power in High-Performance Microprocessors. Vivek Tiwari, Deo Singh, Suresh Rajgopal, Gaurav Mehta, Rakesh Patel, Franklin Baez. Design and Analysis of Power Distribution Networks in PowerPC(TM)Microprocessors. Abhijit Dharchoudhury, Rajendran Panda, David Blaauw, Ravi Vaidyanathan, Bogdan Tutuianu, David Bearden.
http://www.sigda.acm.org/Archives/ProceedingArchives/Dac/Dac98/papers/1998/dac98/p
http://www.sigda.acm.org/Archives/ProceedingArchives/Dac/Dac98/papers/1998/dac98/p
LEC-18:
6.1
OVERVIEW
6.1
Overview
LEC-18:
6.1.1
6.1.1
Laptops, PDA, cell-phones, etc obvious! Every watt above 40W adds $1 to manufacturing cost Approx 25% of operating expense of server farm goes to energy bills (Dis)Comfort of Unix labs in E2 Sandia Labs had to build a special sub-station when they took delivery of Teraops massively parallel supercomputer (over 9000 Pentium Pros) High-speed microprocessors today can run so hot that they will damage themselves Athlon reliability problems, Willamette thermal throttling In 2000, information technology consumed 8% of total power in US.
LEC-18:
6.1.2
6.1.2
All of the articles and papers below are linked to from the Documentation page on the E&CE 427 web site. AMDs Athlon PowerNow! Reduce power consumption in laptops when running on battery by reducing clock speed and supply voltage. Intel Speedstep Reduce power consumption in laptops when running on battery by reducing clock speed to 70-80% of normal. Intel X-Scale An ARM5-compatible microprocessor for low-power systems: http://developer.intel.com/design/intelxscale/ Synopsys PowerMill A simulator that estimates power consumption of the circuit as it is simulated: http://www.synopsys.com/products/etg/powermill ds.html Compaq Itsy Satellites
LEC-18:
6.1.3
Power vs Energy
10
6.1.3
Power vs Energy
Most people talk about power reduction, but sometimes they mean power and sometimes energy.
Power
Watts
Energy / Time
Type Energy
Power minimization is usually about heat removal Energy minimization is usually about battery life or energy costs Units Joules Equivalent Types Work Equations Volts Coulombs 1 C Volts2 2
LEC-18:
6.1.4
11
6.1.4
LEC-18:
6.1.4
12
6.1.4.1 Do Power?
Batteries
Store
Coulombs
Energy
or
Energy
battery
Amps
Power
Energy Time
Seconds
Energy
Volts
Volts Volts
LEC-18:
6.1.4
13
Both instructions executed and energy are measures of work, so MIPs/Watt is a measure of efciency. (This assumes that all instructions perform the same amount of work!)
LEC-18:
6.1.5
14
Question: If I use the SpeedStep feature of my computer, and run at 600MHz with 60W of power, how much longer can I keep the computer running on one battery? How many more simulation steps can I run on one battery?
LEC-18:
6.2
POWER EQUATIONS
15
6.2
Power Equations
DynamicPower StaticPower
Dynamic Power Static Power Switching Power Short Circuit Power Leakage Power
dependent upon clock speed independent of clock speed useful charges up transistors not useful both N and P transistors are on not useful leaks around transistor
e fd
e fd
Power
SwitchPower
ShortPower
LeakagePower
LEC-18:
6.2.1
16
6.2.1
Dynamic power is proportional to how often signals change their value (switch). Roughly 20% of signals switch during a clock cycle. Need to take glitches into account when calculating activity factor. Equations for dynamic power contain clock speed and activity factor.
LEC-18:
6.2.2
Switching Power
17
6.2.2
Switching Power
Charging a capacitor 1 2
Disharging a capacitor
f : frequency at which invertor goes through complete charge-discharge cycle. (eqn 15.4 in Smith)
ClockSpeed ActFact
clock speed average number of times that signal switches from 0 1 or from 1 0 during a clock cycle
1 2
CapLoad
VoltSup2
CapLoad
VoltSup2
LEC-18:
6.2.3
Short-Circuited Power
18
6.2.3
Short-Circuited Power
IShort
Vi
Vo
VoltSup VoltThresh
Gate Voltage
Charging
LEC-18:
6.2.4
Leakage Power
19
6.2.4
Leakage Power
Vi Vo
N P
N-substrate
l mk
ILeak e
PwrLk
ILeak
VoltSup VoltThresh k T
LEC-18:
6.2.5
Glossary
20
6.2.5
VoltSup
Glossary
def aka def aka def aka = Clock speed f Supply voltage V Threshold voltage Vth voltage at which P transistors turn on
ClockSpeed
VoltThresh
LEC-18:
ILeak
6.2.5
def aka def aka = def aka = def aka = =
Glossary
Leakage current IS (reverse bias saturation current) q VoltThresh k T e short circuit time Time that both N and P transistors are turned on when signal changes value. Short circuit current Ishort Current that goes through transistor network while both N and P transistors are turned on. activity factor A NumTransitions NumSignals NumClockCycles Per signal: percentage of clock cycles when signal changes value. Per clock cycle: percentage of signals that change value per clock cycle. Note: When measuring per circuit, sometimes approximate by looking only at ops, rather than every single signal. load capacitance CL switching power (dynamic) 1 ActFact ClockSpeed CapLoad 2 2 VoltSup switching power (dynamic) ActFact ClockSpeed TimeShort IShort VoltSup leakage power (static) ILeak VoltSup total power PwrSw PwrShort PwrLk
21
TimeShort
IShort
ActFact
CapLoad PwrSw
PwrShort
PwrLk Power
def = def =
def =
p qo
LEC-18:
6.2.5
Glossary
def aka Maximum clock speed that an implementation technology can support. fmax VoltSup VoltThresh 2 VoltSup electron charge 1 60218 10 19 C Boltzmanns constant 1 38066 10 23 J/K temperature in Kelvin
22
MaxClockSpeed
q k T
x x
s s
w w
LEC-18:
6.2.6
23
6.2.6
s s
s s
u u
t t
y y y
Power
LEC-18:
6.2.6
24
Multiple Signals
To calculate dynamic power for n signals with different CapLoad, TimeShort, and IShort:
i 1 n
i 1
ActFacti
ClockSpeed
TimeShorti
IShorti
v
VoltSup
zu zu t
DynamicPower
ActFacti
1 CapLoadi 2
ClockSpeed
VoltSup2
LEC-18:
6.2.6
25
Average Power
If know average CapLoad, TimeShort, and IShort, then the above formula simplies to: DynamicPower n ActFactAV G
If capacitances and short-circuit parameters dont have an even distribution, then dont average them. If high-capacitance signals have high-activity factors, then averaging the equations will result in erroneously low predictions for power.
s u
ActFactAV G
ClockSpeed
TimeShortAV G
IShortAV G
v
VoltSup
1 2 CapLoadAV G
ClockSpeed
VoltSup2
s u
LEC-19 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-19 Preliminaries
Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 wk-09 10 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Lec-18 Introduction Lec-19 Data Encoding Lec-20 Clock Gating Faults and Testing Review
wk-11 12 wk-13
LEC-19 Preliminaries
LEC-19:
6.3
LEC-19:
6.3
Analog Parameters
Power reduction parameters at the analog level. capacitance for example, Silicon on Insulator (SOI) resistance for example, copper wires voltage low-voltage circuits
LEC-19:
6.3
Analog Techniques
Power reduction techniques at the analog level. dual-Vt Two different supply voltages: high voltage for performance-critical portions of design, low voltage for remainder of circuit. exotic circuits Special ops, latches, and combinational circuitry that run at a high frequency while minimizing power adiabatic circuits Special circuitry that consumes power on 0 1 transitions, but not 1 0 transitions. These sacrice performance for reduced power. clock trees Up to 30% of total power can be consumed in clock generation and clock tree
LEC-19:
6.3
Digital Parameters
Power-reduction parameters at the digital level. capacitance (number of gates) activity factor clock frequency
LEC-19:
6.3
Digital Techniques
Power-reduction techniques at the digital level. multiple clocks Put a high speed clock in performance-critical parts of design and a low speed clock for remainder of circuit clock gating Turn off clock to portions of a chip when its not being used data encoding Gray coding vs one-hot vs fully encoded vs ... asynchronous circuits Get rid of clocks altogether.... Additional low-power design techniques for RTL from a Qualis engineer: http://home.europa.com/celiac/lowpower.html
LEC-19:
6.4
6.4
If our goal is to reduce power, the most promising approach is to reduce the supply voltage, because, from:
TimeShort
s s
s s
u u t t
Power
1 2 CapLoad
LEC-19:
6.4
10
LEC-19:
6.4
11
Answer: d 20ns current delay along critical path d ?? new delay along critical path V 2 8V current supply voltage V 2 2V new supply voltage Vt 0 7V threshold voltage
MaxClockSpeed 1 d
y }
y }
d d
20ns
31ns
v w | w u s v w w | w s u w v |} u s v | u s } v |} u s v | u } v | u
MaxClockSpeed
Vt
Vt
Vt
Vt
2 8V 0 7V 2 8V
w w w
} }
2 2V 2 2V 0 7V
LEC-19:
6.4
12
And increasing the leakage current increases the power: Power ILeak So, need to strike a balance between reducing VoltSup (which has a quadratic affect on reducing power), and increasing ILeak, which has a linear affect on increasing power.
ILeak e
s |
VoltThresh k T
LEC-19:
6.5
13
6.5
LEC-19:
6.5.1
14
LEC-19:
6.5.2
Example Problem
15
6.5.2
Example Problem
LEC-19:
6.5.2
Example Problem
16
Required behaviour You have been asked to consider three different types of counters: a binary counter, a Gray-code counter, and a one-hot counter. (The table below shows the values from 0 to 15 for the different encodings.)
Question: What is the relative amount of power consumption for the different options?
LEC-19:
6.5.2
Example Problem
17
PLA
cell 1. You may neglect power associated with clocks. 2. You may assume that all counters: (a) are implemented on the same fabrication process (b) run at the same clock speed (c) have negligible leakage and short-circuit currents Decimal 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Gray 0000 0001 0011 0010 0110 0111 0101 0100 1100 1101 1111 1110 1010 1011 1001 1000 One-Hot 0000000000000001 0000000000000010 0000000000000100 0000000000001000 0000000000010000 0000000000100000 0000000001000000 0000000010000000 0000000100000000 0000001000000000 0000010000000000 0000100000000000 0001000000000000 0010000000000000 0100000000000000 1000000000000000 Binary 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
LEC-19:
6.5.2
Example Problem
18
6.5.2.3 Answer
LEC-19:
6.5.2
Example Problem
19
Outline of Thinking
Factors to consider that distinguish: capacitance and activity factor: Capacitance is dependent upon the number of signals, and whether a signal is combinational or a op.
LEC-19:
6.5.2
Example Problem
20
d(1) PLA
d(2) PLA
d(3) PLA
PLA
done
The Gray and Binary counters have the same design, and the Gray counter will have the lower activity factor. Therefore, the Gray counter will have lower power than the Binary counter.
LEC-19:
6.5.2
Example Problem
21
However, we dont know how much lower the power of the Gray counter will be, and we dont know how much power the One-Hot counter will consume.
LEC-19:
6.5.2
Example Problem
22
Capacitance
Gray d() done 1-Hot d() done Binary d() done PLAs Flops PLAs Flops PLAs Flops PLAs Flops PLAs Flops PLAs Flops cap 2 1 2 1 2 1 2 1 2 1 2 1 number 4 4 1 0 0 16 0 0 4 4 1 0 subtotal cap 8 4 2 0 0 16 0 0 8 4 2 0
LEC-19:
6.5.2
Example Problem
23
LEC-19:
6.5.2
Example Problem
24
Activity Factor
LEC-19:
6.5.2
Example Problem
25
Gray coding
LEC-19:
6.5.2
Example Problem
26
One-hot coding
LEC-19:
6.5.2
Example Problem
27
Binary coding
LEC-19:
6.5.2
Example Problem
28
s s
LEC-19:
6.5.2
Example Problem
29
1-Hot
d() done
Binary
d() done
LEC-19:
6.5.2
Example Problem
30
Final Answer
If choose Binary counting as baseline, then relative amounts of power are: Gray One-Hot Binary 54% 35% 100%
If choose One-Hot counting as baseline, then relative amounts of power are: Gray One-Hot Binary 156% 100% 288%
LEC-20 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-20 Preliminaries
Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 wk-09 10 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Lec-18 Introduction Lec-19 Data Encoding Lec-20 Clock Gating Faults and Testing Review
wk-11 12 wk-13
LEC-20 Preliminaries
clock gating: idea circuitry for clock gating power analysis of clock gating
LEC-20:
6.6
CLOCK GATING
6.6
Clock Gating
The basic idea of clock gating is to reduce power by turning off the clock when a circuit isnt needed. This reduces the activity factor.
LEC-20:
6.6.1
LEC-20:
6.6.1
LEC-20:
6.6.1
Increases area
| | |
LEC-20:
6.6.1
LEC-20:
6.6.2
6.6.2
Clock gating is implemented by adding a component that disables the clock when the circuit isnt needed.
i_data i_valid clk o_data
o_valid
LEC-20:
6.6.2
10
Question: How much power will be saved in the following clock-gating scheme?
70% of the time the main circuit has valid data clock gating circuit is 90% effective (90% of the time that the circuit has invalid data, the clock is off) clock gating circuit has 10% of the area of the main circuit clock gating circuit has same activity factor as main circuit neglect short-circuiting and leakage power
LEC-20:
6.6.2
11
LEC-20:
6.6.2
12
Main
w t}
0 1A A
1 2
V2
s w s
s s s t s
s} s
A 1 2 A C V2 1 2 A 0 1C V2
CClkFsm
AClkFsm
CMain
AMain
A C A 0 1C
PwrTot
1 2
AMain
s s
Pwr
PwrLk PwrShort
s s
PwrSw
Pwr
PwrSw PwrLk 1 A C V2 2
PwrTot
y y y
power for main circuit without clock gating power for main circuit with clock gating power for clock enable state machine PwrMain PwrClkFsm PwrShort
LEC-20:
6.6.2
13
PwrTot
0 73A 0 1A A 0 83
w t
w t}
PwrTot
0 1A A
sv DQv
w sCQv w | u w | u v | u | u s
y }
| u
y y y y y y y
effectiveness of clock gating percentage of clock cycles with valid data percentage of clock cycles that clock toggles 1 Eff 1 PctValid Intuition: when E = 0%, PctClk=100%; when E = 100%, PctClk=PctValid PctClk A 1 Eff 1 PctValid A 1 09 1 07 A 0 73A
LEC-20:
6.6.2
14
LEC-20:
6.6.2
15
Valid-Bit Protocol
clk i_valid i_data clk i_valid i_data o_valid o_data o_valid o_data
i valid: high when i data has valid data signies whether circuit should pay attention to or ignore data. o valid: high when o data has valid data signies whether whether environment should pay attention to output of circuit. For more on circuit protocols, see section 2.4.10.
LEC-20:
6.6.2
16
LEC-20:
6.6.2
17
LEC-20:
6.6.2
18
LEC-20:
6.6.2
19
LEC-20:
6.6.2
20
LEC-20:
6.6.2
21
LEC-20:
6.6.2
22
LEC-20:
6.6.2
23
cool_clk
wakeup_out
hot clk: clock that always toggles cool clk: gated clock sometimes toggles, sometimes stays low wakeup: alerts circuit that valid data will be arriving soon clk en: turns on cool clk
LEC-20:
6.6.2
24
LEC-20:
6.6.2
25
LEC-20:
6.6.2
26
LEC-20:
6.6.2
27
LEC-20:
6.6.2
28
LEC-20:
6.6.2
29
LEC-20:
6.6.2
30
Design Decisions
What level of granularity for gated clocks? entire module? individual pipe stages? something in between? When should the clocks turn off? When should the clocks turn on? Protocol for incoming wakeup signal? Protocol for outgoing wakeup signal?
LEC-20:
6.6.2
31
Wakeup Protocol
Designers negotiate incoming and outgoing wakeup protocol with environment. Example wakeup protocol:
wakeup in will arrive 1 clock cycle before valid data wakeup in will stay high until have at least 3 cycles of invalid data same protocol for wakeup out
LEC-20:
6.6.3
Design Problem
32
6.6.3
Design Problem
Design a clock enable state machine for a pipelined module whose latency varies from 5 to 10 clock cycles and that can hold a maximum of 6 instructions (parcels of data).
LEC-20:
6.6.3
Design Problem
33
Design Strategy
When designing clock gating circuitry, consider the two extreme case:
For a constant stream of valid data, the key is to not incur a large overhead in design complexity, area, or clock period when clocks will always be toggling. For a single parcel of valid data, the key is to make sure that the clocks are toggling so that data can percolate through circuit. Also, we want to turn off the clock as soon as possible after data leaves.
a constant stream of valid data circuit is turned off and receives a single parcel of valid data
LEC-20:
6.6.3
Design Problem
34
LEC-20:
6.6.3
Design Problem
35
Scenario 1
1. Scenario: turned off and get one parcel. (a) Need to turn on and stay on until parcel departs (b) idea #1 (parcel count): count number of parcels inside module keep clocks toggling if have non-zero parcels. (c) idea #2 (cycle count): count number of clock cycles since last valid parcel entered module once hit 10 clock cycles without any valid parcels entering, know that all parcels have exited. keep clocks toggling if counter is less than 10
LEC-20:
6.6.3
Design Problem
36
Scenario 2
1. Scenario: constant stream of parcels (a) parcel count would require looking at input and output stream and conditionally incrementing or decrementing counter (b) cycle count would keep resetting counter
LEC-20:
6.6.3
Design Problem
37
LEC-20:
6.6.3
Design Problem
38
0 1 2 0 0 0 0 1 2 3 4 0 1 2 3 4 5 6 7 8 9 10 cycle_clk_en
LEC-20:
6.6.3
Design Problem
39
LEC-20:
6.6.3
Design Problem
40
LEC-20:
6.6.3
Design Problem
41
LEC-20:
6.6.3
Design Problem
42
Design Analysis
Assuming that:
The two factors affecting power are activity factor and capacitance.
both designs will be implemented on same technology leakage current is negligible switching power is negligible
LEC-20:
6.6.3
Design Problem
43
Power If parcel leaves after 5 clock cycles, cycle count will continue to power circuit for another 5 cycles (wasting power!). So, it looks like parcel count wins. However, we should carry out a detailed analysis to see how much difference there is between the two options.
y s
y w s
LEC-20:
6.6.3
Design Problem
44
Behavioural Analysis
Assuming:
Answer:
60% of incoming data are valid even distribution of latencies average length of continuosly valid data is 80 instructions
Question:
Goal: determine what percentage of time cool clk is toggling for each of the two design options.
LEC-20:
6.6.3
Design Problem
45
y y y
LEC-20:
6.6.3
Design Problem
46
y ~ w y t | w t
LEC-20:
6.6.3
Design Problem
47
~ w
LEC-20:
6.6.3
Design Problem
48
Wrapup
5. Summary Capacitance Percentage clocking Parcel Count 19.5 65.8% Cycle Count 20 67.7%
6. Parcel count wins on both capacitance and activity factor, therefore it has the lowest power consumption. 7. How much more power does the cycle count design consume?
5 5%
w v w s w u w s w u v w s w u |
y y y
n%more power
0 658
Chapter 7
49
LEC-20:
7.1
INTRODUCTION
50
7.1
Introduction
LEC-20:
7.1.1
51
7.1.1
The purpose of this lecture is to explain the sources of manufacturing faults, how the faults are caught, and the tradeoffs in trying to catch these faults. We will then introduce the mathematical models for the physical faults.
physical faults wired-AND wired-OR stronger wins mathematical model of fault causes of faults testing burn in bin sorting scan testing built-in self test IDDQ testing economics of testing locations of faults test vector to detect a fault single stuck-at faults undetectable faults redundant circuitry fault domination fault collapsing
fault equivalence gate collapsing node collapsing fault collapsing (intelligent collapsing) fault coverage test vector generation required test vectors order to run test vectors fault hiding scan testing scan chain testing procedure time to run a test boundary scan testing JTAG IEEE 1149 length of time to do a scan test hardware to do scan testing
LEC-20:
7.1.2
Background Material
52
7.1.2
Background Material
Karnaugh maps
LEC-20:
7.1.3
Reading Material
53
7.1.3
Smith ch14
Reading Material
LEC-21 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-21 Preliminaries
Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 wk-09 10 wk-11 12 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Lec-21 Introduction Lec-22 Detection and Test Generation Lec-23 Built-In Self Test (BIST) Lec-24 JTAG Review
wk-13
LEC-21 Preliminaries
physical faults wired-AND wired-OR stronger wins mathematical model of fault causes of faults testing burn in bin sorting
scan testing built-in self test IDDQ testing economics of testing locations of faults test vector to detect a fault single stuck-at faults undetectable faults redundant circuitry
LEC-21 Preliminaries
Background Material
Karnaugh maps
LEC-21 Preliminaries
Reading Material
Smith ch14
LEC-21:
7.2
7.2
LEC-21:
7.2.1
7.2.1
LEC-21:
7.2.1
Shorted wires
Open wire
LEC-21:
7.2.1
Fabrication process (initial construction is bad) chemical mix impurities dust Manufacturing process (damage during construction) handling probing cutting mounting materials corrosion adhesion failure cracking peeling
LEC-21:
7.2.1
10
LEC-21:
7.2.1
11
The hope is that the extreme conditions will cause chips to break that would otherwise have broken in the customers system soon after arrival. The trick is to create conditions that are extreme enough that bad chips will break, but not so extreme to cause good chips to break.
LEC-21:
7.2.1
12
s w
LEC-21:
7.2.1
13
Built In Self Test (BIST) (Smith 14.7): Build circuitry on chip that generates tests and compares actual and expected results IDDQ Testing : (Smith 14.3.6)
Load test vector from tester into chip Run chip on test data Unload result data from chip to tester Compare results from chip against those produced by simulation If results are different, then chip was not manufactured correctly
Measure the quiescent current between VDD and GND. Variations from expected values indicate faults.
LEC-21:
7.2.1
14
Challenges
The challenges in testing:
The crux of testing is to use yesterdays technology to nd faults in tomorrows chips. Agilent engineer at ARVLSI 2001.
test circuitry consumes chip area test circuitry reduces performance decrease fault escapee rate of product that ships while having minimal impact on production cost and chip performance external tester can only look at I/O pins ratio of internal signals to I/O pins is increasing some faults will only manifest themselves at high-clock frequencies
LEC-21:
7.2.1
15
LEC-21:
7.2.2
16
Economics of
The ACHIP costs $10 without any testing Each board uses one ACHIP (plus lots of other chips that we dont care about) 68% of the manufactured ACHIPS do not have any faults For the ACHIP, it costs $1 per chip to catch half of the faults Each 50% reduction in fault escapees doubles cost of testing (intuition: doubles number of tests that are run) If board-level testing detects a bad ACHIP, it costs $200 to replace the ACHIP Board-level testing will detect 100% of the faults in an ACHIP
LEC-21:
7.2.2
17
Economics of Testing
Question: ACHIP? What escapee fault rate will minimize cost of the
For high-volume, small-area chips, testing can consume more than 50% of the total cost.
w s
w s
w s
w s
w s w s w s
ReplaceCost (200 0 32 = $64) (200 0 16 = $32) (200 0 08 = $16) (200 0 04 = $8) (200 0 02 = $4) (200 0 01 = $2) (200 0 005 = $1)
TotCost
NoTestCost
TestCost
EscapeeProb
ReplaceCost
LEC-21:
7.2.3
18
7.2.3
LEC-21:
7.2.3
19
Bad Circuits open wired-AND bridging short wired-OR bridging short stronger wins bridging short (b is stronger) short to VDD
a b a b a b a b c d c d c d c d
a b a b
c d c d
short to GND
LEC-21:
7.2.3
20
BAD
OK
BAD
b BAD
BAD
OK
LEC-21:
7.2.3
21
f g h i b
L2 L1 L4 L3
e g h e
L1
L3 L5 L4
g h
For the same schematic, we can have either four or ve different locations for potential faults, depending upon how the circuit is layed out.
LEC-21:
7.2.3
22
LEC-21:
7.2.4
Detecting a Fault
23
7.2.4
Detecting a Fault
To detect a fault, we compare the actual output of the circuit against the expected value.
LEC-21:
7.2.4
Detecting a Fault
24
Faulty circuit The only test vector that will detect the fault in the circuit is 110. Sometimes multiple test vectors will catch the same fault. Sometimes a single test vector can catch multiple faults.
| P
LEC-21:
7.2.4
Detecting a Fault
25
Another fault The test vector 110 can catch both this fault and the previous one.
| P
a 1
b 1
c 0
good 1
faulty 0
LEC-21:
7.2.5
26
Need: a mathematical model for faults that is abstracted from complexities of circuit layout and plethora of possible faults, yet still detects most or all possible faults.
The possible faults in a circuit are dependent upon the physical layout of the circuit. A very wide variety of possible faults A single test vector can catch many different faults
LEC-21:
7.2.5
27
LEC-21:
7.2.5
28
L10
L12
L11
If restrict to single stuck-at fault model, then have 24 faulty circuits to consider.
2 types of faults
24
LEC-21:
7.2.5
29
L10@0,1
L12@0,1
L11@0,1
If allowed multiple faults, then could have up to 12 different faults in the same circuit. How many faulty circuits would need to be considered? Each of the 12 locations has three possible values: good, stuck-at-1, stuckat-0. Therefore, 312 5 3 105 different circuits would need to be considered! If allowed multiple faults of 4 different types at 12 different locations, then would have 512 1 2 4 108 different faulty circuits to consider!
s w y
s w y |
LEC-21:
7.2.5
30
s w y
s w
LEC-21:
7.2.6
LEC-21:
7.2.6
7.2.6.1 Algorithm
compute Karnaugh map for correct circuit compute Karnaugh map for faulty circuit nd region of disagreement any assignment in region of disagreement is a test vector that will detect fault 5. any assignmemnt outside of region of disagreement will result in same output on both correct and faulty circuit 1. 2. 3. 4.
LEC-21:
7.2.6
d e
a b c
a c
d e
b
Good circuit
a c
Faulty circuit
LEC-21:
7.2.7
Undetectable Faults
34
7.2.7
Undetectable Faults
Not all faults are detectable. 1. If a circuit is irredundant then all single stuck-at faults can be detected. A redundant circuit is one where one or more gates can be removed without affecting the functional behaviour. 2. If not trying to nd all of the faults in a circuit, then a fault that you arent looking for can mask a fault that you are looking for.
LEC-21:
7.2.7
Undetectable Faults
35
LEC-21:
7.2.7
Undetectable Faults
36
Timing Hazards
Static hazard Dynamic hazard Timing hazards are often removed by adding redundant circuitry.
LEC-21:
7.2.7
Undetectable Faults
37
Redundant Circuitry
a b
a b
1,1 1,0
c 1,0 1,0,1 d
e f g
d c
1,1
0,1
0,1
LEC-21:
7.2.7
Undetectable Faults
38
Redundant Circuitry
In this sum-of-products style circuit, each in the Karnaugh map.
a c b
AND
We can prevent this transition from causing a glitch by adding a cube that covers the two squares of the transition from 111 to 101. This cube is 1-1, which is the black cube in the Karnaugh map below and the signal h in the redundant circuit below.
a c b c a b
LEC-21:
7.2.7
Undetectable Faults
39
Redundant Circuitry
a b c a b e h d c f d e
L1
f h g
Redundant circuit
LEC-21:
7.2.7
Undetectable Faults
40
Redundant Circuitry
L1@0 is undetectable. Correct circuit ab bc Faulty circuit ab bc ac With L1@0, ac 0 ab bc 0 ab bc Same equation as correct circuit
{ |
LEC-21:
7.2.7
Undetectable Faults
41
a z z c
b c
a c
L1 L3
So, the signal b and the two extra XOR gates are redundant.
LEC-21:
7.2.7
Undetectable Faults
42
a z
c
b c
L1 L3
z c
eqn a a b b c c
K-map
a c b
diff w/ ckt
a c b
The lesson is that not all faults in redundant circuitry are undetectable.
v u v u
a c
b c
LEC-22 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-22 Preliminaries
Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 wk-09 10 wk-11 12 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Lec-21 Introduction Lec-22 Detection and Test Generation Lec-23 Built-In Self Test (BIST) Lec-24 JTAG Review
wk-13
LEC-22 Preliminaries
node collapsing
redundant circuitry addendum fault domination fault collapsing fault equivalence gate collapsing
fault collapsing (intelligent collapsing) fault coverage test vector generation required test vectors order to run test vectors fault hiding
LEC-22:
7.3
FAULTS
7.3
Faults
LEC-22:
7.3.1
Locations of Faults
7.3.1
a b c
Locations of Faults
L4 L2 L5
At rst, we will consider only the following faults: L2@1, L4@1, L5@1.
ab
bc
b
LEC-22:
7.3.1
Locations of Faults
fault
eqn
K-map
a c b
diff w/ ckt
a c b
test vectors
3)
L5@1
ab
2)
L4@1
1)
L2@1
c
a c b c a b
bc
a c b c a b
101, 100
101, 001
LEC-22:
7.3.1
Locations of Faults
diff w/ ckt
a b
test vectors
If we choose 101, we can detect all three faults. Choosing either 001 or 100 will miss one of the three faults.
3)
L5@1
ab
2)
L4@1
1)
L2@1
c
a c b c a b
bc
a c b c a b
101, 100
101, 001
a c b
LEC-22:
7.3.2
7.3.2
The goal of test vector generation is to nd the smallest set of test vectors that will detect the faults of interest. Test vector generation requires analyzing the faults. We can simplify the task of fault analysis by reducing the number of faults that we have to analyze. Smith has examples of this in Figures 14.13 and 14.14.
LEC-22:
7.3.2
Diff w/ ckt
a b
test vectors
1)
L5@1
ab+c
a c b c a b
101, 001
2)
L6@1
Any test vector that detects L5@1 will also detect L6@1. Denition f1 dominates f2 : any test vector that detects f1 will also detect f2 . L5@1 dominates L6@1. When choosing test vectors we can ignore L6@1 and just include L5@1.
Question: What would happen if we ignored L5@1 and just included L6@1?
Answer: If we chose 100, 010, or 000 as our test vector to detect L6@1, then we would not detect L5@1.
LEC-22:
7.3.2
10
Diff w/ ckt
a b
1)
L1@1
b
a c b c a b
2)
L3@1
The two faults above are equivalent. Denition f1 is equivalent to f2 : f1 and f2 are detected by exactly the same set of test vectors. That is, all of the test vectors that detect f1 will also detect f2 , and vice versa. When choosing test vectors we can ignore one of the faults and just include the other.
LEC-22:
7.3.2
11
AND
@1
@0
@0
@1
NAND
LEC-22:
7.3.2
12
NOT-1
@0
NOT-0 With the net-fault model, which is the one we are using in E&CE 427, inverters and buffers are the only gates where we node collapsing is relevant. With the pin-faul model, where faults are modelled as occuring on the pins of gates, there are other instances where node collapsing can be used.
LEC-22:
7.3.2
13
gate collapsing node collapsing general fault equivalence (intelligent collapsing) fault domination
LEC-22:
7.3.3
Fault Coverage
14
7.3.3
Fault Coverage
Denition Fault coverage: percentage of detectable faults that are detected by a set of test vectors. DetectedFaults DetectableFaults
NOTE: In Smiths book, undetectable faults dont hurt your coverage. This is not universally true. Some peoples denition of fault coverage has denominator of AllPossibleFaults, not just those that are detectable.
FaultCoverage
LEC-22:
7.3.4
15
ab
bc
b
LEC-22:
7.3.4
16
LEC-22:
7.3.4
17
L3@0,1
Gate collapsing
a b
L2 L5 L1 @0 L4 @0 @0 L6 L8 L7
c a b
L3 L1 L4 L2 L5 @0
c a b
L3 L1 L4 L2 L5
@0
L3
LEC-22:
7.3.4
18
Node Collapsing
Node collapsing: none applicable (no invertors or buffers).
a b
Remaining faults:
L3@1
LEC-22:
7.3.4
19
Intelligent Collapsing
Intelligent Collapsing
a b
L2@0 L8@0
L2@0, L8@0
c a b z
L1@1
L1@1, L3@1
c
L3@1
a b
L2@1 L5@1 L4@1 L6@0 L8@0,1 z L7@0
Remaining faults:
L3@1
LEC-22:
7.3.4
20
Diff w/ ckt
a b
1)
L2@1
a+c
a c b c a b
2)
L3@1
b
a c b c a b
3)
L4@1
a+bc
a c b c a b
4)
L5@1
ab+c
a c b c a b
5)
L6@0
bc
a c b c a b
6)
L7@0
ab
a c b c a b
7)
L8@0
0
a c b c a b
8)
L8@1
LEC-22:
7.3.4
21
LEC-22:
7.3.4
22
Remaining Faults
fault eqn K-map
a c b c
Diff w/ ckt
a b
1)
L3@1
b
a c b c a b
2)
L4@1
a+bc
a c b c a b
3)
L5@1
ab+c
a c b c a b
4)
L6@0
bc
a c b c a b
5)
L7@0
ab
LEC-22:
7.3.4
23
Remaining Faults
a b c
L4@1 L6@0
z
L5@1 L3@1 L7@0
LEC-22:
7.3.4
24
LEC-22:
7.3.4
25
Diff w/ ckt
a b
1)
L4@1
a+bc
a c b c a b
2)
L5@1
ab+c
The intersection of the two difference regions is 101. Choosing 101 detects both L4@1 and L5@1. Add 101 to suite of test vectors. Final set of test vectors is: 010, 110, 011, 101.
LEC-22:
7.3.4
26
LEC-22:
7.3.4
27
fault 110
a c b
010
011
101
L1@0
a c b
1 1
a c b
L1@1 L2@0
a c b
1 1
L2@1
a c b
L3@0
a c b
1 1
a c b
L3@1 L4@0
a c b
1 1
a c b
L4@1 L5@0
a c b
1 1
a c b
L5@1 L6@0
a c b
1 1
a c b
L6@1 L7@0
a c b
1 1
L7@1
a c b
1 1
a c b
1 1
1 5
1 6
LEC-22:
7.3.4
28
101 detects the most faults, so we should run it rst. This reduces the faults found by 010 from 5 to 2 (because L6@1, L7@1, and L8@1 will be found by 101). This leaves 110 and 011 with 5 faults each, we can run them in either order, then run 010. We settle on a nal order for our test suite of: 101, 011, 110, 010.
LEC-22:
7.3.4
29
LEC-22:
7.3.4
30
Diff w/ ckt
a b
1)
L1@0
bc
a c b c a b
2)
L1@1
b
a c b c a b
3)
L2@0
0
a c b c a b
dominated by 1, 5
4)
L2@1
a+c
a c b c a b
dominated by 8, 10
5) 6) 7)
ab b bc
same as 2 same as 1
a c b c a b
8) 9)
L4@1 L5@0
a+bc ab
same as 5
a c b c a b
10) 11)
L5@1 L6@0
ab+c bc
same as 1
a c b c a b
1 ab 1 0 1
LEC-22:
7.3.5
31
7.3.5
a b c
L1
L2 L3
Assume that we are not trying to detect all faults L1 is viewed as not being at risk for faults, but L3 is at risk for faults.
a b z c
L3 L1
a b
L1
z c
L3
LEC-22:
7.3.5
32
Fault Hiding
a b z c
L3 L1
a b
L1
z c
L3
Problem: If L1 is stuck-at 1, the test vectors that normally detect L3@0 will not detect L3@0. In the presence of other faults, the set of test vectors to detect a fault will change. fault(s) L3@0 L1@1,L3@0 eqn ab
a c b c a b
K-map
a c b
Diff w/ ckt
a c b
LEC-23 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-23 Preliminaries
Change Log
LEC-23 Preliminaries
Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 wk-09 10 wk-11 12 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Lec-21 Introduction Lec-22 Detection and Test Generation Lec-23 Built-In Self Test (BIST) Lec-24 JTAG Review
wk-13
LEC-23 Preliminaries
LEC-23:
7.4
7.4
LEC-23:
7.4.1
Block Diagram
7.4.1
Block Diagram
LEC-23:
7.4.1
Block Diagram
o_data(1)
d(2) i_data(2)
o_data(2)
LEC-23:
7.4.1
Block Diagram
o_data(1)
d(2) i_data(2)
o_data(2)
LEC-23:
7.4.1
Block Diagram
o_data(1)
d(2) i_data(2)
o_data(2)
LEC-23:
7.4.1
Block Diagram
10
d(0)
d(1) i_data(1)
d(3) i_data(3)
LEC-23:
7.4.1
Block Diagram
11
d(0)
d(1) i_data(1)
d(3) i_data(3)
LEC-23:
7.4.1
Block Diagram
12
d(0)
d(1) i_data(1)
d(3) i_data(3)
LEC-23:
7.4.1
Block Diagram
13
7.4.1.1 Components
There is one test generator per group of inputs (or internal ops) that drive the same circuit to be tested. There is one signature analyzer per output (or internal op).
NOTE: MISR An exception to the above rule is a multiple input signature register (MISR), which can be used to analyze several outputs of the circuit under test. (Smith 14.7.7) The test generator and signature analyzer are both built with linear-feedback shift registers.
LEC-23:
7.4.1
Block Diagram
14
Test generator
generates a psuedo-random set of test vectors for n output bits, generates all vectors from 1 to 2n 1 in a pseudo random order built with a linear-feedback shift register (shift-register portion is the input ops)
LEC-23:
7.4.1
Block Diagram
15
Signature analyzer
checks that the output it is examining has the correct results for the complete set of tests that are run only has a meaningful result at the end of the entire test sequence. built with a linear-feedback shift register similar to a hash function or a lossy compression function if there are no faults, the signature analyzer will denitely say ok (no false negatives) if there is a fault, the signature analyzer might say ok or might say bad (false positives are possible) design tradeoff: more accurate signature analyzers require more hardware
LEC-23:
7.4.1
Block Diagram
16
Result checker
signature analyzers output ok/bad on every clock cycle, but the result is only meaningful at the end of running the complete set of test vectors the result checker looks at test vector inputs to detect the end of the test suite and outputs all ok if all signature analyzers report ok at that moment implemented as an AND gate
LEC-23:
7.4.1
Block Diagram
17
Feedback Shift
Register
Basically, a shift register (sequence of ip-ops) with the output of the last ip-op fed back into some of the earlier ip-ops with XOR gates. Design parameters:
number of ip-ops external or internal XOR feedback taps (coefcients) external-input or self-contained reset or set
LEC-23:
7.4.1
Block Diagram
18
LFSR Example
reset
d0 i
q0 d1
q1 d2
q2
LEC-23:
7.4.1
Block Diagram
19
LFSR Example
d0
q0 d1
q1 d2
q2
set
LEC-23:
7.4.1
Block Diagram
20
LFSR Example
d0
R
q0
d1
q1 d2
q2
set
LEC-23:
7.4.1
Block Diagram
21
LFSR Example
reset
d0
q0
d1
q1
d2
q2
LEC-23:
7.4.1
Block Diagram
22
LEC-23:
7.4.1
Block Diagram
23
Denition pseudo random: The same elements in the same order every time, but the relationship between consecutive elements is apparantly random. Maximal-length linear feedback shift registers are used to generate test vectors for built-in self test.
LEC-23:
7.4.1
Block Diagram
24
set
d0
q0
d1
q1 d2
q2
set
LEC-23:
7.4.1
Block Diagram
25
LEC-23:
7.4.1
Block Diagram
26
LEC-23:
7.4.2
Test Generator
27
7.4.2
Test Generator
The test generator component is a maximal-length LFSR with multiplexors on the inputs to each ip-op. In test mode, the data input on each ip op is connected to the output of the previous ip op. In normal mode, the input of each ip op is connected to the environment.
d0 q0 d1 q1 d2 q2
set
LEC-23:
7.4.2
Test Generator
28
Test Generator
The test generator component is a maximal-length LFSR with multiplexors on the inputs to each ip-op. In test mode, the data input on each ip op is connected to the output of the previous ip op. In normal mode, the input of each ip op is connected to the environment.
mode d1 q1 d2 q2
d0
q0
LEC-23:
7.4.2
Test Generator
29
Test Generator
mode
d0 i_d(0)
q0
d1 i_d(1) d2 i_d(2)
q1
q2
LEC-23:
7.4.3
Signature Analyzer
30
7.4.3
Signature Analyzer
There are four things that change between different signature analyzers:
number of ops ( ops area, accuracy) choice of feedback taps: a good choice can improve accuracy (more isnt necessarily better) bubbles on input to AND gate for ok: determined by expected result from simulating test sequence through circuit under test and LFSR of analyzer. Vector
LEC-23:
7.4.3
Signature Analyzer
31
Signature Analyzer
This circuit:
reset
i
S S
Two ops, most analyzers use more the HP boards in the 1970s used 37 ops! Feedback taps on both ops. Different signature analyzers have different congurations of feedback taps. Also contains ok tester (AND gate). Expected output of LFSR at end of test sequence is: q0=1 and q1=1, or 01. (We know this because of bubble on AND gate. To see why this is the expected output of the signature analyzer, we would need to know the correct sequence of outputs of the circuit under test.)
d0
q0
d1
q1
ok
LEC-23:
7.4.3
Signature Analyzer
32
Signature Analyzer
reset clk i d0 q0 d1 q1 0 0 i6 i5 i4 i3 i2 i1 i0 -
LEC-23:
7.4.3
Signature Analyzer
33
i4i6 356 i5
i4i6 356
LEC-23:
7.4.4
Result Checker
34
7.4.4
Result Checker
The purpose of the result checker is to check the ok circuit at the end of the test sequence. To do this, we need to recognize the end of the test sequence. The simplest way to do this is to notice that the rst test vector is all 1s and that the test vector sequence will repeat as long as the circuit is in test mode. We want to sample the ok signal one clock cycle after the sequence is over. This is the same as the rst clock cycle of the second test sequence. In this clock cycle, the output of the test generator will be all 1s and reset will be 0. We need to look at reset, because otherwise we could not distinguish the rst sequence (when reset is 1) from the subsequenct sequences.
reset q0 q1 q2 ok
all_ok
LEC-23:
7.4.5
35
7.4.5
LEC-23:
7.4.5
36
Addition
represents XOR expression result 0 0 0 0 1 1 1 0 1 1 1 0 x x 0
LEC-23:
7.4.5
37
Multiplication
represents concatenating shift registers expression result x4 1 x4 x2 x3 x5
LEC-23:
7.4.5
38
Example
x5
x3
x2
x3
x2 x
x3 x3
'C
x2 x2 1 1 x5 x4 x4 x2 x x
Calculate x3
x2
x2
LEC-23:
7.4.6
The maximum exponent denotes the number of ops The other exponents denote the ops that tap off of feedback line from last op
LEC-23:
7.4.6
reset
d0
q0
q1
q2
reset
d0
q0
d1
q1
q2
reset
d0 i
q0
q1
q2
reset
d0 i
q0
d1
q1
q2
reset
d0 i
q0
d1
q1
d2
q2
reset
d0 i
q0
d1
q1
q2
d3
q3
px
x4
x3
px
x3
px
x3
px
x3
px
px
x3
x3
x2
LEC-23:
7.4.6
See Smiths Fig 14.27 (pp771), 14.28 (pp773), and Table 14.11 (pp774).
LEC-23:
7.4.6
x5
The op for the most-signicant bit is represented by a coeffcient of 1 for the maximum exponent in the polynomial. Hence, MSB of the rst partial product cancels the x4 of the second partial product, resulting in a coefcient of 0 for x4 in the answer.
'D
x2
x x3 x2 1 x x3 x2 1 2 x x3 x2 1 x3 x2 x
x3
x2
x2
LEC-23:
7.4.7
43
1 1x6 x6 x4
0 0x5 x 1
1 1x4
0 0x3
0 0x2
1 1x1
x4
1: 1 1x0
and the
LEC-23:
7.4.8
Division
44
7.4.8
Division
With rules for multiplication and addition, we can dene division. A fundamental theorem of division denes q and r to be the quotient and remainder, respectively, of m p iff: mx
R D D
qx
px
r x
LEC-23:
7.4.8
Division
45
Long Division
In Galois elds, we do division just as with long division in elementary school. Given:
C C
Quotient Remainder
qx r x
x2 x
1x4 1x4
x x
x4
0x5
x2 x x6 x6
1 1x4
1x3 1x3
0x2
R
0x1
mx px
x6 x4
x4 x
x3
px:
0x0
LEC-23:
7.4.8
Division
46
x4 x3
mx
qx x2 1 x6 x3 x6 x4
px x4 x x
r x x x
LEC-23:
7.4.9
47
R D D
mx
qx
px
r x
The sequence of output bits forms a quotient, q x , of length n The ops in the analyzer form a remainder, r x , of length l
LEC-23:
7.4.9
48
R C C R
mx
ex
q x
px
r x
e x is the error polynomial bits in the message that are ipped have a coefcient of 1 in e x
ex
LEC-23:
7.4.9
49
That is e x must be a multiple of p x . The larger p x is, the smaller the chances that e x will be a multiple of p x .
m x and m x
LEC-23:
7.4.10
Summary
50
7.4.10 Summary
LEC-23:
7.4.10
Summary
51
LEC-23:
7.4.10
Summary
52
1 clock cycles.
LEC-24 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
LEC-24 Preliminaries
Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 wk-09 10 wk-11 12 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Lec-21 Introduction Lec-22 Detection and Test Generation Lec-23 Built-In Self Test (BIST) Lec-24 Scan Testing (JTAG) Review
wk-13
LEC-24 Preliminaries
JTAG IEEE 1149 length of time to do a scan test hardware to do scan testing
LEC-24:
7.5
7.5
LEC-24:
7.5.1
zeta_in(2)
data_in(1)
zeta_in(1)
data_in(0)
zeta_in(0)
Normal Circuit
mode0 scan_in0 mode1 scan_in1
another circuit
scan chain 0
scan_out0
scan chain 1
LEC-24:
7.5.2
Scan Chains
7.5.2
data_in(3)
Scan Chains
mode1 scan_in1 zeta_in(3)
mode0 scan_in0
data_in(2)
zeta_in(2)
data_in(1)
zeta_in(1)
zeta_in(0)
LEC-24:
7.5.2
Scan Chains
scan_out0
scan_out1
Normal Mode
LEC-24:
7.5.2
Scan Chains
Scan Mode
mode0 scan_in0 mode1 scan_in1
scan_out0
scan_out1
Scan Mode
LEC-24:
7.5.2
Scan Chains
another circuit
scan_out0
scan_out1
LEC-24:
7.5.2
Scan Chains
10
LEC-24:
7.5.2
Scan Chains
11
another circuit
scan_out0
scan_out1
LEC-24:
7.5.2
Scan Chains
12
another circuit
scan_out0
scan_out1
LEC-24:
7.5.2
Scan Chains
13
another circuit
scan_out0
LEC-24:
7.5.2
Scan Chains
14
another circuit
LEC-24:
7.5.2
Scan Chains
15
Run Tests
mode0 scan chain 0 scan_in0 mode1 scan chain 0 scan_in1
another circuit
scan_out0
scan_out1
LEC-24:
7.5.2
Scan Chains
16
another circuit
LEC-24:
7.5.2
Scan Chains
17
LEC-24:
7.5.2
Scan Chains
18
LEC-24:
7.5.2
Scan Chains
19
a b y z c d
LEC-24:
7.5.2
Scan Chains
20
mode0 scan_in0 a
mode1 scan_in1
y b z c
scan_out0
scan_out1
LEC-24:
7.5.2
Scan Chains
21
mode0 scan_in0 a
mode1 scan_in1
y b z c
scan_out1
LEC-24:
7.5.2
Scan Chains
22
mode0 scan_in0 a
mode1 scan_in1
y b z c
scan_out1
Load
LEC-24:
7.5.2
Scan Chains
23
mode0 scan_in0 a
mode1 scan_in1
y b z c
scan_out1
Load
LEC-24:
7.5.2
Scan Chains
24
mode0 scan_in0 a
mode1 scan_in1
y b z c
scan_out1
Load
LEC-24:
7.5.2
Scan Chains
25
mode0 scan_in0
mode1 scan_in1
scan_out1
LEC-24:
7.5.2
Scan Chains
26
mode0 scan_in0
mode1 scan_in1
__
+
__
__
__
scan_out1
LEC-24:
7.5.2
Scan Chains
27
mode0 scan_in0
mode1 scan_in1
__
__
scan_out1 (+)
__
LEC-24:
7.5.2
Scan Chains
28
mode0 scan_in0
mode1 scan_in1
__
scan_out0
__
scan_out1 (+, +)
__
clk mode0
LEC-24:
7.5.2
Scan Chains
29
mode0 scan_in0
mode1 scan_in1
scan_out0
__
scan_out1 (+, +)
__
clk mode0
LEC-24:
7.5.2
Scan Chains
30
mode0 scan_in0
mode1 scan_in1
scan_out0
__
scan_out1 (+, +)
__
clk mode0
LEC-24:
7.5.2
Scan Chains
31
mode0 scan_in0
mode1 scan_in1
scan_out0
__
scan_out1 (+, +)
__
clk mode0
LEC-24:
7.5.3
32
7.5.3
Adding scan circuitry 1. Registers around circuit to be tested are grouped into scan chains 2. Replace each op with mux + op 3. Flops and muxes wired together into scan chains 4. Each scan chain is connected to dedicated I/O pins for loading and unloading test vectors Running test vectors 1. Put scan chain in scan mode 2. Load in test vector (one element of vector per clock cycle) 3. Put scan chain in normal mode 4. Run circuit for one clock cycle load result of test into ops 5. Unload results of current test vector while simultaneously loading in next test vector (one element of vector per clock cycle)
LEC-24:
7.5.4
33
7.5.4
A 800MHz chip has scan chains of length 20,000 bits, 18,000 bits, 21,000 bits, 22,000 bits, and two of 15,000 bits. 500,000 test vectors are used for each scan chain. The tests are run at 80% of full speed.
Question:
Answer:
We can load and unload all of the scan chains at the same time, so time will be limited by the longest (22,000 bits). For the rst test vector, we have to load it in, run the circuit for one clock cycle, then unload the result. Loading the second test vector is done while unloading the rst.
Q Q
'CQ
TimeTot
LEC-24:
7.6
BOUNDARY SCAN
34
7.6
Boundary Scan
Boundary scan originated as technique to test wires on printed circuit boards (PCBs). Goal was to replace bed-of-nails style testing with technique that would work for high-density PCBs (lots of small wires close together) Now used to test both boards and chip internals. Used both on boundaries (I/O pins) and internal ops.
LEC-24:
7.6
BOUNDARY SCAN
35
JTAG circuitry now commonly built-into FPGAs and ASICS, or part of a celllibrary. Rarely is a JTAG circuit custom-built as part of a larger part. So, youll probably be choosing and using JTAG circuits, not constructing new ones. Using JTAG circuitry is usually done by giving a description of your printed circuit board (PCB) and the JTAG components on each chip (in BSDL) to test generation software. The software then generates a sequence of JTAG commands and data that can be used to test the wires on the circuit board for opens and shorts.
4 required signals (Scan Pins: TDI, TDO, TCK, TMS) 1 optional signal (Scan Pin: TRST) protocol to connect circuit under test to tester and other circuits state machine to drive test circuitry on chip Boundary Scan Description Language (BSDL): structural language used to describe which features of JTAG a circuit supports
LEC-24:
7.6.1
36
7.6.1
1985 JETAG: Joint European Test Action Group 1986 JTAG (North American companies joined) 1990 JTAG 2.0 formed basis for IEEE 1491 Test access port and boundary scan architecture
LEC-24:
7.6.2
Scan Pins
37
7.6.2
TDO TCK TMS TRST
Scan Pins
test data input: input testvector to chip test data output: output result of test test clock: clock signal that test runs on test mode select: controls scan state machine test reset (optional): resets the scan state machine
'
TDI
LEC-24:
7.6.2
Scan Pins
38
Overview
chip scan registers
TDO control
LEC-24:
7.6.2
Scan Pins
39
Expanded View
chip BSR BSC circuit under test BSC BSC control TDI BR Instruction Decoder IR TCK IDCODE TAP Controller IRC IRC TDO BSC BSC BSC
TMS
LEC-24:
7.6.3
40
7.6.3
LEC-24:
7.6.3
41
LEC-24:
7.6.3
42
JTAG Components
BSR BSC Fig 14.8 Fig 14.5 Fig 14.2 Top level diagram Boundary scan register A chain of boundary scan cells (BSCs) Boundary scan cell Connects external input and scan signal to internal circuit. Acts as wire between external input and internal circuit in normal mode. Bypass-register cell Allows direct connection from TDI to TDO. Acts as a wire when executing BYPASS instruction. Device identication register data register to hold manufacturers name and chip identier. Used in IDCODE instruction. Instruction register cell Cells are combined together as a shift register to form an instruction register (IR) Instruction register Two or more IR cells in a row. Holds data that is shifted in on TDI, sends this data in parallel to instruction decoder. Instruction decoder Reads instruction stored in instruction register (IR) and sends control signals to bypass register (BR) and boundary scan register (BSR) TAP Controller State machine that, together with instruction decoder, controls the scan circuitry.
BR
Fig 14.3
IDCODE
IR cell
Fig 14.4
IR
Fig 14.6
IDecode
Table 14.4
Fig 14.7
LEC-24:
7.6.4
Scan Instructions
43
7.6.4
EXTEST
Scan Instructions
Test board-level interconnect. Drive output pins of chip with hard-coded test vector. Sample results on inputs. Sample result data Load test vector Directly connect TDI to TDO. This is used when several chips are daisy chained together to skip loading data into some chips. Output manufacturer and part number
IDCODE
LEC-24:
7.6.5
TAP Controller
44
7.6.5
TAP Controller
The TAP controller is required to have 16 states and obey the state machine shown in Fig 14.7.
LEC-24:
7.6.6
45
descriptions
of
JTAG/IEEE
Texas Instruments introductory seminar on IEEE 1149.1 http://www.ti.com/sc/docs/jtag/seminar1.pdf Texas Instruments intermediate seminar on IEEE 1149.1 http://www.ti.com/sc/docs/jtag/seminar2.pdf Sun midroSPARC-IIep scan-testing documentation http://www.sun.com/microelectronics/whitepapers/wpr-0018-01/ Intellitech JTAG overview: http://www.intellitech.com/resources/technology.html Actels JTAG description: http://www.actel.com/appnotes/97s05d15.pdf Description of JTAG support on Motorola Coldle microprocessor: http://e-www.motorola.com/collateral/MCF5307TR-JTAG.pdf
LEC-24:
7.7
46
7.7
LEC-24:
7.7.1
Faults
47
7.7.1
Faults
Faults are manufacturing defects. Common occurences are opens (wire is broken) and shorts (two wires are connected together). When working with faults, we work with wire segments, not signals. In the circuit below, there are 8 different wire segments (L1L8). Each wire segment corresponds to a logically distinct fault location. All physical faults on a segment affect the same set of signals, so they are grouped together into a logical fault. If a signal has a fanout of 1, then there is one wire segment. A signal with a fanout of n, where n 1, has n 1 wire segments one for the source signal and one for each gate of fanout.
a L1 L4 L2 L5 c L3 L7
For signal b in the circuit here, the fanout is 2, so there are three wire segments (L2, L4, and L5).
Although there are many different bad behaviours that faults can lead to, the simple model of single-stuck-at-faults has proven very capable of nding real faults in real circuits. single stuck-at-0 (s@0) stuck-at-1 (s@1) assume that at most wire segment in circuit has a fault. assume that the faulty behaviour is that the segment is hardwired to 0. assume that the faulty behaviour is that the segment is hardwired to 1.
L6 L8 z
LEC-24:
7.7.2
Testing
48
7.7.2
Testing
Faults are detected by stimulating circuits (real, manufactured circuit, not a simulation!) with test-vectors and checking that real circuit gives correct output. Standard practice in testing is to test circuits for single stuck-at faults. Mathematics and empirical evidence demonstrate that testing a circuit for single stuck-at faults will also detect many other types of faults and will often detect multiple faults. Some faults are undetectable. Undetectable stuck-at faults are located in redundant parts of a circuit. These redundant parts are added to prevent timing hazards. As such, a stuck-at fault in redundant circuitry will not affect the steady state behaviour of the circuit, but could allow timing glitches to occur. If a circuit has 100% single stuck-at fault coverage with a suite of test vectors, then each stuck-at fault in the circuit can be detected by one or more vectors in the suite. This also means that the circuit has no undetectable faults, and hence, no redundant circuitry. It is possible that achieving 100% coverage for single stuck at faults will allow defective chips to pass if they have faults that are not stuck-at-1 or stuck-at-0, or if they have multiple faults. I think, but havent seen a proof, that achieving 100% single stuck-at coverage will detect all combinations of multiple stuck-at faults. But, if you do not achieve 100% coverage, then a stuck-at fault that you arent testing for can mask (hide) a fault that you are testing for. There are two ways to generate vectors and check result: built-in tests and scan testing. Both require:
generate test vectors overide normal datapath to send test-vectors, rather than normal inputs, as inputs to ops compare outputs of ops to expected result
LEC-24:
7.7.2
Testing
49
1. build Boolean equation (or Karnaugh map) of correct circuit 2. build Boolean equation (or Karnaugh map) of faulty circuit 3. compare equations (or Karnaugh maps), regions of difference represent test vectors that will detect fault Because it takes so much time to perform a scan test, reducing the number of test vectors that are needed is very important. fault1 dominates fault2 is dened as: any test vector that will detect fault1 will also detect fault2.
'
LEC-24:
7.7.2
Testing
50
Summary of Technique to Find and Order Test Vectors: 1. identify all possible faults 2. gate collapsing 3. node collapsing 4. intelligent collapsing 5. fault domination 6. determine required test vectors 7. choose minimal set of test vectors to detect remaining faults 8. order test vectors based on number of faults detected (NOTE: when iterating through this step, need to take into account faults detected by earlier test vectors)
LEC-24:
7.7.2
Testing
51
Each linear feedback shift register has a characteristic polynomial, that corresponds to the behaviour of the signal that is the input to the rst ip-op in the shift register. The exponents in the polynomial correspond to the delay x0 is the input to the shift register, x1 is the output of the rst ip-op, x2 is the output of the second, etc. The coefcient is 1 if theres a feedback tap from the output of the op. Checking is done by building one signature analyzer circuit for each signal tested. The circuit returns true if the signal generates the correct sequence of outputs for the test vectors. Doing this with complete accuracy would require storing 2n bits of information for each output for a circuit with n inputs. This would be as expensive as the original circuit. So, BIST uses mathematics similar to error correction/detection to approximate whether the outputs are correct. This technique is called signature analysis and originated with Hewlett-Packard in the 1970s. The checking is done with an LFSR, similar to the BIST generation circuit. The checking circuit is designed to output a 1 at the end of the sequence of 2n 1 test results if the sequence of results matches the correct circuit. We could do this with an LFSR of 2n 1 ops, but as said before, this would be at least as expensive as duplicating the original circuit.
q2 q1 q0
LEC-24:
7.7.2
Testing
52
The checking LFSR is designed similarly to a hashing function or parity checking circuit. If it returns 0, then we know that there is a fault in the circuit. If it returns a 1, then there is probably not a fault in the circuit, but we cant say for sure. There is a tradeoff between the accuracy of the analyzer and its area. The more accurate it is, the more ip ops are required. The LFSR here recognizes the sequence 1, 0, 1, 1, 1, 0, 0:
q2
It could be used, in conjunction with the maximal-length LFSR above, to detect faults in a circuit that, when stimulated with the sequence with the sequence 111, 011, 001, 100, 010, 101, 110; outputs the sequence 1, 0, 1, 1, 1, 0, 0.
LEC-24:
7.7.3
53
7.7.3
Scan
less hardware
Self Test
more hardware faster ill dened coverage test vectors are hard to modify
LEC-24:
7.7.3
54
Chapter 8
Review
This chapter is a collection of information cover the major topics of the term. The Topics List section for each major area is meant to be relatively complete. The notes sections are less focused and are not indicative of the relative importance of the different topics we covered.
55
LEC-25 Preliminaries
LEC-25: Review
Lecture Notes Sections: 8.1 8.9
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review
LEC-25:
8.1
8.1
Analog effects in the digital world timing analysis power faults and testing
LEC-25:
8.1
Design and optimization techniques Lec-05 Lec-06 Lec-07 Lec-08 Lec-09 Lec-10 Dataow diagrams and high-level models State machines Memory arrays Design example (stack) Optimization and coding guidelines FPGA-Specic optimizations
Functional Validation Lec-11 Datapath Validation Lec-12 Control Validation Performance analysis and prediction Lec-13 Measuring performance, comparing optimizations Lec-14 Digital-circuit performance
LEC-25:
8.1
LEC-25:
8.2
VHDL
8.2
VHDL
simple syntax and semantics things that you should know simply by having done the miniproject and project synthesizing VHDL
match up VHDL code with hardware choose VHDL fragment to generate more optimal hardware identify whether a particular signal will be the output of combinational circuitry or a op identify whether a particular process is combinational or clocked VHDL semantics match up VHDL code with waveforms identify whether two VHDL fragments have same behaviour perform delta-cycle simulation of VHDL perform clock-cycle simulation of VHDL
LEC-25:
8.3
8.3
LEC-25:
8.4
VALIDATION
8.4
Validation
test benches assertions coverage monitors relational specication functional specication boundary conditions / corner cases
LEC-25:
8.5
8.5
LEC-25:
8.6
TIMING ANALYSIS
8.6
Timing Analysis
what affects delay setup, hold, clock-to-Q times, skew, jitter, etc clock period clock skew clock jitter propagation delay load delay setup time hold time clock-to-Q time critical path
nd the critical path through a circuit nd the minimum clock period for a circuit nd a pair of assignments to signals that exercises the critical path false path determine whether a critical path is real or false derating factors
LEC-25:
8.7
POWER
10
8.7
Power
power vs energy equations for power
dynamic power static power switching power short circuit power leakage power activity factor leakage current threshold voltage
LEC-25:
8.8
TESTING
11
8.8
Testing
causes of faults locations of faults physical faults
mathematical models of faults single stuck-at fault will a test for a mathematica fault detect a physical fault?
LEC-25:
8.8
TESTING
12
Testing II
built-in self-testing linear feedback shift register characteristic polynomials addition multiplication division (quotient and remainder) relationship to hardware maximal length linear feedback shift register signature analyzer fault aliasing process and time to run a BIST test
test vector generation generate test vector to nd a particular fault generate test vectors to nd a set of faults fault collapsing gate collapsing node collapsing fault domination order test vectors to reduce test time
LEC-25:
8.9
13
8.9
p
106
i
i 0
LEC-25:
8.9
1 2
R A t
10
1 38066
q e k
1 60218
10
Formulas II
23 19
J/K C 14
LEC-25:
8.9
Part II
Chapter 1
VHDL Problems
SOL-01 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
SOL-01:
1.1
IEEE 1164
1.1
IEEE 1164
For each of the values in the list below, answer whether or not it is dened in the ieee.std_logic_1164 library. If it is part of the library, write a 23 word description of the value. Values: -, #, 0, 1, A, h, H, L, Q, X, Z.
Answer:
- # 0 1 A h H L Q X Z
NOTE: h is not in the package, because characters are case sensitive. For example a /= A.
SOL-01:
1.2
SOL-01:
1.2
architecture main of montevido is signal i, j : std_logic; begin process begin i <= c0 XOR c1; wait until rising_edge(a); j <= c0 XOR c1; t <= b0 XOR b1; process (a, i, j) begin u <= NOT t; if (a = 1) then v <= NOT x; p <= i AND j; end process; else process begin p <= NOT i; case l is end if; when "00" => end process; wait until rising_edge(a); process (a, b0, b1) begin w <= b0 AND b1; if rising_edge(a) then x <= 0; q <= b0 AND b1; when "01" => end if; wait until rising_edge(a); end process; w <= -; process (a, c0, c1, d0, d1, e0, e1) <= 1; x begin when "1-" => if (a = 1) then wait until rising_edge(a); r <= c0 OR c1; w <= c0 XOR c1; s <= d0 AND d1; x <= -; else end case; r <= e0 XOR e1; end process; end if; y <= c0 XOR c1; end process; z <= x XOR w; end main;
SOL-01:
Answer:
1.2
Latch p q r s t u v w x y z
Combinational X X
Flip-op X
X X X X X X X X
SOL-01:
1.3
1.3
NOTES: 1. 2. 3. 4.
... represents a legal fragment of VHDL code assume all signals are properly declared the VHDL code is intendend to be legal, synthesizable code all signals are initially U
SOL-01:
1.3
architecture main of tinyckt is component bigckt ( ... ); signal ... : std_logic; begin p0 : process begin entity bigckt is wait until rising_edge(clk); port ( p0_a <= i; a, b : in std_logic; wait until rising_edge(clk); c : out std_logic end process; ); p1 : process begin end bigckt; wait until rising_edge(clk); p1_b <= p1_d; architecture main of bigckt is p1_c <= p1_b; begin p1_d <= s2_k; process (a, b) end process; begin p2 : process (p1_c, p3_h, p4_i, clk) begin if (a = 0) then if rising_edge(clk) then c <= 0; p2_e <= p3_h; else p2_f <= p1_c = p4_i; if (b = 1) then end if; c <= 1 end process; else p3 : process (i, s4_m) begin c <= 0; p3_g <= i; end if; p3_h <= s4_m; end if; end process; end process; p4 : process (clk, i) begin end main; if (clk = 1) then p4_i <= i; entity tinyckt is else port ( p4_i <= 0; clk : in std_logic; end if; i : in std_logic; end process; o : out std_logic huge : bigckt ); (a => p2_e, b => p1_d, c => h_y); end tinyckt; s1_j <= s3_l; s2_k <= p1_b XOR i; s3_l <= p2_f; s4_m <= p2_f; end main;
For each of the pairs of signals below, what is the minimum length of time between when a change occurs on the source signal and when that change
SOL-01:
1.3
Answer:
NOTE: i doesnt affect the value of p2 f just before a rising edge of clock, so i doesnt affect p2 e at all along the path that goes through p2 f source signal destination signal no connection same clock cycle 1 clock cycle 2 clock cycle 3 clock cycle 4 clock cycle 5 clock cycle 6 clock cycle 7 clock cycle 8 clock cycle 9 clock cycle 10 or more clock cycles i p0 a i p1 b i p1 c i p2 e i p3 g X X X X X i p4 i X X X s4 m hy p1 b p1 d p2 f s1 j X
SOL-01:
1.4
ARITHMETIC OVERFLOW
1.4
Arithmetic Overow
Answer:
An overow in 8 bit arithmetic happens when the carry into the most signicant bit is different from the carry out of the most signicant bit. library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity overflow is port ( num1, num2 : in signed(7 downto 0); cin : in std_logic; overflow : out std_logic ); end overflow; architecture main of overflow is signal num1_ext, num2_ext, result : signed(8 downto 0); begin num1_ext <= 0 & num1; num2_ext <= 0 & num2; result <= num1_ext + num2_ext + ("00000000" & cin); ovrflw <= not (num1_ext(7) xor num2_ext(7)) and ( num1_ext(7) xor result(7) ); end overflow;
SOL-01:
1.5
8-BIT REGISTER
10
1.5
8-Bit Register
clock signal clk input data vector d output data vector q synchronous active-high input reset synchronous active-high input enable
SOL-01:
1.5
8-BIT REGISTER
11
Answer: library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity reg_8 port ( clk, reset, enable : d : q : ); end reg_8; is
architecture main of reg_8 is begin reg: process begin wait until (rising_edge(clk)); if reset = 1 then q <= (others => 0); elsif enable = 1 then q <= d; end if; end process reg; end main;
SOL-01:
1.5.1
Asynchronous Reset
12
1.5.1
Asynchronous Reset
Modify your design so that the reset signal is asynchronous, rather than synchronous.
Answer: reg : process(clk, reset) begin if reset = 1 then q <= (other => 0); elsif rising_edge(clk) then if enable = 1 then q <= d; end if; end if; end process reg;
SOL-01:
1.5.2
Discussion
13
1.5.2
Discussion
Describe the tradeoffs in using synchonous versus asynchronous reset in a circuit implemented on an FPGA.
SOL-01:
1.5.3
14
1.5.3
Write a test bench to validate the functionality of the 8-bit register with synchronous reset.
Answer:
SOL-01:
1.5.3
15
library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity reg_8_tb is end reg_8_tb; architecture main of reg_8_tb is component reg_8 is port ( clk : in std_logic; reset : in std_logic; enable : in std_logic; d : in std_logic_vector (7 downto 0); q : out std_logic_vector (7 downto 0)); end component; signal clk, reset, enable : std_logic; signal d, q : std_logic_vector(7 downto 0); begin uut : reg_8 port map ( clk => clk, reset => reset, enable => enable, d => d, q => q ); process begin clk <= 1 ; reset <= 0 ; wait for 20 ns; -- time=20 ns clk <= 0 ; reset <= 1 ; enable <= 1 ; d <= "10101011"; wait for 20 ns; -- time=40 ns clk <= 1 ; wait for 20 ns; -- time=60 ns clk <= 0 ; en <= 0 ; d <= "00001011" wait for 20 ns; -- time=80 ns clk <= 1 ; wait for 20 ns; -- time=100 ns clk <= 0 ; en <= 1 ; wait for 20 ns; -- time=120 ns clk <= 1 ;
SOL-01:
1.6
VHDL SYNTAX
16
1.6
VHDL Syntax
Answer whether each of the VHDL code fragments q2a through q2f is legal VHDL code. NOTES: ... represents a fragment of legal VHDL code. For full marks, if the code is illegal, you must explain why. The code has been written so that, if it is illegal, then it is illegal for both simulation and synthesis.
1) 2) 3)
architecture main of anchiceratops is signal a, b, c : std_logic; begin process begin architecture main of tulerpeton i wait until rising_edge(c); begin a <= if (b = 1) then lab: for i in 15 downto 0 loop q2a q2b ... ... else end loop; ... end main; end if; ILLEGAL: loop statements are sequential, end process; while architecture bodies contain concurrent end main; statements. ILLEGAL: if-then-else is a statement, not an expression, so cant have if-then-else on right-hand-side of assignment.
SOL-01:
1.6
VHDL SYNTAX
17
architecture main of temnospondyl is component compa port ( architecture main of metaxygnathus ais in std_logic; : signal a : std_logic; b : out std_logic begin ); q2d q2c lab: if (a = 1) generate end component; ... signal p, q : std_logic; end generate; begin end main; coma_1 : compa port map (a => p, b => q); ILLEGAL: condition for ... if-generate statements must end main; be statically determined; testing the value of a signal is dynamic. LEGAL architecture main of pachyderm is architecture main of apatosaurus is function inv(a : std_logic) type state_ty is (S0, S1, S2); return std_logic is signal st : state_ty; begin signal p : std_logic; return(NOT a); begin q2e q2f end inv; case st is signal p, b : std_logic; when S0 | S1 => p <= 0; begin when others => p <= 1; p <= inv(b => a); end case; ... end main; end main; ILLEGAL: case statements are ILLEGAL: the argument to inv sequential; but the body of an should be (a => b) architecture contains concurrent statements.
SOL-02 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
SOL-02:
1.7
CLOCK-CYCLE SIMULATION
1.7
Clock-Cycle Simulation
Given the VHDL code for deinonychus and waveform diagram below, answer what the values of the signals y, z, and p will be at the given times.
SOL-02:
1.7
CLOCK-CYCLE SIMULATION
architecture main of deinonychus is signal y, z : unsigned(15 downto 0) signal state : state_ty; begin proc_herzog: process begin top_loop: loop wait until (rising_edge(clk)); library ieee; next top_loop when (reset = 1 use ieee.std_logic_1164.all; state <= durian; use ieee.numeric_std.all; wait until (rising_edge(clk)); state <= papaya; package deinonychus_pkg is while y < z loop type state_ty is wait until (rising_edge(clk)) (mango, guava, durian, papaya); if sel = 1 then end deinonychus_pkg; wait until (rising_edge(clk next top_loop when (reset = library ieee; state <= mango; use ieee.std_logic_1164.all; end if; use ieee.numeric_std.all; state <= papaya; use work.deinonychus_pkg.all; end loop; end loop; entity deinonychus is end process; port ( proc_hillary: process (clk) clk, reset, sel : in std_logic; begin a, b : in unsigned(15 downto 0); if rising_edge(clk) then p : out unsigned(15 downto 0) if (state = durian) then ); z <= a; end deinonychus; else z <= z + 2; end if; end if; end process; y <= b; p <= y + z; end main;
SOL-02:
0 reset clk
1.7
20
CLOCK-CYCLE SIMULATION
40 60 80 100 120 140 160 180 200
sel
01 0E 02 0C 04 0A 06 08 0E 02 0C 04 0A 06 08 0E 02 0C 04 0A
b state
0F 03 0D 05 0B 07 01 0F 03 0D 05 0B 07 01 0F 03 0D 05 0B 07
0F 03 0D 05 0B 07 01 0F 03 0D 05 0B 07 01 0F 03 0D 05 0B 07
z p
U
U
2
07
6
15
A
11
55ns
107ns
147ns
195ns
SOL-02:
1.8
1.8
Simulate the following VHDL code by drawing a timing diagram. INSTRUCTIONS: 1. The simulation is to be done at the granularity of simulation-steps. 2. Show all changes to process modes and signal values. 3. Each column of the timing diagram corresponds to a simulation step that changes a signal or process. 4. Clearly show the beginning and end of each simulation cycle, delta cycle, and simulation round by writing in the appropriate row a B at the beginning and an E at the end of the cycle or round. NOTES: 1. The initial value of all signals is U. 2. The signal reset becomes 1 at 0 ns and then becomes 0 at 5 ns.
SOL-02:
1.8
architecture main of pong_machine is signal ping_i, ping_n, pong_i, pong_n : std_logic; begin process (clk) begin if rising_edge(clk) then ping_n <= ping_i; pong_n <= pong_i; end if; end process; process (pong_n, ping_n, reset) begin if (reset = 1) then ping_i <= 1; pong_i <= 0; else ping_i <= pong_n; pong_i <= ping_n; end if; end process; out_pong_proc : process (pong_i) begin pong <= pong_i; end process; ping <= ping_i; end main;
SOL-02:
1.9
1.9
Simulate the following VHDL code by completing the timing diagram on the next page. INSTRUCTIONS: 1. The simulation is to be done at the granularity of simulation-steps. 2. Show all changes to process modes and signal values. 3. Each column of the timing diagram corresponds to a simulation step that changes a signal or process. 4. Clearly show the beginning and end of each simulation cycle, delta cycle, and simulation round by writing in the appropriate row a B at the beginning and an E at the end of the cycle or round. 5. Write t=5ns and t=10ns at the top of columns where time advances to 5 ns and 10 ns. NOTES: 1. The initial value of all of the signals are shown in the timing diagram. 2. The only changes on clk, a, and b are: (a) At 5 ns, a changes from 0 to 1. (b) At 5 ns, b changes from 0 to 1. (c) At 10 ns, clk changes from 0 to 1.
SOL-02:
1.9
entity femur is port ( clk, a, b : in std_logic; f : out std_logic ); end femur; architecture main of femur is signal c, d, e : std_logic; begin proc_1 : process (a, b, c) begin c <= a and b; d <= a xor c; end process; proc_2 : process begin e <= d; wait until rising_edge(clk); end process; proc_3 : process (c, e) begin f <= c xor e; end process; end main;
SOL-02:
t=5 ns
t=10 ns
simulation round E E E S P A S P A S P A B E B E S B E B E
B B B
E E E
1.9
simulation cycle
delta cycle
proc_external
proc_1
proc_2 P A S
proc_3
clk
SOL-02:
1.10
1) 2) 3)
entity teradactyl is port ( architecture q3a of teradactyl is a : in std_logic; signal b, c, d : std_logic; v : out std_logic begin ); b <= a; end teradactyl; architecture main of teradactyl is c <= b; d <= c; signal m : std_logic; v <= d; begin end q3a; m <= a; v <= m; SAME end main;
SOL-02:
1.10
architecture q3c of teradactyl is architecture q3b of teradactyl is signal m : std_logic; signal m : std_logic; begin begin process (a) begin process (a, m) begin m <= a; v <= m; end process; m <= a; process (m) begin end process; v <= m; end q3b; end process; end q3c; SAME SAME
SOL-02:
1.11
1) 2) 3)
SOL-02:
1.11
entity ichthyostega is port ( clk : in std_logic; b, c : in signed(3 downto 0); architecture q4a of ichthyostega is v : out signed(3 downto 0) signal bx, cx : signed(3 downto 0); ); begin end ichthyostega; process begin wait until (rising_edge(clk)); architecture main of ichthyostega is bx <= b; signal bx, cx : signed(3 downto 0); cx <= c; begin end process; process begin process begin wait until (rising_edge(clk)); if (cx > 0) then bx <= b; wait until (rising_edge(clk)); cx <= c; v <= bx; end process; else process begin wait until (rising_edge(clk)); wait until (rising_edge(clk)); v <= to_signed(-1, 4); if (cx > 0) then end if; v <= bx; end process; else end q4a; v <= to_signed(-1, 4); end if; DIFFERENT: evaluations of cx > 0 and end process; v <= bx are separated by a clock cycle. end main;
SOL-02:
1.11
architecture q4b of ichthyostega is architecture q4c of ichthyostega is signal bx, cx : signed(3 downto 0); signal bx, cx, dx : signed(3 downto begin begin process begin process begin wait until (rising_edge(clk)); wait until (rising_edge(clk)); bx <= b; bx <= b; cx <= c; cx <= c; wait until (rising_edge(clk)); end process; if (cx > 0) then process begin v <= bx; wait until (rising_edge(clk)); else v <= dx; v <= to_signed(-1, 4); end process; end if; dx <= bx when (cx > 0) end process; else to_signed(-1, 4); end q4b; end q4c; DIFFERENT: each assignment statement SAME (e.g. bx <= b) will execute every other clock cycle, rather than every clock cycle.
SOL-02:
1.12
1)
2) 3) 4)
clk a b c
SOL-02:
1.12
q3a q3b architecture q3a of q3 is architecture q3b of q3 is begin begin process begin process begin a <= 1; b <= 0; loop a <= 1; wait until rising_edge(clk); wait until rising_edge(clk); a <= NOT a; a <= b; end loop; b <= a; end process; wait until rising_edge(clk); b <= NOT a; end process; c <= NOT b; c <= a; end q3a; end q3b; SAME SAME
q3c q3d architecture q3c of q3 is architecture q3d of q3 is begin begin process begin process (b, clk) begin a <= 0; a <= NOT b; b <= 1; end process; wait until rising_edge(clk); process (a, clk) begin b <= a; b <= NOT a; a <= b; end process; wait until rising_edge(clk); c <= NOT b; end process; end q3d; c <= NOT b; end q3c; DIFFERENT: this code has combinaSAME tional loops
SOL-02:
1.12
q3e q3f architecture q3e of q3 is architecture q3f of q3 is begin begin process process begin begin a <= 1; b <= 0; b <= 0; a <= 1; c <= 1; wait until rising_edge(clk); wait until rising_edge(clk); a <= c; a <= c; b <= a; b <= a; wait until rising_edge(clk); c <= NOT b; end process; wait until rising_edge(clk); c <= not b; end process; end q3e; end q3f; DIFFERENT: c is a constant 1 DIFFERENT: a is a constant 1
SOL-02:
1.13
18
1.13
For each of the circuits q2aq2d, answer whether the signal d has the same behaviour as it does in the main architecture of q2.
q2a clk
SOL-02:
1.13
19
q2d
SOL-02:
1.14
20
1.14
For each of the fragments of VHDL q4a...q4d, answer whether the the code is synthesizable. If the code is synthesizable, draw the circuit most likely to be generated by synthesizing the datapath of the code. If the the code is not synthesizable, explain why.
process begin wait until rising_edge(a); e <= d; q4a wait until rising_edge(b); e <= NOT d; end process;
Answer: Unsynthesizable: different conditions in wait statements in same process. This would lead to a single ip-op requiring multiple clock signals.
Answer: unsynthesizable: while process begin loop around code where while (c /= 1) loop some paths have wait if (b = 1) then statements and some do wait until rising_edge(a); not. Even having a while e <= d; loop with a dynamic q4b else condition around code e <= NOT d; without a wait statement end if; would be end loop; unsynthesizable, e <= b; because it would lead to end process; combinational loops in the hardware.
SOL-02:
1.14
21
process (a, d) begin e <= d; end process; process (a, e) begin q4c if rising_edge(a) then f <= NOT e; end if; end process;
process (a) begin if rising_edge(a) then if b = 1 then e <= 0; q4d else e <= d; end if; end if; end process;
Answer: Synchronous reset (AND with bubble). The Reset pin on a ip-op is generally asynchronous, so a op with a reset pin would be incorrect.
SOL-02:
1.15
DATAPATH DESIGN
22
1.15
Datapath Design
Each of the three VHDL fragments q4aq4c, is intended to be the datapath for the same circuit. The circuit is intended to perform the following sequence of operations (not all operations are required to use a clock cycle):
read in source and destination addresses from i src1, i src2, i dst read operands op1 and op2 from mem- clk i_src1 ory compute sum of operands sum i_src2 write sum to memory at destination ad- i_dst dress dst write sum to output o result
o_result
SOL-02:
1.15.1
Correct Implementation?
23
SOL-02:
1.15.1
Correct Implementation?
24
-- This code is to be used for all three code fragments q4a--q4c. signal state : std_logic_vector(3 downto 0); signal src1, src2, dst, op1, op2, sum, mem_in_a, mem_out_a, mem_out_b mem_addr_a, mem_addr_b : unsigned(7 downto 0); ... process (clk) begin if rising_edge(clk) then src1 <= i_src1; src2 <= i_src2; dst <= i_dst; o_result <= sum; end if; end process; mem : ram256x16d port map (clk => clk, i_addr_a => mem_addr_a, i_addr_b => mem_addr_b, i_we_a => mem_we, i_data_a => mem_in_a, o_data_a => mem_out_a, o_data_b => mem_out_b); q4a
SOL-02:
op1
1.15.1
Correct Implementation?
25
<= mem_out_a when state = "0010" else (others => 0); op2 <= mem_out_b when state = "0010" else (others => 0); sum <= op1 + op2 when state = "0100" else (others => 0); mem_in_a <= sum when state = "1000" else (others => 0); mem_addr_a <= dst when state = "1000" else src1; mem_we <= 1 when state = "1000" else 0; mem_addr_b <= src2; process (clk) begin if rising_edge(clk) then if (load = 1) then state <= "1000"; else -- rotate state vector one bit to left state <= state(2 downto 0) & state(3); end if; end if; end process;
SOL-02:
1.15.1
Correct Implementation?
26
Answer: The circuit is not correct: all of the signals are combinational. Also, there could be initialization problems with state.
SOL-02:
q4b
1.15.1
Correct Implementation?
27
process (clk) begin if rising_edge(clk) then op1 <= mem_out_a; op2 <= mem_out_b; end if; end process; sum <= op1 + op2; mem_in_a <= sum; mem_we <= load; mem_addr_a <= dst when load = 1 else src1; mem_addr_b <= src2;
SOL-02:
1.15.1
Correct Implementation?
28
Answer:
SOL-02:
q4c
1.15.1
Correct Implementation?
29
process begin wait until rising_edge(clk); op1 <= mem_out_a; op2 <= mem_out_b; sum <= op1 + op2; mem_in_a <= sum; end process; process (load, dst, src1) begin if load = 1 then mem_addr_a <= dst; else mem_addr_a <= src1; end if; end process; mem_addr_b <= src2;
SOL-02:
1.15.1
Correct Implementation?
30
SOL-02:
1.15.2
Smallest Area
31
SOL-02:
1.15.2
Smallest Area
32
Answer: Assuming that q4c includes mem we: All of the circuits have an adder, memory, input ops, output ops, and a mux for mem addr a. The differences are in the ops and misc circuitry: q4a 1*4 5*4 q4b 2*8 0 q4c 4*8 0
ops ands
SOL-02:
1.15.3
33
SOL-02:
1.15.3
34
Answer:
q4c has the shortest clock period, because it does the least amount of computation between ip ops all of the signals are opped.
Chapter 2
Design Problems
35
SOL-03 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
SOL-03:
2.1
SYNTHESIS
2.1
Synthesis
SOL-03:
2.1.1
Data Structures
2.1.1
Data Structures
If you have to write your own code (i.e. you do not have a library of memory components or a special component generation tool such as LogiBlox or CoreGen). What datastructures in VHDL would you use when creating a register le?
SOL-03:
2.1.2
2.1.2
When using VHDL for an FPGA, under what circumstances is it better to write your own VHDL code for memory, rather than instantiate memory components from a library?
SOL-03:
2.2
DESIGN GUIDELINES
2.2
Design Guidelines
While you are grocery shopping you encounter your co-op supervisor from last year. Shes now forming a startup company in Waterloo that will build digital circuits. Shes writing up the design guidelines that all of their projects will follow. She asks for your advice on some potential guidelines. What is your response to each question? What is your justication for your answer? What are the tradeoffs between the two options? 0. Sample Should all projects use silicon chips, or should all use biological chips, or should each project choose its own technique? Answer: All projects should use silicon based chips, because biological chips dont exist yet. The tradeoff is that if biological chips existed, they would probably consume less power than silicon chips. 1. Should all projects use an asynchronous reset signal, or should all use a synchronous reset signal, or should each project choose its own technique? Answer: Synchronous reset: Synchronous reset leads to more robust designs. With asynchronous reset, a op is reset whenever the reset signal arrives. Due to wire delays, signals will arrive at different ops at different times. If an asynchronous reset occurs at about the time as a clock edge, some ops might be reset in one clock cycle and some in the next. This can lead to glitches and/or illegal values on internal state signals. The tradeoff is that asynchronous reset is often easier to code in VHDL and requires less hardware to implement. 2. Should all projects use latches, or should all projects use ip-ops, or should each project choose its own technique?
SOL-03:
2.2
DESIGN GUIDELINES
Answer: Flops Flip ops lead to more robust designs than latches. Latches are level sensitive and act as wires when enabled. For a latch based design to work correctly, there cannot be any overlap in the time when a consecutive pair of latches are enabled. If this happens, the value on a signal will leak through the latch and arrive at the next set of latches one clock phase too early. Thus, latch based designs are more sensitive to the timing of clock signals. Another disadvantage of latches is that some FPGAs and cell libraries do not support them. In comparison, D-type ip ops are (almost?) always supported. The tradeoff is that latches are smaller and faster than ip ops. A common implementation of a ip-op is a pair of latches in a master/slave combination. 3. Should all chips have registers on the inputs and outputs or should chips have the inputs and outputs directly connected to combinational circuitry, or should each project choose its own technique? By register we mean either ip-ops or latches, based upon your answer to the previous question. If your answer is different for inputs and outputs, explain why. Answer: Flops on outputs and inputs Putting ops on inputs and outputs will make the clock speed of the chip less dependent of the propagation delay between chips. Flops can also be used to isolate the internals of the chip from glitches and other anomolous behaviour that can occur on the boards. The tradeoff is that ops consume area and will increase the latency through the chip. 4. Should all circuit modules on all chips have ip-ops on the inputs and outputs or should chips have the inputs and outputs directly connected to combinational circuitry, or should each project
SOL-03:
2.2
DESIGN GUIDELINES
choose its own technique? By register we mean either ip-ops or latches, based upon your answer to the previous question. If your answer is different for inputs and outputs, explain why. Answer: Each project should adopt a convention of either using ops on inputs of modules or outputs of modules. It is rarely necessary to put ops on both inputs and outputs of modules on the same chip. This is because the wire delay between modules is usually less than a clock period. Putting ops on either the inputs or outputs is advantageous because it provides a standard design convention that makes it easier to glue modules together without violating timing constraints. If modules were allowed to have combinational circuitry on both inputs and outputs, the maximum clock speed of the design could not be determined until all of the modules were glued together. The tradeoff is that ops add area and latency. Sometimes there will be two modules where the combinational circuitry on the outputs of one can be combined with the combinational circuitry on the inputs of the second without violating timing constraints. This discipline prevents that optimization. Aside: Sometimes, to meet performance targets, in situations such as this, a project will remove or move the ops between modules and do clock borrowing to t the maximum amount of circuitry into a clock period. This is a rather low-level optimization that happens late in the design cycle. It can cause big headaches for functional validation and equivalence verication, because the specications for modules are no longer clean and the boundaries between modules on the low-level design might be different from the boundaries in the high-level design. 5. Should all projects use tri-state buffers, or should all projects use multiplexors, or should each project choose its own technique?
SOL-03:
2.2
DESIGN GUIDELINES
Answer: Multiplexors Multiplexors lead to more robust designs. Tri-state buffers rely on analog characteristics of devices to work correctly. Latches can work incorrectly in the presence of voltage uctuations or fabrication process variations. Multiplexors work on a purely Boolean level and as such are less sensitive to changes in voltages or fabrication processes. The tradeoff is that latches are smaller and faster than multiplexors.
SOL-03:
2.3
2.3
Use the dataow diagram below to answer questions 2.3.1 and 2.3.2.
f f d e
g f
SOL-03:
2.3.1
Resource Usage
10
2.3.1
Resource Usage
List the number of items for each resource used in the dataow diagram.
SOL-03:
2.3.2
Optimization
11
2.3.2
Optimization
Draw an optimized dataow diagram that improves the performance and produces the same output values. Or, if the performance cannot be improved, describe the limiting factor on the preformance. NOTES:
Answer:
a b d
you may change the times when signals are read from the environment you may not increase the resource usage (input ports, registers, output ports, f components, g components) you may not increase the clock period
f c f g
g e f
SOL-03:
2.4
12
2.4
Your manager has given you the task of implementing the following pseudocode in an FPGA: if is_odd(a + d) p = (a + d)*2 + ((b + c) - 1)/4; else p = (b + c)*2 + d;
1) 2) 3) 4) 5)
6)
NOTES: You must use registers on all input and output ports. p, a, b, c, and d are to be implemented as 8-bit signed signals. A 2-input 8-bit ALU that supports both addition and subtraction takes 1 clock cycle. A 2-input 8-bit multiplier or divider takes 4 clock cycles. A small amount of additional circuitry (e.g. a NOT gate, an AND gate, or a MUX) can be squeezed into the same clock cycle(s) as an ALU operation, multiply, or divide. You can require that the environment provides the inputs in any order and that it holds the input signals at the same value for multiple clock cycles.
SOL-03:
2.4.1
Maximum performance
13
2.4.1
Maximum performance
What is the minimum number of clock cycles needed to implement the pseudocode with a circuit that has two input ports?
Answer:
Optimizations:
Multiplication by a constant power of 2 can be done without hardware, just connect the wires between the signals. For example, if we have a <= b*2;, we can do this with a(0) <= b(1); a(1) <= b(2); etc. Testing if a signal is odd or even can be done simply by extracting the least signicant bit of the signal.
b c
d 1
SOL-03:
2.4.1
Maximum performance
b c
14
Data ow for even case Even ow requires 4 clock cycles (3 cycles in the datapath plus one more because we have to have ops on both inputs and outputs). Therefore total design will require 4 clock cycles. What is the minimum number of ALUs, multipliers, and dividers needed to achieve the minimum number of clock cycles that you just calculated?
Answer:
SOL-03:
2.4.1
Maximum performance
15
-1 xor and
SOL-03:
2.4.2
Minimum area
16
2.4.2
Minimum area
What is the minimum number of datapath storage registers (8, 6, 4, and 1 bit) and clock cycles needed to implement the pseudocode if the circuit can have at most one ALU, one multiplier, and one divider?
Answer:
a d 3 0 0 0 5 8b regs 6b regs 4b regs 1b regs clock cycles
d -1
SOL-03:
2.5
17
2.5
Design a circuit that performs the following operation: P = (a+d) + ((b - c) - 1) Optimize your design for area.
Answer:
SOL-03:
2.5
18
library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity fsm1 is port( in1: in signed(3 downto 0); in2: in signed(3 downto 0); clk: in std_logic; p: out signed(4 downto 0) ); end fsm1; architecture fsm1_arch of fsm1 is signal add_sel, sub_sel : std_logic; signal add1, add2, sub1, sub2, r1, r2: signed(4 downto 0); begin fsm: process begin wait until rising_edge(clk); add_sel <= - ; sub_sel <= 1 ; wait until rising_edge(clk); add_sel <= 1 ; sub_sel <= 0 ; wait until rising_edge(clk); add_sel <= 0 ; sub_sel <= - ; end process; reg: process begin wait until rising_edge(clk); r1 <= sub1 sub2; r2 <= add1 + add2; end process; -- concurrent statements add1 <= ( 0 & in1) when (add_sel = 1 ) else r1; add2 <= ( 0 & in2) when (add_sel = 1 ) else r2; sub1 <= ( 0 & in1) when (sub_sel = 1 ) else r1; sub2 <= ( 0 & in2) when (sub_sel = 1 ) else to_signed(1,5); p <= r2; end fsm1_arch;
SOL-04 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
SOL-04:
2.6
5.
6. 7. 8. 9. 10.
, AND, XOR
SOL-04:
2.6.1
Algorithm 1
2.6.1
Algorithm
Algorithm 1
q = M[b]; M[a] = b; p = (M[b-1]) * b) + M[b]; Assuming a > b, draw a dataow diagram that is optimized for the fastest overall execution time.
Answer:
1. a > b means that a b 1, therefore can do M[b+1] read in parallel with M[a] write or with M[b] read.
M(wr)
b 1
M(rd)
M(rd)
SOL-04:
2.6.1
Algorithm 1
M a b -1 25ns M(rd) 60ns 60ns M(wr) 65ns M q p 150ns M(rd) 60ns
Critical path is from b to p: 150ns. 5. Explore performance with different clock periods
M a b 1 25ns 5ns 5ns
70 ns 4 cycles 280 ns
SOL-04:
M
2.6.1
a
Algorithm 1
b 1 25ns 5ns
90 ns 3 cycles 270 ns
6. Minimum latency is 3 clock cycles, because cant do all memory operations in parallel and need registers on both inputs and outputs. 7. Best performance is with clock period of 90 ns. 8. Resource usage: Component Quantity Input 2 Output 2 Register 5 (including mem array) Adder 1 Memory read 2 Memory write 1 Multiplication 1 Clock Period 90 ns Latency 3 cycles Execution Time 270 ns
SOL-04:
2.6.2
Algorithm 2
2.6.2
Algorithm 2
q = M[b]; M[a] = b; p = (M[b-1]) * b) + M[b]; Assuming a b, draw a dataow diagram that is optimized for the fastest overall execution time.
Answer:
1. a b means that a b and a b-1, so no memory address conicts to create dependencies and complications. 2. Explore performance with different clock periods
M a b 1 30ns 5ns 5ns
M(rd)
M(rd)
60ns 5ns
60ns
M(wr)
5ns M p
70 ns 5 cycles 350 ns
SOL-04:
M
2.6.2
a
Algorithm 2
b 1 30ns 5ns
M(rd)
M(rd)
60ns 5ns
60ns
M(wr)
65ns
25ns
5ns M p
90 ns 3 cycles 270 ns
3. Without going to a triple-ported memory, cant reduce latency below 3. 4. Area optimization: change b - 1 to b + (-1).
SOL-04:
M
2.6.2
a
Algorithm 2
b -1 25 ns 5ns
M(rd)
M(rd)
60ns 5ns
60ns
M(wr)
65ns
25ns
5ns M p
5. Resource usage: Component Quantity Input 2 Output 1 Register 5 (including mem array) Adder 1 Memory read 2 Memory write 1 Multiplication 1 Clock Period 90 ns Latency 3 cycles Execution Time 270 ns
SOL-05 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
SOL-05:
2.7
2-BIT ADDER
2.7
2-bit adder
This question compares an FPGA and generic-gates implementation of 2bit full adder.
SOL-05:
2.7.1
Generic Gates
2.7.1
Generic Gates
Show the implementation of a 2 bit adder using NAND, NOR, and NOT gates.
SOL-05:
2.7.2
Xilinx FPGA
2.7.2
Xilinx FPGA
Show the CLB implementation of a 2 bit adder in a Xilinx Spartan XCS10 FPGA by drawing the schematic of a CLB and showing the equations for the lookup tables.
SOL-05:
2.8
SKETCHES OF PROBLEMS
2.8
Sketches of Problems
1. calculate resource usage for a dataow diagram (input ports, output ports, registers, datapath components) 2. calculate performance data for a dataow diagram (clock period and number of cycles to execute (CPI)) 3. given a dataow diagram, calculate the clock period that will result in the optimum performance 4. given an algorithm, design a dataow diagram 5. given a dataow diagram, design the datapath and nite state machine 6. optimize a dataow diagram to improve performance or reduce resource usage 7. given fsm diagram, pick VHDL code that best implements diagram correct behaviour, simple, fast hardware or critique hardware
SOL-05:
2.8
SKETCHES OF PROBLEMS
Chapter 3
SOL-06 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
SOL-06:
3.1
3.1
SOL-06:
3.1.1
3.1.1
1. Functionality Briey describe the functionality of a carry-save adder. 2. Testbench Write a testbench for a 16-bit combinational carry save adder. 3. Testbench Maintenance Modify your testbench so that it is easy to change the width of the adder and the latency of the computation. NOTES: (a) You do not need to support pipelined adders. (b) VHDL generics might be useful.
SOL-06:
3.1.2
3.1.2
1. Functionality Briey describe the functionality of a trafc-light controller that has sensors to detect the presence of cars. Answer:
Given a normal trafc light, which spends a constant amount of time as green in direction, add the following two transitions to the system: (a) If the less-busy road does not have any cars present for t1 minutes, transition the trafc light to make the busier of the two roads as green. (b) If the busy road has a car waiting for t2 minutes, transition the trafc light to make the busier of the two roads as green. 2. Boundary Conditions Make a list of boundary conditions to check for your trafc light controller. Answer:
(a) A car arrives at the intersection and triggers the sensor, but makes a right turn before the light turns green in its direction. Should the light turn to green in the direction of the now vacant road, or stay green in the current direction? (b) Same as 1, but the makes a right turn after the other road already has a yellow light. Should the light turn to green in the direction of the now vacant road, or transition from yellow back to green, or very briey stay green in the vacant direction?
SOL-06:
3.1.2
(c) If the less-busy road is yellow, theres no car at the busy road, and a car arrives at the less busy road. Same questions as the rst two situations. 3. Assertions Make a list of assertions to check for your trafc light controller. Answer:
if a light is green, the next colour will be yellow if a light is yellow, the next colour will be red if a light is red, the next colour will be green if no car has been at the less-busy road for at least t1 minutes then the less-busy road is red. (e) if the car sensor has been continuously on for the busy road for at least t2 minutes then the busy road is green.
SOL-06:
3.1.3
3.1.3
*/0
s1
*/0
s2 */0
*/1
s1
s0
s9
0/0 */1 */0
*/0
s8 */0
s3 */0 s4 */0
s6
s3 */0 s2
*/0 */0
s7
s5
s0
*/0
s1
q0
*/0
q1
*/1 s2 */0
*/0
*/1
q4
*/0
q3
Figure 3.3: A concurrent machine Answer each of the following questions for the three state machines in Figures 3.13.3. (a) How many test scenarios (sequences of test vectors) would you need to fully validate the behaviour of the state machine?
SOL-06:
3.1.3
(b) What is the maximum length (number of test vectors) in a test scenario for the state machine? (c) Assuming that neither the inputs nor the outputs are registered, what is the minimum number of ip-ops needed to implement the state machine? Answer: scenarios sequence expected behaviour 1) 000 s0, s2, s3, s0 2) 001 s0, s2, s3, s0 3) 010 s0, s2, s3, s0 4) 011 s0, s2, s3, s0 5) 1000 s0, s1, s2, s3, s0 6) 1001 s0, s1, s2, s3, s0 ... 12) 1111 s0, s1, s2, s3, s0 sequence expected behaviour 1) 0000000000 s0, s1, s2 ..., s9, s0 2) 0000000001 s0, s2, s2 ..., s9, s0 1024) 1111111111 s0, s1, s2 ..., s9, s0 sequence expected behaviour 1) 0...00 (s0,q0), (s1,q1), (s2,q2), (s0,q3), (s1,q4), (s2,q0), (s0,q1), (s1,q2), (s2,q3), (s0,q4), (s1,q0), (s2,q1), (s0,q2), (s1,q3), (s2,q4), (s0,q0) 2) 0...01 same behaviour 215 ) 1..11 same behaviour max len 4 min ops 2
Fig 3.1
Fig 3.2
10
Fig 3.3
15
5 or 4
For Fig 3.3, if we implement each machine separately we need 5 ops, 2 for the S machine and 3 for the Q machine. If we merge the state machines, we need log2 3 5 4 ops.
SOL-06:
3.1.3
One of the purposes of this exercise is to illustrate how many test vectors it requires to exhaustively test the behaviour of even simple circuits. Also, this demonstrates how the structure of a circuit affects the number of test vectors needed. Size alone is not the determining factor. 2. State Machines in General If a circuit has n signals of 1-bit each that are the outputs of ip-ops and m 1-bit signals that are the outputs of combinational circuitry, what is the maximum number of states that the circuit can have? Answer:
The maximum number of states for a circuit with n ops is 2n . The values of combinational signals are determined by the ops and the inputs, and so they dont contribute to the total number of states.
SOL-06:
3.1.4
Additional Problem
3.1.4
Additional Problem
SOL-06:
3.1.5
10
3.1.5
Youre on the functional validation team for a chip that will control a simple portable CD-player. Your task is to create a plan for the functional validation for the signals in the entity cd digital. Youve been told that the player behaves just like all of the other CD players out there. If your test plan requires knowledge about any potential nonstandard features or behaviour, youll need to document your assumptions. track min sec
prev
stop
play
next
pwr
entity cd_digital is port ( ----------------------------------------------------- buttons prev, stop, play, next, pwr : in std_logic; ----------------------------------------------------- detect if player door is open open : in std_logic; ----------------------------------------------------- output display information track : out std_logic_vector(3 downto 0); min : out unsigned(6 downto 0); sec : out unsigned(5 downto 0) ); end cd_digital;
SOL-06:
3.1.5
11
Answer: test1 specication when power is turned on, the display will show the number of tracks on the CD, and the minutes and seconds will show the total length of the CD. stimulus power=0; wait; power=1, all other signals are 0. check display outputs of circuit match specication test2 specication when power is on, play starts CD playing, display for track=1, min and sec show remaining time for song and start decrementing. stimulus power=1; play=0; wait; play=1, all other signals are 0. check display outputs of circuit match specication test3 specication when power is on and CD is playing, next starts next song. Display for track increments, min and sec show remaining time for next song and start decrementing. stimulus power=1; play=0; next=0; wait; play=1; wait; next=1, all other signals are 0. check display outputs of circuit match specication test4 specication when power is on and CD is playing, prev starts previous song. Display for track decrements, min and sec show remaining time for previous song and start decrementing. stimulus power=1; play=0; prev=0; wait; play=1; wait; prev=1, all other signals are 0.
SOL-06:
3.1.5
12
check display outputs of circuit match specication test5 specication when power is on and CD is playing, stop causes CD to stop. stimulus power=1; play=0; stop=0; wait; play=1; wait; stop=1, all other signals are 0. check display outputs of circuit match specication justication for choices These cases test the basic operations of the CD player. Each test focusses on a different aspect of the players behaviour.
SOL-06:
3.1.5
13
Answer: case 1 : press both prev and next while a CD is playing case 2 : open the case while a CD is playing case 3 : press play and stop at the same time case 4 : press any button other than power when the player is off case 5 : press next repeatedly until track counter wraps around role of corner cases : The purpose of corner cases is to test unusual situations that designers might not have thought of, and so are more likely to contain bugs than normal behaviour.
SOL-06:
3.1.5
14
Chapter 4
15
SOL-07 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
SOL-07:
4.1
FARMER
4.1
Farmer
A farmer is trying to decide which of his two trucks to use to transport his apples from his orchard to the market. Facts: capacity of truck big truck small truck 12 tonnes 6 tonnes speed when loaded with apples 15kph 30kph speed when unloaded (no apples) 38kph 70kph
120 km 85 tonnes
1. All of the loads of apples must be carried using the same truck 2. Elapsed time is counted from beginning to deliver rst load to returning to the orchard after the last load 3. Ignore time spent loading and unloading apples, coffee breaks, refueling, etc. 4. For each trip, a truck travels either its fully loaded or empty speed.
Question: Which truck will result in the least elapsed time and what percentage faster will the elapsed time be?
Answer:
SOL-07:
4.1
FARMER
NumTrips Harvest Capacity All trips are for the same distance, so distance cancels out of the equations: Time 1 Speed TimeTotBig 85 12 1 15 1 38 8 0 0930 0 7439 TimeTotSmall 85 6 1 30 1 70 15 0 0477 0 7143 Small truck will take less time TimeSlow TimeFast PctFaster TimeFast TimeTotBig TimeTotSmall TimeTotSmall 0 7439 0 7143 0 7143 4 15%
Question: In planning ahead for next year, is there anything the farmer could do to decrease his delivery time with little or no additional expense? If so, what is it, if not, explain.
Answer: Use two drivers Use a combination of the small truck and large truck to improve his utilization.
' '
'
TimeTot
NumTrips
TimeLoaded
TimeUnloaded
SOL-07:
4.2
4.2
The BigLan network protocol runs at a data rate of 160 Mbps (Mega bits per second). Each BigLan packet contains 100 Bytes of routing information and 1000 Bytes of data. You are working on the DataChopper router, which has the following performance numbers: 75MHz 500 4 clock speed number of clock cycles to process the routing information for a packet CPI for a byte of data
SOL-07:
4.2.1
Maximum Throughput
4.2.1
Maximum Throughput
Which has a higher maximum throughput (as measured in bits per second), the network or your router, and how much faster is it? Answer: The maximum data throughput of the two technologies in terms of bits can be calculated as follows: 1. BigLan Network Protocol Maximum data throughput 2. DataChopper Router Time required for a packet
= = = = = = =
160 Mbps * (8000 data bits / 8800 packet bits) 145.45 Mbps 500 clock cycles + 0.5 CPI per data bit * 8800 packet bits 500 clock cycles + 4400 clock cycles 4900 clock cycles 4900 clock cycles * 13.33 ns per cycle 65333 ns per packet 65333 ns per packet / 8000 data bits 8.167 ns per data bit 1 / 8.167 ns per data bit 122.46 Mbps
= = = =
The network has a higher maximum throughput. What percentage higher? n% higher performance = = = (perf high - perf low) / perf low (145 - 122)/122 19%
SOL-07:
4.2.2
4.2.2
Explain the effect of an increase in packet length on the performance of the DataChopper (as measured in the maximum number of bits per second that it can process).
Answer:
As packet size increases, the overhead associated with the constant routing delay will become less signicant. The data rate of the router will slowly approach that of the network but it will never surpass the network throughput. If there was not any overhead for routing, the peak data rate for the router would be 150 Mbps compared to 160 Mbps of the network.
SOL-07:
4.3
4.3
If performance doubles every two years, by what percentage does performance go up every month?
Answer:
SOL-07:
4.4
MICROPROCESSORS
4.4
Microprocessors
The Yme microprocessor is very small and inexpensive. One performance sacrice the designers have made is to not include a multiply instruction. Multiplies must be written in software using loops of shifts and adds. The Yme currently ships at a clock frequency of 200MHz and has an average CPI of 4. A competitor sells the Y!v1 microprocessor, which supports exactly the same instructions as the Yme. The Y!v1 runs at 150MHz, and the average program is 10% faster on the Yme than it is on the Y!v1.
SOL-07:
4.4.1
Average CPI
4.4.1
Average CPI
Question: What is the average CPI for the Y!v1? If you dont have enough information to answer this question, explain what additional information you need and how you would use it?
Answer:
Use the following subscripts: Yme Y!v1 Y!u2 The Yme is 10% faster than the Y!v1.
1 2 3
SOL-07:
4.4.1
Average CPI
10
1 10
33
Common mistakes:
A new version of the Y!, the Y!u2 has just been announced. The Y!u2 includes a multiply instruction and runs at 180MHz. The Y!u2 publicity brochures claim that using their multiply instruction, rather than shift/add loops, can eliminate 10% of the instructions in the average program. The brochures also claim that the average performance of Y!u2 is 30% better than that of the Y!v1.
1 10
1 10
CPI2
1 10
1 10
Time2
1 10
Time
Time1 NumInst1 CPI1 ClockSpeed1 ClockSpeed2 NumInst1 CPI1 NumInst2 ClockSpeed1 ClockSpeed2 CPI1 ClockSpeed1 150MHz 4 200MHz
SOL-07:
4.4.2
11
4.4.2
Question: Assuming the advertising claims are true, what is the average CPI for the Y!u2? If you dont have enough information to answer this question, explain what additional information you need and how you would use it?
Answer:
Solve forCPI3
3 38
Common mistakes:
CPI3
13
13
Time3
SOL-07:
4.4.3
Analysis
12
4.4.3
Analysis
Which of the following do you think is most likely
1. the Y!u2 is basically the same as the Y!v1 except for the multiply 2. the Y!u2 designers made performance sacrices in their design in order to include a multiply instruction 3. the Y!u2 designers performed other signicant optimizations in addition to creating a multiply instruction
Answer: The most likely analysis is that the Y!u2 is basically the same as the Y!v1 except for the multiply. This is because the Y!u2 has a slightly larger CPI than the Y!v1, this is in keeping with the addition of a multiply instruction. A multiply instruction probably has a larger-than-average CPI. The increase in clock speed likely comes from a new fabrication process, and would not have required signicant changes to the design of the chip.
SOL-07:
4.5
13
4.5
Draw an optimized dataow diagram that improves the performance and produces the same output values. Or, if the performance cannot be improved, describe the limiting factor on the performance. NOTES:
you may change the times when signals are read from the environment you may not increase the resource usage (input ports, registers, output ports, f components, g components) you may not increase the clock period
a b c a b f f d e f c f f g g f g e
SOL-07:
4.6
14
4.6
This question deals with the implementation and optimization for the algorithm and library of circuit components shown below. Algo- Component q = M[b]; Register if (a > b) then Adder M[a] = b; Subtracter p = (M[b-1]) * b) + M[b]; with , , ALU rithm else Memory read M[a] = b; Memory write p = M[b+1] * a; Multiplication end; 2:1 Multiplexor NOTES: 1. 2. 3. 4. 5. 25% of the time, a > b The inputs of the algorithm are a and b. The outputs of the algorithm are p and q. You must register both your inputs and outputs. You may choose to read your input data values at any time and produce your outputs at any time. For your inputs, you may read each value only once (i.e. the environment will not send multiple copies of the same value). Execution time is measured from when you read your rst input until the latter of producing your last output or the completion of writing a result to memory M is an internal memory array, which must be implemented as dualported memory with one read/write port and one write port. Assume all memory address and other arithmetic calculations are within the range of representable numbers (i.e. no overows occur). If you need a circuit not on the list above, assume that its delay is 30 ns. Your dataow diagram must include circuitry for computing a > b and using the result to choose the value for p Delay 5 ns 25 ns 30 ns 40 ns 60 ns 60 ns 65 ns 5 ns
6.
7. 8. 9. 10.
, AND, XOR
SOL-07:
4.6
15
Draw a dataow diagram for each operation that is optimized for the fastest overall execution time. NOTE: You may sacrice area efciency to achieve high performance, but marks will be deducted for extra hardware that does not contribute to performance.
b happens 75% of the time, so initially focus on 1. a common case. b means that a b 1, therefore can do (a) a M[b+1] read in parallel with M[a] write or with M[b] read. (b) But, could have a b, so cant do M[a] write in parallel with M[b] read. M a b -1
65ns p 150ns
SOL-07:
4.6
16
(c) Critical path is from b to p: 150ns + 5ns for mux on p = 155ns. (d) Longest operation in diagram is multiplication: 65ns. (e) Minimum clock period is 65ns + 5ns for register = 70ns.
M a b 1 25ns 5ns
M 5ns
b 1 25ns 5ns
M(rd) 60ns
60ns M(wr)
65ns 5ns
5ns M q p
M
M a
q
b
1 30ns
5ns
5ns
M(rd)
M(rd)
60ns 5ns
60ns
M(wr)
5ns M p
period 70 ns 75 ns 90 ns latency 5 cycles 4 cycles 3 cycles time 350 ns 300 ns 270 ns (f) Minimum latency is 3 clock cycles, because cant do all memory operations in parallel and need registers on both inputs and outputs. (g) Best overall performance for a b case is with clock period of 90 ns. 2. Now try a b with 90 ns clock period.
SOL-07:
4.6
17
(a) a b means that a b and a b-1, so no memory address conicts to create dependencies and complications.
M a b 1 30ns M 5ns a
M(rd)
M(rd)
60ns 5ns
60ns
M(wr)
65ns
60ns
M(wr)
25ns
5ns M p M p
period 90 ns 95 ns latency 4 cycles 3 cycles time 360 ns 285 ns (b) Without going to a triple-ported memory, cant reduce latency below 3. b case is with clock period (c) Best performance for a of 95 ns. 3. Choose 95 ns clock period, which gives a latency of 3 clock cycles for both options. 4. Optimize dataow diagrams to reduce area without sacricing performance.
b -1 25 ns 5ns
M(rd)
M(rd)
60ns 5ns
65ns
25ns
5ns
SOL-07:
4.6
M
18
5ns 30ns
M(rd) M(rd) 60ns M(rd) 60ns 5ns M(wr) 60ns M(wr) q 65ns
M(rd)
60ns 5ns
65ns q
25ns 5ns
5ns M M p p
5ns
5ns
M(wr)
M(rd) 0
65ns
25ns 5ns M q p
SOL-07:
4.6
19
Component Input Output Register Adder Subtracter ALU Memory read Memory write Multiplication 2:1 Multiplexor Clock Period Average Latency Average Execution Time
M b
M(wr)
M(rd)
65ns
SOL-07:
M
4.6
a
20
M(wr)
M(rd)
65ns
SOL-07:
4.7
MULTIPLY INSTRUCTION
21
4.7
Multiply Instruction
You are part of the design team for a microprocessor implemented on an FPGA. You currently implement your multiply instruction completely on the FPGA. You are considering using a specialized multiply chip to do the multiplication. Your task is to evaluate the performance and optimality tradeoffs between keeping the multiply circuitry on the FPGA or using the external multiplier chip. If you use the multipliplier chip, it will reduce the CPI of the multiply instruction, but will not change the CPI of any other instruction. Using the multiplier chips will also force the FPGA to run at a slower clock speed and will raise the cost. FPGA option FPGA + MULT option
MULT FPGA FPGA
average CPI % of instrs that are multiplies CPI of multiply Clock speed Cost
SOL-07:
4.7.1
Highest Performance
22
4.7.1
Highest Performance
Which option, FPGA or FPGA+MULT, gives the higher performance (as measured in MIPs), and what percentage faster is the higher-performance option?
40
Find CPI for non-multiply (other) instructions: FPGA other mult mult other
FM
01 09
mult other 20
mult
3 333
MIPsFM
MHzFM FM
MIPsFPGA
other
other
SOL-07:
4.7.1
Highest Performance
23
FM
mult
mult
other
01
09
3 333
36
44 4
FM
44 4 40
FPGA FPGA 40
11 1%
MIPsFM
MIPsFM
MHzFM FM 160 36
other
SOL-07:
4.7.2
Optimality
24
4.7.2
Optimality
Which option, FPGA or FPGA+MULT, is more optimal (as measured in MIPs/$), and what percentage more optimal is the more optimal option?
Answer:
The FPGA+MULT option is 3.4% more optimal than the FPGA option.
n-pct-optimal
optFM
optFPGA
MIPsFPGA PriceFPGA 40 20 2
MIPsFM PriceFM 44 4 23 1 93
SOL-07:
4.7.3
Performance Metrics
25
4.7.3
Performance Metrics
Explain whether MIPs is a good choice for the performance metric when making this decision.
Answer:
MIPs is a good metric for this example, because we are comparing two microprocessors that use the same instruction set and will be used in the same environment. In general, the disadvantage of MIPs is that it doesnt take into account that different instructions accomplish different amounts of work. This causes problems when comparing microprocessors that use different instruction sets (e.g. one with a cosine instruction and one without).
SOL-07:
4.7.3
Performance Metrics
26
Chapter 5
27
SOL-08 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
SOL-08:
5.1
TERMINOLOGY
5.1
Terminology
Assume that the timing diagram shows the limits of the allowed times (either minimum or maximum). For each of the terms in the table below, answer which time periods (one or more of t1 t9 or NONE) are examples of the term. t7 t4 signal is stable
t3 t1 t2 t6 signal may change t9
SOL-08:
5.2
5.2
SOL-08:
5.3
CRITICAL PATH
5.3
a
Critical Path
d f g k h l m i j
b c
delay 2 4 4 6
Assume all delay and timing factors other than combinational logic delay are negligible.
SOL-08:
5.3.1
Ignoring potential false paths, list the signals in the critical path through this circuit. 5
5.3.1 Ignoring potential false paths, list the signals in the critical path through this circuit.
a
2 2
d6
6 6
f8 g 12
8 12
i 16
b c
e8
12 8
j 18 m 16 l 16
k 10
10 12 12
h4
SOL-08:
5.3.2
SOL-08:
5.3.3
Missing Factors
5.3.3
Missing Factors
What factors that affect the maximum clock speed does your analysis for parts 1 and 2 not take into account?
Answer:
SOL-08:
5.3.4
False Path?
5.3.4
False Path?
Is the critical path you found a real critical path, or a false path? If it is a false path, nd the real critical path. If it is a critical path, nd a set of assignments to the primary inputs that exercises the critical path. a d f i
b c e 0 k g 0 j m l
Answer:
Therefore, the rst candidate critical path is a false path. NOTE: rules for XOR require one of the inputs to remain stable, otherwise output of XOR will not change. Find next candidate.
1,
SOL-08:
5.3.4
a
False Path?
d 6
6 6
9
f 8
8 12
2 2
i 16
b c
e6,8
8
6,8
g 10,12 10
8
j 16 m 16 l 16
k 10
10 12 12
h 2
e6,8
8
6,8
g 10,12 10
8
j 16 m 16 l 16
k 10
10 12 12
h 2
2. Quick check if g can change: Static equation for g a b bc Therefore, g can change. 1 on f, because of (a) For f, have choice of 1 or 0 reconvergent fanout. (b) Try 1 rst, because its simpler. (c) For g, have choice of 0 or 0 1 on d, because of reconvergent fanout. (d) Try 0 rst, because its simpler. (e) d is ok, 0 from both sides (f) Conict on output of inverter.
3. Try 0
1 on i
SOL-08:
5.3.4
False Path?
0 0 0
10
0 0
f g
1 i
b c
j k m l
b c
5. Try 0 1 on i, 0 Conict on d
1 on f, 0
1 on f.
f 0 g j k m l
1 on d.
SOL-08:
5.3.4
a
False Path?
d f g j k m l
11
b c
7. Try 0
(a) For e, have choice on b of whether to invert or not, because e is an xor. 0 and e is (b) Because path from h is propagating 1 0 1, need to invert. (c) For inversion, need to put a 1 on c. (d) Conict on b a d f i
0 b c e g j k m l
8. Need to get g to toggle. (a) Static equation for g is a b bc Only assignment that makes g=0 is abc Only assignment that causes g to toggle because of change on b is a=0, b=1 0, c=0.
6. Try 0
1 on m. Conict on e. 1 on l.
SOL-08:
5.3.4
False Path?
12
(b) Try to push rising edge on b through g to i, j, m, or l; with a=0 and c=0. 0 a d f i 0
b 0 c e g j 0 k m 0 l 0
(c) Cant get rising edge on b to toggle both g and an output. Therefore, critical path does not go through both b and g. 9. Find next candidate path. a d
2 6 6
f g 10
8 10
i 14
b c
e 6,8
10 8
j 16 m 14 l 14
k 10
10 10 10
SOL-08:
5.3.4
a
False Path?
d
2 6 6
13
f g 10
8 10
i 14
b c
e 6,8
10 8
j 16 m 14 l 14
k 10
10 10 10
h 4 a 0 d0 0 e 0 g f 1
b c
1 0
j 0 k m 0 l
h 1
10. Cant get rising edge on c to toggle both g and j. However, the rising edge can toggle i and l. Both the path from c to j and from c to l have a delay of 14. a 8 6 d 6 f 2 i 14 10
b c
2 6
e 6,8
g 10
10 8
j 16 m 14 l 14
k 10
10 10 10
h 4
SOL-08:
5.3.4
False Path?
14
11. The pair of assignments abc and abc will exercise the critical paths from c to i and c to l, both of which have a delay of 14.
SOL-08:
5.4
TIMING MODELS
15
5.4
Timing Models
In your next job, you have been told to use a fanout timing model, which states that the delay through a gate increases linearly with the number of gates in the immediate fanout. You dimly recall that a long time ago you learned about a timing model named Elmo, Elmwood, Elmore, El-Morre, or something like that. For the circuit shown below as a schematic and as a layout, answer whether the fanout timing model closely matches the delay values predicted by the Elmore delay model.
G2 G3 G1 G4 G5 G1
Gate Cg 0 Symbol Description Interconnect level 2 Capacitance Cx Resistance 0
Interconnect level 1
Cy
Antifuse
G2
G3
G4
G5
Assumptions:
The capacitance of a node on a wire is independent of where the node is located on the wire.
SOL-08:
5.5
SOL-08:
5.5.1
Worst-Case Commercial
17
5.5.1
Worst-Case Commercial
Estimate the delay under worst-case commercial conditions (assume that the junction temperature is the same as the ambient temperature)
Answer: For worst-case commercial condition, assuming that TA = TJ, Logic Module delay, tPD, for ACT 3 Std with 4 fanout is 5.7 ns (see Smith Table 5.2). Assume this is the slowest path, then estimated critical path delay between registers, tCRIT (worst-case commercial) is:
tCRIT
SOL-08:
5.5.2
Worst-Case Industrial
18
5.5.2
Worst-Case Industrial
Find the derating factor for worst-case industrial conditions and calculate the delay (assume that the junction temperature is the same as the ambient temperature).
Answer: For worst-case industrial conditions, assuming that TA = TJ, the derating factor is 1.07 (see Table 5.3). Hence the delay tCRIT (worst-case industrial) is: 7% greater than worst case commercial delay: 1 07 9 5 10 2ns
SOL-08:
5.5.3
Answer: For worst-case industrial conditions, the derating factor at 105C is found by linear interpolation between the values for 85C (1.07) and 125C (1.17). The interpolated derating factor is 1.12. Hence the delay is: tCRIT (worst-case industrial, TJ = 105 0C) 1 12 9 5 10 6ns.
SOL-09 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
SOL-09:
5.6
SHORT ANSWER
5.6
Short Answer
SOL-09:
5.6.1
Wires in FPGAs
5.6.1
Wires in FPGAs
In an FPGA today, what percentage of the clock period is typically consumed by wire delay?
Answer: 4060%
SOL-09:
5.6.2
5.6.2
If you were to compare a typical digital circuit from 5 years ago with a typical digital circuit today, would you nd that the percentage of the total clock period consumed by capacative load has increased, stayed the same, or decreased?
Transistors have gotten smaller, die size has remained roughly the same size or even increased, clock speeds are increasing. Signals are travelling roughly the same distance as before, but driving smaller capactive loads. Thus, wire delay is not decreasing much, but capacitive load is decreasing. The clock period is decreasing, so the wire delay is taking up a larger percentage of the clock period and capacitive load delay is taking up a smaller percentage.
SOL-09:
5.6.3
5.6.3
As temperature increases, does the delay through a typical combinational circuit increase, stay the same, or decrease?
Answer: Increase. Justication: As temperature increases, atoms vibrate more, and so have greater probability of colliding with electrons owing with current. This increases resistivity, which increases delay.
SOL-09:
5.7
5.7
SOL-09:
5.7.1
Cause
5.7.1
Cause
SOL-09:
5.7.2
Behaviour
5.7.2
Behaviour
What is the bad behaviour that results if a hold time violation occurs?
SOL-09:
5.7.3
Rectication
5.7.3
Rectication
If a circuit has a hold time violation, how would you correct the problem with minimal effort?
SOL-09:
5.8
LATCH ANALYSIS
10
5.8
Latch Analysis
Does the circuit below behave like a latch? If not, explain why not. If so, calculate the hold time and answer whether it is active-high or active-low.
d
d en
Answer:
0 1 1 1
1 0 0
en
en
Load mode
Store mode
From the mode diagrams, if the circuit is a latch, it is active high, because latch is in load mode when en=1.
Now check if timing of circuit is correct. The critical transition is from load mode to store mode.
SOL-09:
5.8
LATCH ANALYSIS
d l1 q s1 en cn
11
cn
l1 q
en
s1
Node labels
Hold time constraint must prevent new value arriving at d before en sets l1 to 1. Delay along data path is 0. Delay along clock path is 1. Hold time is 1. Y 1 active high
SOL-09:
5.9
12
5.9
Chapter 6
Power Problems
13
SOL-10 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
SOL-10:
6.1
SOL-10:
6.1.1
Short Answers
6.1.1
Short Answers
SOL-10:
6.1.1
Short Answers
Answer:
where T is temperature. Short circuiting power will increase because: As temperature increases, atoms vibrate more, and so have greater probability of colliding with electrons owing with current. This increases resistivity, which increases delay. Signals will rise and fall more slowly, which will increase the short circuiting time, and hence increase short circuiting power
"
Leakage power will increase, because the equation for the leakage power is: q e k T
SOL-10:
6.1.1
Short Answers
Answer: Increase transistor size so as to increase threshold voltage. This will require an increase in supply voltage, which will likely increase total power. Alternative: when increase transistor size, keep the supply voltage the same, but decrease performance. Alternative: change fabrication process and materials to reduce leakage current. This will likely be expensive. Alternative: Use dual-Vt fabrication process.
SOL-10:
6.1.1
Short Answers
Answer:
Alternative: Even if the utilization rate is low, the utilization pattern could prevent the clock gating circuitry from turning off the clock to main circuit. For example, if the circuit receives new data every other clock cycle, it would have a utilization rate of 50%, but might need to be powered up 100% of the time.
If the circuitry has a high utilization rate, then the power consumed by the clock gating circuit could be more than that saved in the main circuit.
SOL-10:
6.1.1
Short Answers
Answer:
Gray coding is designed to reduce power, because only one bit changes when incrementing or decrementing. Program counters usually increment, rather than jump to completely different values. So, using gray coding should reduce power consumption. The downside is that the memory system probably doesnt use gray-coded addresses, so additional circuitry would be needed to convert between gray and binary codes. This will increase area and likely decrease performance. Additionally, the extra circuitry to do the translation might require more power than is saved by using gray coding.
SOL-10:
6.1.2
VLSI Gurus
6.1.2
VLSI Gurus
The VLSI gurus at your company have come up with a way to decrease the average rise and fall time (0-to-1 and 1-to-0 transitions) for signals. The current value is 1ns. With their fabrication tweaks, they can decrease this to 0.85ns .
SOL-10:
6.1.2
VLSI Gurus
6.1.2.1 Affect on Power If you implement their suggestions, and make no other changes, what affect will this have on power? (NOTE: Based on the information given, be as specic as possible.)
Answer: Reducing short circuit time from 1 ns to 0.85 ns means reducing raising/falling time. Hence, the new short circuit power is 85% of original.
SOL-10:
6.1.2
VLSI Gurus
10
6.1.2.2 Critique
A group of wannabe performance gurus claim that the above optimization can be used to improve performance by at least 15%. Briey outline what their plan probably is, critique the merits of their plan, and describe any affect their performance optimization will have on power.
Answer: The plan was probably to increase clock speed by 15%. However reducing Tshort by 0.15 ns can at most decrease clock period by 2 0 15 0 30 ns, while clock period 1 ns. Therefore, it does not work.
SOL-10:
6.1.3
Advertising Ratios
11
6.1.3
Advertising Ratios
One day you are strolling the hallways in search of inspiration, when you bump into a person from the marketing department. The marketing department has been out surng the web and has noticed that companies are advertising the MIPs/mm2 , MIPs/Watt, and Watts/cm3 of their products. This wide variety of different metrics has confused them. Explain whether each metric is a reasonable metric for customers to use when choosing a system. If the metric is reasonable, say whether bigger is better (e.g. 500 MIPs/mm2 is better than 20 MIPs/mm2 ) or smaller is better (e.g. 20 MIPs/mm2 is better than 500 MIPs/mm2 ), and which one type of product (cell phone, desktop computer, or compute server) is the metric most relevant to.
SOL-11 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
SOL-11:
6.1.4
6.1.4
As the supply voltage is scaled down (reduced in value), the maximum clock speed that the circuit can run at decreases. The scaling down of supply voltage is a popular technique for minimizing power. The maximum clock speed is related to the supply voltage by the following equation: MaxClockSpeed
2
With a supply voltage of 3V and a threshold voltage of 0.8V, the maximum clock speed is measured to be 200MHz. What will the maximum clock speed be with a supply voltage of 1.5V?
Answer: MaxClockSpeed
2
MaxClockSpeed1
MaxClockSpeed1
MaxClockSpeed1
m m
MaxClockSpeed1 MaxClockSpeed2
Where
is threshold voltage.
1 5V 0 8V 1 5V
3V
3V 0 8V
SOL-11:
6.1.5
SOL-11:
6.1.6
You need to increase the clock speed of a chip by 10% You must not increase its dynamic power consumption The only design parameter you can change is supply voltage Assume that short-circuiting current is negligible
SOL-11:
6.1.6
Only need to reduce dynamic power, therefore neglect static (leakage) power.
11 0 95
(0
'
'
11
%#&
2 2 2
$
%#&
1 2
1 2
%
#"!
Power
Power
Power
Power
2
11
Power
1 2
"
Power
m
2
1 2
SOL-11:
6.1.6
SOL-11:
6.1.6
Answer: Decreasing the supply voltage will bring it closer to the threshold voltage. As the difference between the supply and threshold voltage decreases, it will limit the maximum frequency that the circuit can run at. This then leads to decreasing the threshold voltage, which will then increase the leakage current, and raise the static power dissipation:
SOL-11:
6.1.7
6.1.7
In each low power approach described below identify which component(s) of the power equation is (are) being minimized and/or maximized:
SOL-11:
6.1.7
Answer: Scaling the supply voltage (V) reduces the dynamic power
SOL-11:
6.1.7
10
Answer: Resizing transistor to increase the width to length ratio decreases the resistance of the transistor, which makes it faster. This means that the supply voltage can be reduced to save power while maintaining performance. However, increasing the width to length ratio increases the capacitance. After a certain point, the capacitance increase becomes more signicant than the reduction in supply voltage, causing power to increase. Therefore, resizing is adjusting supply voltage and load capacitance to minimize their product in the switching power component.
SOL-11:
6.1.7
11
Answer: When inputs are registered, the activity factor is decreased, which decreases the dynamic power.
SOL-11:
6.1.7
12
Answer: Gray coding reduces the activity factor on signals that typically change by 1 or a small amount. Address signals have this behaviour, in contrast to data signals, where consecutive values are often completely different. Reducing the activity factor will reduce the dynamic power.
SOL-11:
6.1.8
13
6.1.8
While you are eating lunch at your regular table in the company cafeteria, a vice president sits down and starts to talk about the difculties with a new chip. The chip is a slight modication of existing design that has been ported to a new fabrication process. Earlier that day, the rst sample chips came back from fabrication. The good news is that the chips appear to function correctly. The bad news is that they consume about 10% more power than had been predicted. The vice president explains that the extra power consumption is a very serious problem, because power is the most important design metric for this chip. The vice president asks you if you have any idea of what might cause the chips to consume more power than predicted.
SOL-11:
6.1.8
14
6.1.8.1 Hypothesis
Hypothesize a likely cause for the surprisingly large power consumption, and justify why your hypothesis is likely to be correct.
SOL-11:
6.1.8
15
6.1.8.2 Experiment
Briey describe how to determine if your hypothesized cause is the real cause of the surprisingly large power consumption.
SOL-11:
6.1.8
16
6.1.8.3 Reality
The vice president wants to get the chips out to market quickly and asks you if you have any ideas for reducing their power without changing the design or fabrication process. Describe your ideas, or explain why her suggestion is infeasible.
Chapter 7
17
SOL-12 Preliminaries
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
SOL-12:
7.1
7.1
A modern (circa 1995) production tester costs US$510 million. This cost is depreciated over the life of the tester (usually ve years in the States due to tax guidelines). 1. Neglecting all operating expenses other than depreciation, if the tester is in use 24 hours a day, 365 days per year how much does one second of test time cost? Answer:
$0 031 for a US$ 5 million tester $0 062 for a US$ 10 million tester
2. A new tester sits idle for 6 months, because the design of the chips that it is to test is behind schedule. After the chips begin shipping, the tester is used 100% of the time. What is the cost of testing the chips relative to the cost if the chips had been completed on time? Answer: 6 months is 10% of a 5 year lifespan Therefore the tester will test 90% of the total number of chips that it would normally test. The cost per chip for testing will be: 1 0 90
111%OrigTestCost
NewTestCost
OrigTestCost
365
CostPerSecond
SOL-12:
7.1
3. The dimensions of the die to be tested are 20mm 10mm. The wafers are 200mm in diameter. Fabricating a wafer with die costs $3000. The yield is 70%. Assume that the number of die per wafer is equal to wafer area divided by chip area. What percentage of the fabrication + test cost is for test if the chip is on schedule and requires 1 minute to test? Answer:
157
16 3%
TestCostPct
DieTestCost
TestCostPerSec $0 062 60 $3 72
DieFabCost
$19 10
200 2 10 20
DiePerWafer
WaferArea DieArea
TestTime
SOL-12:
7.2
7.2
Given information:
What fault escapee rate will result in the lowest total cost for ACHIPs?
However, here we have two ACHIPs per board, so we need to use the escapee probability to compute the probability of board needing to be replaced. The revised equation for total cost is:
TotCost
The ACHIP costs $10 without any testing Each board uses two ACHIPs (plus lots of other chips that we dont care about) 68% of the manufactured ACHIPS do not have any faults For the ACHIP, it costs $1 per chip to catch half of the faults Each 50% reduction in fault escapees doubles cost of testing (intuition: doubles number of tests that are run) If board-level testing detects faults in either one or both ACHIPs, it costs $200 to replace the ACHIP(s) (This is an approximation, based on the fact that the cost of the chip is much less than the total cost of $200). Board-level testing will detect 100% of the faults in an ACHIP
SOL-12:
7.2
The testing cost doubles, because we have two ACHIPs per board to test. The probablity of a board having at least one bad ACHIP (and therefore needing to be replaced) is 1 - the probability that both ACHIPs are good.
2
The chips will have a lowest cost if either $8 or $16 is spent on testing and they have a fault escapee rate of 4% or 2%. We choose to spend $16 on testing, because that has a lower escapee rate for the same total cost. The lower escapee rate will improve our reputation for quality.
ReplaceProb
EscapeeProb
SOL-12:
7.3
7.3
4
In a circuit with i inputs, o outputs, and g gates with an average fanout of fo (fo 1), and average fanin of , what is the minimum number of faults that must be considered when using a single-stuck-at fault model?
Answer:
The minimum number of wire segments to connect a gate or input to fo other gates or outputs is fo + 1. (Assuming fo 1. If fo = 1, then the minimum number of wire segments is 1. With i inputs and g gates, this results in (i g) (fo 1) wire segments. Each wire segment has two possible faults (stuck-at-1 and stuck-at-0), therefore there are 2 (i g) (f 1) potential single-stuck-at faults that must be considered. NOTE: the fanin degree does not direcly factor into this equation. However, there is a relationship between the number of gates g, the number of inputs i, the depth of the circuitry, the fanout degree fo, and the fanin degree . For example, the maximum number of gates whose inputs are all primary inputs is i fo .
SOL-12:
7.4
7.4
Draw the set of faults that collapse for AND, OR, NAND, and NOR gates, and a two-input mux.
Answer:
@0 @0
@0
@1 @1
@1
@0 @0
@1
@1 @1
@0
A two-input mux does not have any controlling inputs, so it does not have any collapsible faults.
SOL-12:
7.5
7.5
Given a correct circuit, and a non-stuck-at fault (e.g. bridging AND), will a single-stuck-at fault model detect the fault? If so, identify a single-stuck at fault that will detect, or explain why cant be detected.
SOL-12:
7.6
UNDETECTABLE FAULTS
7.6
Undetectable Faults
Identify one of the undetectable single stuck-at fault in the circuit below, or say NONE if all single stuck-at faults are detectable. a L1 L6 L4 b L2 L8 z L5 L7 c L3
SOL-12:
7.7
10
7.7
Your task is to generate test vectors to detect faults in the circuit shown below. Your manager has said that manufacturing only has time to run three test vectors on the circuit. L1 a L6
L4
b c
L2 L5 L3
L7
L8
SOL-12:
7.7.1
11
7.7.1
Which test vectors should you run and in what order should you run them?
SOL-12:
7.7.2
12
7.7.2
Write a brief statement (backed up with data) to support either staying with three test vectors or increasing the test suite to four vectors.
SOL-12:
7.8
13
7.8
A 1.2GHz chip has scan chains of length 30,000 bits, 20,000 bits, 24,000 bits, 25,000 bits, and two of 12,000 bits. 500,000 test vectors are used for each scan chain. The tests are run at 50% of full speed. Calculate the total test time.
Answer:
We can load and unload all of the scan chains at the same time, so time will be limited by the longest (30,000 bits). For the rst test vector, we have to load it in, run the circuit for one clock cycle, then unload the result. Loading the second test vector is done while unloading the rst. Clock Cycles 30,000 1 30,000 1 30,000 ... Vector 1 Load Run Dump Vector 2 Vector 3 ...
...
Load ...
20 8secs
TimeTot
SOL-12:
7.9
BIST
14
7.9
BIST
In this problem, we will revisit the circuit from section 7.3.1, which is shown below. But, this time well use BIST to test the circuit, rather than analyzing the faults and then choosing test vectors to catch the potential faults.
a b c
L1 L4 L2 L5 L3 L7 L6 L8
SOL-12:
7.9.1
Characteristic Polynomials
15
7.9.1
Characteristic Polynomials
Derive the characteristic polynomials for the linear feedback shift registers shown below:
d0
R
q0
d1
q1
d2
q2
d0
q0
d1
q1 d2
q2
set
set
Answer: Both circuits have three ops, so their maximum exponent is x3 . A feedback tap on each signal di has corresponds to a coefcient of 1 on xi in the characteristic polynomial. The rst circuit has feedback taps for d0, d1, and d2. This gives a characteristic polynomial of: x3 x2
The second circuit has taps on d0 and d1, but not one on d2: x3
SOL-12:
7.9.2
Test Generation
16
7.9.2
Test Generation
Answer:
For an LFSR with n ops, the length of a maximal-length non-repeating sequence is 2n 1. Both of the LFSRs under consideration have 3 ops, so we are looking for a sequence of 7 non-repeating values. We will rst simulate the circuits to see their values, and then demonstrate how characteristic polynomials and division over Galois elds can be used to accomplish the same thing. d0 1 0 0 1 1 q0 1 1 0 0 1 x3 d0 1 1 0 0 1 0 1 q0 1 1 1 0 0 1 0 x3 For x3 x2 d1 0 1 0 1 0 x2 q1 1 0 1 0 1 x d2 0 0 1 1 0 q2 1 0 0 1 1
1) 2) 3) 4)
1 q2 1 1 0 0 1 0 1
1) 2) 3) 4) 5) 6) 7)
d1 0 0 1 0 1 1 1 x
q1 1 0 0 1 0 1 1 1
same as 1)
SOL-12:
7.9.2
Test Generation
17
For x3 x 1, we see that it generates a sequence of 7 different values before repeating. The circuit has three ops, so the maximum length sequence of non-repeating values it can generate is 23 1, which is 7. Thus, x x3 is a maximal length linear feedback shift register. Format for division: lfsr quotient message ... remainder
For an LFSR with no external input and n ops, the rst n coefcients of the message are the reset values of the LFSR, and all of the other remaining coefcients are 0. For a test vector generator LFSR, the reset values are all 1s. We hope to have a sequence of 7 unique remainders. With the three initial values in the LFSR ops, we require a message polynomial of 3 + 7=10 values. 0x2 0x1
The message polynomial is then: 1x9 1x8 1x7 0x6 0x5 0x4 0x3
0x0
SOL-12:
7.9.2
Test Generation
18
1x
0x
1x
1x
The values on the ip ops inside an LFSR with n ops show up as the n-most-signicant coefcients on the polynomials immediately below the subtraction lines in the long-divison. For example, after the second subtraction, the polynomial is: 0x7 0x6 1x5 0x4 . The three most signicant coefcients are: 001 and the value on (q2,q1,q0) after two steps of execution is also 001.
Quotient Remainder
1x6 1x2
1x5 1x1
1x2 1x0
1x0
7 7 7 7 7 7 7
7 7 7 7 7 7
7 7 7 7
7 7
1x2 0x5
0x1 0x4
1x0 0x3
0x2
0x1
0x0
SOL-12:
7.9.3
Signature Analyzer
19
7.9.3
Signature Analyzer
Given a signature analyzer equation of x2 x 1, nd the expected value of the ops in the signature analyzer at the end of the test sequence. Also, design the hardware for the signature analyzer and result checker.
Answer:
set mode q0
i_d(0)
S
q1
i_d(1)
S
q2
i_d(2)
S
Expected sequence of values from circuit: z q0 q1 q2 1) 1 1 1 1 x6 2) 1 0 1 0 x5 z 3) 1 0 0 0 x4 4) 0 1 0 0 x3 5) 0 0 1 0 x2 6) 1 1 0 1 x1 7) 0 1 1 1 x0 Polynomial for output sequence of circuit under test: x6 x 1
Connect test generator to circuit Remainder of result sequence divided by signature analyzer is values in ops of signature analyzer at end of test sequence.
mx px qx r x
message (output of circuit under test) polynomial of signature analyzer quotient remainder
x6 x2
x x
1 1
SOL-12:
7.9.3
Signature Analyzer
quotient circuit under test ... remainder
20
signature analyzer
Carry out the division: 1x4 1x6 1x6 1x3 0x5 1x5 1x5 1x5 0x2 0x4 1x4 1x4 1x4 0x4 0x4 1x1 0x3 0x3 1x3 1x3 0x3 1x3 1x3
1x2
1x1
1x0
1x1
Quotient Remainder
1x1
1x0
Check division:
x6
1x6
Division was done correctly. The nal value on the three ops in the signature analyzer will be the remainder: 1x1 0x0 10.
1x6
1x4
1x3
1x1
1x0
1x2
1x0
mx x
qx
px 1x1
1x0 0x2
1x0
r x x1 x
SOL-12:
7.9.3
Signature Analyzer
21
NOTE: When looking at the remainder (signature), we look at the outputs of the ops, representing the op nearest the input as x0 . Using hardware:
clk i d0 q0 d1
reset d0 i
S S R
1 0 0 0 0 1 1 1 0 1 1 0 0 0 0
remaind
0 1 0 1 1 0 0 0
0 1 1 0 1 1 1 0
q0
d1
q1
q1
0 0 1 1 0 1 1 1
quotient
Signature analyzer and timing diagram The quotients and the remainder calculated using long division match the ones that were calculated using the circuit. The values on the ops in the signature analyzer match, cycle by cycle, the two most signicant coefcients on the intermediate remainders calculated during long division. The intermediate remainders are the polynomials below the subtraction lines. (When looking at the circuit, remember that for an LFSR with n ops, it takes n clock cycles for the circuit to become primed with the input sequence and match the long-division arithmetic.) The ok circuit for this signature analyzer is just a 2-input AND gate, because the remainder is 11.
SOL-12:
7.9.3
Signature Analyzer
22
reset d0 i
S S R
q0
d1
q2
q0 q1
ok
Signature analyzer with ok circuit The result checker should check the ok signal one cycle after the last test vector. The last test vector in the sequence is 110. We can either look for 110 and delay by one clock cycle, or we can look for the rst test vector (111) in second iteration the sequence. To make sure that we are looking at the second iteration of the sequence, and not the rst, we look at reset.
max-length LFSR q0 q1 q2 circuit under test z signature analyzer ok
all_ok
max-length LFSR
SOL-12:
7.9.4
23
7.9.4
Answer:
We have a sequence of 7 bits coming from the circuit under test. This gives us 27 128 possible sequences. Of these, 1 is the good sequence and 127 are faulty sequences. The signature analyzer stores 2 bits of data, which gives us 4 possible values. Thus, on average 128 4 32 different result sequences will map to the same 2-bit signature. Of these 32 vectors, 1 is the good sequence and 31 are faulty sequences. Assume that each result sequence is equally likely to occur. (NOTE: this is a poor assumption, a full analysis would make each stuck-at fault equally likely, then compute the result vector for each fault.) With this assumption, there is a 31 127 24% chance that a faulty sequence will result in the same signature as the good sequence. There is approximately a 24% chance that a faulty circuit will not be detected.
SOL-12:
7.9.5
24
7.9.5
If we increase the size of the signature analyzer by one ip op, by how much do we change the the approximate probability of a fault not being detected?
Answer:
A signature analyzer with 3 bits of data gives us 8 possible values. Thus, on average 128 8 16 different result sequences will map to the same 3-bit signature. Assuming that each result sequence is equally likely to occur, there is a 15 127 11 8% chance that a faulty sequence will result in the same signature as the good sequence. There is approximately a 12% chance that a faulty circuit will not be detected. Thus, we have decreased the probability of a faulty circuit not being detected from 24% to 12%.
SOL-12:
7.9.6
25
7.9.6
Answer:
b.
SOL-12:
7.9.6
26
1x2
1x1
1x0
1x1
This remainder is the same as the remainder for the correct circuit, thus the fault will be not detected! In hardware:
clk i d0 q0 d1 q1 1 0 0 0 0 1 1 1 0 1 1 0 0 0 0 remainder
0 1 0 1 1 0 0 0
0 1 1 0 1 1 1 0
0 0 1 1 0 1 1 1
quotient
Quotient Remainder
1x4 1x1
1x3 1x0
1x1
1x1 0x3
1x0 0x2
SOL-12:
7.9.7
27
7.9.7
Answer: For a maximal-length LFSR of n bits, it takes 2n 1 clock cycles to generate the 2n 1 test vectors, plus one cycle at the end to op the results. This gives a total of 2n clock cycles, which in our case is 8.
SOL-12:
7.10
28
7.10
You add a BIST circuit to a chip. This causes the chip to exceed the power envelop that marketing has dictaed is needed. What can you do to reduce the power consumption of the chip without negatively affecting performance or incuring signicant design effort?
Answer: When in test mode, run the clock at a lower frequency so that the chip will consume less power. Add clock gating to signature analyzer so that it is turned off when the chip is in normal mode.
SOL-12:
7.11
29
7.11
a
L1
L2 L5 L10
L13
L15
L14
L3
L6
L11
1. Does the circuit have any untestable single-stuck-at faults? If so, identify them. Answer:
a c
None of the minterms are completely covered by other minterms, so the circuit is irredundant and does not have undetectable faults. The two minterms ac and ab overlap, but neither is completely covered by other minterms. So, if one of them was stuck at 0, there would be at least one set of input values that would cause the faulty circuit to differ from the correct circuit. 2. Does the circuit have any static timing hazards?
SOL-12:
7.11
30
Answer: Moving from abc to abc moves between minterms. Thus, there is a potential timing hazard.
a
c
Potential glitch (static hazard)
3. Add any circuitry needed to prevent static timing hazards in the circuit below, then identify any untestable single-stuck-at faults in the resulting circuit. Answer:
a c
L1 L7 L8 L4 L9@0 L16@0
L12
L2
L13@0 L15
L19@0
L14
SOL-12:
7.11
31
The minterms ab and bc are both completely covered by other minterms. Thus, these minterms are redundant and are sources of undetectable faults. This gives us L13@0 and L19@0 as undetectable single stuck-at faults. Using gate collapsing, we see that the following faults are equivalent to L13@0: L9@0, L160. And the following are equivalent to L19@0: L17@0, L18@0. NOTE: although both L16@0 and L17@0 are undetectable, this does not mean that L2@0 is undetectable. L2@0 is equivalent to having both L16@0 and L17@0 at the same time. Check the Boolean equations if you are in doubt about this.
SOL-12:
7.12
32
7.12
SOL-12:
7.12.1
Are there any physical faults that are detectable by scan testing but not by built-in self
7.12.1 Are there any physical faults that are detectable by scan testing but not by built-in self testing?
If not, explain why. If so, describe such a fault.
Answer: Yes.
A fault that is only detectable with 000 will be detectable by scan testing but not by built-in self test. A fault that results in the same signature as the correct circuit will be detectable by scan testing but not by built-in self test.
SOL-12:
7.12.2
Are there any physical faults that are detectable by built-in self testing but not by scan t
7.12.2 Are there any physical faults that are detectable by built-in self testing but not by scan testing?
If not, explain why. If so, describe such a fault.
Answer: No. Any fault that is detectable by built-in self testing can be detected by scan testing where the test vector that we scan in in the BIST test vector that triggers the fault. If scan testing is interpreted as boundary scan testing and built-in self test is allowed inside a chip, then there are faults that are detectable by built-in self test but not by boundary scan testing. These faults would be inside redundant sequential circuitry. But, this scenario was not intended to be part of this question.
SOL-12:
7.13
FAULT TESTING
35
7.13
Fault Testing
In this question, you will design and analyze built-in self test circuitry for the circuit-under-test shown below.
SOL-12:
7.13.1
36
Answer:
clk d0 1 0 1 q0 1 1 0 1 d1 0 1 1 q1 1 0 1 1 value 3 1 2 3
SOL-12:
7.13.2
37
Answer:
1.
SOL-12:
7.13.3
38
Answer:
2. Simulating correct output sequence 011 through signature analyzer: i 0 1 1 d0 0 1 0 q0 0 0 1 0 3. Equation for faulty circuit-under-test is ab a b output 1 1 1 0 1 0 0 1 0
4. Simulating faulty output sequence 100 through signature analyzer: i 1 0 0 d0 1 1 1 q0 0 1 1 1 5. Output of signature analyzer is different from correct circuit, so the fault will be detected.
ab.
SOL-12:
7.13.4
Testing time
39
Answer:
1. reset circuit 2. run rst of three test vectors 3. run second of three test vectors 4. run three of three test vectors 5. op result from circuit under test into signature analyzer 5 clock cycles.