Good VLSI Design Test Power Tutorial

E&CE 427: Digital System Engineering
Mark Aagaard University of Waterloo Dept of Electrical and Computer Engineering 2003t1Winter March 24, 2003
E&CE 427: 2003t1Winter 0
Contents
I Lecture Notes
1 VHDL LEC-02: Introduction to VHDL . . . . . . . . . . . . . . . . . . . 1.1 Prelude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Topics in this Chapter . . . . . . . . . . . . . . . . . . 1.1.2 Background Material . . . . . . . . . . . . . . . . . . . 1.1.3 Recommended Reading . . . . . . . . . . . . . . . . . 1.2 Introduction to VHDL . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 VHDL Origins and History . . . . . . . . . . . . . . . . 1.2.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Synthesis of a Simulation-Based Language . . . . . . 1.2.4 Solution to Synthesis Sanity . . . . . . . . . . . . . . . 1.2.5 VHDL Disadvantages . . . . . . . . . . . . . . . . . . 1.2.6 VHDL Advantages . . . . . . . . . . . . . . . . . . . . 1.2.7 VHDL and Other Languages . . . . . . . . . . . . . . 1.2.7.1 VHDL vs Verilog . . . . . . . . . . . . . . . . 1.2.7.2 VHDL vs SystemC . . . . . . . . . . . . . . . 1.2.7.3 VHDL vs Other Hardware Description Languages . . . . . . . . . . . . . . . . . . . . . 1.2.7.4 Summary of VHDL Evaluation . . . . . . . . 1.3 Overview of Syntax . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Syntactic Categories . . . . . . . . . . . . . . . . . . . 1.3.2 Library Units . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Entities and Architecture . . . . . . . . . . . . . . . . . 1.3.4 Concurrent Statements . . . . . . . . . . . . . . . . . 1.3.5 Component Declaration and Instantiations . . . . . . . 1.3.6 Processes . . . . . . . . . . . . . . . . . . . . . . . . 1.3.7 Sequential Statements . . . . . . . . . . . . . . . . . . 1.4 Concurrent vs Sequential Statements . . . . . . . . . . . . . 1.4.1 Concurrent Assignment vs Process . . . . . . . . . . 1.4.2 Conditional Assignment vs If Statements . . . . . . . 1.4.3 Selected Assignment vs Case Statement . . . . . . . i
1
3 1 4 5 6 7 9 10 14 18 19 20 21 22 23 24 25 26 27 28 29 31 36 39 40 45 47 48 49 50
CONTENTS
1.4.4 Coding Style . . . . . . . . . . . . . . . . . . . . . . . 1.5 Overview of Processes . . . . . . . . . . . . . . . . . . . . . 1.5.1 Combinational Process vs Clocked Process . . . . . . 1.5.2 Latch Inference . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Combinational vs Flopped Signals . . . . . . . . . . . LEC-03: Details of Process Execution . . . . . . . . . . . . . . . 1.6 Details of Process Execution . . . . . . . . . . . . . . . . . . 1.6.1 Denitions and Algorithm . . . . . . . . . . . . . . . . 1.6.1.1 Temporal Granularities of Simulation . . . . . 1.6.1.2 Process Modes . . . . . . . . . . . . . . . . 1.6.1.3 Simulation Algorithm . . . . . . . . . . . . . 1.6.1.4 Delta-Cycle Denitions . . . . . . . . . . . . 1.6.2 Example: Process Execution . . . . . . . . . . . . . . 1.6.3 Example: Need for Provisional Assignments . . . . . LEC-04: Hardware Building Blocks . . . . . . . . . . . . . . . . 1.7 VHDL and Hardware Building Blocks . . . . . . . . . . . . . . 1.7.1 Basic Building Blocks . . . . . . . . . . . . . . . . . . 1.7.2 Deprecated Building Blocks for RTL . . . . . . . . . . 1.7.3 Hardware and Code for Flops . . . . . . . . . . . . . . 1.7.3.1 Flip-Flops vs Latches . . . . . . . . . . . . . 1.7.3.2 Flops with Waits and Ifs . . . . . . . . . . . . 1.7.3.3 Flops with Synchronous Reset . . . . . . . . 1.7.3.4 Flops with Chip-Enable . . . . . . . . . . . . 1.7.3.5 Flops with Chip-Enable and Mux on Input . . 1.7.3.6 Flops with Chip-Enable, Muxes, and Reset . 1.7.4 An Example Sequential Circuit . . . . . . . . . . . . . 1.8 Synthesizable vs Non-Synthesizable Code . . . . . . . . . . 1.8.1 Unsynthesizable Code . . . . . . . . . . . . . . . . . . 1.8.1.1 Initial Values . . . . . . . . . . . . . . . . . . 1.8.1.2 Wait For . . . . . . . . . . . . . . . . . . . . . 1.8.1.3 Different Wait Conditions . . . . . . . . . . . 1.8.1.4 Multiple if rising edges in Same Process 1.8.1.5 if rising edge and wait in Same Process 1.8.1.6 if rising edge with else Clause . . . . 1.8.1.7 if rising edge Inside a for Loop . . . . 1.8.1.8 wait Inside of a for loop . . . . . . . . . 1.8.2 Synthesizable, but Undesirable Hardware . . . . . . . 1.8.2.1 Asynchronous Reset . . . . . . . . . . . . . 1.8.2.2 Bad Form of Nested Ifs . . . . . . . . . . . . 1.8.2.3 Deeply Nested Ifs . . . . . . . . . . . . . . . 1.9 Numbers, Arithmetic, Arrays, and Signals . . . . . . . . . . . 1.9.1 Arithmetic Packages . . . . . . . . . . . . . . . . . . . 51 55 62 67 71 1 5 6 7 12 17 22 24 79 1 5 6 8 12 13 14 15 16 17 19 22 28 29 30 31 32 34 35 36 37 39 41 42 43 44 45 46
CONTENTS
1.9.2 1.9.3 1.9.4 1.9.5 1.9.6 1.9.7 Shift and Rotate Operations . . . . Overloading of Arithmetic . . . . . Different Widths and Arithmetic . . Overloading of Comparisons . . . Different Widths and Comparisons Type Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 48 49 50 51 52 57 58 59 1 5 6 8 9 12 13 14 15 16 17 18 19 20 25 26 38 39 51 54 56 59 61 66 88 92 95 1 7 8 11 14
2 RTL Design with VHDL 2.1 Prelude to Chapter . . . . . . . . . . . . . . . . . . . 2.1.1 Topics in this Chapter . . . . . . . . . . . . . LEC-05: Dataow Diagrams . . . . . . . . . . . . . . . . 2.2 Design Flow . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Generic Design Flow . . . . . . . . . . . . . . Additional material in notes . . . . . . . . 2.2.2 Implementation Flows . . . . . . . . . . . . . 2.2.3 Classes of Hardware . . . . . . . . . . . . . . 2.2.4 Design Flow: Datapath vs Control vs Storage 2.2.4.1 Datapath-Centric Design Flow . . . 2.2.4.2 Control-Centric Design Flow . . . . 2.2.4.3 Storage-Centric Design Flow . . . . 2.3 Dataow Diagrams and High-Level Models . . . . . 2.3.1 Overview of Example . . . . . . . . . . . . . 2.3.1.1 Software vs Hardware Algorithms . 2.3.1.2 Serial vs Parallel . . . . . . . . . . . 2.3.2 Dataow Diagrams . . . . . . . . . . . . . . . 2.3.2.1 Dataow Diagrams Overview . . . . 2.3.2.2 Area Estimation . . . . . . . . . . . 2.3.3 Dataow Diagram Execution . . . . . . . . . 2.3.3.1 Performance Estimation . . . . . . . 2.3.3.2 Design Analysis . . . . . . . . . . . 2.3.4 Area / Performance Tradeoffs . . . . . . . . . 2.3.5 Optimize Inputs and Outputs . . . . . . . . . 2.3.6 From Dataow Diagram to High-Level Model 2.3.7 From Dataow Diagram to DP+Ctrl Model . . 2.3.7.1 Datapath for DP+Ctrl Model . . . . 2.3.8 Dataow Diagram Scheduling . . . . . . . . . 2.3.9 Summary: From Dataow to Hardware . . . . LEC-06: State Machine Design . . . . . . . . . . . . . . 2.4 Finite State Machines in VHDL . . . . . . . . . . . . 2.4.1 Mealy vs Moore State Machines . . . . . . . 2.4.2 State Machines and VHDL . . . . . . . . . . 2.4.2.1 Implicit and Explicit State Machines
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
2.4.3 Some Simple State Machines . . . . . . . . . . . . . . 2.4.3.1 Implementing a Simple Moore Machine . . . 2.4.3.2 Implementing a Simple Mealy Machine . . . 2.4.4 State Encoding . . . . . . . . . . . . . . . . . . . . . . 2.4.4.1 Constants vs Enumerated Type . . . . . . . 2.4.4.2 Encoding Schemes . . . . . . . . . . . . . . 2.4.5 From Dataow to State Machine . . . . . . . . . . . . 2.4.6 Implicit vs Explicit State Machines . . . . . . . . . . . 2.4.7 Implicit State Machines . . . . . . . . . . . . . . . . . 2.4.7.1 Multi-Wait Process . . . . . . . . . . . . . . . 2.4.7.2 Counter . . . . . . . . . . . . . . . . . . . . . 2.4.8 Explicit State Machines . . . . . . . . . . . . . . . . . 2.4.8.1 State Machine . . . . . . . . . . . . . . . . . 2.4.8.2 Conditional Assignment . . . . . . . . . . . . 2.4.8.3 Conditional Assignment with Dont Care . . . 2.4.8.4 Selected Assignment with Dont Care . . . . 2.4.8.5 Case Statement . . . . . . . . . . . . . . . . 2.4.9 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.10 Input / Output Protocols . . . . . . . . . . . . . . . . LEC-07: Memory Design . . . . . . . . . . . . . . . . . . . . . . . 2.5 Memory Arrays and RTL Design . . . . . . . . . . . . . . . . 2.5.1 Memory Arrays and Dataow Diagrams . . . . . . . . 2.5.1.1 Legend for Dataow Diagrams . . . . . . . . 2.5.1.2 Basic Memory Operations . . . . . . . . . . 2.5.1.3 Data Dependencies . . . . . . . . . . . . . . 2.5.1.4 Denition of Three Types of Dependencies . 2.5.1.5 Dataow Diagrams and Data Dependencies 2.5.1.6 Example: Memory Array and Dataow Diagram . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Memory Arrays in VHDL . . . . . . . . . . . . . . . . . 2.5.2.1 Two-Dimensional Array . . . . . . . . . . . . 2.5.2.2 Memory Array in Hardware . . . . . . . . . . 2.5.2.3 Example VHDL Code for Memory Array in Hardware . . . . . . . . . . . . . . . . . . . . 2.5.2.4 Library Component . . . . . . . . . . . . . . 2.5.2.5 Build Memory from Slices . . . . . . . . . . . 2.5.2.6 Dual-Ported Memory . . . . . . . . . . . . . LEC-08: Design Example: Stack . . . . . . . . . . . . . . . . . . 2.6 Design Example: Stack . . . . . . . . . . . . . . . . . . . . . 2.6.1 Stack Requirements . . . . . . . . . . . . . . . . . . . 2.6.1.1 Stack Entity . . . . . . . . . . . . . . . . . . . 2.6.1.2 Stack Instructions . . . . . . . . . . . . . . . 17 18 30 38 39 44 46 48 49 50 51 52 53 54 55 56 57 59 62 1 7 8 9 10 12 17 18 25 39 40 42 43 44 48 53 1 7 8 9 10
CONTENTS
2.6.1.3 Stack Instruction Encoding . . . . . . . . . . 2.6.1.4 Miscellaneous Requirements . . . . . . . . . 2.6.2 Stack Algorithm . . . . . . . . . . . . . . . . . . . . . 2.6.3 Stack Dataow Diagrams . . . . . . . . . . . . . . . . 2.6.3.1 Initial Diagrams . . . . . . . . . . . . . . . . 2.6.3.2 Partition into Clock Cycles . . . . . . . . . . 2.6.3.3 High-Level Model . . . . . . . . . . . . . . . 2.6.3.4 Individual Block Diagrams . . . . . . . . . . . 2.6.3.5 Complete Block Diagram . . . . . . . . . . . 2.6.4 Stack: Register Transfer Level . . . . . . . . . . . . . 2.6.4.1 Stack: Separate Control, Datapath and Storage . . . . . . . . . . . . . . . . . . . . . . . 2.6.4.2 Stack: Datapath Operations . . . . . . . . . 2.6.4.3 Stack: Explicit State Machine . . . . . . . . . LEC-09: Guidelines and Optimization Techniques . . . . . . . . 2.7 RTL Coding Guidelines . . . . . . . . . . . . . . . . . . . . . 2.7.1 Design Process . . . . . . . . . . . . . . . . . . . . . 2.7.2 Signal Declarations . . . . . . . . . . . . . . . . . . . 2.7.3 Processes . . . . . . . . . . . . . . . . . . . . . . . . 2.7.4 Flip-Flops and Latches . . . . . . . . . . . . . . . . . 2.7.4.1 Multiplexors and Tri-State Signals . . . . . . 2.7.5 State Machines . . . . . . . . . . . . . . . . . . . . . . 2.7.5.1 Reset . . . . . . . . . . . . . . . . . . . . . . 2.7.6 Inputs and Outputs . . . . . . . . . . . . . . . . . . . . 2.8 Additional VHDL Features . . . . . . . . . . . . . . . . . . . . 2.8.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.2 Still More VHDL Features . . . . . . . . . . . . . . . . 2.9 General Optimization Techniques . . . . . . . . . . . . . . . . 2.9.1 Strength Reduction . . . . . . . . . . . . . . . . . . . 2.9.1.1 Arithmetic Strength Reduction . . . . . . . . 2.9.1.2 Boolean Strength Reduction . . . . . . . . . 2.9.2 Replication and Sharing . . . . . . . . . . . . . . . . . 2.9.2.1 Mux-Pushing . . . . . . . . . . . . . . . . . . 2.9.2.2 Common Subexpression Elimination . . . . . 2.9.2.3 Computation Replication . . . . . . . . . . . 2.9.3 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.4 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . LEC-10: FPGA-Specic Guidelines and Optimization . . . . . . 2.10 FPGA-Specic Guidelines . . . . . . . . . . . . . . . . . . . 2.10.1 Generic FPGAs . . . . . . . . . . . . . . . . . . . . . 2.10.1.1 Overview of Generic FPGA Hardware . . . 2.10.1.2 Generic Clocks . . . . . . . . . . . . . . . . 11 12 13 17 18 23 28 37 43 45 52 70 80 1 4 5 6 11 15 17 18 20 24 25 26 30 31 32 33 34 35 36 37 39 40 41 1 5 6 7 24
CONTENTS
2.10.1.3 Special Circuitry in FPGAs 2.10.2 Altera APEX20K . . . . . . . . . . . 2.11 Example Circuits . . . . . . . . . . . . . . . 2.11.1 Ripple-Carry Adder . . . . . . . . . . 2.11.2 Barrel Shifter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 32 36 37 38 43 1 8 9 14 21 22 27 31 34 39 46 50 51 53 55 57 58 61 63 64 65 66 68 69 1 10 11 14 15 16 17 19 21 28 33
3 Functional Validation LEC-11: Functional Validation of Datapath Circuits 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Validation / Verication / Testing . . . . . 3.1.2 Why Your First Circuit Will Not Work . . . 3.2 Test Cases . . . . . . . . . . . . . . . . . . . . . 3.2.1 Coverage . . . . . . . . . . . . . . . . . . 3.2.2 Heating System Example . . . . . . . . . 3.2.2.1 Number of Cases to Consider . 3.2.2.2 Representation Simplication . . 3.2.3 Floating Point Divider Example . . . . . . 3.2.4 Functional Validation Challenges . . . . . 3.3 Testbenches . . . . . . . . . . . . . . . . . . . . 3.3.1 Overview of Test Benches . . . . . . . . . 3.3.2 Reference Model Style Testbench . . . . 3.3.3 Relational Style Testbench . . . . . . . . 3.3.4 Coding Structure of a Testbench . . . . . 3.3.5 Datapath vs Control . . . . . . . . . . . . 3.4 Functional Validation for Datapath Circuits . . . . 3.4.1 A Spec-Less Testbench . . . . . . . . . . 3.4.2 Use an Array for Test Vectors . . . . . . . 3.4.3 Build Spec into Stimulus . . . . . . . . . . 3.4.4 Have Separate Specication Entity . . . . 3.4.5 Generate Test Vectors . . . . . . . . . . . 3.4.6 Relational Specication . . . . . . . . . . LEC-12: Functional Validation of State Machines . 3.5 Functional Validation of Control Circuits . . . . . 3.5.1 Overview of Queues in Hardware . . . . . 3.5.2 VHDL Coding . . . . . . . . . . . . . . . . 3.5.2.1 Package . . . . . . . . . . . . . 3.5.2.2 Other VHDL Coding . . . . . . . 3.5.3 Code Structure for Validation . . . . . . . 3.5.4 Instrumentation Code . . . . . . . . . . . 3.5.5 Coverage Monitors . . . . . . . . . . . . . 3.5.6 Assertions . . . . . . . . . . . . . . . . . 3.5.7 VHDL Coding Tips . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
3.5.8 Queue Specication . . . . . . . . . . . . . . . . . . . 3.5.9 Queue Testbench . . . . . . . . . . . . . . . . . . . . 4 Performance Analysis and Optimization 4.1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Background Material . . . . . . . . . . . . . . . . . . . 4.1.3 Reading Material . . . . . . . . . . . . . . . . . . . . . LEC-13: Introduction to Performance Analysis . . . . . . . . . 4.2 Dening Performance . . . . . . . . . . . . . . . . . . . . . . 4.3 Comparing Performance . . . . . . . . . . . . . . . . . . . . . 4.3.1 Performance for Different Tasks . . . . . . . . . . . . . 4.3.2 Optimizing Performance . . . . . . . . . . . . . . . . . 4.4 Clock Speed, CPI, Program Length, and Performance . . . . 4.4.1 Mathematics . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Example: CISC vs RISC and CPI . . . . . . . . . . . . 4.4.3 Summary of Equations . . . . . . . . . . . . . . . . . LEC-14: Performance and Dataow Diagrams . . . . . . . . . . 4.5 Performance Analysis and Dataow Diagrams . . . . . . . . 4.5.1 Dataow Diagrams, CPI, and Clock Speed . . . . . . 4.5.1.1 Tradeoffs . . . . . . . . . . . . . . . . . . . . 4.5.2 Dataow Diagram with Two Instructions . . . . . . . . 4.5.2.1 Scheduling of Operations for Different Clock Periods . . . . . . . . . . . . . . . . . . . . . 4.5.2.2 Performance Computation for Different Clock Periods . . . . . . . . . . . . . . . . . 4.5.2.3 Example: Two Instructions Taking Similar Time 4.5.2.4 Example: Same Total Time, Different Order for A . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Example: From Algorithm to Optimized Dataow . . . 4.5.4 Optimality: Performance vs Area Tradeoffs . . . . . . 4.5.5 Affect of Instruction Set on Performance . . . . . . . . 4.5.6 Affect of Time to Market on Relative Performance . . 39 43 45 46 47 48 49 1 7 10 13 14 16 17 18 22 1 5 6 7 9 10 14 15 18 20 24 27 30
CONTENTS
5 Timing Analysis 5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Background Material . . . . . . . . . . . . . . . . . . . 5.1.3 Reading Material . . . . . . . . . . . . . . . . . . . . . LEC-15: Introduction to Timing Analysis . . . . . . . . . . . . . 5.2 Delays and Denitions . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Related Background Denitions . . . . . . . . . . . . . 5.2.2 Timing Constraints . . . . . . . . . . . . . . . . . . . . 5.2.2.1 Minimum Clock Period . . . . . . . . . . . . . 5.2.2.2 Hold Constraint . . . . . . . . . . . . . . . . 5.2.3 Clock-Related Timing Denitions . . . . . . . . . . . . 5.2.3.1 Clock Skew (Smith 6.5.1) . . . . . . . . . . . 5.2.3.2 Clock Latency (Smith 6.5.1) . . . . . . . . . . 5.2.3.3 Clock Jitter (Smith pp873) . . . . . . . . . . . 5.2.4 Storage Related Timing Denitions (Smith 2.5.2) . . . 5.2.4.1 Setup Time . . . . . . . . . . . . . . . . . . . 5.2.4.2 Hold Time . . . . . . . . . . . . . . . . . . . 5.2.4.3 Clock-to-Q Time . . . . . . . . . . . . . . . . 5.2.4.4 Example Timing Violations . . . . . . . . . . 5.2.5 Propagation Delays . . . . . . . . . . . . . . . . . . . 5.2.5.1 Load Delays (Smith 3.1) . . . . . . . . . . . . 5.2.5.2 Interconnect Delays (Smith 7.1) . . . . . . . 5.3 Critical Paths: False and True . . . . . . . . . . . . . . . . . . 5.3.1 Critical Path Example . . . . . . . . . . . . . . . . . . 5.3.2 Algorithm to Find Critical Path . . . . . . . . . . . . . 5.3.2.1 Critical Path Between Two Signals . . . . . . 5.3.2.2 Critical Path Between Sets of Signals . . . . 5.3.3 False Paths . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3.1 Static False Path Example . . . . . . . . . . 5.3.3.2 Dynamic False Path Example . . . . . . . . . CHANGE ver2 (2002/12/02): corrected edge polarity on a . . . . . . . . . . . . . . . . . . . . . 5.3.3.3 Another Dynamic False Path Example . . . . 5.3.3.4 And Another Dynamic False Path Example . 5.3.3.5 Algorithm for False Path Detection . . . . . . 5.3.4 Increasing the Accuracy of Critical Path Analysis . . . LEC-16: Math, Physics, and Applications of Timing Analysis . 5.4 Analog Effects in Timing Analysis . . . . . . . . . . . . . . . . 5.4.1 Timing Model (Smith 3.1, 13.6) . . . . . . . . . . . . . 5.4.1.1 Equation for Output Voltage . . . . . . . . . . 5.4.1.2 Extrinsic / Intrinsic Delays (Smith 13.6) . . . 33 34 35 36 37 1 10 11 17 21 22 23 24 26 27 29 31 32 33 34 38 39 41 42 46 47 48 51 52 53 59 68 71 73 76 84 1 5 6 7 13
vi
CONTENTS
5.4.2 Data-Dependent Delay . . . . . . . . . . . . . . . . 5.4.3 Interconnect Delay (Smith 7.1) . . . . . . . . . . . . 5.4.3.1 Elmore Time Constant (Smith 7.1.2) . . . . 5.4.3.2 Interconnect with Single Fanout . . . . . . 5.4.3.3 Interconnect with Multiple Gates in Fanout 5.4.3.4 FPGAs, Interconnect, and Synthesis . . . . 5.5 Practical Usage of Timing Analysis . . . . . . . . . . . . . . 5.5.1 Speed Binning (Smith 5.1.6) . . . . . . . . . . . . . 5.5.2 Worst Case Timing (Smith 5.1.7) . . . . . . . . . . . 5.5.2.1 Fanout delay . . . . . . . . . . . . . . . . . 5.5.2.2 Derating Factors . . . . . . . . . . . . . . . LEC-17: Timing Analysis (Latches and Flip Flops) . . . . . . 5.6 Timing Analysis of Latches and Flip Flops . . . . . . . . . . 5.6.1 Simple Latch . . . . . . . . . . . . . . . . . . . . . . 5.6.2 Clock-to-Q Time of a Simple Latch . . . . . . . . . . 5.6.3 Setup Timing of a Simple Latch . . . . . . . . . . . . 5.6.3.1 Hold Time of a Simple Latch . . . . . . . . 5.6.3.2 Example of a Bad Latch . . . . . . . . . . . 5.6.4 Timing Analysis of a Transmission Gate Latch . . . 5.6.4.1 Transmission Gate (Smith 2.4.3) . . . . . . 5.6.4.2 Transmission Gate Latch (Smith 2.5.1) . . 5.6.4.3 Clock-to-Q Delay for Latch . . . . . . . . . 5.6.4.4 Setup and Hold Times for Latch . . . . . . 5.6.5 Falling Edge Flip Flop (Smith 2.5.2) . . . . . . . . . 5.6.5.1 Behaviour of Flip-Flop . . . . . . . . . . . . 5.6.5.2 Clock-to-Q of Flip-Flop . . . . . . . . . . . 5.6.5.3 Setup of Flip-Flop . . . . . . . . . . . . . . 5.6.5.4 Hold of Flip-Flop . . . . . . . . . . . . . . . 5.6.6 Timing Analysis of FPGA Cells (Smith 5.1.5) . . . . 5.6.6.1 Standard Timing Equations . . . . . . . . . 5.6.6.2 Hierarchical Timing Equations . . . . . . . 5.6.6.3 Actel Act 2 Logic Cell . . . . . . . . . . . . 5.6.6.4 Timing Analysis of Actel Sequential Module 5.6.7 Exotic Flop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 16 17 19 25 37 38 39 40 41 42 1 4 5 20 21 25 29 30 31 32 35 36 39 40 41 42 43 44 45 46 47 52 54
CONTENTS
6 Power Analysis and Design LEC-18: Introduction to Power . . . . . . . . . . . . . . . . 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Importance of Power and Energy . . . . . . . . . 6.1.2 Industrial Names and Products . . . . . . . . . . 6.1.3 Power vs Energy . . . . . . . . . . . . . . . . . . 6.1.4 Batteries, Power and Energy . . . . . . . . . . . 6.1.4.1 Do Batteries Store Energy or Power? . 6.1.4.2 Battery Life and Efciency . . . . . . . 6.1.5 Example Problem: Battery Life and Power . . . . 6.2 Power Equations . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Dynamic Power and Activity Factor . . . . . . . . 6.2.2 Switching Power . . . . . . . . . . . . . . . . . . 6.2.3 Short-Circuited Power . . . . . . . . . . . . . . . 6.2.4 Leakage Power . . . . . . . . . . . . . . . . . . . 6.2.5 Glossary . . . . . . . . . . . . . . . . . . . . . . 6.2.6 Note on Power Equations . . . . . . . . . . . . . LEC-19: Data Encoding for Power Reduction . . . . . . . 6.3 Overview of Power Reduction Techniques . . . . . . . . 6.4 Voltage Reduction for Power Reduction . . . . . . . . . 6.5 Data Encoding for Power Reduction . . . . . . . . . . . 6.5.1 How Data Encoding Can Reduce Power . . . . . 6.5.2 Example Problem . . . . . . . . . . . . . . . . . 6.5.2.1 Problem Statement . . . . . . . . . . . 6.5.2.2 Additional Information . . . . . . . . . . 6.5.2.3 Answer . . . . . . . . . . . . . . . . . . LEC-20: Clock Gating for Power Reduction . . . . . . . . . 6.6 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Introduction to and Overview of Clock Gating . . 6.6.1.1 Examples of Clock Gating . . . . . . . . 6.6.1.2 Design Tradeoffs . . . . . . . . . . . . . 6.6.1.3 Functional Validation and Clock Gating 6.6.2 Implementing Clock Gating . . . . . . . . . . . . 6.6.2.1 Simple Power Analysis . . . . . . . . . 6.6.2.2 Valid-Bit Protocol . . . . . . . . . . . . . 6.6.2.3 Clock Gating and Big Circuit . . . . . 6.6.2.4 Designing Clock Gating Circuitry . . . . 6.6.3 Design Problem . . . . . . . . . . . . . . . . . . 6.6.3.1 Solution Sketch . . . . . . . . . . . . . 55 1 7 8 9 10 11 12 13 14 15 16 17 18 19 20 23 1 4 9 13 14 15 16 17 18 1 4 5 6 7 8 9 10 14 21 29 32 34
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
7 Fault Testing and Testability 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Purpose and List of Concepts . . . . . . . . . . . . . . 7.1.2 Background Material . . . . . . . . . . . . . . . . . . . 7.1.3 Reading Material . . . . . . . . . . . . . . . . . . . . . LEC-21: Introduction to Faults, Testing, and Testability . . . . 7.2 Faults and Testing . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Overview of Faults and Testing . . . . . . . . . . . . . 7.2.1.1 Faults (Smith 14.3) . . . . . . . . . . . . . . 7.2.1.2 Causes of Faults (Smith 14.3) . . . . . . . . 7.2.1.3 Testing (Smith 14) . . . . . . . . . . . . . . . 7.2.1.4 Burn In (Smith 14.3.1) . . . . . . . . . . . . . 7.2.1.5 Bin Sorting (Smith 5.1.6) . . . . . . . . . . . 7.2.1.6 Testing Techniques (Smith 14) . . . . . . . . 7.2.1.7 Design for Testability (DFT) (Smith 14.6) . . 7.2.2 Example Problem: Economics of Testing (Smith 14.1) 7.2.3 Physical Faults (Smith 14.3.3) . . . . . . . . . . . . . . 7.2.3.1 Types of Physical Faults . . . . . . . . . . . . 7.2.3.2 Locations of Faults . . . . . . . . . . . . . . . 7.2.3.3 Layout Affects Locations . . . . . . . . . . . 7.2.3.4 Naming Fault Locations . . . . . . . . . . . . 7.2.4 Detecting a Fault . . . . . . . . . . . . . . . . . . . . . 7.2.4.1 Which Test Vectors will Detect a Fault? . . . 7.2.4.2 A Single Test-Vector Can Detect Several Faults . . . . . . . . . . . . . . . . . . . . . . 7.2.5 Mathematical Models of Faults (Smith 14.3.4) . . . . . 7.2.5.1 Single Stuck-At Fault Model . . . . . . . . . . 7.2.6 Generate Test Vector to Find a Mathematical Fault (Smith 14.4) . . . . . . . . . . . . . . . . . . . . . . . . 7.2.6.1 Algorithm . . . . . . . . . . . . . . . . . . . . 7.2.6.2 Example of Finding a Test Vector . . . . . . . 7.2.7 Undetectable Faults . . . . . . . . . . . . . . . . . . . 7.2.7.1 Redundant Circuitry . . . . . . . . . . . . . . 7.2.7.2 Curious Redundant Circuitry and Fault Detection . . . . . . . . . . . . . . . . . . . . . LEC-22: Fault Detection and Test-Vector Generation . . . . . . 7.3 Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Locations of Faults . . . . . . . . . . . . . . . . . . . . 7.3.2 Choosing Test Vectors (Smith 14.3.7) . . . . . . . . . 7.3.2.1 Fault Domination . . . . . . . . . . . . . . . . 7.3.2.2 Fault Equivalence . . . . . . . . . . . . . . . 7.3.2.3 Gate Collapsing . . . . . . . . . . . . . . . . 49 50 51 52 53 1 6 7 8 9 10 11 12 13 15 16 18 19 20 21 22 23 24 25 26 27 31 32 33 34 35 41 1 4 5 8 9 10 11
CONTENTS
7.3.2.4 Node Collapsing . . . . . . . . . . . . . . . . 7.3.2.5 Fault Collapsing Summary . . . . . . . . . . 7.3.3 Fault Coverage . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Generate Test Vectors for 100% Coverage . . . . . . 7.3.4.1 Collapse the Faults . . . . . . . . . . . . . . 7.3.4.2 Check for Fault Domination . . . . . . . . . . 7.3.4.3 Required Test Vectors . . . . . . . . . . . . . 7.3.4.4 Faults Not Covered by Required Test Vectors 7.3.4.5 Order to Run Test Vectors . . . . . . . . . . . 7.3.4.6 Summary of Technique to Find and Order Test Vectors . . . . . . . . . . . . . . . . . . 7.3.4.7 Complete Analysis . . . . . . . . . . . . . . . 7.3.5 One Fault Hiding Another . . . . . . . . . . . . . . . . LEC-23: Built In Self Test . . . . . . . . . . . . . . . . . . . . . . 7.4 Built In Self Test (Smith 14.7) . . . . . . . . . . . . . . . . . . 7.4.1 Block Diagram . . . . . . . . . . . . . . . . . . . . . . 7.4.1.1 Components . . . . . . . . . . . . . . . . . . 7.4.1.2 Linear Feedback Shift Register (LFSR) . . . 7.4.1.3 Maximal-Length LFSR . . . . . . . . . . . . . 7.4.2 Test Generator . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Signature Analyzer . . . . . . . . . . . . . . . . . . . . 7.4.4 Result Checker . . . . . . . . . . . . . . . . . . . . . . 7.4.5 Arithmetic over Binary Fields . . . . . . . . . . . . . . 7.4.6 Shift Registers and Characteristic Polynomials (Smith 14.7.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.6.1 Circuit Multiplication . . . . . . . . . . . . . . 7.4.7 Bit Streams and Characteristic Polynomials . . . . . . 7.4.8 Division . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.9 Signature Analysis: Math and Circuits . . . . . . . . . 7.4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . LEC-24: Scan Testing (JTAG) . . . . . . . . . . . . . . . . . . . . 7.5 Scan Testing in General (Smith 14.6) . . . . . . . . . . . . . . 7.5.1 Structure and Behaviour of Scan Testing . . . . . . . 7.5.2 Scan Chains . . . . . . . . . . . . . . . . . . . . . . . 7.5.2.1 Circuitry in Normal Mode . . . . . . . . . . . 7.5.2.2 Scan in Operation . . . . . . . . . . . . . . . 7.5.2.3 Scan in Operation with Example Circuit . . . 7.5.3 Summary of Scan Testing . . . . . . . . . . . . . . . . 7.5.4 Example: Time to Test a Chip . . . . . . . . . . . . . . 7.6 Boundary Scan . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Boundary Scan History . . . . . . . . . . . . . . . . . 7.6.2 Scan Pins . . . . . . . . . . . . . . . . . . . . . . . . . 12 13 14 15 16 20 24 25 26 29 30 31 1 5 6 13 17 23 27 30 34 35 39 42 43 44 47 50 1 4 5 6 7 9 18 32 33 34 36 37
CONTENTS
7.6.3 Scan Registers and Cells . . . . . . . . . 7.6.4 Scan Instructions . . . . . . . . . . . . . . 7.6.5 TAP Controller . . . . . . . . . . . . . . . 7.6.6 Other descriptions of JTAG/IEEE 1194.1 . 7.7 Summary and Conclusions on Testing . . . . . . 7.7.1 Faults . . . . . . . . . . . . . . . . . . . . 7.7.2 Testing . . . . . . . . . . . . . . . . . . . 7.7.2.1 Scan Testing . . . . . . . . . . . 7.7.2.2 Built-In Self Test (BIST) . . . . . 7.7.3 Scan vs Self Test . . . . . . . . . . . . . . 8 Review LEC-25: Review . . . . . . . . . . . . . . . 8.1 Overview of the Term . . . . . . . . . 8.2 VHDL . . . . . . . . . . . . . . . . . . 8.3 Design and Optimization Techniques . 8.4 Validation . . . . . . . . . . . . . . . . 8.5 Performance Prediction and Analysis 8.6 Timing Analysis . . . . . . . . . . . . . 8.7 Power . . . . . . . . . . . . . . . . . . 8.8 Testing . . . . . . . . . . . . . . . . . 8.9 Formulas to be Given on Final Exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 43 44 45 46 47 48 49 51 53 55 1 2 5 6 7 8 9 10 11 13
xi
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
CONTENTS
xi
II Solutions to Tutorial Notes

1 VHDL Problems SOL-01: VHDL Syntax . . . . . . . . . . . . . . . . . . . . . 1.1 IEEE 1164 . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Flops, Latches, and Combinational Circuitry . . . . . . . 1.3 Counting Clock Cycles . . . . . . . . . . . . . . . . . . . 1.4 Arithmetic Overow . . . . . . . . . . . . . . . . . . . . 1.5 8-Bit Register . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Asynchronous Reset . . . . . . . . . . . . . . . . 1.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . 1.5.3 Testbench for Register . . . . . . . . . . . . . . . 1.6 VHDL Syntax . . . . . . . . . . . . . . . . . . . . . . . . SOL-02: VHDL Semantics . . . . . . . . . . . . . . . . . . . 1.7 Clock-Cycle Simulation . . . . . . . . . . . . . . . . . . 1.8 Delta-Cycle Simulation: Pong . . . . . . . . . . . . . . . 1.9 Delta-Cycle Simulation: Femur . . . . . . . . . . . . . . 1.10 VHDL VHDL Behavioural Comparison: Teradactyl . 1.11 VHDL VHDL Behavioural Comparison: Ichtyostega 1.12 Waveform VHDL Behavioural Comparison . . . . . 1.13 Hardware VHDL Comparison . . . . . . . . . . . . 1.14 Synthesizable VHDL and Hardware . . . . . . . . . . . 1.15 Datapath Design . . . . . . . . . . . . . . . . . . . . . 1.15.1 Correct Implementation? . . . . . . . . . . . . . 1.15.2 Smallest Area . . . . . . . . . . . . . . . . . . . 1.15.3 Shortest Clock Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
3 1 2 3 6 9 10 12 13 14 16 1 2 5 7 10 12 15 18 20 22 23 31 33
CONTENTS
2 Design Problems SOL-03: Datapath and Control Design . . . . . . . 2.1 Synthesis . . . . . . . . . . . . . . . . . . . . . 2.1.1 Data Structures . . . . . . . . . . . . . . 2.1.2 Own Code vs Libraries . . . . . . . . . 2.2 Design Guidelines . . . . . . . . . . . . . . . . 2.3 Dataow Diagram Optimization . . . . . . . . . 2.3.1 Resource Usage . . . . . . . . . . . . . 2.3.2 Optimization . . . . . . . . . . . . . . . 2.4 Dataow Diagram Design . . . . . . . . . . . . 2.4.1 Maximum performance . . . . . . . . . 2.4.2 Minimum area . . . . . . . . . . . . . . 2.5 Design and Optimization . . . . . . . . . . . . SOL-04: Memory Design . . . . . . . . . . . . . . . 2.6 Dataow Diagrams with Memory Arrays . . . . 2.6.1 Algorithm 1 . . . . . . . . . . . . . . . . 2.6.2 Algorithm 2 . . . . . . . . . . . . . . . . SOL-05: Optimization and FPGA Implementation 2.7 2-bit adder . . . . . . . . . . . . . . . . . . . . 2.7.1 Generic Gates . . . . . . . . . . . . . . 2.7.2 Xilinx FPGA . . . . . . . . . . . . . . . . 2.8 Sketches of Problems . . . . . . . . . . . . . . 3 Functional Validation Problems SOL-06: Functional Validation . . . . . . 3.1 Functional Validation Problems . . . . 3.1.1 Carry Save Adder . . . . . . . 3.1.2 Trafc Light Controller . . . . . 3.1.3 State Machines and Validation 3.1.4 Additional Problem . . . . . . . 3.1.5 Test Plan Creation . . . . . . . 3.1.5.1 Early Tests . . . . . . 3.1.5.2 Corner Cases . . . . 35 1 2 3 4 5 9 10 11 12 13 16 17 1 2 3 6 1 2 3 4 5 7 1 2 3 4 6 9 10 11 13
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
CONTENTS
4 Performance Analysis and Optimization Problems SOL-07: Performance Analysis and Optimization . 4.1 Farmer . . . . . . . . . . . . . . . . . . . . . . . 4.2 Network and Router . . . . . . . . . . . . . . . . 4.2.1 Maximum Throughput . . . . . . . . . . . 4.2.2 Packet Size and Performance . . . . . . . 4.3 Performance Short Answer . . . . . . . . . . . . 4.4 Microprocessors . . . . . . . . . . . . . . . . . . 4.4.1 Average CPI . . . . . . . . . . . . . . . . 4.4.2 Why not you too? . . . . . . . . . . . . . . 4.4.3 Analysis . . . . . . . . . . . . . . . . . . . 4.5 Dataow Diagram Optimization . . . . . . . . . . 4.6 Optimization with Memory Arrays . . . . . . . . . 4.7 Multiply Instruction . . . . . . . . . . . . . . . . . 4.7.1 Highest Performance . . . . . . . . . . . 4.7.2 Optimality . . . . . . . . . . . . . . . . . . 4.7.3 Performance Metrics . . . . . . . . . . . . 15 1 2 4 5 6 7 8 9 11 12 13 14 21 22 24 25 27 1 2 3 4 5 6 7 8 15 16 17 18 19 1 2 3 4 5 6 7
xv
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
5 Timing Analysis Problems SOL-08: Timing Analysis . . . . . . . . . . . . . . . . . . . . . . 5.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Critical Path and False Path . . . . . . . . . . . . . . . . . . . 5.3 Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Ignoring potential false paths, list the signals in the critical path through this circuit. . . . . . . . . . . . . . 5.3.2 What is the combinational delay through the critical path? . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Missing Factors . . . . . . . . . . . . . . . . . . . . . . 5.3.4 False Path? . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Timing Models . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Worst Case Conditions and Derating Factor . . . . . . . . . . 5.5.1 Worst-Case Commercial . . . . . . . . . . . . . . . . . 5.5.2 Worst-Case Industrial . . . . . . . . . . . . . . . . . . 5.5.3 Worst-Case Industrial, Non-Ambient Junction Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . SOL-09: Timing Analysis (II) . . . . . . . . . . . . . . . . . . . . 5.6 Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Wires in FPGAs . . . . . . . . . . . . . . . . . . . . . 5.6.2 Age and Time . . . . . . . . . . . . . . . . . . . . . . 5.6.3 Temperature and Delay . . . . . . . . . . . . . . . . . 5.7 Hold Time Violations . . . . . . . . . . . . . . . . . . . . . . . 5.7.1 Cause . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
5.7.2 Behaviour . . . . . . . . . . . 5.7.3 Rectication . . . . . . . . . 5.8 Latch Analysis . . . . . . . . . . . . 5.9 Combinational Timing (Smith 13.23) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 9 10 12 13 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 7 8 9 10 11 12 13 14 15 16
xv
6 Power Problems SOL-10: Power Analysis and Reduction . . . . . . . . . . 6.1 Power Analysis and Reduction Problems . . . . . . . 6.1.1 Short Answers . . . . . . . . . . . . . . . . . . 6.1.1.1 Power and Temperature . . . . . . . . 6.1.1.2 Leakage Power . . . . . . . . . . . . 6.1.1.3 Clock Gating . . . . . . . . . . . . . . 6.1.1.4 Gray Coding . . . . . . . . . . . . . . 6.1.2 VLSI Gurus . . . . . . . . . . . . . . . . . . . . 6.1.2.1 Affect on Power . . . . . . . . . . . . 6.1.2.2 Critique . . . . . . . . . . . . . . . . . 6.1.3 Advertising Ratios . . . . . . . . . . . . . . . . SOL-11: Power Analysis and Reduction . . . . . . . . . . 6.1.4 Vary Supply Voltage . . . . . . . . . . . . . . . 6.1.5 Power Reality and Math (Smith prob 15.16) . . 6.1.6 Clock Speed Increase Without Power Increase 6.1.6.1 Supply Voltage . . . . . . . . . . . . . 6.1.6.2 Supply Voltage . . . . . . . . . . . . . 6.1.7 Power Reduction Strategies . . . . . . . . . . . 6.1.7.1 Supply Voltage . . . . . . . . . . . . . 6.1.7.2 Transistor Sizing . . . . . . . . . . . . 6.1.7.3 Adding Registers to Inputs . . . . . . 6.1.7.4 Gray Coding . . . . . . . . . . . . . . 6.1.8 Power Consumption on New Chip . . . . . . . 6.1.8.1 Hypothesis . . . . . . . . . . . . . . . 6.1.8.2 Experiment . . . . . . . . . . . . . . . 6.1.8.3 Reality . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
7 Problems on Faults, Testing, and Testability SOL-12: Faults, Testing, and Testability . . . . . . . . . . . . . . 7.1 Based on Smith q14.9: Testing Cost . . . . . . . . . . . . . . 7.2 Testing Cost and Total Cost . . . . . . . . . . . . . . . . . . . 7.3 Minimum Number of Faults . . . . . . . . . . . . . . . . . . . 7.4 Smith q14.10: Fault Collapsing . . . . . . . . . . . . . . . . . 7.5 Mathematical Models and Reality . . . . . . . . . . . . . . . . 7.6 Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . 7.7 Test Vector Generation . . . . . . . . . . . . . . . . . . . . . 7.7.1 Choice of Test Vectors . . . . . . . . . . . . . . . . . . 7.7.2 Number of Test Vectors . . . . . . . . . . . . . . . . . 7.8 Time to do a Scan Test . . . . . . . . . . . . . . . . . . . . . 7.9 BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9.1 Characteristic Polynomials . . . . . . . . . . . . . . . 7.9.2 Test Generation . . . . . . . . . . . . . . . . . . . . . 7.9.3 Signature Analyzer . . . . . . . . . . . . . . . . . . . . 7.9.4 Probabilty of Catching a Fault . . . . . . . . . . . . . . 7.9.5 Probabilty of Catching a Fault . . . . . . . . . . . . . . 7.9.6 Detecting a Specic Fault . . . . . . . . . . . . . . . . 7.9.7 Time to Run Test . . . . . . . . . . . . . . . . . . . . . 7.10 Power and BIST . . . . . . . . . . . . . . . . . . . . . . . . . 7.11 Timing Hazards and Testability . . . . . . . . . . . . . . . . 7.12 Testing Short Answer . . . . . . . . . . . . . . . . . . . . . . 7.12.1 Are there any physical faults that are detectable by scan testing but not by built-in self testing? . . . . . . 7.12.2 Are there any physical faults that are detectable by built-in self testing but not by scan testing? . . . . . . 7.13 Fault Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.13.1 Design test generator . . . . . . . . . . . . . . . . . 7.13.2 Design signature analyzer . . . . . . . . . . . . . . . 7.13.3 Determine if a fault is detectable . . . . . . . . . . . 7.13.4 Testing time . . . . . . . . . . . . . . . . . . . . . . . 17 1 2 4 6 7 8 9 10 11 12 13 14 15 16 19 23 24 25 27 28 29 32 33 34 35 36 37 38 39
xvi
Part I
Lecture Notes
Chapter 1
VHDL: The Language
LEC-02 Preliminaries
LEC-02: Introduction to VHDL

Lecture Notes Sections: 1.1 1.5.3
University of Waterloo Dept of Electrical and Computer Engineering E&CE 427 Digital Systems Engineering 2003t1Winter
Schedule
wk-01 05 VHDL Design and Optimization wk-01 Overview and VHDL wk-02 VHDL Semantics wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review
wk-06 wk-07 wk-08 wk-08 10 wk-11 12 wk-13
Concepts
Lecture Notes: Sections 1.11.5.3
synthesis simulation entity architecture process concurrent statement sequential statement
port type direction signal combinational process clocked process latch inference
LEC-02:
1.1
PRELUDE
1.1
Prelude
LEC-02:
1.1.1
Topics in this Chapter
1.1.1
VHDL syntax VHDL semantics synthesizing VHDL
LEC-02:
1.1.2
Background Material
1.1.2
Background Material
Smith Chapters 1 and 2
LEC-02:
1.1.3
Recommended Reading
1.1.3
Recommended Reading
Links to many VHDL resources are on the E&CE 427 web pages under Documentation. In addition to Smith, two other books on VHDL are on reserve in the Davis Centre Library:
Relevant chapters in Smith: 8 (Software), 10 (VHDL), 12 (Synthesis); Appendix A. Suggested reading order in Smith:
Designers Guide to VHDL, Peter J. Ashenden VHDL for Logic Synthesis, Andrew Rushton
First pass Ch 8 10.5 entities and architectures,
10.10 sequential statements, 10.13 concurrent statements,
LEC-02:
1.1.3
Recommended Reading
8 10.9 other declarations 10.15 congurations and specications 10.16 example: engine controller remainder of Ch 12
10.14 execution 12.2 synthesis 12.6 VHDL logic synthesis
Third pass: 10.110.4 intro to VHDL 10.6 packages and libraries 10.8 type declarations
Second pass: 10.11 operators 10.12 arithmetic 12.7 FSM synthesis 12.8 Memory synthesis
Reference material: Table 10.27: VHDL summary Table 10.28: VHDL denitions Appendix A: VHDL syntax
LEC-02:
1.2
INTRODUCTION TO VHDL
1.2
Introduction to VHDL
LEC-02:
1.2.1
VHDL Origins and History
10
1.2.1
VHDL = VHSIC Hardware Description Language VHSIC = Very High Speed Integrated Circuit The VHSIC Hardware Description Language (VHDL) is a formal notation intended for use in all phases of the creation of electronic systems. Because it is both machine readable and human readable, it supports the development, verication, synthesis and testing of hardware designs, the communication of hardware design data, and the maintenance, modication, and procurement of hardware. Language Reference Manual (IEEE Design Automation Standards Committee, 1993a)
VHDL is a lot more than synthesis of digital hardware
LEC-02:
1.2.1
11
VHDL History
Developed by the United States Department of Defense as part of the very high speed integrated circuit (VHSIC) program in the early 1980s. The Department of Defense intended VHDL to be used for the documentation, simulation and verication for electronic systems. Goals: improve design process over schematic entry standardize design descriptions amongst multiple vendors portable and extensible
LEC-02:
1.2.1
12
VHDL History (Contd)
Inspired by the ADA programming language large: 97 keywords, 94 syntactic rules verbose (designed by committee) static type checking, overloading complicated syntax: parentheses are used for both expression grouping and array indexing Example: a <= b * (3 + c); a <= (3 + c); -- integer -- 1-element array of integers
LEC-02:
1.2.1
13
VHDL History (Contd)
Standardized by IEEE in 1987 (IEEE 1076-1987), revised in 1993, 2000. In 1993 the IEEE standard VHDL package for model interoperability, STD_LOGIC_1164 (IEEE Standard 1164-1993), was developed. std_logic_1164 denes 9 different values for signals (See Smith Section 10.6.2) In 1997 the IEEE standard packages for arithmetic over std logic and bit signals were dened (IEEE Standard 1076.31997). numeric_std denes arithmetic over std logic vectors and integers. NB: This is the package that you should use for arithmetic. Dont use std logic arith it has less uniform support for mixed integer/signal arithmetic and has a greater tendency for differences between tools. numeric_bit denes arithmetic over bit vectors and integers. We wont use bit signals in this course, so you dont need to worry about this package.
LEC-02:
1.2.2
Semantics
14
1.2.2
Semantics
The original goal of VHDL was to simulate circuits. The semantics of the language dene circuit behaviour.
a c <= a AND b;
simulation
b c
But now, VHDL is used in simulation and synthesis. Synthesis is concerned with the structure of the circuit. Synthesis: converts one type of description (behavioural) into another, lower level, description (usually a netlist).
c <= a AND b;
synthesis
a c b
Synthesis is a computer-aided design (CAD) technique that transforms a
LEC-02:
1.2.2
Semantics
15
designers concise, high-level description of a circuit into a structureal description of a circuit.
LEC-02:
1.2.2
Semantics
16
CAD Tools
CAD Tools allow designers to automate lower-level design processes in implementing the desired functionality of a system. NOTE: EDA = Electronic Design Automation. In digital hardware design EDA = CAD.
LEC-02:
1.2.2
Semantics
17
Synthesis vs Simulation
For synthesis, we want the code we write to dene the structure of the hardware that is generated.
c <= a AND b;
synthesis
a c b
But, the VHDL semantics dene the behaviour of the hardware that is generated, not the structure of the hardware.
a c b a c <= a AND b; b c a c b
LEC-02:
1.2.3
Synthesis of a Simulation-Based Language
18
1.2.3 Synthesis of a Simulation-Based Language
Not all of VHDL is synthesizable c <= a AND b; (synthesizable) c <= a AND b AFTER 2ns; (NOT synthesizable) how do you build a circuit with exactly 2ns of delay through an AND gate? more examples of non-synthesizable code are in section 1.8 See section 1.8 for more details Different synthesis tools support different subsets of VHDL Some tools generate erroneous hardware for some code behaviour of hardware differs from VHDL semantics Some tools generate unpredictable hardware There is an IEEE standard (1076.6) for a synthesizable subset of VHDL, but tool vendors dont yet conform to it. (Most vendors still dont have full support for the 1993 extensions to VHDL!). For more info, see http://www.vhdl.org/siwg/.
LEC-02:
1.2.4
Solution to Synthesis Sanity
19
1.2.4
Solution to Synthesis Sanity
Pick a high-quality synthesis tool and study its documentation thoroughly Learn the idioms of the tool Different VHDL code with same behaviour can result in very different circuits Be careful if you have to port VHDL code from one tool to another KISS: Keep It Simple Stupid VHDL examples in lectures will illustrate reliable coding techniques for the Synopsys tools (and most other tools as well). Follow the coding guidelines and examples from lecture As you write VHDL, think about the hardware you expect to get. NB: If you cant predict the hardware, then the hardware probably wont be very good (small, fast, correct, etc)
LEC-02:
1.2.5
VHDL Disadvantages
20
1.2.5
VHDL Disadvantages
Some VHDL programs cannot be synthesized Different tools support different subsets of VHDL. Different tools generate different circuits for same code VHDL is verbose Many characters to say something simple VHDL is complicated and confusing Many different ways of saying the same thing Constructs that have similar purpose have very different syntax (case vs. select) Constructs that have similar syntax have very different semantics (variables vs signals) Hardware that is synthesized is not always obvious (when is a signal a ip-op vs latch vs combinational) The infamous latch inference problem (See section 1.5.2 for more information)
LEC-02:
1.2.6
VHDL Advantages
21
1.2.6
VHDL Advantages
VHDL supports unsynthesizable constructs that are useful in writing testbenches and other non-hardware artifacts that we need in hardware design. VHDL can be used throughout a large portion of the design process in different capacities, from specication to implementation to verication. VHDL has static typechecking many errors can be caught before synthesis and/or simulation. (In this respect, it is more similar to Java than to C.) VHDL has a rich collection of datatypes VHDL is a full-featured language with a good module system (libraries and packages). VHDL has a well-dened standard.
LEC-02:
1.2.7
VHDL and Other Languages
22
1.2.7
LEC-02:
1.2.7
23
1.2.7.1 VHDL vs Verilog
Verilog is a simpler language: smaller language, simple circuits are easier to write VHDL has more features than Verilog richer set of data types and strong type checking VHDL offers more exibility and expressivity for constructing large systems. The VHDL Standard is more standard than the Verilog Standard VHDL and Verilog have simulation-based semantics Simulation vendors generally conform to VHDL standard Some Verilog constructs dont simulate the same in different tools VHDL is used more than Verilog in Europe and Japan Verilog is used more than VHDL in North America South-East Asia, India, South America: ?????
LEC-02:
1.2.7
24
1.2.7.2 VHDL vs SystemC
System C looks like C familiar syntax C is often used in algorithmic descriptions of circuits, so why not try to use it for synthesizable code as well? If you think VHDL is hard to synthesize, try C.... SystemC simulation is slower than advertised
LEC-02:
1.2.7
25
1.2.7.3 VHDL vs Other Hardware Description Languages
Superlog: A new language (still under active development) that is based on Verilog and C. Basic core comes from Verilog. C-like extensions included to make language more expressive and powerful. Developed by the Co-Design company. Esterelle: A language evolving from academia to commercial viability. Very clean semantics. Aimed at state machines, limited support for datapath operations.
LEC-02:
1.2.7
26
1.2.7.4 Summary of VHDL Evaluation
VHDL is far from perfect and has lots of annoying characteristics VHDL is a better language for education than Verilog because the static typechecking enforces good software engineering practices The richness of VHDL will be useful in creating concise high-level models and powerful testbenches
LEC-02:
1.3
OVERVIEW OF SYNTAX
27
1.3
Overview of Syntax
This section is just a brief overview of the syntax of VHDL, focussing on the constructs that are most commonly used. Read a book on VHDL and use online resources. (Look for VHDL under the Documentation tab in the E&C 427 web pages for more information.)
LEC-02:
1.3.1
Syntactic Categories
28
1.3.1
Syntactic Categories
There are ve major categories of syntactic constructs. (There are many, many minor categories and subcategories of constructs.)
Library units (section 1.3.2) Top-level constructs (packages, entities, architectures) Concurrent statements (section 1.3.4) Statements executed at the same time (in parallel) Sequential statements (section 1.3.7) Statements executed in series (one after the other) Expressions Arithmetic (section 1.9), Boolean, Vectors , etc Declarations Components , signals, variables, types, functions, ....
LEC-02:
1.3.2
Library Units
29
1.3.2
Library Units
Library units are the top-level syntactic constructs in VHDL. They are used to dene and include libraries, declare and implement interfaces, dene packages of declarations and otherwise bind together VHDL code.
Package body dene the contents of a library Packages determine which parts of the library are externally visible Use clause use a library in an entity/architecture or another package technically, use clauses are part of entities and packages, but they proceed the entity/package keyword, so we list them as toplevel constructs Entity (section 1.3.3)
LEC-02:
1.3.2
Library Units
30
dene interface to circuit
See Smith Section 10.6 for information on packages and use clauses.
Architecture (section 1.3.3) dene internal signals and gates of circuit
LEC-02:
1.3.3
Entities and Architecture
31
1.3.3

entity entity architecture
Each hardware module is described with an Entity/Architecture pair
architecture
Figure 1.1: Entity and Architecture The syntax of VHDL is dened using a variation on Backus-Naur forms (BNF). See Smith Appendix A.1 for a description of the rules for understanding VHDL grammar.
Entity: interface
LEC-02:
1.3.3
32
names, modes (in / out), types of externally visible signals of circuit
library ieee; use ieee.std_logic_1164.all; entity and_or is port ( a, b, c : in std_logic ; z : out std_logic ); end and_or; Figure 1.2: Example of an entity
Architecture: internals structure and behaviour of module
LEC-02:
1.3.3
33
Figure 1.3: Simplied grammar of entity
[ use_clause ] entity ENTITYID is [ port ( SIGNALID : (in | out) TYPEID [ := expr ] ; ); ] [ declaration ] [ begin concurrent_statement ] end [ entity ] ENTITYID ;
LEC-02:
1.3.3
34
Architecture
architecture signal x : begin x <= a AND z <= x OR end main; main of and_or is std_logic; b; (a AND c); Figure 1.4: Example of architecture
LEC-02:
1.3.3
35
[ use_clause ] architecture ARCHID of ENTITYID is [ declaration ] begin concurrent_statement ] [ end [ architecture ] ARCHID ; Figure 1.5: Simplied grammar of architecture
LEC-02:
1.3.4
Concurrent Statements
36
1.3.4
Concurrent statements are used inside architectures Concurrent statements execute in parallel (Figure 1.6) Concurrent statements make VHDL fundamentally different from most software languages. Hardware (gates) naturally execute in parallel VHDL mimics the behaviour of real hardware. At each innitesimally small moment of time, each gate: 1. samples its inputs 2. computes the value of its output 3. drives the output
LEC-02:
1.3.4
37
architecture main of bowser is begin x1 <= a AND b; x2 <= NOT x1; z <= NOT x2; end main;
architecture main of bowser is begin z <= NOT x2; x2 <= NOT x1; x1 <= a AND b; end main;
a b
x1
x2
Figure 1.6: The order of concurrent statements doesnt matter
LEC-02:
1.3.4
38
conditional assignment
... <= ... when ... else ...;
selected assignment
with ... select ... <= ... when ... | ..., else ...;
component instantiation
...: ... port map ( ... => ..., ... );
for-generate
...: for ... in ... generate ... end generate;
if-generate
...: if ... generate ... end generate;
process
process ... begin ... end process;
Figure 1.7: The most commonly used concurrent statements
normal assignment (... <= ...) if-then-else style (uses when) Smith Section 10.13.4
case/switch style assignment Smith Section 10.13.4
use an existing circuit section 1.3.5, Smith Section 10.13.6
replicate some hardware Smith Section 10.13.7
conditionally create some hardware Smith Section 10.13.7
the body of a process is executed sequentially Sections 1.3.6, 1.6; Smith Section 10.10
LEC-02:
1.3.5
Component Declaration and Instantiations
39
1.3.5 Component Declaration and Instantiations

There are two different syntaxes for component declaration and instantiation. The VHDL-93 syntax is much more concise than the VHDL-87 syntax. Not all tools support the VHDL-93 syntax. In particlar for E&CE 427, the Synopsys tools do not fully support the VHDL-93 syntax. See Smith Section 10.13.6 for more discussion on the syntax of component declaration and instantiation.
LEC-02:
1.3.6
Processes
40
1.3.6
Processes
Processes are used to describe the behaviour of hardware A process is a concurrent statement (Section 1.3.4). The body of a process contains sequential statements (Section 1.3.7) Processes are the most complex and difcult to understand part of VHDL (Sections 1.5 and 1.6)
LEC-02:
1.3.6
Processes
41
Example Process with Sensitivity List

process (a, b, c) begin y <= a AND b; if (a = 1) then z1 <= b AND c; z2 <= NOT c; else z1 <= b OR c; z2 <= c; end if; end process;
LEC-02:
1.3.6
Processes
42
Example Process with Wait Statements

process begin y <= a AND b; z <= 0; wait until rising_edge(clk); if (a = 1) then z <= 1; y <= 0; wait until rising_edge(clk); else y <= a OR b; end if; end process; Processes must have either a sensitivity list or at least one wait statement on each execution path through the process. Processes cannot have both a sensitivity list and a wait statement.
LEC-02:
1.3.6
Processes
43
Sensivity List
The sensitivity list contains the signals that are read in the process. An important coding guideline to ensure consistent synthesis and simulation results is to include all signals that are read in the sensitivity list. If you forget some signals, you will either end up with unpredictable hardware and simulation results (different results from different programs) or undesirable hardware (latches where you expected purely combinational hardware). For more on this topic, see sections 1.5.2 and 1.6. There is one exception to this rule: for a process that implements a ip-op with an if rising edge statement, it is acceptable to include only the clock signal in the sensitivity list other signals may be included, but are not needed.
LEC-02:
1.3.6
Processes
44
Process Grammar
[ PROCLAB : ] process ( sensitivity_list ) declaration ] [ begin sequential_statement end process [ PROCLAB ] ; Figure 1.8: Simplied grammar of process
LEC-02:
1.3.7
Sequential Statements
45
1.3.7
Used inside processes and functions.
LEC-02:
1.3.7
46
wait signal assignment if-then-else case
loop while loop for loop next
wait until ...; ... <= ...; if ... then ... elsif ... end if; case ... is when ... | ... => ...; when ... => ...; end case; loop ... end loop; while ... loop ... end loop; for ... in ... loop ... end loop; next ...;
Figure 1.9: The most commonly used sequential statements
LEC-02:
1.4
CONCURRENT VS SEQUENTIAL STATEMENTS
47
1.4
Concurrent vs Sequential Statements
Concurrent assignments can be translated into sequential statements. But, not all sequential can be translated into concurrent statements.
LEC-02:
1.4.1
Concurrent Assignment vs Process
48
1.4.1
Concurrent Assignment vs Process

architecture main of tiny is begin process (a) begin b <= a; end process; end main;
The two code fragments below have identical behaviour: architecture main of tiny is begin b <= a; end main;
LEC-02:
1.4.2
Conditional Assignment vs If Statements
49
1.4.2 Conditional Assignment vs If Statements

The two code fragments below have identical behaviour: Concurrent Statements t <= <val1> when <cond> else <val2>; Sequential Statements if <cond> then t <= <val1>; else t <= <val2>; end if
LEC-02:
1.4.3
Selected Assignment vs Case Statement
50
1.4.3 Selected Assignment vs Case Statement

The two code fragments below have identical behaviour Concurrent Statements with <expr> select t <= <val1> when <choices1>, <val2> when <choices2>, <val3> when <choices3>; Sequential Statements case <expr> is when <choices1> => t <= <val1>; when <choices2> => t <= <val2>; when <choices3> => t <= <val3>; end case;
LEC-02:
1.4.4
Coding Style
51
1.4.4
Coding Style
Code thats easy to write with sequential statements, but difcult with concurrent:
LEC-02:
1.4.4
Coding Style
52
case <expr> is when <choice1> => if <cond> then o <= <expr1>; else o <= <expr2>; end if; when <choice2> => ... end case;
LEC-02:
1.4.4
Coding Style
53
Overall structure: with <expr> select t <= ... when <choice1>, ... when <choice2>; Failed attempt: with <expr> select t <= -- want to write: -<val1> when <cond> -else <val2> -- but conditional assignment -- is illegal here when c1, ... when c2;
LEC-02:
1.4.4
Coding Style
54
Concurrent Statements (Contd)

Concurrent statement with correct behaviour, but messy: t <= <expr1> when (expr = <choice1> AND <cond>) else <expr2> when (expr = <choice1> AND NOT <cond>) else ... ;
Lesson: complicated, nested control constructs are easier with sequential statements than with concurrent statements.
LEC-02:
1.5
OVERVIEW OF PROCESSES
55
1.5
Overview of Processes
Processes are the most difcult VHDL construct to understand. This section gives an overview of processes. Section 1.5 gives the details of the semantics of processes.
Within a process, statements are executed almost sequentially Among processes, execution is done in parallel Remember: a process is a concurrent statement!
LEC-02:
1.5
56
entity ENTITYID is interface declarations end ENTITYID; architecture ARCHID of ENTITYID is begin concurrent statements process begin sequential statements end process; concurrent statements end ARCHID; Figure 1.10: Sequential statements in a process
LEC-02:
1.5
57
Key concepts in VHDL semantics for processes:
VHDL mimics hardware Hardware (gates) execute in parallel Processes execute in parallel with each other All possible orders of executing processes must produce the same simulation results (waveforms) If a signal is not assigned a value, then it holds its previous value
All orders of executing concurrent statements must produce the same waveforms
LEC-02:
1.5
58
It doesnt matter whether you are running on a single-threaded operating system, on a multi-threaded operating system, on a massively parallel supercomputer, or on a special hardware emulator with one FPGA chip per VHDL process all simulations must be the same. These concepts are the motivation for the semantics of executing processes in VHDL (Section 1.6) and lead to the phenomenon of latch-inference (Section 1.5.2).
LEC-02:
1.5
execution sequence execution sequence execution sequence
59
architecture procA: process stmtA1; stmtA2; stmtA3; end process; procB: process stmtB1; stmtB2; end process; B1 B2 B1 B2 B1 B2 A1 A2 A3 A1 A2 A3 A1 A2 A3
single threaded: single threaded: multithreaded: procA before procB before procA and procB procA procB in parallel Figure 1.11: Different process execution sequences
LEC-02:
1.5
60
Figure 1.12: All execution orders must have same behaviour
LEC-02:
1.5
61
Sections 1.5.11.5.3 discuss the hardware generated by processes. Sections 1.61.6.3 discuss the behaviour and execution of processes.
LEC-02:
1.5.1
Combinational Process vs Clocked Process
62
1.5.1 Combinational Process vs Clocked Process

Each synthesizable process is either combinational or clocked.
LEC-02:
1.5.1
63
Combinational process:
Executing the process takes part of one clock cycle Target signals are outputs of combinational circuitry A combinational processes must have a sensitivity list A combinational process does not have any wait statements and does not have any events, rising_edges, or falling_edges in conditions for if or in case statements Hardware is just combinational circuitry
LEC-02:
1.5.1
64
Clocked process:
Executing the process takes one (or more) clock cycles Target signals are outputs of ops Process contains one or more wait or if rising edge statements hardware contains combinational circuitry and ip ops
NOTE: C locked processes are sometimes called sequential processes, but this can be easily confused with sequential statements, so in E&CE 427 well refer to synthesizable processes as either combinational or clocked.
LEC-02:
1.5.1
65
Example of Combinational Process

process (a,b,c) p1 <= a; if (b = c) then p2 <= b; else p2 <= a; end if; end process;
LEC-02:
1.5.1
66
Example Clocked Processes

process begin wait until rising_edge(clk); b <= a; end process; process (clk) begin if rising_edge(clk) then b <= a; end if; end process;
LEC-02:
1.5.2
Latch Inference
67
1.5.2
Latch Inference
The semantics of VHDL require that if a signal is assigned a value on some passes through a process and not on other passes, then on a pass through the process when the signal is not assigned a value, it must maintain its value from the previous pass. process (a, b, c) begin if (a = 1) then z1 <= b; z2 <= b; else z1 <= c; end if; end process;
a b c z1 z2
Figure 1.13: Example of latch inference
LEC-02:
1.5.2
Latch Inference
68
When a signals value must be stored, VHDL infers a latch or a ip-op in the hardware to store the value.
LEC-02:
1.5.2
Latch Inference
69
If you want a latch or a ip-op for the signal, then latch inference is good. If you want combinational circuitry, then latch inference is bad.
LEC-02:
1.5.2
Latch Inference
70
Causes of Latch Inference

Generally, latch inference refers to the unintentional creation of latches. The usual cause of unintended latch inference is missing assignments to signals in if-then-else and case statements. Latch inference happens during elaboration. When using the Synopsys tools, look for: Inferred memory devices in the output or log les.
LEC-02:
1.5.3
Combinational vs Flopped Signals
71
1.5.3
Combinational vs Flopped Signals
Signals assigned to in combinational processes are combinational. Signals assigned to in clocked processes are outputs of ip-ops. The one exception to this can occur in a clocked process that contains a signal that is assigned to in every branch of every if-then-else and case statement. Such a signal might be generated as combinational logic. Mixing combinational and clocked signals in the same process is bad design discipline, because it can lead to different results from different synthesis tools. So, if you follow good coding practices, you wont need to worry about this exception.
LEC-03: Details of Process Execution

Schedule
wk-06 wk-07 wk-08 wk-08 10 wk-11 12 wk-13
Overview
This lecture relates fragments of VHDL code to the basic building blocks of hardware: ip-ops, Boolean gates, arithmetic circuits, etc. The semantics of VHDL are behavioural, not structural, but by understanding the behavioural semantics of VHDL we can derive the relationship between VHDL code and netlists.
Concepts
temporal granularities process modes simulation cycle simulation step
delta cycle simulation round provisional assignment
LEC-03:
1.6
DETAILS OF PROCESS EXECUTION
1.6
Details of Process Execution
LEC-03:
1.6.1
Denitions and Algorithm
1.6.1
LEC-03:
1.6.1
1.6.1.1 Temporal Granularities of Simulation

This begins our discussion of the behaviour and execution of processes. There are several different granularities of time to analyze VHDL behaviour. In this course, we will discuss three major granularities: clock cycles, timing simulation, and delta cycles.
LEC-03:
1.6.1
Clock Cycle
smallest unit of time is a clock cycle combinational logic has zero delay ip-ops have a delay of one clock cycle used for simulation early in the design cycle fastest simulation run times
LEC-03:
1.6.1
Timing Simulation
smallest unit of time is a nano, pico, or fempto second combinational logic and wires have delay as computed by timing analysis tools ip-ops have setup, hold, and clock-to-Q timing parameters used for simulation when ne-tuning design and conrming that timing contraints are satised slow simulation times for large circuits
LEC-03:
1.6.1
10
Delta Cycles
In assignments and exams, you will need to be able to simulate VHDL code at each of the three different levels of temporal granularity. In the laboratories and project, you will use simulation programs for both clock-cycle simulation and timing simulation. We dont have access to a program that will produce delta-cycle waveforms, but if anyone is looking for a challenging co-op job or fourth-year design project....
units of time are artifacts of VHDL semantics and simulation software simulation cycles, delta cycles, and simulation steps are inntesimaly small amounts of time VHDL semantics are dened in terms of these concepts
LEC-03:
1.6.1
11
Denitely Delta
For the remainder of section 1.6, well look at only the delta cycle view of the world.
LEC-03:
1.6.1
12
1.6.1.2 Process Modes

Each process is in one of the following modes: active, suspended, or postponed.
NOTE: postponed This use of the word postponed differs from that in the VHDL Standard. We wont be using postponed processes as dened in the Standard.
LEC-03:
1.6.1
13
Process Modes
active
e sp su te tiv a
nd
postponed resume
ac
suspended
LEC-03:
1.6.1
14
Suspended
active
d en sp su e
postponed resume
ac
tiv at
suspended
Nothing to currently execute A process stays suspended until the event that it is waiting for occurs: either a change in a signal on its sensitivity list or the condition in a wait statement
LEC-03:
1.6.1
15
Postponed
active
d en sp su e
postponed resume
ac
tiv at
suspended
Wants to execute, but not currently active A process becomes active when the simulator chooses it from the pool of postponed processes
LEC-03:
1.6.1
16
Active
active
d en sp su e tiv at
postponed resume
ac
suspended
Currently executing A process stays active until it hits a wait statement or completes the execution of the last statement in the process, at which point it suspends
LEC-03:
1.6.1
17
1.6.1.3 Simulation Algorithm

The algorithm presented here is a simplication of the actual algorithm in Section 12.6 of the VHDL Standard. The most signicant simplication is that this algorithm does not support delayed assignments. To support delayed assignments, each signals provisional value would be generalized to an event wheel, which holds provisional assignments for multiple times in the future. A somewhat ironic note, only six of the two hundred pages in the VHDL Standard are devoted to the semantics of executing processes.
LEC-03:
1.6.1
18
Initialization
Simulations start at step 6 with all processes postponed and all signals with a default value (U for std logic).
LEC-03:
1.6.1
19
The Algorithm
LEC-03:
1.6.1
20
1. All processes are suspended. 2. Each process looks at the signals that changed value and checks its sensitivity list or wait condition to see if it should resume 3. Update signals with their provisional values 4. Resume all suspended processes whose sensitivity list changed or wait condition became true. 5. If there are no postponed processes, then simulation time increments to the next scheduled event and the simulation continues at Step 1. 6. While there are postponed processes: (a) Pick one or more postponed processes to become active. (b) As a process executes, assignments to signals are provisional new values do not become visible until step 3 in the next simulation cycle (c) A process runs until it hits its sensitivity list or a wait statement, at which point it suspends. (d) Processes that become suspended stay suspended until there are no more postponed or active processes. 7. Calculate the new simulation time: If zero-delay assignments were made in the current simulation cycle then simulation time does not advance else simulation time is set to time of next scheduled event
LEC-03:
1.6.1
21
NOTE: Parallel execution In n-threaded execution, at most n processes are active at a time
LEC-03:
1.6.1
22
1.6.1.4 Delta-Cycle Denitions

Denition simulation step: Executing one sequential assignment. Denition simulation cycle: The operations that occur between the time when all processes are suspended, until all are suspended again. Denition delta cycle: A simulation cycle that does not advance simulation time. Equivalently: A simulation cycle with zero-delay assignments. Denition simulation round: A sequence of simulation cycles that all have the same simulation time. Equivalently: a contiguous sequence of delta cycles.
LEC-03:
1.6.1
23
NOTE: Ofcial and unofcial terminology Simulation cycle and delta cycle are ofcial denitions in the VHDL Standard. Simulation step and simulation round are not standard denitions. They are used in E&CE 427 because we need words to associate with the concepts that they describe.
LEC-03:
1.6.2
Example: Process Execution
24
1.6.2
LEC-03:
1.6.2
25
entity bamboozle is begin port ( a, b : in std_logic; e : out std_logic ); end bamboozle; architecture main of bamboozle is signal c, d : std_logic; begin procA : process (a, b) begin c <= a AND b; end process; procB : process (b, c, d) begin d <= NOT c; e <= b AND d; end process; end main; Figure 1.14: Example circuit for process execution
LEC-03:
1.6.2
26
In simulation run, a and b are external inputs with the following scheduled events:
In this example, we will treat the external inputs as if they were driven by an external process.
procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process;
a: (0 at 0 ns), (1 at 10 ns), (0 at 15 ns) b: (1 at 0 ns), (0 at 12 ns)
d e
LEC-03:
0ns
a b c d e
1.6.2

10ns 12ns 15ns
27
Run of external inputs
LEC-03:
1.6.2
28
process mode (S=suspended, P=postponend A=active) simulation-step pointer (one per process) P procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; b <= 1; a U wait for 10 ns; b U a <= 1; wait for 2 ns; c U b <= 0; d U wait for 3 ns; a <= 0; e U end process; visible-assignment value
U a U b Uc Ud U e
Legend initial values simulation step
LEC-03:
P
1.6.2
29
procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
U a U b Uc Ud U e
Step 6: Initial conditions
LEC-03:
A
1.6.2
30
procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
U a U b Uc Ud U e
Step 6(a): Activate procA
LEC-03:
1.6.2
31
A P
procA: process (a, b) begin c <= a AND b; end process; procB: process (b, c, d) begin d <= NOT c; e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
U a U b UUc Ud U e
Step 6(b): Provisional assignment to c
LEC-03:
S
1.6.2
32
U a U b UUc Ud U e
Step 6(c): Suspend procA
LEC-03:
S
1.6.2
33
U a U b UUc Ud U e
Step 6(a): Activate procC
LEC-03:
S
1.6.2
34
0U a U b UUc Ud U e
Step 6(b): Provisional assignment to a
LEC-03:
S
1.6.2
35
0U a 1U b UUc Ud U e
Step 6(b): Provisional assignment to b
LEC-03:
S
1.6.2
36
0U a 1U b UUc Ud U e
Step 6(c): Suspend procC
LEC-03:
S
1.6.2
37
0U a U b UUc Ud U e
Step 6(a): Activate procB
LEC-03:
P
1.6.2
38
0U a U b UUc UUd U e
Step 6(b): Provisional assignment to d
LEC-03:
S
1.6.2
39
0U a 1U b UUc UUd UU e
U U
Step 6(b): Provisional assignment to e
LEC-03:
S
1.6.2
40
U U
Step 6(c): Suspend procB
LEC-03:
S
1.6.2
41
procA: process (a, b) begin c <= a AND b; 0U UUc UUd end process; a UU procB: process (b, c, d) begin e 1U d <= NOT c; b e <= b AND d; end process; 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U U wait for 2 ns; b <= 0; U d U wait for 3 ns; U e U a <= 0; end process; All processes suspended: End of simulation cycle
LEC-03:
S
1.6.2
42
0ns
U U
Step 7: Simulation time remains at 0 ns
LEC-03:
S
1.6.2
43
procA: process (a, b) begin c <= a AND b; 0U UUc UUd end process; a UU procB: process (b, c, d) begin 1U d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process; Step 1: Beginning of next simulation cycle Note: First simulation cycle compacted into two columns. This is done only in this example to save space and is not standard practice.
LEC-03:
S
1.6.2
44
procA: process (a, b) begin c <= a AND b; 0U UUc UUd end process; a UU procB: process (b, c, d) begin 1U d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process; Step 2: Check sensitivity lists for changes
LEC-03:
S
1.6.2
45
procA: process (a, b) begin c <= a AND b; 0 end process; a Uc Ud procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process; Step 3: Update signal values
U e
LEC-03:
P
1.6.2
46
procA: process (a, b) begin c <= a AND b; 0 end process; a Uc Ud procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process; Step 4: Resume procA and procB
U e
LEC-03:
A
1.6.2
47
procA: process (a, b) begin c <= a AND b; end process; a procB: process (b, c, d) begin d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process;
0 Uc 1 Ud U e
Step 6(a): Activate procA
LEC-03:
1.6.2
48
A P
procA: process (a, b) begin c <= a AND b; 0 end process; a 0Uc Ud U procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U wait for 3 ns; e U a <= 0; end process; Step 6(b): Provisional assignment to c
LEC-03:
S
1.6.2
49
0 0Uc 1 Ud U e
Step 6(c): Suspend procA
LEC-03:
S
1.6.2
50
0 0Uc 1 Ud U e
Step 6(a): Activate procB
LEC-03:
S
1.6.2
51
procA: process (a, b) begin c <= a AND b; 0 end process; a 0Uc UUd U procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; e U a <= 0; end process; Step 6(b): Provisional assignment to d
LEC-03:
S
1.6.2
52
procA: process (a, b) begin c <= a AND b; 0 end process; a UUd 0Uc UU procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U U wait for 3 ns; e U U a <= 0; end process; Step 6(b): Provisional assignment to e
LEC-03:
S
1.6.2
53
0 0Uc 1 UUd UU e
U U
LEC-03:
S
1.6.2
54
procA: process (a, b) begin c <= a AND b; 0 end process; a 0Uc UUd UU procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U e U a <= 0; end process; Step 7: All processes suspended; end of simulation cycle
LEC-03:
S
1.6.2
55
procA: process (a, b) begin c <= a AND b; 0 end process; a 0Uc UUd UU procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U e U a <= 0; end process; Step 1: Begin next simulation cycle
LEC-03:
S
1.6.2
56
procA: process (a, b) begin c <= a AND b; 0 end process; a 0Uc UUd UU procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U e U a <= 0; end process; Step 2: Check sensitivity lists for changes
LEC-03:
S
1.6.2
57
procA: process (a, b) begin c <= a AND b; 0 end process; a 0c Ud procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U e U a <= 0; end process; Step 3: Update signal values
U e
LEC-03:
S
1.6.2
58
0 0c 1 Ud U e
0ns
U U
Step 4: Resume procB
LEC-03:
S
1.6.2
59
procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 1Ud U procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U e U a <= 0; end process; Steps 6(a,b): Activate procB; Provisional assignment to d
LEC-03:
S
1.6.2
60
procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 1Ud UU procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 6(b): Provisional assignment to e
LEC-03:
S
1.6.2
61
0 0c 1 1Ud UU e
0ns
U U U
LEC-03:
S
1.6.2
62
procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 1Ud UU procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 7: All processes suspended; end of simulation cycle
LEC-03:
S
1.6.2
63
procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 1Ud UU procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 1: Begin next simulation cycle
LEC-03:
S
1.6.2
64
procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 1Ud UU procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 2: Check sensitivity lists for changes
LEC-03:
S
1.6.2
65
0 0c 1 1d U e
0ns
0ns
U U U
Step 3: Update signals
LEC-03:
S
1.6.2
66
0 0c 1 1d U e
0ns
0ns
U U U
Step 4: Resume procB
LEC-03:
S
1.6.2
67
procA: process (a, b) begin c <= a AND b; 0 end process; 11d a 0c U procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Steps 6(a, b): Activate procB; provisional assignment to d
LEC-03:
P
1.6.2
68
procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 11d 1U procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 6(b): Provisional assignment to e
LEC-03:
S
1.6.2
69
0 0c 1 11d 1U e
0ns
0ns
0ns
U U U
LEC-03:
S
1.6.2
70
procA: process (a, b) begin c <= a AND b; 0 end process; a 0c 1d 1 procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns 0ns 10ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U U wait for 3 ns; U U e U a <= 0; end process; simulation round Step 7: No changes to "sensitized" signals --- time advances
LEC-03:
1.6.2
71
Step 1: Begin next simulation cycle (Not shown)

S procA: process (a, b) begin c <= a AND b; 0 a 0c 1d end process; 1 procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns 0ns 10ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U U wait for 3 ns; U U e U a <= 0; end process; simulation round Step 2: Resume procC
LEC-03:
1.6.2
72
Step 2: Check sensitivity lists for changes (Not shown) Step 3: Update signal values (Not shown)
S procA: process (a, b) begin c <= a AND b; 10 end process; a 0c 1d 1 procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns 0ns 10ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U U wait for 3 ns; U U e U a <= 0; end process; Step 6(a,b): Activate procC, provisional assignment to a
LEC-03:
S
1.6.2
73
procA: process (a, b) begin c <= a AND b; 10 0c 1d end process; a 1 procB: process (b, c, d) begin e 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns 0ns 10ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 6(c): Suspend procC; end of simulation cycle
LEC-03:
1.6.2
74
Step 1: Begin next simulation cycle (Not shown)

S procA: process (a, b) begin c <= a AND b; 10 a 0c 1d end process; 1 procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns 0ns 10ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; d U U wait for 3 ns; e U U U a <= 0; end process; Step 2: Check sensitivity lists for changes
LEC-03:
S
1.6.2
75
procA: process (a, b) begin c <= a AND b; 1 0c 1d end process; a 1 procB: process (b, c, d) begin 1 d <= NOT c; b e <= b AND d; end process; 0ns 0ns 0ns 0ns 0ns 10ns procC: process begin a <= 0; a U b <= 1; wait for 10 ns; b U a <= 1; c U wait for 2 ns; b <= 0; U d U wait for 3 ns; U U e U a <= 0; end process; Step 3: Update signal values
LEC-03:
P
1.6.2
76
1 0c 1 1d 1 e
0ns
0ns
0ns
10ns
U U U
Step 4: Resume procA
LEC-03:
1.6.2
77
Note and Questions

NB: If a signal is updated with the same value it had in the previous simulation cycle, then it does not change, and therefore does not trigger processes to resume.
Question: What are the different granularities of time that occur when doing delta-cycle simulation?
Answer: simulation step, simulation cycle, delta cycle, simulation round
Question: What is the order of granularity, from nest to coarsest, amongst the different granularities related to delta-cycle simulation?
LEC-03:
1.6.2
78
Answer: same order as listed just above
LEC-03:
1.6.3
Example: Need for Provisional Assignments
79
1.6.3 Example: Need for Provisional Assignments

This is an example of processes where updating signals during a simulation cycle leads to different results for different process execution orderings. architecture main of flotsam is begin p_c: process (a, b) begin c <= a AND b; a end process; p_d: process (a, c) begin b d <= a XOR c; end process; end main;
Figure 1.15: Circuit to illustrate need for provisional assignments
LEC-03:
1.6.3
80
1. Start with all signals at 0. 2. Simultaneously change to a = 1 and b = 1.
LEC-03:
.
1.6.3
81 .
If assignments are not visible within same simulation cycle (correct: i.e. provisional assignments are used)
p_c p_d a b c d
0 0 0 0
P A P
S A
p_c S P A p_dS a b c d
0 0 0 0
P P A S
S P A S
If p c is scheduled before p d, then d will have a 1 pulse.
If p d is scheduled before p c, then d will have a 1 pulse.
LEC-03:
.
1.6.3
82 .
If assignments are visible within same simulation cycle (incorrect)
p_c p_d a b c d
0 0 0 0
P A P
S A
p_c S P A p_dS a b c d
0 0 0 0
P P A S
S P A S
If p c is scheduled before p d, then d will stay constant 0.
If p d is scheduled before p c, then d will have a 1 pulse.
With provisional assignments, both orders of scheduling processes result in the same behaviour on all signals. Without provisional assignments, differ-
LEC-03:
1.6.3
83
ent scheduling orders result in different behaviour.
LEC-04: Hardware Building Blocks

Schedule
wk-06 wk-07 wk-08 wk-08 10 wk-11 12 wk-13
Overview
This lecture uses the VHDL semantics from Lecture 03 to describe how we determine what hardware will be synthesized from VHDL.
Concepts
Lecture Notes: Sections 1.71.9.7 basic building blocks ip-ops and latches coding ip-ops coding sequential circuits
good and bad coding practices arithmetic operations on signals
LEC-04:
1.7
VHDL AND HARDWARE BUILDING BLOCKS
1.7
VHDL and Hardware Building Blocks
This section outlines the building blocks for register transfer level design and how to write VHDL code for the building blocks.
LEC-04:
1.7.1
Basic Building Blocks
1.7.1
(also: n-to-1 muxes) 2:1 mux
D CE
WE A DO
WE A0 DI0 A1 DO1 DO0
DI
LEC-04:
1.7.1

VHDL and, or, nand, nor, xor, xnor if-then-else, case statement, selected assignment, conditional assignment +, -, sll, srl, sla, sra, rol, ror wait until, if-thenelse, rising edge 2-d array or library component
Hardware AND, OR, NAND, NOR, XOR, XNOR multiplexer
adder, subtracter, negater shifter, rotater ip-op memory array, register le, queue
Figure 1.16: RTL Building Blocks
LEC-04:
1.7.2
Deprecated Building Blocks for RTL
1.7.2
LEC-04:
1.7.2
Latches
Use ops, not latches Latch-based designs are susceptible to timing problems The transparent phase of a latch can let a signal leak through a latch causing the signal to affect the output one clock cycle too early Its possible for a latch-based circuit to simulate correctly, but not work in real hardware, because the timing delays on the real hardware dont match those predicted in synthesis
LEC-04:
1.7.2
10
T, JK, SR, etc ip-ops
Limit yourself to D-type ip-ops Most FPGA and ASIC cell libraries include only D-type ip ops (However, the ip-ops in Alteras APEX FPGAs can be congured as D, T, JK, or SR ip-ops.)
LEC-04:
1.7.2
11
Tri-state buffers
Use multiplexers, not tri-state buffers Tri-state designs are susceptible to stability and signal integrity problems Getting tri-state designs to simulate correctly is difcult, some library components dont support tri-state signals Tri-state designs rely on the code never letting two signals drive the bus at the same time It can be difcult to check that bus arbitration will always work correctly Manufacturing and environmental variablity can make real hardware not work correctly even if it simulates correctly Typical industrial practice is to avoid use of tri-state signals on a chip, but allow tri-state signals at the board level
LEC-04:
1.7.3
Hardware and Code for Flops
12
1.7.3
LEC-04:
1.7.3
13
1.7.3.1 Flip-Flops vs Latches

ip-op Edge sensitive: output only changes on rising (or falling) edge of clock latch Level sensitive: output changes whenever clock is high (or low) A common implementation of a ip-op is a pair of latches (Master/Slave op). Latches are sometimes called transparent latches, because they are transparent (input directly connected to output) when the clock is high. The clock to a latch is sometimes called the enable line. There is more information in the course notes on timing analysis for storage devices (Section 5.6).
LEC-04:
1.7.3
14
1.7.3.2 Flops with Waits and Ifs

The two code fragments below synthesize to identical hardware (ops). If process (clk) begin if rising_edge(clk) then q <= d; end if; end process; Wait process begin wait until rising_edge(clk); q <= d; end process;
LEC-04:
1.7.3
15
1.7.3.3 Flops with Synchronous Reset

The two code fragments below synthesize to identical hardware (ops with synchronous reset). Notice that the synchronous reset is really nothing more than an AND gate on the input. If process (clk) begin if rising_edge(clk) then if (reset = 1) then q <= 0; else q <= d; end if; end if; end process; Wait process begin wait until rising_edge(clk); if (reset = 1) then q <= 0; else q <= d0; end if; end process;
LEC-04:
1.7.3
16
1.7.3.4 Flops with Chip-Enable

The two code fragments below synthesize to identical hardware (ops with chip-enable lines). If process (clk) begin if rising_edge(clk) then if (ce = 1) then q <= d; end if; end if; end process; Wait process begin wait until rising_edge(clk); if (ce = 1) then q <= d; end if; end process;
LEC-04:
1.7.3
17
1.7.3.5 Flops with Chip-Enable and Mux on Input

The two code fragments below synthesize to identical hardware (ops with chip-enable lines and muxes on inputs).
LEC-04:
1.7.3
If

Wait
18
process (clk) begin if rising_edge(clk) then if (ce = 1) then if (sel = 1) then q <= d1; else q <= d0; end if; end if; end if; end process;
process begin wait until rising_edge(clk); if (ce = 1) then if (sel = 1) then q <= d1; else q <= d0; end if; end if; end process;
LEC-04:
1.7.3
19
1.7.3.6 Flops with Chip-Enable, Muxes, and Reset

The two code fragments below synthesize to identical hardware (ops with chip-enable lines, muxes on inputs, and synchronous reset). Notice that the synchronous reset is really nothing more than a mux, or an AND gate on the input. NB: The specic combination and order of tests is important to guarantee that the circuit synthesizes to a op with a chip enable, as opposed to a level-sensitive latch testing the chip enable and/or reset followed by a op. NB: The chip-enable pin on the op is connected to both ce and reset. If the chip-enable pin was not connected to reset, then the op would ignore reset unless chipenable was asserted.
LEC-04:
1.7.3
20
Chip-Enable, Mux, Reset with If

process (clk) begin if rising_edge(clk) then if (ce = 1 or reset =1 ) then if (reset = 1) then q <= 0; elsif (sel = 1) then q <= d1; else q <= d0; end if; end if; end if; end process;
LEC-04:
1.7.3
21
Chip-Enable, Mux, Reset with Wait

process begin wait until rising_edge(clk); if (ce = 1 or reset = 1) then if (reset = 1) then q <= 0; elsif (sel = 1) then q <= d1; else q <= d0; end if; end if; end process;
LEC-04:
1.7.4
An Example Sequential Circuit
22
1.7.4
There are many ways to write VHDL code that synthesizes to the schematic in gure 1.17. The two major choices in the styles are:
Some examples of these different optiona are shown in gures 1.181.21.
Put all of the code in a single process, or have collection of clocked processes, combinational processes, and concurrent statements. Use wait or if rising edge for ip ops.
LEC-04:
sel reset
1.7.4
23
a
R
c clk
S
entity and_not_reg is port ( reset, clk, sel : in std_logic; c : out std_logic ); end; Schematic and entity for examples of different code organizations in Figures 1.181.21 Figure 1.17: Schematic and entity for and not reg
LEC-04:
1.7.4
24
One Process
architecture one_proc of and_not_reg is signal a : std_logic; begin process begin wait until rising_edge(clk); if (reset = 1) then a <= 0; elsif (sel = 1) then a <= NOT a; else a <= a; end if; c <= NOT a; end process; end one_proc; Figure 1.18: One process implementation of Figure 1.17
LEC-04:
1.7.4
25
Two Processes with Wait

architecture two_proc_wait of and_not_reg is signal a : std_logic; begin process begin wait until rising_edge(clk); if (reset = 1) then a <= 0; elsif (sel = 1) then a <= NOT a; else a <= a; end if; end process; process begin wait until rising_edge(clk); c <= NOT a; end process; end two_proc_wait; Figure 1.19: Two processes with wait implementation of Figure 1.17
LEC-04:
1.7.4
26
Two Processes with If-Then-Else

architecture two_proc_if of and_not_reg is signal a : std_logic; begin process (clk) begin if rising_edge(clk) then if (reset = 1) then a <= 0; elsif (sel = 1) then a <= NOT a; else a <= a; end if; end if; end process; process (clk) begin if rising_edge(clk) then c <= NOT a; end if; end process; end two_proc_if; Figure 1.20: Two processes with if-then-else implementation of Figure 1.17
LEC-04:
1.7.4
27
architecture comb of and_not_reg is signal a, b, d : std_logic; begin process (clk) begin if rising_edge(clk) then if (reset = 1) then a <= 0; else a <= d; end if; end if; end process; process (clk) begin if rising_edge(clk) then c <= NOT a; end if; end process; d <= b when (sel = 1) else a; b <= NOT a; end comb; Figure 1.21: Concurrent statement implementation of Figure 1.17
LEC-04:
1.8
SYNTHESIZABLE VS NON-SYNTHESIZABLE CODE
28
1.8 Synthesizable vs Non-Synthesizable Code

Synthesis is done by matching VHDL code against templates or patterns. Its important to use idioms that your synthesis tools recognizes. If you arent careful, you could write code that has the same behaviour as one of the idioms, but which results in inefcient or incorrect hardware. Section 1.7 described common idioms and the resulting hardware.
LEC-04:
1.8.1
Unsynthesizable Code
29
1.8.1
LEC-04:
1.8.1
30
1.8.1.1 Initial Values

Initial values on signals (UNSYNTHESIZABLE) signal bad_signal : std_logic := 0; Reason: In most implementation technologies, when a circuit powers up, the values on signals are completely random. Some FPGAs are an exception to this. For some FPGAs, when a chip is powered up, all ip ops will be 0. For other FPGAs, the initial values can be programmed.
LEC-04:
1.8.1
31
1.8.1.2 Wait For

Wait for length of time (UNSYNTHESIZABLE) wait for 10 ns; Reason: Delays through circuits are dependent upon both the circuit and its operating environment, particularly supply voltage and temperature. For example, imagine trying to build an AND gate that will have exactly a 2ns delay in all environments.
LEC-04:
1.8.1
32
1.8.1.3 Different Wait Conditions

wait statements with different conditions in a process (UNSYNTHESIZABLE) -- different clock signals process begin wait until rising_edge(clk1); x <= a; wait until rising_edge(clk2); x <= a; end process;
LEC-04:
1.8.1
33
-- different clock edges process begin wait until rising_edge(clk); x <= a; wait until falling_edge(clk); x <= a; end process; Reason: processes with multiple wait statements are turned into nite state machines. The wait statements denote transitions between states. The target signals in the process are outputs of ip ops. Using different wait conditions would require the ip ops to use different clock signals at different times. Multiple clock signals for a single ip op would be difcult to synthesize, inefcient to build, and fragile to operate.
LEC-04:
1.8.1
34
1.8.1.4 Multiple if rising edges in Same Process

Multiple if rising edge statements in a process (UNSYNTHESIZABLE) process (clk) begin if rising_edge(clk) then q0 <= d0; end if; if rising_edge(clk) then q1 <= d1; end if; end process; Reason: The idioms for synthesis tools generally expect just a single if rising edge statement in each process.
LEC-04:
1.8.1
35
1.8.1.5 if rising edge and wait in Same Process

An if rising edge statement and a wait statement in the same process (UNSYNTHESIZABLE) process (clk) begin if rising_edge(clk) then q0 <= d0; end if; wait until rising_edge(clk); q0 <= d1; end process; Reason: The idioms for synthesis tools generally expect just a single type of op-generating statement in each process.
LEC-04:
1.8.1
36
1.8.1.6 if rising edge with else Clause

The if statement has a rising edge condition and an else clause (UNSYNTHESIZABLE). process (clk) begin if rising_edge(clk) then q0 <= d0; else q0 <= d1; end if; end process; Reason: q0 is supposed to be the output of a ip-op in one case and the output of combinational circuitry in another.
LEC-04:
1.8.1
37
1.8.1.7 if rising edge Inside a for Loop

An if rising edge statement in a for-loop (UNSYNTHESIZABLESynopsys) process (clk) begin for i in 0 to 7 loop if rising_edge(clk) then q <= d; end if; end loop; end process;
LEC-04:
1.8.1
38
Synthesizable Alternative
A synthesizable alternative to an if rising edge statement in a for-loop is to put the if-rising-edge outside of the for loop. process (clk) begin if rising_edge(clk) then for i in 0 to 7 loop q <= d; end loop; end if; end process; Reason: just an idiom of the synthesis tool. Synthesizable for loops are described in Rushton Section 8.7. For loops in general are described in Ashenden. Examples of for loops in E&CE will appear when describing testbenches for functional validation.
LEC-04:
1.8.1
39
1.8.1.8 wait Inside of a for loop

wait statements in a for loop (UNSYNTHESIZABLE) process begin for i in 0 to 7 loop wait until rising_edge(clk); x <= to_unsigned(i,4); end loop; end process; Reason: Unknown. For-loops are generally unsynthsizable, but while-loops with the same behaviour are synthesizable.
NOTE: For loops For loops are very useful in simulation, particular for test benches.
LEC-04:
1.8.1
40
Synthesizable Alternative to Wait-Inside-For

while loop (synthesizable) This is the synthesizable alternative to the the wait statement in a for loop above. process begin -- output values from 0 to 4 on i -- sending one value out each clock cycle i <= to_unsigned(0,4); wait until rising_edge(clk); while (4 > i) loop i <= i + 1; wait until rising_edge(clk); end loop; end process;
LEC-04:
1.8.2
Synthesizable, but Undesirable Hardware
41
1.8.2 Synthesizable, but Undesirable Hardware

NB: The results for the examples in this section are highly dependent upon the tool that you use and the target technology library.
LEC-04:
1.8.2
42
1.8.2.1 Asynchronous Reset

In an asynchronous reset, the test for reset occurs outside of the test for the clock edge. process (reset, clk) begin if (reset = 1) then q <= 0; elsif rising_edge(clk) then q <= d1; end if; end process;
LEC-04:
1.8.2
43
1.8.2.2 Bad Form of Nested Ifs

if rising edge statement inside another if (BAD HARDWARE) In Synopsys, with some target libraries, this design results in a levelsensitive latch whose input is a op. process (ce, clk) begin if (ce = 1) then if rising_edge(clk) then q <= d1; end if; end if; end process;
LEC-04:
1.8.2
44
1.8.2.3 Deeply Nested Ifs

Deeply chained if-then-else statements can lead to long chains of dependent gates, rather than checking different cases in parallel. Slow (maybe) if cond1 then stmts1 elsif cond2 then stmts2 elsif cond3 then stmts3 elsif cond4 then stmts4 end if; Fast (hopefully) if only one of the conditions can be true at a time, then try using a case statement or some other technique that allows the conditions to be evaluated in parallel.
LEC-04:
1.9
NUMBERS, ARITHMETIC, ARRAYS, AND SIGNALS
45
1.9 Numbers, Arithmetic, Arrays, and Signals

VHDL includes all of the common arithmetic and logical operators. Use the VHDL arithmetic operators and let the synthesis tool choose the better implementation for you. It is almost impossible for a hand-coded implementation to beat vendor-supplied arithmetic libraries. To use the operators, you must choose which arithmetic package you wish to use (section 1.9.1). The arithmetic operators are overloaded, and you can usually use any mixture of constants and signals of different types that you need (Section 1.9.3). However, you might need to convert a signal from one type (e.g. std logic vector) to another type (e.g. integer) (Section 1.9.7).
LEC-04:
1.9.1
Arithmetic Packages
46
1.9.1
Arithmetic Packages
Rushton Ch-7 covers arithmetic packages. Rushton Appendex A.5 has the code listing for the numeric std package. To do arithmetic with signals, use the numeric_std package. This package denes types signed and unsigned, which are std_logic vectors on which you can do signed or unsigned arithmetic. numeric std supersedes std logic arith. earlier arithmetic packages, such as
Use only one arithmetic package, otherwise the different denitions will clash and you can get strange error messages.
LEC-04:
1.9.2
Shift and Rotate Operations
47
1.9.2
Shift and Rotate Operations
Shift and rotate operations are described with three character acronyms:
The shift right arithmetic (sra) operation preserves the sign of the operand, by coping the most signicant bit into lower bit positions. The shift left arithmetic does the analogous operation, except that the least signicant bit is copied.
shift/rotate
left/right
arithmetic/logical
LEC-04:
1.9.3
Overloading of Arithmetic
48
1.9.3
Overloading of Arithmetic
The arithmetic operators +, -, and * are overloaded on signed vectors, unsigned vectors, and integers. Tables 1.11.4 show the different combinations of target and source types and widths that can be used. Table 1.1: Overloading of Arithmetic Operations (+, -) target unsigned unsigned src1 unsigned integer unsigned src2 integer unsigned signed
OK OK fails in analysis
In these tables means dont care.
LEC-04:
1.9.4
Different Widths and Arithmetic
49
1.9.4
Different Widths and Arithmetic

target narrow wide wide narrow narrow src1/2 wide narrow wide narrow narrow src2/1 int narrow int
Table 1.2: Different Vector Widths and Arithmetic Operations (+, -)
fails in elaboration fails in elaboration OK OK OK
wide narrow
Example vectors unsigned(7 downto 0) unsigned(4 downto 0)
LEC-04:
1.9.5
Overloading of Comparisons
50
1.9.5
Overloading of Comparisons
src1 unsigned integer signed integer unsigned signed src2 integer unsigned integer signed signed unsigned
Table 1.3: Overloading of Comparison Operations (=, /=, >=, >, <)
OK OK OK OK fails in analysis fails in analysis
LEC-04:
1.9.6
Different Widths and Comparisons
51
1.9.6
Different Widths and Comparisons

src1 wide narrow src2
Table 1.4: Different Vector Widths and Comparison Operations (=, /=, >=, >, <)
OK OK
LEC-04:
1.9.7
Type Conversion
52
1.9.7
Type Conversion
The functions unsigned, signed, to integer, to unsigned and to signed are used to convert between integers, std-logic vectors, signed vectors and unsigned vectors. The listing below summarizes the types of these functions. unsigned( val : std_logic_vector ) signed( val : std_logic_vector ) to_integer( val : signed ) to_integer( val : unsigned ) return unsigned; return signed; return integer; return integer;
to_unsigned( val : signed; width : natural) to_signed( val : integer; width : natural)
return signed; return signed;
The most common example of converting between two types arises when using a signal as an index into an array. To use a signal as an index into
LEC-04:
1.9.7
Type Conversion
53
an array, you must convert the signal into an integer using the function to_integer (Figure 1.22). library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; ... signal bit_sig : std_logic; signal uns_sig : unsigned(7 downto 0); signal vec_sig : std_logic_vector(255 downto 0); ... bit_sig <= vec_sig( to_integer(uns_sig) ); ... Figure 1.22: Using a signal as an index to array To convert a std_logic_vector into an integer, you must rst say whether the signal should be interpreted as signed or unsigned. As illus-
LEC-04:
1.9.7
Type Conversion
54
trated in gure 1.23, this is done by: 1. Convert the std_logic_vector signal to signed or unsigned, using the function signed or unsigned 2. Convert the signed or unsigned signal into an integer, using to_integer
LEC-04:
1.9.7
Type Conversion
55
library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; ... signal bit_sig : std_logic; signal std_sig : std_logic_vector(7 downto 0); signal vec_sig : std_logic_vector(255 downto 0); ... bit_sig <= vec_sig( to_integer( unsigned( std_sig ) ) ); ... Figure 1.23: Using a std logic vector as an index to array
LEC-04:
1.9.7
Type Conversion
56
Chapter 2
RTL Design with VHDL: From Requirements to Optimized Code
57
LEC-04:
2.1
PRELUDE TO CHAPTER
58
2.1
Prelude to Chapter
LEC-04:
2.1.1
59
2.1.1
design ows dataow diagrams state machines memory arrays design example optimization
LEC-05: Dataow Diagrams

Schedule
wk-01 02 VHDL wk-01 Overview and VHDL wk-02 VHDL Semantics Design and Optimization wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review
wk-03 05
wk-06 wk-07 wk-08 wk-08 10 wk-11 12 wk-13
Concepts
serial vs parallel algorithms and hardware dataow diagrams area estimation performance estimation
register allocation datapath, register, input, output allocation area / performance tradeoffs scheduling
Reading
Rushton VHDL for Logic Synthesis (On reserve in DC-Library).
Chapter 1: Introduction Chapter 2: The Register Transfer Level Design Cycle
LEC-05:
2.2
DESIGN FLOW
2.2
Design Flow
LEC-05:
2.2.1
Generic Design Flow
2.2.1
Generic Design Flow
Most people agree on the general terminology and process for a digital hardware design ow. However, each book and course has its own particular way of presenting the ideas. Here we will lay out the consistent set of denitions that we will use in E&CE 427. This might be different from what you have seen in other courses or on a work term. Focus on the ideas and you will be ne both now and in the future. The design ow presented here focuses on the artifacts that we work with, rather than the operations that are performed on the artifacts. This is because the same operations can be performed at different points in the design ow, while the artifacts each have a unique purpose.
LEC-05:
2.2.1
Generic Design Flow

Requirements
Modify Algorithm Analyze Modify High-Level Model Analyze dp/ctrl specific Modify DP+Ctrl Code Analyze Modify Opt. RTL Code Analyze Modify Implementation Analyze
Hardware
Figure 2.1: Generic Design Flow
LEC-05:
2.2.1
Generic Design Flow
Design Flow Artifacts

Additional material in notes Table 2.1: Artifacts in the Design Flow Requirements Algorithm High-Level Model Dataow Diagram Hardware Block Diagram State Machine DP+Ctrl RTL code Optimized RTL Code Implementation Code Description of what the customer wants Functional description of computation. HDL code with signals and clock cycles Picture of datapath behaviour Picture of datapath structure Picture of control behaviour Synthesizable HDL code HDL code written to meet design goals All of the info to build a specic chip
LEC-05:
2.2.2
Implementation Flows
2.2.2
Synopsys Design Compiler and FPGA Compiler are general-purpose synthesis programs. They have very few, if any, technology-specic algorithms. Instead, they rely on libraries to describe technology-specic parameters of the primitive building blocks (e.g. the delay and area of individual gates, PLAs, CLBs, ops, memory arrays). Mentor Graphics product Leonardo Spectrum, Cadences product BuildGates, and Synplicitys product Synplify are similar. In comparison, Avant! (Now owned by Synopsys) and Cadence sell separate tools that do place-and-route and other low-level (physical design) tasks. These general-purpose synthesis tools do not (generally) do the nal stages of the design, such as place-and-route and timing analysis, which are very specic to a given implementation technology. The implementationtechnology-specic tools generally also produce a VHDL le that accurately models the chip. We will refer to this le as the implementation VHDL code.
LEC-05:
2.2.2
10
Synopsys with Xilinx and Altera

With the Synopsys and Xilinx tools, we compile VHDL code into a Xilinxspecic design le (xnf Xilinx netlist le). We then use the Xilinx tools to generate a bit le, which can be downloaded to a Xilinx FPGA. The name of the implementation VHDL le is often sufxed with routed.vhd. With Synopsys and the Altera tool Quartus, we compile the VHDL code into an EDIF le for the netlist and a TCL le for the commands to Quartus. Quartus then generates a sof (SRAM Object File), which can be downloaded to an Altera SRAM-based FPGA. The extension of the implementation VHDL le is often .vho, for VHDL output.
LEC-05:
2.2.2
11
Terminology: Behavioural and Structural

NOTE: behavioural and structural models The phrases behavioural model and structural model are commonly used for what well call high-level models and synthesizable models. In most cases, what people call structural code contains both structural and behavioural code. The technically correct denition of a structural model is an HDL program that contains only component instantiations and generate statements. Thus, even a program with c <= a AND b; is, strictly speaking, behavioural.
LEC-05:
2.2.3
Classes of Hardware
12
2.2.3
Classes of Hardware
Each circuit tends to be dominated by either its datapath, control (state machine) or storage (memory).
Datapath Purpose: compute output data based on input data Each parcel of input produces one parcel of output Examples: arithmetic, decoders Storage Purpose: hold data for future use Data is not modied while stored Examples: register les, FIFO queues Control Purpose: modify internal state based on inputs, compute outputs from state and inputs Mostly individual signals, few data (vectors) Examples: bus arbiters, memory-controllers
LEC-05:
2.2.4
Design Flow: Datapath vs Control vs Storage
13
2.2.4 Design Flow: Datapath vs Control vs Storage

All three classes of circuits (datapath, control, and storage) follow the same generic design ow (Figure 2.1), but the details in the ow differ. This is particularly true for the transition from the high-level model to the model that separates the datapath and control circuitry. The different classes of circuits all use dataow diagrams, hardware block diagrams, and state machines. What differs is how much effort is put into each type of description and the order in which the different descriptions are used.
Lec-05:
2.2.4.1
Datapath-Centric Design Flow
14
2.2.4.1 Datapath-Centric Design Flow

High-Level Model
Modify Dataflow Analyze Modify Block Diagram Analyze State Machine
DP+Ctrl RTL Code
Figure 2.2: Datapath-Centric Design Flow
Lec-05:
2.2.4.1
15
2.2.4.2 Control-Centric Design Flow

High-Level Model
Modify State Machine Analyze Modify Dataflow Diagram Analyze Modify Block Diagram Analyze
DP+Ctrl RTL Code
Figure 2.3: Control-Centric Design Flow
Lec-05:
2.2.4.1
16
2.2.4.3 Storage-Centric Design Flow

In E&CE 427, we wont be discussing storage-centric design. Storagecentric design differs from datapath- and control-centric design in that storage-centric design focusses on building many replicated copies of small cells. Storage-centric designs include a wide range of circuits, from simple memory arrays to complicated circuits such as register les, translation lookaside buffers, and caches. The complicated circuits can contain large and very intricate state machines, which would benet from some of the techniques for control-centric circuits.
LEC-05:
2.3
DATAFLOW DIAGRAMS AND HIGH-LEVEL MODELS 17
2.3 Dataow Diagrams and High-Level Models
LEC-05:
2.3.1
Overview of Example
18
2.3.1
Overview of Example
Requirement: compute the sum of 6 numbers: output = a + b + c + d + e + f Well go through the following artifacts: 1. 2. 3. 4. 5. 6. requirements algorithm dataow diagram hardware block diagram state machine high-level model
LEC-05:
2.3.1
Overview of Example
19
2.3.1.1 Software vs Hardware Algorithms
In software, the expression: (((((a + b) + c) + d) + e) + f) takes the same amount of time to execute as: (a + b) + (c + d) + (e + f). But: hardware runs in parallel in algorithmic description, parentheses can guide parallel vs serial execution
LEC-05:
2.3.1
Overview of Example
20
2.3.1.2 Serial vs Parallel

Serial (((((a+b)+c)+d)+e)+f)
a b c d e f
Parallel (a+b)+(c+d)+(e+f)
+ + + + +
a b c d e f
+ +
LEC-05:
2.3.1
Overview of Example
21
Performance Estimation
a b c d e f
1 + 2 + 3 + 4 + 5 +
a b c d e f
1 + 2 +
3 +
5 adders on longest path (slower)
3 adders on longest path (faster)
There is more information on performance in section 2.3.3.1 and all of chap-
LEC-05:
2.3.1
Overview of Example
22
ter 4 is devoted to performance.
LEC-05:
2.3.1
Overview of Example
23
Area Estimation
a b c d e f
1 + 2 + 3 + 4 + 5 +
a b c d e f
1 + 4 +
2 +
3 +
5 +
5 adders used
5 adders used
LEC-05:
2.3.1
Overview of Example
24
Design Comparison
a b c d e f
+ + + + +
5 adders on longest path (slower) 5 adders used
a b c d e f
+ +
+
3 adders on longest path (faster) 5 adders used
LEC-05:
2.3.2
Dataow Diagrams
25
2.3.2
Dataow Diagrams
A disciplined approach for going beyond combinational logic for datapathcentric circuits
LEC-05:
2.3.2
Dataow Diagrams
26
2.3.2.1 Dataow Diagrams Overview
Purpose: Provide a disciplined approach for designing datapath-centric circuits Guide the design from algorithm to high-level model Guide the design from high-level model to model with separated datapath and control Estimate area and performance Make tradeoffs between different design options Background Based on techniques from high-level synthesis tools
LEC-05:
2.3.2
Dataow Diagrams
27
Dataow Diagrams Overview

a b c d e f
+
x1
+
x2
+
x3
+
x4
+
z
LEC-05:
2.3.2
Dataow Diagrams
28
Clock Cycle Boundaries

a b c d e f
+
x1
+
x2
Horizontal lines mark clock cycle boundaries
+
x3
+
x4
+
z
LEC-05:
2.3.2
Dataow Diagrams
29
Latency
a b c d e f
+
2 3 4 5 6
z x1
+
x2
+
x3
+
x4
+
Latency = 6 clock cycles
LEC-05:
2.3.2
Dataow Diagrams
30
Latency
a b c d e f
+
x1
+
2
x2
+
x3
+
3 4
z x4
+
Latency = 4 clock cycles
Question:
Note the imbalanced clock cycle utilization.
LEC-05:
2.3.2
Dataow Diagrams
31
Flip Flops
a b c d e f
+
x1
+
x2
+
x3
Signals crossing clock boundaries are flip-flops
+
x4
+
z
LEC-05:
2.3.2
Dataow Diagrams
32
Registered Inputs and Outputs

a b c d e f
+
x1
+
x2
+
x3
+
x4
+
z
Flops on both inputs and outputs
LEC-05:
2.3.2
Dataow Diagrams
33
Registered Inputs
a b c d e f
+
x1
+
x2
+
x3
+
x4
+
z
Flops on inputs, but not outputs (Latency = 5)
LEC-05:
2.3.2
Dataow Diagrams
34
Datapath Components
a b c d e f
+
x1
+
x2
+
x3
+
x4
Blocks in clock cycles are datapath components
+
z
LEC-05:
2.3.2
Dataow Diagrams
35
Inputs
Unconnected signal tails are inputs Horizontal lines mark clock cycle boundaries
+
x1
+
x2
+
x3
+
x4
+
z
LEC-05:
2.3.2
Dataow Diagrams
36
Outputs
a b c d e f
+
x1
+
x2
+
x3
+
x4
+
z
Unconnected signal heads are outputs
LEC-05:
2.3.2
Dataow Diagrams
37
Summary
a b c d e f
+
x1
+
x2
+
x3
+
x4
+
z
Unconnected signal heads are outputs
LEC-05:
2.3.2
Dataow Diagrams
38
2.3.2.2 Area Estimation
Maximum number of blocks in a clock cycle is total number of that component that are needed Maximum number of signals that cross a cycle boundary is total number of registers that are needed Maximum number of unconnected signal tails in a clock cycle is total number of inputs that are needed Maximum number of unconnected signal heads in a clock cycle is total number of outputs that are needed
LEC-05:
2.3.3
Dataow Diagram Execution
39
2.3.3
LEC-05:
2.3.3
40
Execution with Registers on Both Inputs and Outputs
LEC-05:
a
2.3.3
b c

d e f
41
0
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
x1 x2 x3 x4 x5 z
+
z
LEC-05:
a
2.3.3
b

c d e f
42
0 1
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
x1 x2 x3 x4 x5 z
+
z
LEC-05:
a
2.3.3
b c

d e f
43
0 1
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
x1 x2 x3 x4 x5 z
+
z
LEC-05:
a
2.3.3
b

c d e f
44
0 1
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
2 3
x1 x2 x3 x4 x5 z
+
z
LEC-05:
a
2.3.3
b c

d e f
45
0 1
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
2 3 4
x1 x2 x3 x4 x5 z
+
z
LEC-05:
a
2.3.3
b

c d e f
46
0 1
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
2 3 4
x1 x2 x3 x4 x5 z
+
z
LEC-05:
a
2.3.3
b c

d e f
47
0 1
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
2 3 4
x1 x2 x3 x4 x5 z
+
z
5 6
LEC-05:
a
2.3.3
b

c d e f
48
0 1
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
2 3 4
x1 x2 x3 x4 x5 z
+
z
5 6
LEC-05:
2.3.3
49
Execution Without Output Registers

a b c d e f
0 1
clk a
0 1 2 3 4 5 6
x1
+ + x2 + x3 + x4
x5
2 3 4
x1 x2 x3 x4 x5 z
+
z
LEC-05:
a
2.3.3
b

c d e f
50
0 1
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
2 3 4
x1 x2 x3 x4 x5 z
+
z
LEC-05:
2.3.3
51
2.3.3.1 Performance Estimation
LEC-05:
2.3.3
52
Performance Equations
Performance 1 TimeExec
Latency = Number of clock cycles from inputs to outputs There is much more information on performance in chapter 4, which is devoted to performance.
TimeExec
Latency
ClockPeriod
LEC-05:
2.3.3
53
Performance of Dataow Diagrams
Latency: count horizontal lines in diagram Min clock period (Max clock speed) limited by longest path in a clock cycle
a b c d e f
0 1
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
LEC-05:
2.3.3
54
2.3.3.2 Design Analysis

a b c d e f
0 1
clk a
0 1 2 3 4 5 6
+ x1 + x2 + x3 + x4
x5
2 3 4
x1 x2 x3 x4 x5 z
+
z
LEC-05:
2.3.3
55
Design Analysis Contd

num inputs num outputs num registers num adders min clock period latency 6 1 6 1 delay through op and one adder 5 clock cycles
LEC-05:
2.3.4
Area / Performance Tradeoffs
56
2.3.4

one add per clock cycle
a b c d e f
two adds per clock cycle

0 1
a b c d e f
0 1
+
x1
+
x1
+
x2
+
x2
+
x3
+
x3
+
x4
+
x4
+
z
5 6
+
z
NB: In the Two-add design, half of the last clock cycle is wasted.
LEC-05:
2.3.4
57
Two Adds per Clock Cycle

a b c d e f
0
clk
0 1 2 3 4 5 6
a x1
+
x1
+
x2
x2
+
x3
x3
x4 x5
+
x4
+
z
3 4
LEC-05:
2.3.4
58
Design Comparison
One add per clock cycle
a b c d e f
Two adds per clock cycle

a b c d e f
0 1
0 1
+
x1
+
x1
+
x2
+
x2
+
x3
+
x3
+
x4
+
x4
+
z
5 6
+
z
inputs outputs registers adders clock period latency
6 1 6 1 op + 1 add 6
6 1 6 2 op + 2 add 4
Question: Under what circumstances would each of the design options (one add and two add) be the fastest?
Answer: time = latency * clock period compare execution times for both options
LEC-05:
2.3.5
Optimize Inputs and Outputs
59
2.3.5
inputs regs
If currently storing all inputs and can change environments behaviour to delay sending some inputs, then can reduce the number of inputs and registers. One-add before I/O opt
a b c d e f
One-add after I/O opt

a b
+
x1
+
x1
+
x2
+
x2
+
x3
+
x3
+
x4
+
x4
+
z
+
z
6 6
2 2
LEC-05:
2.3.5
60
Design Comparison
One-add after I/O opt
a b
Two-add after I/O opt

a b c
+
x1
+
x1 d
+
x2
+
x2 e
+
x3
+
x3 f
+
x4
+
x4
+
z
+
z
2 1 2 1 op + 1 add 6
3 1 3 2 op + 2 add 4
LEC-05:
2.3.6
From Dataow Diagram to High-Level Model
61
2.3.6 From Dataow Diagram to High-Level Model

Here we illustrate the process of going from a dataow diagram to a highlevel model. In the high-level model the entire circuit will be implemented in a single process. For larger circuits it may be benecial to have separate processes for different groups of signals. High-level models are distinguished from lower-level models in that the code for the datapath and control are intermingled. In a high-level model of a datapath-centric circuit, there will probably not be any code devoted to the state machine.
LEC-05:
2.3.6
62
Hardware Recipe for Two-Add

Table 2.2: Hardware Recipe for Two-Add inputs adders registers output registered inputs registered outputs clock cycles from inputs to outputs 3 2 3 1 YES YES 4
LEC-05:
2.3.6
63
High-Level Models of Datapaths

The following two fragments of VHDL code (the hlm and hlm2 architectures) are derived directly from the dataow diagram labeled Two-add after in section 2.3.5 after input/output, datapath and register allocation have been done. The code between wait statements describes the work that is done in a clock cycle. The hlm architecture combines the datapath and control in a single process with multiple wait statements in the process. Because the process is clocked, all of the signals that are assigned to in the process are registers. Combinational signals would need to be done using concurrent assignments or combinational processes. The hlm2 architecture is derived from the hlm architecture by separating combinational and registered signals.
LEC-05:
2.3.6
64
High-Level Model with Single Process

architecture hlm of big_add is process begin -------------------------------wait until rising_edge(clk); -------------------------------r1 <= i1; r2 <= i2; r3 <= i3; -------------------------------wait until rising_edge(clk); -------------------------------r1 <= (r1 + r2) + r3; r2 <= i2; r3 <= i3; -------------------------------wait until rising_edge(clk); -------------------------------r1 <= (r1 + r2) + r3; r2 <= i2; -------------------------------wait until rising_edge(clk); -------------------------------r3 <= (r1 + r2); end process; o1 <= r3; end hlm;
LEC-05:
2.3.6
65
High-Level Model with Combinational Signals

architecture hlm2 of big_add is ---------------------------------process begin ---------------------------wait until rising_edge(clk); ---------------------------r1 <= i1; r2 <= i2; r3 <= i3; ---------------------------wait until rising_edge(clk); ---------------------------r1 <= a2; r2 <= i2; r3 <= i3; ---------------------------wait until rising_edge(clk); ---------------------------r1 <= a2; r2 <= i2; ---------------------------wait until rising_edge(clk); ---------------------------r3 <= a1; end process; ---------------------------------a1 <= r1 + r2; a2 <= a1 + r3; ---------------------------------o1 <= r3; ---------------------------------end hlm2;
LEC-05:
2.3.7
From Dataow Diagram to DP+Ctrl Model
66
2.3.7 From Dataow Diagram to DP+Ctrl Model
LEC-05:
2.3.7
67
Dataow Diagram and Datapath Blocks

a b c
+
x1
+
x2
+
x3
+ +
x4 f
+
z
Figure 2.4: Dataow diagram and building blocks for block diagram
LEC-05:
2.3.7
68
I/O Allocation
i1 i2 a b i3 c i1 i2 i3
+
x1
+
x2
i2 d
i3 e
+
x3
+ +
x4 i2 f
+
z o1
+
o1
LEC-05:
2.3.7
69
Datapath Allocation
i1 i2 a b a1 i3 c i1 i2 i3
+
x1 a2
+
x2 a1
i2 d
i3 e
+
x3 a2
a1
+
a2
+
x4 a1
i2 f
+
z o1
+
o1
LEC-05:
2.3.7
70
Register Allocation
i1 i2 a b r1 r2 a1 i3 c r3 i1 i2 i3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
71
Allocation Completed
i1 i2 a b r1 r2 a1 i3 c r3 i1 i2 i3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
I/O Allocation
Datapath Allocation Register Allocation
i1 a i2 b, d, f i3 c, e o1 z a1 x1, x3, z a2 x2, x4 r1 a, x2, x4 r2 b, d, f r3 c, e
Figure 2.5: Block diagram after I/O, datapath, and register allocation
LEC-05:
2.3.7
72
Connect the Blocks

To connect the blocks:
a1
Simulate the dataow diagram, drawing connections between blocks when they communicate
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
73
Connect the Blocks

a1
Simulate the dataow diagram, drawing connections between blocks when the communicate
i1 i2 a b r1 r2 i3 c r3 i1 i2 i3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
74
Connect the Blocks

a1
Simulate the dataow diagram, drawing connections between blocks when the communicate
i1 i2 a b r1 r2 i3 c r3 i1 i2 i3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
75
Connect the Blocks and Add Muxes


a1
Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
76


a1
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
77


a1
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
78


a1
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
79


a1
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
80


a1
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
81


a1
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
82


a1
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
83


a1
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
84
Done with Simulation


a1
i1 i2 i3
i1 i2 a b r1 r2
i3 c r3
+
x1 a2
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
85
Add State Machine

The state machine keeps track of which clock cycle of the dataow diagram is currently being executed.
Simulate the dataow diagram, drawing connections between blocks when the communicate Add muxes when have multiple drivers Clean up drawing, add state machine (control)
LEC-05:
2.3.7
86
Add State Machine

The state machine drives the datapath signals whose values are dependent upon which clock cycle of the dataow diagram is being executed. Typical examples are:
Select signals on multiplexers Instruction signals on arithmetic modules Chip-enable lines on registers and ip-ops
i1 i2 i3
i1 i2 a b r1 r2 a1
i3 c r3
+
x1 a2
ctrl
+
x2 r1 a1
i2 d r2
i3 e r3
r1 a1
r2
r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
LEC-05:
2.3.7
87
Classes of Hardware
i1 i2 i3
datapath ctrl
r1 a1 r2 r3
storage control
+
a2
+
o1
Figure 2.6: Classes of hardware in example circuit
LEC-05:
2.3.7
88
2.3.7.1 Datapath for DP+Ctrl Model

The following VHDL code is derived directly from the block diagram in gure 2.6.
LEC-05:
2.3.7
89
architecture main of big_add is fsm : process ... end; process (clk) begin if rising_edge(clk) then if r1_gets_in = 1 then r1 <= i1; else r1 <= a2; end if; end if; end process; process (clk) begin if rising_edge(clk) then r2 <= i2; end if; end process; process (clk) begin if rising_edge(clk) then if r3_gets_in = 1 then r3 <= i3; else r3 <= a1; end if; end if; end process; a1 <= r1 + r2; a2 <= a1 + r3; o1 <= r3; end main;
LEC-05:
2.3.7
90
In section 2.4, well discuss how to build the control circuitry (nite state machine, represented by the fsm process).
LEC-05:
2.3.7
91
From Dataow to Hardware (Almost)

1. 2. 3. 4. 5. 6. 7. 8. 9. Create dataow diagram Optimize inputs and outputs I/O allocation: assign dataow signals to hardware inputs and outputs Datapath allocation: assign dataow blocks to components Register allocation: assign dataow signals to registers Derive high-level model Connect the hardware, add muxes where needed Derive datapath for DP+Ctrl model Build the state machine, connect to datapath + storage
LEC-05:
2.3.8
Dataow Diagram Scheduling
92
2.3.8
Schedule: move functional blocks between clock cycles Allows tradeoffs between performance and area NOTE: Parallel algorithms have higher performance and greater scheduling exibility than serial algorithms NOTE: Serial algorithms tend to have less area than parallel algorithms Serial (((((a+b)+c)+d)+e)+f)
a b c d e f
+ + + + +
a b c d e f
+ +
LEC-05:
2.3.8
93
Design Analysis
a b c d e f
+ +
+
clock period num adders 1 add 3
LEC-05:
2.3.8
94
Scheduling to Optimize Area

original
a b c d e f a
after scheduling
b c d
+ +
+ +
+ +
6 1 6 3 op + 1 add 3
4 1 4 2 op + 1 add 3
LEC-05:
2.3.9
Summary: From Dataow to Hardware
95
2.3.9 Summary: From Dataow to Hardware

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Create dataow diagram Schedule data operations Optimize inputs and outputs I/O allocation: assign dataow signals to hardware inputs and outputs Datapath allocation: assign dataow blocks to components Register allocation: assign dataow signals to registers Derive high-level model Connect the hardware, add muxes where needed Derive datapath for DP+Ctrl model Build the state machine, connect to datapath + storage
LEC-06: State Machine Design

Schedule
wk-03 05
wk-06 wk-07 wk-08 wk-08 10 wk-11 12 wk-13
Overview
This lecture builds on material from Lec-05, where dataow diagrams were introduced. The bulk of the lecture discusses nite state machine design. First how to build a state machine from a dataow diagram, and then various ways of coding up state machines in VHDL.
Concepts
input/output protocols deriving nite state machines from dataow diagrams coding state machines in
VHDL state encoding explicit state machines implicit state machines
Background
Mano Digital Design
Section 6-4: Analysis of Clocked Sequential Circuits Section 6-5: State Reduction and Assignment Section 6-7: Design Procedure
Reading
Smith ASIC
Rushton VHDL for Logic Synthesis (On reserve in DC-Library).
By now, you should be done with Chapter 8 (Programable ASIC Design Software) and Chapter 10 (VHDL) Section 12.2: Synthesis (From Lec-02) Section 12.6: VHDL Logic Synthesis (From Lec-02) Section 12.7: Finite State Machine Synthesis
Chapter 8: Sequential VHDL Chapter 9: Registers Section 12.2: Finite State Machines
LEC-06:
2.4
FINITE STATE MACHINES IN VHDL
2.4
Finite State Machines in VHDL
LEC-06:
2.4.1
Mealy vs Moore State Machines
2.4.1
LEC-06:
2.4.1
Moore Machines
Outputs are dependent upon only the state No combinational paths from inputs to outputs Outputs can be either ops or combinational
s0/0 a s1/1 !a s2/0
s3/0
LEC-06:
2.4.1
10
Mealy Machines
Outputs are dependent upon both the state and the inputs Combinational paths from inputs to outputs Outputs must be combinational
s0 a/1 s1 /0 s3 /0 !a/0 s2
LEC-06:
2.4.2
State Machines and VHDL
11
2.4.2
A state machine is generally written as a single clocked process, or as a pair of processes, where one is clocked and one is combinational.
LEC-06:
2.4.2
12
Design Decisions
Moore vs Mealy (Sections 2.4.3.1 and 2.4.3.2) Implicit vs Explicit (Section 2.4.6) State values in explicit state machines: Enumerated type vs constants (Section 2.4.4.1) State values for constants: encoding scheme (binary, gray, one-hot, ...) (Section 2.4.4.2)
LEC-06:
2.4.2
13
How to Steer a State Machine

The following VHDL control constructs are useful to steer the transition from state to state:
if ... then ... case for ... loop while ... loop
else
loop next exit
LEC-06:
2.4.2
14
2.4.2.1 Implicit and Explicit State Machines

There are two general ways to code state machines: implicit and explicit.
LEC-06:
2.4.2
15
Implicit State Machines

Some state machines do not have a specic state signal. The state machine uses multiple wait states in a process to control the values driving the control signals needed by the datapath. These state machines are called implicit. The synthesis program adds an implicit register to hold the state signal and combinational circuitry to update the state signal. In Synopsys synthesis tools, the state signal dened by the synthesizer is named multiple wait state reg.
LEC-06:
2.4.2
16
Explicit State Machines

The alternative to an implicit state machine is an explicit style, where the engineer denes a signal to represent the state and provides code to store and update the state signal. In this case, each process has at most one wait statement.
LEC-06:
2.4.3
Some Simple State Machines
17
2.4.3
LEC-06:
2.4.3
18
2.4.3.1 Implementing a Simple Moore Machine
LEC-06:
2.4.3
19
Entity and Diagram
s0/0 a s1/1 !a s2/0
entity moore is port ( a, clk : in std_logic; z : out std_logic ); end moore;
s3/0
LEC-06:
2.4.3
20
Implicit State Machine

architecture main of moore is begin process begin z <= 0; wait until rising_edge(clk); if (a = 1) then z <= 1; else z <= 0; end if; wait until rising_edge(clk); z <= 0; wait until rising_edge(clk); end process; end main;
LEC-06:
2.4.3
21
LEC-06:
2.4.3
22
Explicit with Flopped Outputs

architecture main of moore_v2 is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then case state is when s0 => if (a = 1) then state <= s1; z <= 1; else state <= s2; z <= 0; end if; when s1 | s2 => state <= s3; z <= 0; when s3 => state <= s0; z <= 1; end case; end if; end process; end main;
LEC-06:
2.4.3
23
Explicit with Flopped Outputs
LEC-06:
2.4.3
24
Explicit with Combinational Outputs

architecture main of moore_v3 is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then case state is when s0 => if (a = 1) then state <= s1; else state <= s2; end if; when s1 | s2 => state <= s3; when s3 => state <= s0; end case; end if; end process; z <= 1 when (state = s1) else 0; end main;
LEC-06:
2.4.3
25
Explicit with Combinational Outputs
LEC-06:
2.4.3
26
State Machine with Next Signals

architecture main of moore_v4 is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin type state_ty is (s0, s1, s2, s3); signal state, state_nxt : state_ty; begin process (clk) begin if rising_edge(clk) then state <= state_nxt; end if; end process; state_nxt <= s1 when (state = s0) and (a = 1) else s2 when (state = s0) and (a = 0) else s3 when (state = s1) or (state = s2) else s0; z <= 1 when (state = s1) else 0; end main;
LEC-06:
2.4.3
27
State Machine with Next Signals
LEC-06:
2.4.3
28
Explicit with Combinational Process

architecture main of moore_v4 is type state_ty is (s0, s1, s2, s3); signal state, state_nxt : state_ty; begin process (clk) begin if rising_edge(clk) then state <= state_nxt; end if; end process; process (state, a) begin case state is when s0 => if (a = 1) then state_nxt <= s1; else state_nxt <= s2; end if; when s1 | s2 => state_nxt <= s3; when s3 => state_nxt <= s0; end case; end process; z <= 1 when (state = s1) else 0; end main;
LEC-06:
2.4.3
29
Explicit with Combinational Process
LEC-06:
2.4.3
30
2.4.3.2 Implementing a Simple Mealy Machine
LEC-06:
2.4.3
31
Entity and Diagram
s0 a/1 s1 /0 s3 /0 !a/0 s2
entity mealy is port ( a, clk : in std_logic; z : out std_logic ); end moore;
LEC-06:
2.4.3
32

architecture main of mealy is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process begin state <= s0; wait until rising_edge(clk); if (a = 1) then state <= s1; else state <= s2; end if; wait until rising_edge(clk); state <= s3; wait until rising_edge(clk); end process; z <= 1 when (state = s0) and a = 1 else 0; end main;
LEC-06:
2.4.3
33
LEC-06:
2.4.3
34
Explicit State Machine

architecture main of mealy_v2 is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then case state is when s0 => if (a = 1) then state <= s1; else state <= s2; end if; when s1 | s2 => state <= s3; when others => state <= s0; end case; end if; end process; z <= 1 when (state = s0) and a = 1 else 0; end main;
LEC-06:
2.4.3
35
Explicit State Machine
LEC-06:
2.4.3
36
State Machine with Next Signal

architecture main of mealy_v3 is type state_ty is (s0, s1, s2, s3); signal state, state_nxt : state_ty; begin process (clk) begin if rising_edge(clk) then state <= state_nxt; end if; end process; state_nxt <= s1 when (state = s0) and a = 1 else s2 when (state = s0) and a = 0 else s3 when (state = s1) or (state = s2) else s0; z <= 1 when (state = s0) and a = 1 else 0; end main;
LEC-06:
2.4.3
37
State Machine with Next Signal
LEC-06:
2.4.4
State Encoding
38
2.4.4
State Encoding
LEC-06:
2.4.4
State Encoding
39
2.4.4.1 Constants vs Enumerated Type

Using an enumerated type: type state_ty is (s0, s1, s2, s3); signal state : state_ty; Using constants: type state_ty is std_logic_vector(1 downto 0); constant s0 : state_ty := "11"; constant s1 : state_ty := "10"; constant s2 : state_ty := "00"; constant s3 : state_ty := "01"; signal state : state_ty;
LEC-06:
2.4.4
State Encoding
40
Providing Encodings for Enumerated Types

Many synthesizers allow the user to provide hints on how to encode the states, or allow the user to provide explicitly the desire encoding. These hints are done either through VHDL attributes or special comments in the code.
LEC-06:
2.4.4
State Encoding
41
Simulation
When doing functional simulation with enumerated types, simulators often display waveforms with pretty-printed values rather than bits (e.g. s0 and s1 rather than 11 and 10). However, when simulating a design that has been mapped to gates, the enumerated type dissappears and you are left with just bits. If you dont know the encoding that the synthesis tool chose, it can be very difcult to debug the design.
LEC-06:
2.4.4
State Encoding
42
Covering All Cases

When writing case statements or selected assignments that test the value of std logic signals, you will get an error unless you include a provision for non 1/0 signals. For example:
signal t : std_logic; ... case t is when 1 => ... when 0 => ... end case; will result in an error message about missing cases. You must provide for t being H, U, etc. The simplest thing to do is to make the last test when other. However, this opens you up to potential bugs if the enumerated type you are testing grows to include more values, which then end up unintentionally executing your when other branch, rather than having a special branch of their own in the case statement.
LEC-06:
2.4.4
State Encoding
43
Unused Values
If the number of values you have in your datatype is not a power of two, then you will have some unused values that are representable. For example: type state_ty is std_logic_vector(2 downto 0); constant s0 : state_ty := "011"; constant s1 : state_ty := "000"; constant s2 : state_ty := "001"; constant s3 : state_ty := "011"; constant s4 : state_ty := "101"; signal state : state_ty; This type only needs ve unique values, but can represent eight different values. What should we do with the three representable values that we dont need? The safest thing to do is to code your design so that if an illegal value is encountered, the machine resets or enters an error state.
LEC-06:
2.4.4
State Encoding
44
2.4.4.2 Encoding Schemes
Binary: Conventional binary counter. One-hot: Exactly one bit is asserted at any time. Modied one-hot: Alteras Quartus synthesizer generates an almostone-hot encoding where the initial state is all Os. Gray: Transition between adjacent values requires exactly one bit ip. Custom: Choose encoding to simplify combinational logic for specic task.
LEC-06:
2.4.4
State Encoding
45
Tradeoffs in Encoding Schemes
Gray is good for low-power applications where consecutive data values typically differ by 1 (e.g. no random jumps). One-hot usually has less combinational logic and runs faster than binary for machines with up to a dozen or so states. With more than a dozen states, the extra ip-ops required by one-hot encoding become too expense. Custom is great if you have lots of time and are incredibly intelligent, or have deep insight into the guts of your design.
LEC-06:
2.4.5
From Dataow to State Machine
46
2.4.5
This section designs the state machine for the big_add example used in dataow diagrams (Section 2.3.7). We pick up from the VHDL code for the datapath in section 2.3.7.1.
i1 i2 a b r1 r2 a1 i3 c r3 i1 i2 i3
+
x1 a2
ctrl
+
x2 r1 a1
i2 d r2
i3 e r3 r1 a1 r2 r3
+
x3 a2
+
a2
+
x4 r1 a1
i2 f r2
+
r3 z o1
+
o1
Two control signals from state machine: r1 gets in r3 gets in r1 reads from input or a2 r3 reads from input or a1
Simulate dataow diagram and record required values of signals. cycle 1 2 3 4 r1 gets in true false false r3 gets in true true false
LEC-06:
2.4.5
47
Dont Care Values

NOTE: Dont care values In cycle 3, we dont care what is the value of r3 gets in. In cycle 4, we dont care what the value of r1 gets in is. So we assign these signals - in these clock cycles, which is dont care in VHDL. This should allow the synthesis tool to use whatever value is most helpful in simplifying the Boolean equations for the signal (e.g. Karnaugh maps). In the past, some groups in E&CE 427 have used - quite succesfuly to decrease the area of their design. However, a few groups found that using - increased the size of their design, when they were expecting it to decrease the size. So, if you are tweaking your design to squeeze out the last few unneeded bits of area, pay close attention as to whether using - hurts or helps.
LEC-06:
2.4.6
Implicit vs Explicit State Machines
48
2.4.6
Implicit vs Explicit State Machines
There are two broad categories of state machines in VHDL: explicit and implicit. Explicit state machines are a direct translation of the hardware: a concurrent assignments to for the next-state equations and a clocked process for the ops to hold the state. Implicit state machines are built with processes that have multiple wait statements in a process. Explicit state machines are more cumbersome to write, but they are simpler to synthesize and more commonly used. Implicit state machines are concise and readable. Very few books or synthesis manuals describe multiple-wait statement processes, but they are relatively well supported among synthesis tools.
LEC-06:
2.4.7
49
2.4.7
Several examples of implicit state machines that could be used to drive r1 gets in and r3 gets in.
LEC-06:
2.4.7
50
2.4.7.1 Multi-Wait Process

This example directly controls the signals from a multi-wait process. process (clk) begin ------------------------------------------- cycle 1 wait until rising_edge(clk); r1_gets_in <= 1; r3_gets_in <= 1; ------------------------------------------- cycle 2 wait until rising_edge(clk); r1_gets_in <= 0; r3_gets_in <= 1; ------------------------------------------- cycle 3 wait until rising_edge(clk); r1_gets_in <= 0; r3_gets_in <= -; ------------------------------------------- cycle 4 wait until rising_edge(clk); r1_gets_in <= -; r3_gets_in <= 0; end process;
LEC-06:
2.4.7
51
2.4.7.2 Counter
This example uses a counter in a process to keep track of the state, and then uses concurrent assignments for the control signals. The assignments to r1 gets in and r3 gets in could be done with conditional assignments, or a combinational process. Some of these alternatives are illustrated in section 2.4.8. ---------------------------------------------------process (clk) begin cycle_count <= to_unsigned(0, 2); -------------------------------wait until rising_edge(clk); -------------------------------while 3 > cycle_count loop cycle_count <= cycle_count + 1; wait until rising_edge(clk); end loop; end process; ---------------------------------------------------with cycle_count select r1_gets_in <= 1 when to_unsigned(0,2), 0 when others ; ---------------------------------------------------with cycle_count select r3_gets_in <= 1 when to_unsigned(3,2), 0 when others ; ----------------------------------------------------
LEC-06:
2.4.8
52
2.4.8
This is an explicit state machine. A clocked process is used to store the state and a concurrent assignment is used to calculate the next state. The datapath is the same as in section 2.3.6 The control signals for the datapath (r1_gets_in and r3_gets_in) drive the two multiplexors, one for each register (r1 and r3). The values of r1_gets_in and r3_gets_in are determined by the current state of the machine. In this section we rst write the explicit state machine, and then look at several different coding styles for communicating between the state machine and datapath.
LEC-06:
2.4.8
53
2.4.8.1 State Machine

This is the explicit state machine. It stays the same for all of the different examples here. architecture main of big_add is type state_ty is (S0, S1, S2, S3); signal state, state_nxt : state_ty; ... begin process (clk) begin if rising_edge(clk) then state_cur <= state_nxt; end if; end process; with state_cur select state_nxt <= S1 when S0, S2 when S1, S3 when S2, S0 when S3 ; ...r1_gets_in asn... ...r3_gets_in asn... ...datapath... end main;
LEC-06:
2.4.8
54
2.4.8.2 Conditional Assignment

The rst coding example uses simple conditional assignments. r1_gets_in <= else r3_gets_in <= else 1 when state_cur = S0 0; 1 when state_cur = S3 0;
LEC-06:
2.4.8
55
2.4.8.3 Conditional Assignment with Dont Care

The simple conditional assignment doesnt take advantage of the fact that the last state doesnt use the adder a1, so we dont care whether r1 reads from the input or from the a2. We give the synthesis tool a chance to simplify equations for r1_gets_in (and thereby hopefully reduce area) by putting a dont care value for r1_gets_in in the last state. r1_gets_in <= 1 when state_cur = S0 else 0 when (state_cur = S1) OR (state_cur = S2) else -; r3_gets_in <= 1 when (state_cur = S0) OR (state_cur = S1) else 0 when (state_cur = S4); else -;
LEC-06:
2.4.8
56
2.4.8.4 Selected Assignment with Dont Care

The conditional assignment code has many occurrences of state cur in the conditions, which is ugly. So, use a case-like statement (the selected assignment). with state_cur select r1_gets_in <= 0 when 1 when - when ; with state_cur select r3_gets_in <= 0 when 1 when - when ; S0, S1 | S2, others
S3 S0 | S1, others
LEC-06:
2.4.8
57
2.4.8.5 Case Statement

The selected assignment code tests state cur for both assignments, so try a case statement in a process, which allows multiple assignments within the case statement. process (state_cur) begin case state_cur is when S0 => r1_gets_in <= r3_gets_in <= when S1 => r1_gets_in <= r3_gets_in <= when S2 => r1_gets_in <= r3_gets_in <= when S3 => r1_gets_in <= r3_gets_in <= end case; end process;
1; 1; 0; 1; 0; -; -; 0;
LEC-06:
2.4.8
58
Summary and Conclusion

After writing out the different options, the selected assignment style looks to be the best option for this example. The code is short, clean and easy to understand.
LEC-06:
2.4.9
Reset
59
2.4.9
Reset
All circuits should have a reset signal that puts the circuit back into a good initial state. There are standard ways to add a reset signal to both explicit and implicit state machines. It is important that reset is tested on every clock cycle, otherwise a reset might not be noticed, or your circuit will be slow to react to reset and could generate illegal outputs after reset is asserted.
LEC-06:
2.4.9
Reset
60
Reset with Implicit State Machine

With an implicit state machine, we need to insert a loop in the process and test for reset after each wait statement. process (clk) begin init : loop cycle_count <= to_unsigned(0, 2); wait until rising_edge(clk); next init when (reset = 1); while 3 > cycle_count loop cycle_count <= cycle_count + 1; wait until rising_edge(clk); next init when (reset = 1); end loop; end loop; end process; -- outermost loop
-- test for reset
-- test for reset
LEC-06:
2.4.9
Reset
61
Reset with Explicit State Machine

Reset is often easier to include in an explicit state machine, because we need only put a test for reset = 1 in the clocked process for the state: process (clk) begin if rising_edge(clk) then if reset = 1 then state_cur <= S0; else state_cur <= state_nxt; end if; end if; end process;
LEC-06:
2.4.10
Input / Output Protocols
62
2.4.10 Input / Output Protocols

An important aspect of hardware design is choosing a input/output protocol that is easy to implement and suits both your circuit and your environment. Here are a few simple and common protocols.
LEC-06:
2.4.10
63
Four phase handshaking protocol

rdy data ack
Figure 2.7: Four phase handshaking protocol Used when timing of communication between producer and consumer is unpredictable. The disadvantage is that it is cumbersome to implement and slow to execute.
LEC-06:
2.4.10
64
Valid-bit protocol
clk valid data
Figure 2.8: Valid-bit protocol A low overhead (both in area and performance) protocol. Consumer must always be able to accept incoming data. Often used in pipelined circuits. More complicated versions of the protocol can handle pipeline stalls.
LEC-06:
2.4.10
65
Start/Done Protocol
clk start data_in done data_out
Figure 2.9: Start/Done protocol A low overhead (both in area and performance) protocol. Useful when a circuit works on one piece of data at a time and the time to compute the result is unpredictable.
LEC-07: Memory Design

Lecture Notes Sections: 2.5 2.5.2.6
Schedule
wk-03 05
wk-06 wk-07 wk-08 wk-08 10 wk-11 12 wk-13
Overview
This lecture builds on material from Lec-05, where dataow diagrams were introduced. In this lecture, we show how to deal with memory reads and writes in dataow diagrams. This ties in with data hazards in computer architecture.
Concepts
Lecture Notes: Sections 2.52.5.2.6
memory arrays in dataow diagrams data dependencies and
hazards memory arrays in VHDL
Background
Reading
Smith ASIC
Section 12.8: Memory Synthesis The remainder of Chapter 12
LEC-07:
2.5
MEMORY ARRAYS AND RTL DESIGN
2.5
Memory Arrays and RTL Design
LEC-07:
2.5.1
Memory Arrays and Dataow Diagrams
2.5.1 Memory Arrays and Dataow Diagrams
LEC-07:
2.5.1
2.5.1.1 Legend for Dataow Diagrams

name name name name (rd) name(wr)
Input port
Output port
State signal
Array read
Array write
LEC-07:
2.5.1
10
2.5.1.2 Basic Memory Operations

mem mem addr mem(rd) data mem (anti-dependency) mem(wr) data addr
mem
data := mem[addr]; Memory Read
mem[addr] := data; Memory Write
Dataow diagrams show the dependencies between operations. The basic memory operations are similar, in that each arrow represents a data dependency.
LEC-07:
2.5.1
11
Basic Memory Operations (Contd)

There are a few aspects of the basic memory operations that are potentially surprising:
The antidependency for memory reads is related to Write-after-Read dependencies, as discussed in Section 2.5.1.4. The apparent dependency on and production of an entire memory array is because we dont know which address in the array will be read from or written to. There are optimizations that can be performed when we know the address (Section 2.5.1.5).
The anti-dependency arrow producing mem on a read. Reads and writes are dependent upon the entire previous value of the memory array. The write operation appears to produce an entire memory array, rather than just updating an individual element of an existing array.
LEC-07:
2.5.1
12
2.5.1.3 Data Dependencies

Instructions in a program can be reordered, so long as the data dependencies are preserved.
LEC-07:
2.5.1
13
Data Dependencies (Contd)

M[3] M[2] M[1] M[0] 30 20 10 0 M[2] := 21 M[3] := 31 A B := M[2] := M[0] 21
M[3] := 32 M[0] := 01 C := M[3]
Initial Program
LEC-07:
2.5.1
14

M[3] M[2] M[1] M[0] 30 20 10 0 M[2] := 21 M[3] := 31 A B := M[2] := M[0] 21
M[3] := 32 M[0] := 01 C := M[3]
Initial Program with Dependencies
LEC-07:
2.5.1
15

M[2] := 21 B A := M[0] := M[2]
M[3] := 31 M[3] := 32 M[0] := 01 C := M[3]
Valid Modication
LEC-07:
2.5.1
16

M[2] := 21 B A := M[0] := M[2]
M[3] := 31 C := M[3]
M[3] := 32 M[0] := 01
Valid (or Bad?) Modication
LEC-07:
2.5.1
17
2.5.1.4 Denition of Three Types of Dependencies

There are three types of data dependencies. pipeline terminology in computer architecture.
M[i] := := M[i] := :=
The names come from
:= M[i] :=
:= M[i]
M[i]
:=
M[i]
:=
Read after Write
Write after Write
Write after Read
LEC-07:
2.5.1
18
2.5.1.5 Dataow Diagrams and Data Dependencies
LEC-07:
2.5.1
19
Read after Write Dependencies

Algo: mem[wr addr] := data in; data out := mem[rd addr]; mem data_in wr_addr
mem(wr)
rd_addr
mem(rd)
mem
data_out
Read after Write
LEC-07:
2.5.1
20
Read after Write Optimization

Algo:
mem
mem[wr addr] := data out := data_in wr_addr
data in; mem[rd addr];

rd_addr
mem(wr)
mem(rd)
mem
data_out

Optimization when rd addr
wr addr
LEC-07:
2.5.1
21
Write after Write Dependencies

Algo:
mem
mem[wr1 addr] := mem[wr2 addr] := data1 wr1_addr
data1; data2;
mem(wr)
data2
wr2_addr
mem(wr)
mem
Write after Write
LEC-07:
2.5.1
22
Write after Write Scheduling Option

Algo: mem[wr1 addr] := data1; mem[wr2 addr] := data2; mem data2 wr2_addr
mem(wr) data1 wr1_addr
mem(wr)
mem

Scheduling option when wr1 addr
wr2 addr
LEC-07:
2.5.1
23
Write after Read Dependencies

Algo: rd data := mem[wr addr] := mem rd_addr mem[rd addr]; wr data;
mem(rd)
wr_data wr_addr
mem(wr)
rd_data
mem
Write after Read
LEC-07:
2.5.1
24
Write after Read Optimization

Algo: rd data := mem[rd addr]; mem[wr addr] := wr data; mem rd_addr wr_data wr_addr
mem(rd)
mem(wr)
rd_data
mem

Optimization when rd addr
wr addr
LEC-07:
2.5.1
25
2.5.1.6 Example: Dataow Diagram
Memory
Array
and
LEC-07:
2.5.1
26
Initial Dataow Diagram

mem M data_in wr_addr 21 2
M(wr)
31
M(wr)
M(rd)
M(rd)
32
1 2 3 4 5 6 7
M[2] := 21 M[3] := 31 A B := M[2] := M[0]
M(wr)
01
M(wr)
M[3] := 32 M[0] := 01 C := M[3] M C 7 M(rd)
Figure 2.10: Memory array example code and initial dataow diagram
LEC-07:
2.5.1
27
Dependency Arrow and Addresses

The dependency and anti-dependency arrows in dataow diagram in Figure 2.10 are based solely upon whether an operation is a read or a write. The arrows do not take into account the address that is read from or written to. In gure 2.11, we have used knowledge about which addresses we are accessing to remove unneeded dependencies. These are the real dependencies and match those shown in the code fragment for gure 2.10. In gure 2.12 we have placed an ordering on the read operations and an ordering on the write operations. The ordering is derived by obeying data dependencies and then rearranging the operations to perform as many operations in parallel as possible.
LEC-07:
2.5.1
28
Optimize Dependencies for Known Addresses

M(wr)
31
M(wr)
M(rd)
M(rd)
32
M(wr)
01
M(wr)
M(rd)
LEC-07:
2.5.1
29
Optimize Anti-Dependencies for Known Addresses

M(wr)
31
M(wr)
M(rd)
M(rd)
32
M(wr)
01
M(wr)
M(rd)
LEC-07:
2.5.1
30
Minimal Dependencies
M 0 21 2 31 3
M(rd) B 01 0 M(wr)
M(wr)
M(wr)
2 M(rd)
32 3 M(wr) 3 M(rd)
Figure 2.11: Memory array with minimal dependencies
Question:
What is the critical path?
LEC-07:
2.5.1
31
Critical Path
M 0 21 2 31 3
M(rd) B 01 0 M(wr)
M(wr)
M(wr)
2 M(rd)
32 3 M(wr) 3 M(rd)
LEC-07:
2.5.1
32
Reads and Writes

M 0 21 2 31 3
M(rd) B 01 0 M(wr)
M(wr)
M(wr)
read write
2 M(rd)
32 3 M(wr) 3 M(rd)
Question:
In what order should operations occur?
Question:
Which operations must be rst or last?
LEC-07:
2.5.1
33
Obvious First and Last Operations

M 0 21 2 31 3
M(rd) B 01 0
M(wr)
M(wr)
2 M(rd)
32 3 M(wr) 3 3 M(rd)
M(wr)
First and last read are obvious from critical path. Last write is obvious.
Question: point?
Any operations forced into a specic order at this
LEC-07:
2.5.1
34
Middle Read
M 0 21 2 31 3
M(rd) B 01 0
M(wr)
M(wr)
2 2 M(rd)
32 3 M(wr) 3 3 M(rd)
M(wr)
Only three reads, so once rst and last have been picked, the middle one is determined
Question:
Which write should happen rst?
LEC-07:
2.5.1
35
First Write
M 0 21 2 31 3
M(rd) B 01 0
M(wr)
M(wr)
2 2 M(rd)
32 3 M(wr) 3 3 M(rd)
M(wr)
First write is one closest to start of critical path, although because we know addresses, could reschedule rst two writes.
Question:
Can we complete the ordering?
LEC-07:
2.5.1
36
Complete Ordering
M 0 21 2 31 3
M(rd) B 01 0
M(wr)
M(wr)
2 2 M(rd) 3
32 3 M(wr) 3 3 M(rd)
M(wr)
Figure 2.12: Memory array with orderings Ordering of writes 2 and 3 are determined because both have 3 as their address.
LEC-07:
2.5.1
37
Place Operations in Clock Cycles

M 0 21 2
M(rd) B
M(wr)
2 2 M(rd) A 2
31 3 M(wr)
32 3 3 M(wr)
01 0 4 M(wr) 3
3 M(rd)
LEC-07:
2.5.1
38
Final Dataow Diagram

M 0 1 M(rd) B 2 2 M(rd) A 32 3 3 M(wr) 2 31 3 M(wr) 1 21 2 M(wr)
3 3 M(rd) C 4
01 0 M(wr) M
Figure 2.13: Final version of Figure 2.10 Put as many parallel operations into same clock cycle as allowed by resources (one write + one read, two reads, or one write for dual port RAM). Preserve depencies by putting dependent operations in separate clock cycles.
LEC-07:
2.5.2
Memory Arrays in VHDL
39
2.5.2
LEC-07:
2.5.2
40
2.5.2.1 Two-Dimensional Array

A memory array can be written in VHDL as a two-dimensional array: subtype data is std_logic_vector(7 downto 0); type data_vector is array( natural range <> ) of data; signal mem : data_vector(31 downto 0); However, a two-dimensional array does not accurately capture the limitations on a memory array in hardware.
LEC-07:
2.5.2
41
Two-Dimensional Array
The example below illustrates: lack of interface protocol, combinational write, multiple write ports, multiple read ports. architecture main of mem_not_hw is subtype data is std_logic_vector(7 downto 0); type data_vector is array( natural range <> ) of data; signal mem : data_vector(31 downto 0); begin y <= mem( a ); mem( a ) <= b; -- comb read process (clk) begin if rising_edge(clk) then mem( c ) <= w; -- write port #1 end if; end process; process (clk) begin if rising_edge(clk) then mem( d ) <= v; -- write port #2 end if; end process; u <= mem( e ); -- read port #2 end main;
LEC-07:
2.5.2
42
2.5.2.2 Memory Array in Hardware

Most simple memory arrays are single- or dual-ported, support just one write operation at a time, and have an interface protocol using a clock and write-enable.
WE WE A DI DO A0 DI0 A1 DO1 DO0
LEC-07:
2.5.2
43
2.5.2.3 Example VHDL Code for Memory Array in Hardware

package mem_pkg is subtype data is std_logic_vector(7 downto 0); type data_vector is array( natural range <> ) of data; end; entity mem is port ( clk : in std_logic; we : in std_logic -a : in unsigned(4 downto 0); -di : in data; -do : out data -); end mem; architecture main of mem is signal mem : data_vector(31 downto 0); begin do <= mem( to_integer( a ) ); process (clk) begin if rising_edge(clk) then if we = 1 then mem( to_integer( a ) ) <= di; end if; end if; end process; end main;
write enable address data_in data_out
LEC-07:
2.5.2
44
2.5.2.4 Library Component

Synopsys synthesis tools implement each bit in a two-dimensional array as a ip-op. Each FPGA and ASIC vendors supplies libraries of memory arrays that are smaller and faster than a two-dimensional array of ip ops. These libraries exploit specialized hardware on the chips to implement the memory. NB: To synthesize a reasonable implementation of a memory array with Synopsys, you must instantiate a vendor-supplied memory component. Some other synthesis tools can infer memory arrays from two-dimensional arrays and synthesize efcient implementations.
LEC-07:
2.5.2
45
Recommended Design Process with Memory

1. high-level model with two-dimensional array 2. two-dimensional array packaged inside memory entity/architecture 3. vendor-supplied component
LEC-07:
2.5.2
46
Altera
Altera uses MegaFunctions to implement RAM in VHDL. A MegaFunction is a black-box description of hardware on the FPGA. There are tools in Quartus to generate VHDL code for RAM components of different sizes. In E&CE 427 we will provide you with the VHDL code for the RAM components that you will need in Lab-3 and the Project. The APEX20KE chips that we are using have dedicated SRAM blocks called Embedded System Blocks (ESB). Each ESB can store 2048 bits and can be congured in any of the following sizes: Number of Elements 2048 1024 512 256 128 Word Size (bits) 1 2 4 8 16
LEC-07:
2.5.2
47
Xilinx
Use component instantiation to get these components
Other sizes are also available, consult the datasheet for your chip.

ram16x1s ram16x1d
16 16
1 single ported memory 1 dual-ported memory
LEC-07:
2.5.2
48
2.5.2.5 Build Memory from Slices

If the vendors libraries of memory components do not include one that is the correct size for your needs, you can construct your own component from smaller ones.
LEC-07:
2.5.2
49
Widen the Words in a Memory

WriteEn Addr DataIn[W-1..0] DataIn[2W-1..2] Clk
WE A DI DO WE A DI DO
NxW
NxW
DataOut[W-1..0] DataOut[2W-1..W]
Figure 2.14: An N 2W memory from N W components
LEC-07:
2.5.2
50
Increase Number of Words in a Memory

WriteEn Addr[logN] Addr[logN-1..0] DataIn Clk
WE A DI DO
NxW
WE A DI DO
NxW
DataOut
Figure 2.15: A 2N W memory from N W components
LEC-07:
2.5.2
51
A 16 4 Memory from 16 1 Components

library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity ram16x4s is port ( clk, we : in std_logic; data_in : in std_logic_vector(3 downto 0); addr : in unsigned(3 downto 0); data_out : out std_logic_vector(3 downto 0) ); end ram16x4s;
LEC-07:
2.5.2
52
A 16 4 Memory from 16 1 Components

architecture main of ram16x4s is component ram16x1s port (d : in std_logic; -- data in a3, a2, a1, a0 : in std_logic; -- address we : in std_logic; -- write enable wclk : in std_logic; -- write clock o : out std_logic -- data out ); end component; begin mem_gen: for i in 0 to 3 generate ram : ram16x1s port map ( we => we, wclk => clk, ----------------------------------------------- d and o are dependent on i a3 => addr(3), a2 => addr(2), a1 => addr(1), a0 => addr(0), d => data_in(i), o => data_out(i) ---------------------------------------------); end generate; end main;
LEC-07:
2.5.2
53
2.5.2.6 Dual-Ported Memory

Dual ported memory is similar to single ported memory, except that it allows two simultaneous reads, or a simultaneous read and write. When doing a simultaneous read and write to the same address, the read will not see the data currently being written.
Question: Why do dual-ported memories usually not support writes on both ports?
LEC-08: Design Example: Stack

Schedule
wk-03 05
wk-06 wk-07 wk-08 wk-08 10 wk-11 12 wk-13
Overview
This lecture builds on material from the previous three lectures where dataow diagrams, nite state machines, and memory array design were described. This lecture takes a stack (push, pop, swap, top) from an algorithmic description to an RTL implementation in VHDL. The major new idea is working with dataow diagrams for circuits that perform multiple operations.
Concepts
Lecture Notes: Sections 2.62.6.4.3 combining FSMs, datapath, and storage
dataow diagrams with multiple instructions
Background
Reading
LEC-08:
2.6
DESIGN EXAMPLE: STACK
2.6
Design Example: Stack
LEC-08:
2.6.1
Stack Requirements
2.6.1
Stack Requirements
LEC-08:
2.6.1
Stack Requirements
2.6.1.1 Stack Entity

VHDL entity for the stack: entity stack is port ( reset, clk : in std_logic; inp : in std_logic_vector(3 downto 0); outp : out std_logic_vector(3 downto 0) ); end stack; The input signal inp is used for both instructions and data.
LEC-08:
2.6.1
Stack Requirements
10
2.6.1.2 Stack Instructions

push pop swap tos put a new piece of data onto the top of the stack remove the top piece of data from the stack swap the top two pieces of data output the current data on the top of the stack
LEC-08:
2.6.1
Stack Requirements
11
2.6.1.3 Stack Instruction Encoding

VHDL package dening stack instructions: package stack_instr is constant pop : std_logic_vector(3 constant push : std_logic_vector(3 constant tos : std_logic_vector(3 constant swap : std_logic_vector(3 end stack_instr; downto downto downto downto 0) 0) 0) 0) := := := := "0001"; "0010"; "0100"; "1000";
LEC-08:
2.6.1
Stack Requirements
12
2.6.1.4 Miscellaneous Requirements

The stack shall have 16 elements The inputs shall be registered. When a push operation is done, in the clock cycle following the push instruction, inp shall have the data that is to be pushed onto the stack. Popping from an empty stack or pushing onto a full stack results in undened behaviour. When doing a tos or pop operation, the output outp shall have the tos data in the clock cycle after the tos instruction is input. At all other times the output is unconstrained. In the clock cycle following reset being asserted (set to 1), the stack shall be empty.
LEC-08:
2.6.2
Stack Algorithm
13
2.6.2
Stack Algorithm
A simple Perl program to implement an algorithmic description of the stack. NB: You dont need to know Perl in E&CE 427. Perl is just one example of the many different software programming languages that can be used to create algorithmic descriptions of circuits.
LEC-08:
2.6.2
Stack Algorithm
14
Stack Algorithm Preliminaries

#! /usr/bin/perl -Wall local ($line, @stack, $stack, $tmp); $tos = 0;
LEC-08:
2.6.2
Stack Algorithm
15
Stack Algorithm Core
if ( $line eq "tos") print( $stack $tos ); elsif ( $line eq "pop") print( $stack $tos ); $tos = $tos - 1; elsif ( $line eq "push" ) $tos = $tos + 1; $line = <STDIN>; chop( $line ); $stack $tos = $line; elsif ( $line eq "swap" ) $tmp = $stack $tos ; $stack $tos = $stack $tos-1 ; $stack $tos-1 = $tmp;

while ($line = <STDIN>) chop( $line );

LEC-08:
2.6.2
Stack Algorithm
16
Usage of Perl Stack

push 3 tos 3 push 4 tos 4 pop 4 tos 3
LEC-08:
2.6.3
Stack Dataow Diagrams
17
2.6.3
LEC-08:
2.6.3
18
2.6.3.1 Initial Diagrams

Do one diagram for each operation. Do the initial dataow diagrams without any clock cycle information.
LEC-08:
2.6.3
19
Pop
stack tos
stack(rd)
-1
stack
data_out
tos
Pop
LEC-08:
2.6.3
20
Push
stack data_in tos
+1
stack(wr)
stack
tos
Push
LEC-08:
2.6.3
21
Tos
stack tos
stack(rd)
stack
data_out
tos
Tos
LEC-08:
2.6.3
22
Swap
stack tos
-1
stack(rd)
stack(rd)
stack(wr)
stack(wr)
stack
tos
Swap Note: scheduling decision and anti-dependency arrows
LEC-08:
2.6.3
23
2.6.3.2 Partition into Clock Cycles
LEC-08:
2.6.3
24
Pop, Push
stack data_in stack tos tos +1
stack(rd)
-1
stack(wr) stack tos
stack
data_out
tos
2 1
Pop registers (stack, tos) ALU
3 1
Push registers (stack, tos, data in) ALU
LEC-08:
2.6.3
25
Tos
stack tos
stack(rd)
stack
data_out
tos
Tos registers (stack, tos)
LEC-08:
2.6.3
26
Swap
stack tos
-1
stack(rd)
stack(rd)
stack(wr)
stack(wr)
stack
tos
5 1
registers (stack, tos, stack[tos], stack[tos-1], tos-1) ALU Swap version 1
LEC-08:
2.6.3
27
Swap (Optimized)
stack tos
-1
stack(rd)
stack(rd)
-1 stack(wr)
stack(wr)
stack
tos
4 1
registers (stack, tos, stack[tos], stack[tos-1]) ALU Swap version 2 (Optimized) eliminated one register
LEC-08:
2.6.3
28
2.6.3.3 High-Level Model

This high-level model is taken directly from the dataow diagrams and block diagrams. There is one process that combines control, datapath, and storage; except for the output (outp), which is done with a concurrent assignment statement. Notice that there is a next init when (reset = 1); after every wait statement. This is needed to get the circuit back to its initial state in the next clock cycle when reset is asserted. First, well see the overall structure of the hlm architecture, and then the gory details.
LEC-08:
2.6.3
29
Stack HLM Structure

architecture hlm of stack is ...declarations... begin ----------------------------------------------process begin init : loop ...reset assignments... loop -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------case inp is when pop => ...pop code... when push => ...push code... when swap => ...swap code... when tos => ...tos code... when others => next init; end case; end loop; end loop; end process; ----------------------------------------------outp <= stack(to_integer(tos)); ----------------------------------------------end hlm;
LEC-08:
2.6.3
30
Stack HLM Declarations

architecture hlm of stack is ----------------------------------------------subtype data_ty is std_logic_vector(3 downto 0); type stack_ty is array (15 downto 0) of data_ty; ----------------------------------------------signal tos : unsigned(3 downto 0); signal tmp1, tmp2 : data_ty; signal stack : stack_ty; signal empty : std_logic; ----------------------------------------------begin
LEC-08:
2.6.3
31
Stack HLM: Pop

when pop => tos <= tos - 1;
LEC-08:
2.6.3
32
Stack HLM: Push

when push => if (empty = 0) then tos <= tos + 1; end if; -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------stack(to_integer(tos)) <= inp; empty <= 0;
LEC-08:
2.6.3
33
Stack HLM: Swap

when swap => tmp1 <= stack(to_integer(tos-1)); -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------tmp2 <= stack(to_integer(tos)); -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------stack(to_integer(tos-1)) <= tmp2; -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------stack(to_integer(tos)) <= tmp1;
LEC-08:
2.6.3
34
Stack HLM: Tos

when tos => null;
LEC-08:
2.6.3
35
Stack HLM: Others

when others => next init; end case; end loop; end loop; end process;
LEC-08:
2.6.3
36
Stack HLM: Output

----------------------------------------------outp <= stack(to_integer(tos)); ----------------------------------------------end hlm;
LEC-08:
2.6.3
37
2.6.3.4 Individual Block Diagrams

Build one block diagram for each operation.
LEC-08:
2.6.3
38
Pop
stack
stack
tos
tos
we a di do
outp
stack(rd)
-1
-1
stack
data_out
tos
Pop
LEC-08:
2.6.3
39
Push
stack data_in tos control +1
stack tos
stack(wr) stack tos
d
ce
q
1
we
a di do
inp
Push
LEC-08:
2.6.3
40
Tos
stack tos
stack(rd) 0
tos
stack
we a di do
outp
stack
data_out
tos
Tos
LEC-08:
2.6.3
41
Swap Dataow
stack tos
-1
stack(rd)
stack(rd)
-1 stack(wr)
stack(wr)
stack
tos
LEC-08:
2.6.3
42
Swap Block Diagram

control
tmp1 stack tos
d ce
we a
-1
di
do
tmp2
d ce
Swap
LEC-08:
2.6.3
43
2.6.3.5 Complete Block Diagram

Merge all of the block diagrams together, reusing components whereever possible.
LEC-08:
2.6.3
44
Block Diagram for All Operations

control
tos_inc_dec_sel tos_ce tmp2_ce stack_addr_sel stack_data_sel stack_we tmp1_ce
reset
r
tos
d ce
q
stack
tmp1
d ce
we a
-1 1
di
do
tmp2
outp
d
inp
ce
All Operations
LEC-08:
2.6.4
Stack: Register Transfer Level
45
2.6.4
The high-level model is synthesizable, but might be large and slow.

It uses a 2-d array for the stack, rather than specialized memory components from the library. We are relying on the synthesis tool to build a state machine to drive the datapath. Sometimes, by writing code that is closer to gate-level hardware, we can improve peformance and/or area.
LEC-08:
2.6.4
46
Structuring RTL Code

There are four different ways to structure your RTL code:

Single process Separate datapath Separate control, storage, and datapath Fully disassembled
LEC-08:
2.6.4
47
Single Process Structure


Control Storage Datapath
LEC-08:
2.6.4
48
Separate Datapath

Control Storage
Datapath
LEC-08:
2.6.4
49
Separate Control, Storage and Datapath

Control

Storage
Datapath
LEC-08:
2.6.4
50
Fully Disassembeled
Next-State Funs

Control Storage
Storage
Datapath
LEC-08:
2.6.4
51
Stack RTL
To write the RTL code for the stack, consider the following options:
(e.g. dene a state type and a signal of type state and do assignments to current and next-state signals Question to ponder: does an explicit state machine result in better hardware?

Replacing the stack as an array with a component instantiation of a memory array from the FPGA libraries Dening a state machine and signals to control the datapath
LEC-08:
2.6.4
52
2.6.4.1 Stack: Separate Control, Datapath and Storage

This design is derived directly from the hardware block diagram. We separate the state machine and datapath using the control signals that drive the datapath (mux select lines, chip enables, etc). The state machine drives signals that control the datapath. The state machine is very similar to that in the high level model. In every state we assign values to the signals that control the datapath. The datapath is done with concurrent statements. By using concurrent statements, rather than processes, for the datapath, we eliminate the need for the datapath assignments to have sensitivity lists, which simplies the code. This style works best when there are a large number of states and a small number of datapath components.
LEC-08:
2.6.4
53
Block Diagram
control
tos_inc_dec_sel tos_ce tmp2_ce stack_addr_sel stack_data_sel stack_we tmp1_ce
reset
r
tos
d ce
q
stack
tmp1
d ce
we
stack_addr
a di do
-1 1
tos_adj+
outp tmp2
d
inp stack_data_in
ce
Registers Memory Combinational FSM outputs
Inventory tos, tmp1, tmp2 stack tos adj, stack addr, stack data in tos ce, tos inc dec sel, stack addr sel, stack data sel, stack we, tmp2 ce, tmp1 ce,
LEC-08:
2.6.4
54
SepFsm Overview (1)

architecture sepfsm of stack is ...declarations... begin ...component instantiation for memory... ...clocked process for state machine... ...clocked process for tmp1... ...clocked process for tmp2... ...clocked process for tos... ...concurrent assignment for tos adj... ...concurrent assignment for stack addr... ...concurrent assignment for stack data in... end sepfsm;
LEC-08:
2.6.4
55
SepFsm Overview (2)

architecture sepfsm of stack is ...declarations... begin ...component instantiation for memory... process begin init : loop ...initialization... loop -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------case inp is when pop => ...pop code... when push => ...push code... when swap => ...swap code... when tos => ...tos code... when others => next init; end case; end loop; end loop; end process; ...clocked process for tmp1... ...clocked process for tmp2... ...clocked process for tos... ...concurrent assignment for tos adj... ...concurrent assignment for stack addr... ...concurrent assignment for stack data in... end sepfsm;
LEC-08:
2.6.4
56
SepFsm Declarations (1)

architecture sepfsm of stack is signal tos, tos_adj, stack_addr : unsigned(3 downto 0); signal inp_intern, stack_data_in, stack_data_out, tmp1, tmp2 : std_logic_vector(3 downto 0);
Question: Why are some signals unsigned and others std logic vector?
Answer: Signals that are used as numbers (e.g. addresses for memory array) are unsigned. Non-numeric signals are std logic vector
LEC-08:
2.6.4
57

signal synch_reset, empty, tos_inc_dec_sel, stack_addr_sel, tos_ce, stack_we, tmp1_ce, tmp2_ce : std_logic; signal stack_data_sel : std_logic_vector(1 downto 0);
LEC-08:
2.6.4
58

-----------------------------------------------------component ram16x4s port (data_in : in std_logic_vector(3 downto 0); addr : in unsigned(3 downto 0); we : in std_logic; clk : in std_logic; data_out : out std_logic_vector(3 downto 0) ); end component; ------------------------------------------------------
LEC-08:
2.6.4
59
SepFsm Ram Instantiation

begin stack : ram16x4s port map ( ---------------------------------------------we => stack_we, clk => clk, ---------------------------------------------addr => stack_addr, data_in => stack_data_in, data_out => stack_data_out ---------------------------------------------);
LEC-08:
2.6.4
60
SepFsm Initialization
process begin init : loop -------------------------------empty <= 1; tos_inc_dec_sel <= -; stack_addr_sel <= -; tos_ce <= 1; stack_we <= 0; stack_data_sel <= "--"; tmp1_ce <= -; tmp2_ce <= -;
LEC-08:
2.6.4
61
SepFsm Pop
-------------------------------loop -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------case inp is when pop => tos_inc_dec_sel <= 0; stack_addr_sel <= 1; tos_ce <= 1; stack_we <= 0; stack_data_sel <= "--"; tmp1_ce <= -; tmp2_ce <= -;
LEC-08:
2.6.4
62
SepFsm Push
when push => if (empty = 1) then tos_inc_dec_sel <= -; stack_addr_sel <= 0; tos_ce <= 0; else tos_inc_dec_sel <= 1; stack_addr_sel <= 1; tos_ce <= 1; end if; stack_data_sel <= "--"; stack_we <= 0; tmp1_ce <= -; tmp2_ce <= -; -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------empty <= 0; tos_inc_dec_sel <= -; stack_addr_sel <= 0; tos_ce <= 0; stack_data_sel <= "00"; stack_we <= 1; tmp1_ce <= -; tmp2_ce <= -;
LEC-08:
2.6.4
63
SepFsm Swap
when swap => ... end case; end loop; end loop; end process;
LEC-08:
2.6.4
64
SepFsm tmp1
-----------------------------------------------------process (clk) begin if rising_edge(clk) then if (tmp1_ce = 1) then tmp1 <= stack_data_out; end if; end if; end process;
LEC-08:
2.6.4
65
SepFsm tmp2
-----------------------------------------------------process (clk) begin if rising_edge(clk) then if (tmp2_ce = 1) then tmp2 <= stack_data_out; end if; end if; end process;
LEC-08:
2.6.4
66
SepFsm Tos
-----------------------------------------------------process (clk) begin if rising_edge(clk) then if (reset = 1) then tos <= to_unsigned(0, 4); elsif (tos_ce = 1) then tos <= tos_adj; end if; end if; end process;
LEC-08:
2.6.4
67
SepFsm Tos Adjustment

-----------------------------------------------------tos_adj <= tos + 1 when (tos_inc_dec_sel = 1) else tos - 1 ; ...
LEC-08:
2.6.4
68
SepFsm Stack Address

-----------------------------------------------------stack_addr <= tos when (stack_addr_sel = 0) else tos_adj ;
LEC-08:
2.6.4
69
SepFsm Stack Data

-----------------------------------------------------stack_data_in <= inp_intern when (stack_data_sel = "00") else tmp1 when (stack_data_sel = "01") else tmp2 ; -----------------------------------------------------end sepfsm;
LEC-08:
2.6.4
70
2.6.4.2 Stack: Datapath Operations

The state machine in Section 2.6.4.1 controlled each datapath component individually. An alternative style is for the state machine to tell the datapath what state it is in, or what global collection of operations to perform, then each part of the datapath decodes this and takes the appropriate action. This style works best when there are a small number of states and a large number of datapath components.
LEC-08:
2.6.4
71
Dp-Op Declarations
architecture dp_op of stack is ----------------------------------------------------- define the states type dp_op_ty is (init_op, pop_op, push1_op, push2_op, swap_wr_tmp1_op, swap_wr_tmp2_op, swap_rd_tmp1_op, swap_rd_tmp2_op, nop_op ); signal dp_op : dp_op_ty; signal tos, tos_adj, stack_addr : unsigned(3 downto 0); signal inp_intern, stack_data_in, stack_data_out, tmp1, tmp2 : std_logic_vector(3 downto 0); signal empty, stack_we : std_logic; begin
LEC-08:
2.6.4
72
Dp-Op State Machine

--------------------------------------------------------process begin init : loop -------------------------------empty <= 1; dp_op <= init_op; loop -------------------------------wait until rising_edge(clk); next init when (reset = 1); -------------------------------case inp is when pop => dp_op <= pop_op; when push => dp_op <= push1_op; -------------------------------wait until rising_edge(clk); next init when (reset = 1); --------------------------------- stack(to_integer(tos)) <= inp; dp_op <= push2_op; empty <= 0; when swap => ... ... end case; end loop; end loop; end process;
LEC-08:
2.6.4
73
Dp-Op Input Storage

----------------------------------------------------process (clk) begin if rising_edge(clk) then inp_intern <= inp; end if; end process;
LEC-08:
2.6.4
74
Dp-Op Tos
-----------------------------------------------------process (clk) begin if rising_edge(clk) then if (dp_op = init_op) then tos <= to_unsigned(0,4); elsif ( (dp_op = pop_op) OR (dp_op = push1_op and (empty = 0)) ) then tos <= tos_adj; end if; end if; end process; -----------------------------------------------------tos_adj <= tos + to_unsigned(1,3) when (dp_op = push1_op) else tos - to_unsigned(1,3) ; ------------------------------------------------------
LEC-08:
2.6.4
75
Dp-Op Stack Address

stack_addr <= tos_adj when ( OR OR OR ) else tos ;
(dp_op = pop_op) ((dp_op = push1_op) AND (empty = 0)) (dp_op = swap_wr_tmp1_op) (dp_op = swap_rd_tmp2_op)
LEC-08:
2.6.4
76
Dp-Op Data Input Register

stack_data_in <= inp_intern when (dp_op = push2_op) else tmp1 when (dp_op = swap_rd_tmp1_op) else tmp2 ;
LEC-08:
2.6.4
77
Dp-Op Write Enable

stack_we <= 1 when ( (dp_op = push2_op) OR (dp_op = swap_rd_tmp1_op) OR (dp_op = swap_rd_tmp2_op) ) else 0
LEC-08:
2.6.4
78
Dp-Op Output
----------------------------------------------------outp <= stack_data_out; -----------------------------------------------------
LEC-08:
2.6.4
79
Dp-Op RAM Instantiation

stack : ram16x4s port map ( ---------------------------------------------we => stack_we, clk => clk, ---------------------------------------------addr => stack_addr, data_in => stack_data_in, data_out => stack_data_out ---------------------------------------------); end dp_op;
LEC-08:
2.6.4
80
2.6.4.3 Stack: Explicit State Machine

Here we drop the loop ... wait ... style of implicit state machines and build an explicit state machine with current and next state signals. Notice that the stack is such a simple design that each datapath operation in the Dp-Op architecture is used in only one state. This is a sign that the Dp-Op style is not well-suited to the stack. This example also illustrates the use of a function to capture common code. The function is used here to determine which state to go to next when a new input instruction arrives.
LEC-08:
2.6.4
81
Explicit Declarations
architecture state of stack is type state_ty is (init_st, pop_st, push1_st, push2_st, swap_wr_tmp1_st, swap_wr_tmp2_st, swap_rd_tmp1_st, swap_rd_tmp2_st, nop_st ); signal state, state_n : state_ty; ... ...
LEC-08:
2.6.4
82
Explicit Function
-------------------------------------------------------function restart (inp : std_logic_vector(3 downto 0)) return state_ty is begin case inp is when pop => return(pop_st); when push => return(push1_st); when swap => return(swap_wr_tmp1_st); when others => return(nop_st); end case; end restart; begin
LEC-08:
2.6.4
83
Explicit State Storage

-----------------------------------------------------process (clk) begin if rising_edge(clk) then if (reset = 1) then state <= init_st; empty_n <= 1; else state <= state_n; empty_n <= empty; end if; end if; end process;
LEC-08:
2.6.4
84
Explicit Next State

-----------------------------------------------------process (state, inp) begin case state is when init_st | pop_st | push2_st | swap_wr_tmp2_st | nop_st => state_n <= restart(inp); when push1_st => state_n <= push2_st; when swap_rd_tmp1_st => state_n <= swap_rd_tmp2_st; when swap_rd_tmp2_st => state_n <= swap_wr_tmp1_st; when swap_wr_tmp1_st => state_n <= swap_wr_tmp2_st; end case; end process; ...
LEC-08:
2.6.4
85
Explicit Tos
process (clk) begin if rising_edge(clk) then if (state = init_st) then tos <= to_unsigned(0,4); elsif ( (state = pop_st) OR (state = push1_st and (empty = 0)) ) then tos <= tos_adj; end if; end if; end process;
LEC-08:
2.6.4
86
Explicit Tos Adjustment

-----------------------------------------------------tos_adj <= tos + to_unsigned(1,3) when (state = push1_st) else tos - to_unsigned(1,3) ;
LEC-08:
2.6.4
87
Explicit Stack Address

-----------------------------------------------------stack_addr <= tos_adj when ( (state = pop_st) OR ((state = push1_st) AND (empty = 0)) OR (state = swap_wr_tmp1_st) OR (state = swap_rd_tmp2_st) ) else tos ; ... end state;
LEC-09: Guidelines and Optimization Techniques

Schedule
wk-01 05 VHDL Design and Optimization wk-01 Overview and VHDL wk-02 VHDL Semantics Design and Optimization wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Guidelines and Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review
wk-03 05
wk-06 wk-07 wk-08 wk-08 10 wk-11 12 wk-13
Concepts
Lecture Notes: Sections 2.72.9.4 coding guidelines more vhdl features strength reduction mux-pushing

common subexpression elimination replication arithmetic optimizations
LEC-09:
2.7
RTL CODING GUIDELINES
2.7
RTL Coding Guidelines
LEC-09:
2.7.1
Design Process
2.7.1
Design Process
Recommendation: Spend the time up front to plan a good design on paper. Use dataow diagrams and state machines to predict performance and area. This section gives guidelines for building robust, portable, and synthesizable VHDL code. Portability is both for different simulation and synthesis tools and for different implementation technologies. Remember, there is a world of difference between getting a design to work in simulation and getting it to work on a real FPGA. And there is also a huge difference between getting a design to work in an FPGA for a few minutes of testing and getting thousands of products to work for months at a time in thousands of different environments around the world. The coding guidelines here are designed both for helping you to get your E&CE 427 project to work as well as all of the subsequent industrial designs. Finally, note that there are exceptions to every rule. You might nd yourself in a circumstance where your particular circumstance (e.g. choice of tool, target technology, etc) would benet from bending or breaking a guideline here. Within E&CE 427, of course, there wont be any such circumstances.
LEC-09:
2.7.2
Signal Declarations
2.7.2
Signal Declarations
LEC-09:
2.7.2
Signal Declarations
Signals vs Variables
Use signals, do not use variables reason The intention of the creators of VHDL was for signals to be wires and variables to be just for simulation. Some synthesis tools allow some uses of variables, but when using variables, it is easy to create a design that works in simulation but not in real hardware.
LEC-09:
2.7.2
Signal Declarations
Std Logic
Use std_logic signals, do not use bit or Boolean reason std_logic is the most commonly used signal type across synthesis tools, simulations tools, and cell libraries
LEC-09:
2.7.2
Signal Declarations
Port Modes
Use in or out, do not use inout reason inout signals are tri-state. note If you have an output signal that you also want to read from, you might be tempted to declare the direction of the signal to be inout. A better solution is to create a new, internal, signal that you both read from and write to. Then, your output signal can just read from the internal signal.
LEC-09:
2.7.2
Signal Declarations
10
Primary Inputs and Outputs of Chip
Declare the primary inputs and outputs of chips as either std logic and std logic vector. Do not use signed or unsigned for primary inputs or outputs. reason Both the Altera tool Quartus and the Xilinx tool ngd2vhdl convert signed and unsigned vectors in entities into std-logicvectors. If you want your same testbench to work for both functional simulation and timing simulation, you must not use signed or unsigned signals in the top-level entity of your chip. note Signed and unsigned signals are ne inside testbenches, for non-top-level entities, and inside architectures. It is only the toplevel entity that should not use signed or unsigned signals.
LEC-09:
2.7.3
Processes
11
2.7.3
Processes
For a combinational process, the sensitivity list should contain all of the signals that are read in the process. reason Gives consistent results across different tools. Many synthesis tools will implicitly include all signals that a process reads in its sensitivity list. This differs from the VHDL Standard. A tool that adheres to the standard will introduce latches if not all signals that are read from are included in the sensitivity list. exception In a clocked process using an if rising edge, it is acceptable to have only the clock in the sensitivity list
LEC-09:
2.7.3
Processes
12
Combinational Processes
For a combinational process, every signal that is assigned to, must be assigned to in every branch of if-then and case statements. reason If a signal is not assigned a value in a path through a combinational process, then that signal will be a latch. note For a clocked process, if a signal is not assigned a value in a clock cycle, then the ip-op for that signal will have a chipenable pin.
LEC-09:
2.7.3
Processes
13
Single Assignment Rule
Each signal should be assigned to in only one process. reason Multiple processes driving the same signal is the same as having multiple gates driving the same wire. This can cause contention, short circuits, and other bad things. exception Multiple drivers are acceptable if your implementation technology has wired-ANDs or wired-ORs. FPGAs dont have wiredANDs or wired-ORs.
LEC-09:
2.7.3
Processes
14
Separate Unrelated Signals
Separate unrelated signals into different processes reason Grouping assignments to unrelated signals into a single process can complicate the control circuitry for that process. Each branch in a case statement or if-then-else adds multiplexor or chip-enable circuitry.
LEC-09:
2.7.4
Flip-Flops and Latches
15
2.7.4

Use ops, not latches (see section 1.7.2). Use D-ops, not T, JK, etc (see section 1.7.2).
LEC-09:
2.7.4
16
Know Your Hardware
For every signal in your design, know whether it should be a ip-op or combinational. Before simulating your design, examine the log le LOG/dc shell.log to see if the ip ops in your circuit match your expectations, and to check that you dont have any latches in your design.
LEC-09:
2.7.4
17
2.7.4.1 Multiplexors and Tri-State Signals
Use multiplexors, not tri-state buffers (see section 1.7.2).
LEC-09:
2.7.5
State Machines
18
2.7.5
State Machines
In a state machine, illegal and unreachable states should transition to the reset state reason Creates more robust implementations. In the eld, your circuit will be subjected to illegal inputs, voltage spikes, temperature uctuations, clock speed variations, etc. At some point in time, something wierd will happen that will cause it to jump into an illegal state. Having a system crash and reboot is much better than having it generate incorrect outputs that arent detected.
LEC-09:
2.7.5
State Machines
19
State Encoding
If your state machine has less than 16 states, use a one-hot encoding. reason For n states, a one-hot encoding uses n ip-ops, while a binary encoding uses log2 n ip-ops. One-hot signlas are simpler to decode, because only one bit must be checked to determine if the circuit is in a particular state. For small values of n, a one-hot signal results in a smaller and faster circuit. For large values of n, the number of signals required for a one-hot design is too great of a penalty to compensate for the simplicity of the decoding circuitry. note Using an enumerated type for states allows the synthesis tool to choose state encodings that it thinks will work well to balance area and clock speed. Quartus uses a modied one-hot encoding, where the bit that denotes the reset state is inverted. That is, when the reset bit is 0, the system is in the reset state and when the reset bit is a 1 the system is not in the reset state. The other bits have the normal polarity. The result is that when the system is in the reset state, all bits are 0 and when the system is in a non-reset state, two bits are 1. note Using your own encoding allows you to leverage knowledge about your design that the synthesis tool might not be able to deduce.
LEC-09:
2.7.5
State Machines
20
2.7.5.1 Reset
Include a reset signal in all clocked circuits. reason For most implementation technologies, when you power-up the circuit, you do not know what state it will start in. Also, if something goes wrong while the circuit is running, you need a way to get it into a guaranteed state.
LEC-09:
2.7.5
State Machines
21
Reset with Implicit State Machines
For implicit state machines, check for reset after every wait statement. reason Missing a wait statement means that your circuit might not notice a reset signal, or different signals could reset in different clock cycles, causing your circuit to get out of synch.
LEC-09:
2.7.5
State Machines
22
Reset Only Important Flops
Connect reset to the important control signals in the design, such as the state signal. Do not reset every ip op. reason Using reset adds area and delay to a circuit. The fewer signals that need reset, the faster and smaller your design will be. note Connect the reset signal to critical ip-ops, such as the state signal. Datapath signals rarely need to be reset. You do not need to reset every signal
LEC-09:
2.7.5
State Machines
23
Synchronous Reset
Use synchronous, not asynchronous, reset reason Creates more robust implementations. Signal propagation delays mean that asynchronous resets cause different parts of the circuit to be reset at different times. This can lead to glitches, which then might cause the circuit to move to an illegal state.
LEC-09:
2.7.6
Inputs and Outputs
24
2.7.6
Inputs and Outputs
Put ip ops on primary inputs and outputs of a chip reason Creates more robust implementations. Signal delays between chips are unpredictable. Signal integrity can be a problem (remember transmission lines from E&CE 324?). Putting ip ops on inputs and outputs of chip provides clean boundaries between circuits. note This only applies to primary inputs and outputs of a chip (the signals in the top-level entity). Within a chip, you should adopt a standard of putting ip-ops on either inputs or outputs. Within a chip, you do not need to put ip-ops on both inputs and outputs.
LEC-09:
2.8
ADDITIONAL VHDL FEATURES
25
2.8
Additional VHDL Features
LEC-09:
2.8.1
Vectors
26
2.8.1
Vectors
VHDL supports reading from and assigning to slices (aka discrete subranges) of vectors.

The ranges on both sides of the assignment must be the same. The direction (downto or to) of each slice must match the direction of the signal declaration. The direction of the target and expression may be different.
LEC-09:
2.8.1
Vectors
27
Declarations
---------------------------------------------------a, b : in std_logic_vector(15 downto 0); c, d, e : out std_logic_vector(15 downto 0); ---------------------------------------------------ax, bx : in std_logic_vector(0 to 15); cx, dx, ex : out std_logic_vector(0 to 15); ---------------------------------------------------m, n : in unsigned(15 downto 0); p, q, r : out unsigned(15 downto 0); ---------------------------------------------------w, x : in signed(15 downto 0); y, z : out signed(15 downto 0) ----------------------------------------------------
LEC-09:
2.8.1
Vectors
28
Legal code
c(3 downto 0) cx(0 to 3) (e(3), e(4)) (e(5), e(6)) <= <= <= <= a(15 downto 12); a(15 downto 12); bx(12 to 13); b(13 downto 12);
LEC-09:
2.8.1
Vectors
29
Illegal code
d(0 to 3) <= a(15 to 12); -- slice dirs must be same as decl e(3) & e(2) <= b(12 to 13); -- syntax error on & p(3 downto 0) <= (m + n)( 3 downto 0); -- syntax error on )( z(3 downto 0) <= m(15 downto 12); -- types on lhs and rhs must match
LEC-09:
2.8.2
Still More VHDL Features
30
2.8.2
Still More VHDL Features
Some constructs that are useful and will be described in later chapters and sections: for-generate : replicates hardware if-generate : conditionally generates hardware report : print a message on stderr while simulating assert : assertions about behaviour of signals, very useful with report statements. generics : parameters to an entity that are dened at elaboration time. attributes : predened functions for different datatypes. For example: high and low indices of a vector.
LEC-09:
2.9
GENERAL OPTIMIZATION TECHNIQUES
31
2.9
General Optimization Techniques
LEC-09:
2.9.1
Strength Reduction
32
2.9.1
Strength Reduction
Strength reduction replaces one operation with another that is simpler.
LEC-09:
2.9.1
Strength Reduction
33
2.9.1.1 Arithmetic Strength Reduction

Multiply by a constant power of two Multiply by a power of two Divide by a constant power of two Divide by a power of two wired shift logical left shift logical left wired shift logical right shift logical right
LEC-09:
2.9.1
Strength Reduction
34
2.9.1.2 Boolean Strength Reduction

Boolean tests that can be implemented as wires

is odd, is even : least signicant bit is neg, is pos : most signicant bit NOTE: use is odd(a) rather than a(0)
LEC-09:
2.9.2
Replication and Sharing
35
2.9.2
LEC-09:
2.9.2
36
2.9.2.1 Mux-Pushing
Pushing multiplexors into the fanin of a signal can reduce area. Before z <= a + b when (w = 1) else a + c; After tmp <= b when (w = 1) else c; z <= a + tmp; The rst circuit will have two adders, while the second will have one adder. Some synthesis tools will perform this optimization automatically, particularly if all of the signals are combinational.
LEC-09:
2.9.2
37
2.9.2.2 Common Subexpression Elimination

Introduce new signals to capture subexpressions that occur multiple places in the code. Before y <= else z <= else a + b + c when (w = 1) d; a + c + d when (w = 1) e; a + c; b + tmp when (w = 1) d; d + tmp when (w = 1) e;
After tmp <= y <= else z <= else
LEC-09:
2.9.2
38
Subexpression Elimination
NOTE: Clocked subexpressions Care must be taken when doing common subexpression elimination in a clocked process. Putting the temporary signal in the clocked process will add a clock cycle to the latency of the computation, because the tmp signal will be ip-op. The tmp signal must be combinational to preserve the behaviour of the circuit.
LEC-09:
2.9.2
39
2.9.2.3 Computation Replication

To improve performance If same result is needed at two very distant locations and wire delays are signicant, it might improve performance (increase clock speed) to replicate the hardware To reduce area If same result is needed at two different times that are widely separated, it might be cheaper to reuse the hardware component to repeat the computation than to store the result in a register
NOTE: Muxes are not free Each time a component is reused, multiplexors are added to inputs and/or outputs. Too much sharing of a component can cost more area in additional multiplexors than would be spent in replicating the component
LEC-09:
2.9.3
Arithmetic
40
2.9.3
Arithmetic
VHDL is left-associative. The expression a + b + c + d is interpreted as (((a + b) + c) + d). You can use parentheses to suggest parallelism. Perform arithmetic on the minimum number of bits needed. If you only need the lower 12 bits of a result, but your input signals are 16 bits wide, trim your inputs to 12 bits. This results in a smaller and faster design than computing all 16 bits of the result and trimming the result to 12 bits.
LEC-09:
2.9.4
Pipelining
41
2.9.4
Pipelining
Pipelines will not be covered in E&CE 427. This subsection is provided for those who already understand the basics of pipelining. You can turn a dataow diagram into a pipeline by making each clock cycle of the dataow diagram a separate pipe stage. However, this can be complicated and error-prone. You need to worry about data hazards if you have state-holding registers in your algorithm. You need to worry about structural hazards if different instructions have different latencies.
LEC-10: FPGA-Specic Guidelines and Optimization

Schedule
wk-01 05 VHDL wk-01 Overview and VHDL wk-02 VHDL Semantics Design and Optimization wk-03 Dataow Diagrams; State Machines wk-04 Memory Design; Example Design wk-05 Guidelines and Optimizations Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review
wk-03 05
wk-06 wk-07 wk-08 wk-08 10 wk-11 12 wk-13
Overview
In this lecture we will go over some design guidelines and optimization techniques that are specic to FPGAs.
Concepts
Lecture Notes: Sections 2.102.11.2 Coding guidelines for FPGAs Hardware for generic FPGAs

Altera hardware Coding guidelines for Altera FPGAs
LEC-10:
2.10
FPGA-SPECIFIC GUIDELINES
2.10
FPGA-Specic Guidelines
LEC-10:
2.10.1
Generic FPGAs
2.10.1 Generic FPGAs
LEC-10:
2.10.1
Generic FPGAs
2.10.1.1 ware
Overview of Generic FPGA Hard-
LEC-10:
2.10.1
Generic FPGAs
Generic FPGA Cell

Cell = = Logic Element (LE) in Altera Congurable Logic Block (CLB) in Xilinx
carry_in
data_in
comb
D CE
data_out
ctrl_in
carry_out
LEC-10:
2.10.1
Generic FPGAs
Congurable Comb/Flop Connection

carry_in comb_data_out comb_data_in comb
D CE R
flop_data_out
flop_data_in ctrl_in
carry_out
LEC-10:
2.10.1
Generic FPGAs
10
Separate Comb and Flop

D CE R
flop_data_out
carry_out
LEC-10:
2.10.1
Generic FPGAs
11
Connect Comb and Flop

D CE R
flop_data_out
carry_out
LEC-10:
2.10.1
Generic FPGAs
12
Flopped and Unopped Outputs

D CE R
flop_data_out
carry_out
LEC-10:
2.10.1
Generic FPGAs
13
Generic FPGA Cell

D CE R
flop_data_out
carry_out
LEC-10:
2.10.1
Generic FPGAs
14
Flip Flops Are Free
Flip-ops are almost free in FPGAs reason In FPGAs, the area consumed by a design is usually determined by the amount of combinational circuitry, not by the number of ip-ops. Usually each 4:1 combinational circuit has a ip-op.
LEC-10:
2.10.1
Generic FPGAs
15
Use It or Lose
Aim for using 8090% of the cells on a chip. reason If you use more than 90% of the cells on a chip, then the placeand-route program might not be able to route the wires to connect the cells. reason If you use less than 80% of the cells, then probably: there are optimizations that will increase performance and still allow the design to t on the chip; or you spent too much human effort on optimizing for low area; or you could use a smaller (cheaper!) chip. exception In E&CE 427 (unlike in real life), the mark is based on the actual number of cells used.
LEC-10:
2.10.1
Generic FPGAs
16
Area Estimation
You can estimate the area of a design by counting the number of ipops in the fanin of each ip-op. reason Each set of four source signals requires one cell. Source ops Cells 1 1 2 1 3 1 4 1 5 2 6 2 7 2 8 3 9 3 10 3 11 4 note This technique is generally an overestimate, because a single cell can drive several other cells (common subexpression elimination).
LEC-10:
2.10.1
Generic FPGAs
17
Local Connections for Generic Cell

NB: In these slides, the space between tightly grouped wires sometimes dissapears, making a group of wires appear to be a single large wire.
LEC-10:
2.10.1
Generic FPGAs
18

General purpose interconnect (congurable, slow) Carry chains and cascade chains (verticaly adjacent cells, fast)
LEC-10:
2.10.1
Generic FPGAs
19

General purpose interconnect (congurable, slow) Carry chains and cascade chains (vertically adjacent cells, fast)
LEC-10:
2.10.1
Generic FPGAs
20
Generic Blocks of Cells
LEC-10:
2.10.1
Generic FPGAs
21
LEC-10:
2.10.1
Generic FPGAs
22
LEC-10:
2.10.1
Generic FPGAs
23
Cells not used for computation can be used as wires to shorten length of path between cells.
LEC-10:
2.10.1
Generic FPGAs
24
2.10.1.2
Generic Clocks
Characteristics of clock signals:
Characteristics of FPGAs:

High fanout (drive many gates) Long wires (destination gates scattered all over chip)
Very few gates that are large (strong) enough to support a high fanout. Very few wires that traverse entire chip and can be connected to every ip-op.
LEC-10:
2.10.1
Generic FPGAs
25
Clocks
Guideline for clock signals on FPGAs:
Use just one clock signal reason If all ip-ops use the same clock, then the clock does not impose any constraints on where the place-and-route tool puts ipops and gates. If different ip-ops used different clocks, then ip-ops that are near each other would probably be required to use the same clock.
LEC-10:
2.10.1
Generic FPGAs
26
Clocks
Guideline for clock signals on FPGAs:
Use only one edge of clock signal reason There are two ways to use both rising and falling edges of a clock signal: have rising-edge and falling-edge ip ops, or have two different clock signals that are inverses of each other. Most FPGAs have only rising-edge ip ops. Thus, using both edges of a clock signal is equivalent to having two different clock signals, which is deprecated by the preceding guideline.
LEC-10:
2.10.1
Generic FPGAs
27
2.10.1.3
Special Circuitry in FPGAs
LEC-10:
2.10.1
Generic FPGAs
28
Memory
For ve or more years, FPGAs have had special circuits for RAM and ROM. In Altera FPGAs, these circuits are called ESBs (Embedded System Blocks). These special circuits are possible because many FPGAs are fabricated on the using the same processes as SRAM chips. So, the FPGAs simply contain small chunks of SRAM.
LEC-10:
2.10.1
Generic FPGAs
29
Microprocessors
A new feature to appear in FPGAs in 2001 and 2002 is hardwired microprocessors on the same chip as programmable hardware.
Altera Xilinx: Virtex-II Pro
Hard Arm 922T with 200 MIPs Power PC 405 with 420 D-MIPs
Soft Nios with ?? MIPs Microblaze with 100 D-MIPs
The Xilinx-II Pro has 4 Power PCs and enough programmable hardware to implement a complete 32-bit microprocessor.
LEC-10:
2.10.1
Generic FPGAs
30
Arithmetic Circuitry
A new feature to appear in FPGAs in 2001 and 2002 is hardwired circuits for multipliers and adders.
Using these resources can improve signicantly both the area and performance of a design.

Altera: Mercury Xilinx: Virtex-II Pro
16 18
16 at 130MHz 18 at ???MHz
LEC-10:
2.10.1
Generic FPGAs
31
Input / Output
Recently, high-end FPGAs have started to include special circuits to increase the bandwidth of communication with the outside world. Product True-LVDS (1 Gbps) Rocket I/O (3 Gbps)
Altera Xilinx
LEC-10:
2.10.2
Altera APEX20K
32
2.10.2 Altera APEX20K
LEC-10:
2.10.2
Altera APEX20K
33
APEX20K Block Hierarchy

Chip 52 Mega Logic Array Blocks (MegaLABs) 1 Embedded System Block (ESB) Memory and wide combinational functions 16 Logic Array Blocks (LABs) 10 Logic Elements (LEs) 4-input lookup table Carry and cascade Flip-op
Each level of hierarchy has its own interconnect (wires).
LEC-10:
2.10.2
Altera APEX20K
34
LE Computation and Storage

4-input lookup table (LUT) Carry-chain computation circuitry Cascade-chain computation circuitry Flip-op with load, clear, clock-enable
LEC-10:
2.10.2
Altera APEX20K
35
LE Interconnect

4 data inputs 2 data outputs Carry in, carry out Cascade in, cascade out Clock, clock-enable Async clear, synch set (load), synch clear (reset) Global reset
LEC-10:
2.11
EXAMPLE CIRCUITS
36
2.11
Example Circuits
LEC-10:
2.11.1
Ripple-Carry Adder
37
2.11.1 Ripple-Carry Adder

Ripple-Carry Adder 70 65 60 55 50 45 40 35 30 Delay (ns) 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 Data Width
LEC-10:
2.11.2
Barrel Shifter
38
2.11.2 Barrel Shifter

This example illustrates:

packages for-generate if-generate faking 2-dimensional arrays
LEC-10:
2.11.2
Barrel Shifter
39
Barrel Shifter Package

library ieee; use ieee.std_logic_1164.all; package shift_const_pkg is constant width : integer := 28; constant depth : integer := 3; end shift_const_pkg;
LEC-10:
2.11.2
Barrel Shifter
40
Barrel Shifter Entity

library ieee; use ieee.std_logic_1164.all; use work.shift_const_pkg.all; entity barrel_shift is port ( clk : in std_logic; di : in std_logic_vector(width - 1 downto 0); do : out std_logic_vector(width - 1 downto 0); sel : in std_logic_vector(depth - 1 downto 0) ); end barrel_shift;
LEC-10:
2.11.2
Barrel Shifter
41
Barrel Shifter Architecture

architecture main of barrel_shift is subtype word is std_logic_vector(width - 1 downto 0); type x_ty is array(depth downto 0) of word; signal x : x_ty; begin process (clk) begin if rising_edge(clk) then for w in width - 1 downto 0 loop x(0)(w) <= di(w); do(w) <= x(depth)(w); end loop; end if; end process; for_d : for d in depth - 1 downto 0 generate for_w : for w in width - 1 downto 0 generate if_msb : if w + 2**depth >= width generate x(d+1)(w) <= 0 when sel(0) = 1 else x(d)(w); end generate; if_norm : if not(w + 2**depth >= width) generate x(d+1)(w) <= x(d)(w + 2**d) when sel(0) = 1 else x(d)(w); end generate; end generate; end generate; end main;
LEC-10:
2.11.2
Barrel Shifter
42
Chapter 3
Functional Validation
LEC-11: Functional Validation of Datapath Circuits

Schedule
wk-01 02 wk-03 05 wk-06 VHDL Design and Optimization Functional Validation Lec-11 Datapath Validation and Testbenches Lec-12 Control Validation and Assertions Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review
wk-07 wk-08 wk-08 10 wk-11 12 wk-13
Overview
The purpose of this lecture is to illustrate techniques to quickly and reliably detect bugs in datapath circuits. We will discusses validation of datapath circuits and introduce the notions of testbench, specication, and implementation. Well illustrate a progression of techniques that can be used to go from very simple tests to more complete and complicated tests.
Concepts

Specication Implementation Design Under Test (DUT) Unit Under Test (UUT) Test Bench Stimulus

Manual tests Array of test vectors Generated tests Functional specication Relational specication
Background
Basic hardware and software debugging techniques
Reading (Smith)
Smiths ASIC: 10.2.7 13.1 : 13.2 : 13.5 : : sample testbench levels of temporal abstraction for simulation simulation example different simulation models for hardware
Reading (Rushton and Ashenden)

Rushtons VHDL for Logic Synthesis: Ch 13 : Testbenches Ashendens Designers Guide to VHDL: Sect 1.4 : Testbenches Sect 6.2.1 : Testing the Behavioural Model of a Pipelined Multiplier Accumulator Sect 6.3.3 : Testing the Register-Transfer-Level Model of a Pipelined Multiplier Accumulator Sect 15.3 : Testing the Behavioural Model of a DLX Computer System Sect 15.5 : Testing the Register-Transfer-Level Model of a DLX Computer System Janick Bergerons verication guild website: http://www.janick.bergeron.com/guild/default.htm
LEC-11:
3.1
OVERVIEW
3.1
Overview
LEC-11:
3.1.1
Validation / Verication / Testing
3.1.1
functional validation checking that a design (e.g. RTL code) has the correct behaviour

usually treats combinational circuitry as having zero-delay usually done by simulating circuit with test vectors big challenges are simulation speed and test generation
LEC-11:
3.1.1
10
Terminology
formal verication checking that a design has the correct behaviour for every possible input and internal state

uses mathematics to reason about circuit, rather than checking individual vectors of 1s and 0s capacity problems: only usable on detailed models of small circuits or abstract models of large circuits mostly a research topic, but some practical applications have been demonstrated tools include model checking and theorem proving formal verication is not a guarantee that the circuit will work correctly
LEC-11:
3.1.1
11
Terminology
performance validation checking that implementation has (at least) desired performance power validation checking that implementation has (at most) desired power equivalence verication (checking) checking that the design generated by a synthesis tool has same behaviour as RTL code. timing verication checking that all of the paths in a circuit t meet the timing constraints
LEC-11:
3.1.1
12
Terminology Dogma (Formal Verication)

To the formal verication community, verication implies that all possible cases have been checked. In comparison validation means that some, but not all, cases were checked. Obviously not everyone follows this convention...
LEC-11:
3.1.1
13
Terminology Dogma (Hardware vs Software)

Note: in software testing refers to running programs with specic inputs and checking if the program does the right thing. In hardware, testing usually means manufacturing testing, which is checking the circuits that come off of the manufacturing line.
LEC-11:
3.1.2
Why Your First Circuit Will Not Work
14
3.1.2

Notes from Kenn Heinrich (UW E&CE grad)
Everyone should get a lecture on why their rst industrial design wont work in the eld. Here are few reasons:
LEC-11:
3.1.2
15
Unreachable States
1. You forgot to make your unreachable states transition to the initial (reset) state. Clock glitches, power surges, etc will occasionally cause your system to jump to a state that isnt dened or produce an illegal data value. When this happens, your design should reset itself, rather than crash or generatel illegal outputs.
LEC-11:
3.1.2
16
Untestable Registers
2. You have internal registers that you cant access or test. If you can set a register you must have some way of reading the register from outside the chip.
LEC-11:
3.1.2
17
Cannot Isolate Your Chip

3. Another chip controls your chip, and the other chip is buggy. All of your external control lines should be able to be disabled, so that you can isolate the source of problems.
LEC-11:
3.1.2
18
Insufcient Decoupling Capacitors

4. Not enough decoupling capacitors on your board. The analog world is cruel and and unusual. Voltage spikes, current surges, crosstalk, etc can all corrupt the integrity of digital signals. Trying to save a few cents on decoupling capacitors can cause headaches and signicant nancial costs in the future.
LEC-11:
3.1.2
19
The Laboratory is Not Reality

5. You only tested your system in the lab, not in the real world. As a product, systems will need to run for months in the eld, simulation and simple lab testing wont catch all of the weirdness of the real world.
LEC-11:
3.1.2
20
Unexplored Corner Cases

6. You didnt adequately test the corner cases and boundary conditions. Every corner case is as important as the main case. Even if some weird event happens only once every six months, if you do not handle it correctly, the bug can still make your system unusable and unsellable.
LEC-11:
3.2
TEST CASES
21
3.2
Test Cases
Test case / test vector : A combination of inputs and internal state values. Represents one possible test of the system. Boundary conditions / corner cases : A test case that represents an unusual situation on input and/or internal state signals. Corner cases are likely to contain bugs. Test scenario : A sequence of test vectors that, together, exercise a particular situation (scenario) on a circuit. For example, a scenario for an elevator controller might include a sequence of button pushes and movements between oors. Test suite : A collection of test vectors that a run on a circuit.
LEC-11:
3.2.1
Coverage
22
3.2.1
Coverage
To be sure that an implementation is correct, we must check every combination of values. This includes both input values and internal state (ip ops). If we have ni bits of inputs and ns bits in ip-ops, we have to test 2ni ns different cases when doing functional validation.
Question: If we have nc combinational signals, why dont we have to test 2ni ns nc different cases?

LEC-11:
3.2.1
Coverage
23
Coverage
Question: If we have nc combinational signals, why dont we have to test 2ni ns nc different cases?
Answer: The value of each combinational signal is determined by the ip ops and inputs in its fanin. Once the values of the inputs and ip ops are known, the value of each combinational signal can be calculated. Thus, the combinational signals do not add additional cases that we need to consider.

LEC-11:
3.2.1
Coverage
24
Coverage
Denition Coverage: The coverage that a suite of tests achieves on a circuit is the percentage of cases that are simulated by the tests. 100% coverage means that the circuit has been simulated for all combinations of values for input signals and internal signals.
LEC-11:
3.2.1
Coverage
25
Coverage
NOTE: Coverage Terminology There are many different types of coverage, which measure everything from percentage of cases that are exercised to number of output values that are exercises.
LEC-11:
3.2.1
Coverage
26
Coverage
NOTE: Coverage Tools There are many different commercial software programs that measure code and other types of coverage. Company Cadence Cadence Fintronic interHDL Summit Design Synopsys TransEDA Verisity Veritools Aldec Tool Afrma Coverage Analyzer DAI Coverscan FinCov Coverit HDLScore CoverMeter Verication Navigator SureCov Express VCT, VeriCover Riviera Coverage code, expressions, fsm code bought by Avant! ? code, events, variables code coverage (dead?) code and fsm code, block, values, fsm code, branch code, block
LEC-11:
3.2.2
Heating System Example
27
3.2.2
This example is a simple heating system that might appear in a home.

Three states: off, low, and high. The user can set the desired temperature to any value between 15C and 25C. There is a thermometer to measure the current temperature for values between 0C and 40C. The state machine in gure 3.1 describes the transitions between states.
LEC-11:
3.2.2
28
Transitions Between States

diff = des_temp - cur_temp 3 =< diff < 5 OFF diff < -3 7 =< diff diff < -2 LOW
Figure 3.1: Transitions between states
5 =< di ff
HIGH
LEC-11:
3.2.2
29
Sample Scenario
off low des_tmp low high high current state low high
off low 23 22 20 15 13
off
high
low
off
low
high
Figure 3.2: A sample scenario for the heating system
LEC-11:
3.2.2
30
State and Signal Ranges

Item state cur_temp des_temp Range off, low, high 0..40 15..25 Num Values 3 41 11
Figure 3.3: State and Signal Ranges
LEC-11:
3.2.2
31
3.2.2.1 Number of Cases to Consider
Figure 3.4: Number of cases in heating systems

(number of inputs) 451 1353 A total of 1353 cases to test
Number of input values Number of states Number of cases
451 3 (number of states) 3
41
11
LEC-11:
3.2.2
32
But, how many bits to represent 41, 11, or 3 values?

Signals are vectors of Boolean values. They must have 2n possible values. Item state des_temp cur_temp Range off, low, high 15..25 0..40 Num Values 3 11 41 Bits 2 4 6 Representable Values 4 16 64
Figure 3.5: Number of bits for signals in heating systems
LEC-11:
3.2.2
33
Actual Number of Cases to Consider
Figure 3.6: Actual number of cases to consider
Three times more values to consider than originally thought
Number of input values Number of states Number of cases
64
16
1024 4 4096
LEC-11:
3.2.2
34
3.2.2.2 Representation Simplication

Two-thirds of representable values are illegal / unused. Unused values leads to wasted area in circuit and increases validation effort.
LEC-11:
3.2.2
35
Adjust Range to be Powers of Two

Item state des_temp Range off, low, high 15 25 12 27 17 24 0..40 -20..43 17 7 24 4 17 5 24 3 Num Values 3 11 16 8 41 64 19 16 Bits 2 4 4 3 6 6 5 4 Actual Num of Values 4 16 16 8 64 64 32 16
cur_temp
$ "!! # $ "!! #
!"! !"! "!!
LEC-11:
3.2.2
36
Scenario with Adjusted Ranges

off low des_tmp low high high low high 15 13 off low 23 22 20
current state
off
high
low
off
low
high
Notice that with adjusted ranges, there is very little change in behaviour.
LEC-11:
3.2.2
37
State Machine with Adjusted Ranges

2 =< diff < 4 OFF diff < -3 5 =< diff diff < -2 LOW
4 =< di ff
HIGH
LEC-11:
3.2.2
38
Reduced Number of Cases to Consider

Number of legal input values Number of illegal input values Total number of input values Number of legal state values Number of illegal state values Total number of state values Number of legal cases Number of illegal cases Total number of cases Old 451 573 1024 3 1 4 1353 2743 4096 New 128 0 128 3 1 4 384 128 512
Choosing data ranges to be powers of two reduced number of illegal inputs and internal state values.
LEC-11:
3.2.3
Floating Point Divider Example
39
3.2.3
Consider doing the functional simulation for a double precision (64-bit) oating-point divider. Given Information Data width Number of gates in circuit Number of assembly-language instructions to simulate one gate for one test case Number of clock cycles required to execute one assembly language instruction on the computer that is running the simulation Clock speed of computer that is running the simulation 64 bits 10 000 100 0.5
1 Gigahertz
LEC-11:
3.2.3
40
Number of Cases
Question: How many cases must be considered?
Answer:
3 4E 38cases
% & ' &
NumTestsTot
NumInputCases NumStateCases 264 264 20
$ ! $ !
item src1 src2
bits 64 64
num values 1 8E 19 264 1 8E 19 264
$ !
LEC-11:
3.2.3
41
Simulation Run Time

Question: How long will it take to simulate all of the different possible cases?
Answer:
1 7E 35secs 5 6E 26years
Learn the general technique, not the specic formula!
$ !
!
TestTimeTot
10000gates
100
instrs gate
05
cycles instr
1E 9
secs cycle
3 4E 38cases
$ ! $ !
LEC-11:
3.2.3
42
Coverage
Question: If you can run simulations non-stop for one year on ten computers, what coverage will you achieve?
Answer:
1. Calculate number of seconds to simulate one test case on one computer instrs cycles secs TestTime1:1 10000gates 100 05 1E 9 gate instr cycle 5E 4secs
!
LEC-11:
3.2.3
43
One Test : Ten Computers

2. Number of seconds to simulate one test using 10 computers TestTime1:1 TestTime1:10 10comps 5E 4secs 10 5E 5secs
LEC-11:
3.2.3
44
Number of Tests
3. Number of tests per year using ten computers secs mins hours days 60 60 24 365 25 min hour day year NumTests:10 TestTime1:10 SpeedOfLight in m/s TestTime1:10 3E 8secs 5E 5secs 6E 12cases
LEC-11:
3.2.3
45
Coverage
4. Calculate coverage achieved by running tests on ten computers for one year NumTestsRun Covg NumTestsTot NumTests:10 NumTestsTot 6E 12 3E 38 2E 26 0 0000000000000000000000002%
$ $
LEC-11:
3.2.4
Functional Validation Challenges
46
3.2.4
From Validating the Intel(R) Pentium(R) Microprocessor by Bob Bentley, Design Automation Conference 2001. (Link on E&CE 427 web page.)

Simulating the Pentium 4 Processor on a Pentium 3 Processor ran at about 15 MHz. By tapeout, over 200 billion simulation cycles had been run on a network of computers. All of these simulations represent less than two minutes of running a real processor.
LEC-11:
3.2.4
47
Research
Research challenges: 1. How to make simulations run faster? 2. How to choose test cases so that cases that are run are likely to detect bugs?
LEC-11:
3.2.4
48
Research
Research activities in functional validation: 1. 2. 3. 4. Simulation accelleration Coverage analysis Test generation Formal verication
LEC-11:
3.2.4
49
Practice
Challenges in practice: 1. 2. 3. 4. Writing specication Identifying corner cases Choosing test cases Finding root cause of unexpected behaviour
LEC-11:
3.3
TESTBENCHES
50
3.3
Testbenches
A test bench (also known as a test rig, test harness, or test jig) is a collection of code used to simulate a circuit and check if it works correctly. Testbenches are not synthesized. You do not need to restrict yourself to the synthesizable subset of VHDL. Use the full power of VHDL to make your testbenches concise and powerful.
LEC-11:
3.3.1
Overview of Test Benches
51
3.3.1

testbench specification stimulus check
implementation
Implementation Circuit that youre checking for bugs also known as: design under test or unit under test Stimulus Generates test vectors Specication Describes desired behaviour of implementation Check Checks whether implementation obeys specication
LEC-11:
3.3.1
52
Notes and observations
) ) ) ) ) )
Testbenches usually do not have any inputs or outputs. Inputs are generated by stimulus Outputs are analyzed by check and relevant information is printed using report statements Different circuits will use different stimuli, specications, and checks. The roles of the specication and check are somewhat exible. Most circuits will have complex specications and simple checks. However, some circuits will have simple specications and complex checks. If two circuits are supposed to have the same behaviour, then they can use the same stimuli, specication, and check. If two circuits are supposed to have the same behaviour, then one can be used as the specication for the other. Testbenches are restricted to stimulating only primary inputs and observing only primary outputs. To check the behaviour of internal signals, use assertions (Lec-12).
LEC-11:
3.3.2
Reference Model Style Testbench
53
3.3.2

reference model testbench specification stimulus
implementation
) ) )
Specication has same inputs and outputs as implementation. Specication is a clock-cycle accurate description of desired behaviour of implementation. Check is an equality test between outputs of specication and implementation.
LEC-11:
3.3.2
54
Examples
) ) )
Execution modules: output is sum, difference, product, quotient, etc.of inputs DSP lters Instruction decoders
NOTE: Functional specication vs Reference model Functional specication and reference model are often used interchangeably.
LEC-11:
3.3.3
Relational Style Testbench
55
3.3.3

relational testbench
stimulus
check
implementation
) ) ) )
Relational testbenches, or relational specications are used when we do not want to specify the specic output values that the implementation must produce. Instead, we want to check that some relationship holds between the output and the input, or that some relationship holds amongst the output values (independent of the values of the input signals.) Specication is usually just wires to feed the input signals to the check. Check is the brains and encodes the desired behaviour of the circuit.
LEC-11:
3.3.3
56
Examples
) ) )
Carry-save adders: the two outputs are the sum of the three inputs, but do not specify exact values of each individiual output. Arbiters: every request is eventually granted, but do not specify in which order requests are granted. One-hot encoding: exactly one bit of vector is a 1, but do not specify which bit is a 1.
NOTE: Relational specication vs relational testbench Relational specication and relational testbench are often used interchangeably.
LEC-11:
3.3.4
Coding Structure of a Testbench
57
3.3.4
testbench
Coding Structure of a Testbench

specification
stimulus
check
implementation
architecture main of athabasca_tb is component declaration for implementation; other declarations begin implementation instantiation; stimulus process; specification process (or component instantiation); check process; end main;
LEC-11:
3.3.5
Datapath vs Control
58
3.3.5
Datapath vs Control
Datapath and control circuits tend to use different styles of testbenches.
LEC-11:
3.3.5
Datapath vs Control
59
Datapath Validation
Datapath circuits tend to be well-suited to reference-model style testbenches:
) )
Each set of inputs generates one set of outputs Each set of outputs is a function of just one set of inputs
LEC-11:
3.3.5
Datapath vs Control
60
Control Validation
Control circuits often pose problems for testbenches,
Assertions (Lec-12) can be used to check the behaviour of internal signals. Control circuits tend to use assertions to check correctness and rely on testbenches only to stimulate inputs.
) ) ) )
Many more internal signals than outputs. The behaviour of the outputs provides a view into only a fragment of the current state of the circuit. It may take many clock cycles from when a bug is exercised inside the circuit until it generates a deviation from the correct behaviour on the outputs. When the deviation on the outputs is observed, it is very difcult to pinpoint the precise cause of the deviation (the root cause of the bug).
LEC-11:
3.4
FUNCTIONAL VALIDATION FOR DATAPATH CIRCUITS 61
3.4 Functional Validation for Datapath Circuits

In this section we will incrementally develop a testbench for a very simple circuit: an AND gate. The process scales well to very large circuits. The process allows validation to begin as soon a circuit is simulatable, even before a complete specication has been written.
LEC-11:
3.4
FUNCTIONAL VALIDATION FOR DATAPATH CIRCUITS 62
Implementation
entity and2 is port ( a, b : in std_logic; c : out std_logic ); end and2; architecture main of and2 is begin c <= 1 when (a = 1 AND b = 1) else 0; end and2;
LEC-11:
3.4.1
A Spec-Less Testbench
63
3.4.1
A Spec-Less Testbench
(NOTE: this code has not been checked for correctness) First, use waveform viewer to check that implementation generates reasonable outputs for a small set of inputs. entity and2_tb is end and2_tb; architecture main_tb of and2_tb is component and2 port ( a, b : in std_logic; c : out std_logic ); end component; signal ta, tb, tc_impl : std_logic; signal ok : boolean; begin --------------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); --------------------------------------------stimulus : process begin ta <= 0; tb <= 0; wait for 10ns; ta <= 1; tb <= 1; wait for 10ns; end process; --------------------------------------------end main_tb; Use this testbench until implementation generates solid Boolean values (No X or U data) and have checked that a few simple test cases generate correct outputs.
LEC-11:
3.4.2
Use an Array for Test Vectors
64
3.4.2
Use an Array for Test Vectors
Writing code to drive inputs and repetitively typing wait for 10 ns; can get tedious, so code up test vectors in an array. (NOTE: this code has not been checked for correctness) architecture main_tb of and2_tb is ... begin ... stimulus : process type test_datum_ty is record ra, rb : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty ; constant test_vectors : test_vectors_ty := -a b ( ( 0, 0), ( 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; wait for 10 ns; end loop; end process; end main_tb; Use this testbench until checking the correctness of the outputs by hand using waveform viewer becomes difcult.
LEC-11:
3.4.3
Build Spec into Stimulus
65
3.4.3
Build Spec into Stimulus
(NOTE: this code has not been checked for correctness) After a few test vectors appear to be working correctly (via a manual check of waveforms on simulation), begin automatically checking that outputs are correct.
architecture main_tb of and2_tb is ... begin -----------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); -----------------------------------------stimulus : process type test_datum_ty is record ra, rb, rc : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := -a, b: inputs -c : expected output -a b c ( ( 0, 0, 0), ( 0, 1, 0), ( 1, 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; tc_spec <= test_vectors(i).rc; wait for 10 ns; end loop; end process; -----------------------------------------check : process (tc_impl, tc_spec) begin ok <= (tc_impl = tc_spec); end process; -----------------------------------------end main_tb; Use this testbench until it becomes tedious to calculate manually the correct result for each test case.
) )
Add expected result to stimulus Add check process
LEC-11:
3.4.4
Have Separate Specication Entity
66
3.4.4
Rather than write the specication as part of stimulus, create separate specication entity/architecture. The specication component then calculates the expected output values. (NOTE: if your simulation tool supports congurations, the spec and impl can share the same entity, well see this in section 3.5)
LEC-11:
3.4.4
67
entity and2_spec is ...(same as and2 entity)... end and2_spec; architecture spec of and2_spec is begin c <= a AND b; end spec; architecture main_tb of and2_tb is component and2 ...; component and2_spec ...; signal ta, tb, tc_impl, tc_spec : std_logic; signal ok : boolean; begin -----------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); spec : and2_spec port map (a => ta, b => tb, c => tc_spec); -----------------------------------------stimulus : process begin type test_datum_ty is record ra, rb : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := -a b ( ( 0, 0), ( 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; wait for 10 ns; end loop; end process; -----------------------------------------check : process (tc_impl, tc_spec) begin ok <= (tc_impl = tc_spec); end process; -----------------------------------------end main_tb;
LEC-11:
3.4.5
Generate Test Vectors
68
3.4.5
Generate Test Vectors
When it becomes tedious to write out each test vector by hand, we can automaticaly compute them. This example uses a pair of nested for loops to generate all four permutations of input values for two signals. architecture main_tb of and2_tb is ... begin ... stimulus : process subtype std_test_ty of std_logic is (0, 1); begin for va in std_test_tylow to std_test_tyhigh loop for vb in std_test_tylow to std_test_tyhigh loop ta <= va; tb <= vb; wait for 10 ns; end loop; end loop; end process; ... end main_tb;
LEC-11:
3.4.6
Relational Specication
69
3.4.6
Relational Specication
Sometimes we want to check a relationship between the output and the input, rather than check that the output has a specic value. To do this, we drop the spec process, and put the brains into the check process. architecture main_tb of and2_tb is ... begin -----------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); -----------------------------------------stimulus : process ... end process; -----------------------------------------check : process (tc_impl, tc_spec) begin ok <= NOT (tc_impl = 1 AND (ta =0 OR tb = 0)); end process; -----------------------------------------end main_tb;
LEC-12: Functional Validation of State Machines

Schedule
wk-01 02 wk-03 05 wk-06 VHDL Design and Optimization Functional Validation Lec-11 Datapath Validation and Testbenches Lec-12 Control Validation and Assertions Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review
wk-07 wk-08 wk-08 10 wk-11 12 wk-13
Overview
This lecture illustrates techniques for validating state machines by using a FIFO queue. The lecture goes over an implementation, specication, and testbench. The verication uses assertions and coverage monitors inside the implementation to improve the chances of catching bugs.
Concepts
Dont care conditions : Conditions or situations where we dont care what the implementation does. Use of uninitialized data : Implementation should start with U on all signals. assert and report statements : Printing error messages to the screen.
Concepts (Contd)
Instrumentation code : Code that is added to design but will not appear in hardware. Used to measure (instrument) behaviour of internal signals in circuit. Often used to aid in validation, performance analysis, etc. Coverage monitors : Processes that help check if test vectors are fully exercising behaviour of implementation. Assertions : Properties that behaviour of internal signals should obey.
Concepts (Contd)
Running multiple scenarios from one test bench General VHDL coding guidelines
VHDL Constructs and Ideas
) )
) ) )
separate package and package body assert, report textio package: read, write,
readline dont care std match
Background
State machine design
Reading
None
LEC-12:
3.5
FUNCTIONAL VALIDATION OF CONTROL CIRCUITS 10
3.5 Functional Validation of Control Circuits

Control circuits are often more challenging to validate than datapath circuits.
In this section, we will explore the functional validation of state machines via a First-In First-Out queue. The VHDL code for the queue is on the web at: http://www.ece.uwaterloo.ca/ece427/exs/queue
) )
Control circuits have many internal signals. Testbenches are unable access key information about the behaviour of a control circuit. Many clock cycles can elapse between when a bug causes an internal signal to have an incorrect value and when an output signal shows the effect of the bug.
LEC-12:
3.5.1
Overview of Queues in Hardware
11
3.5.1

write read
Figure 3.7: Structure of queue

Write 1 A Write 2 A
queue
Empty
Figure 3.8: Write Sequence

Write 1 A B Write 2 A B
Figure 3.9: A Second Example Write
LEC-12:
3.5.1

Read 2 A A B
12
Read 1
Figure 3.10: Example Read Sequence

Write 1 Write 2
B C D E F G H I J
B C D E F G H I J
Figure 3.11: Write Illustrating Index Wrap

Write 1 K B C D E F G H I J Write 2 K B C D E F G H I J
Figure 3.12: Write Illustrating Full Queue
LEC-12:
do_rd
3.5.1
13
mem do_wr rd_idx data_rd data_wr wr_idx
empty
Figure 3.13: Queue Signals

do_rd wr_idx
mem do_wr data_wr rd_idx

WE A0 DI0 A1 DO1 DO0
data_rd
empty
Figure 3.14: Incomplete Queue Blocks Control circuitry not shown.
LEC-12:
3.5.2
VHDL Coding
14
3.5.2
VHDL Coding
LEC-12:
3.5.2
VHDL Coding
15
3.5.2.1 Package
Things to notice in queue package: 1. separation of package and body package queue_pkg is subtype data is std_logic_vector(3 downto 0); function to_data(i : integer) return data; end queue_pkg; package body queue_pkg is function to_data(i : integer) return data is begin return std_logic_vector(to_unsigned(i, 4)); end to_data; end queue_pkg;
LEC-12:
3.5.2
VHDL Coding
16
3.5.2.2 Other VHDL Coding

VHDL coding techniques to notice in queue implementation: 1. type declaration for vectors 2. attributes (a) Smith pp420,421 (Tables 10.14, 10.15) (b) low, high, length, 3. functions (reduce overall implementation and maintenance effort) (a) reduce redundant code (b) hide implementation details (c) (just like software engineering....)
LEC-12:
3.5.3
Code Structure for Validation
17
3.5.3
Validation things to notice in queue implementation: 1. instrumentation code 2. coverage monitors 3. assertions
LEC-12:
3.5.3
18

architecture ... is ... begin ... normal implementation ... process (clk) begin if rising_edge(clk) then ... instrumentation code ... prev_signame <= signame; end if; end process; ... assertions ... ... coverage monitors ... end;
LEC-12:
3.5.4
Instrumentation Code
19
3.5.4
process (clk) begin if rising_edge(clk) then prev_rd_idx <= rd_idx; prev_wr_idx <= wr_idx; prev_do_rd <= do_rd; prev_do_wr <= do_wr; end if; end process;
) ) ) ) )
Added to implementation to support validation Usually keeps track of previous values of signals Does not create hardware (Optimized away during synthesis) Does not feed any output signals Must use synthesizable subset of VHDL
LEC-12:
3.5.4
20
Naming Convention
NOTE: Naming convention for instrumentation For assertions, signals are named prev signame and signame, rather than next signame and signame as is done for state machines. This is because for assertions we use the prev signals as history signals, to keep track of past events. In contrast, for state machines, we name the signals next, because the state machine computes the next values of signals.
LEC-12:
3.5.5
Coverage Monitors
21
3.5.5
Coverage Monitors
The goal of a coverage monitors is to check if a certain event is exercised in a simulation run. If a test suite does not trigger a coverage monitor, then we probably want to add a test vector that will trigger the monitor. For example, for a circuit used in a microwave oven controller, we might want to make sure that we simulate the situation when the door is opened while the power is on.
LEC-12:
3.5.5
Coverage Monitors
22
Steps to Creating Coverage Monitors

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Identify important events, conditions, transitions Write instrumentation code to detect event Use report to write when event happens When run simulation, report statements will print when coverage condition detected Pipe simulation results to log le Examine log le and coverage monitors to nd cases and transitions not tested by existing test vectors Add test vectors to exercise missing cases Idea: automate detection of missing cases using Perl script to nd coverage messages in VHDL code that arent in log le Real world: most commercial synthesis tools come with add-on packages that provide different types of coverage analysis Research/entrepreneurial idea: based on missing coverage cases, nd new test vectors to exercise case
LEC-12:
3.5.5
Coverage Monitors
23
Coverage Events for Queue

Prev wr rd wr rd Now
Prev rd wr wr
Now
rd
Prev wr rd wr
Now
rd
LEC-12:
3.5.5
Coverage Monitors
24
Coverage Events for Queue
) ) ) ) ) )
wr wr wr rd rd wr
idx and rd idx are far apart idx and rd idx are equal idx catches rd idx idx catches wr idx idx wraps idx wraps
LEC-12:
3.5.5
Coverage Monitors
25
Coverage Monitor Template

process (signals read) begin if (condition) then report "coverage: message"; elsif (condition) ) then report "coverage: message"; else report "error: case fall through on message" severity warning; end if; end process;
LEC-12:
3.5.5
Coverage Monitors
26
Coverage Monitor Code

Events related to rd idx equals wr idx. process (prev_rd_idx, prev_wr_idx, rd_idx, wr_idx) begin if (rd_idx = wr_idx) then if ( prev_rd_idx = prev_wr_idx ) then report "coverage: read = write both moved"; elsif ( rd_idx /= prev_rd_idx ) then report "coverage: Read caught write"; elsif ( wr_idx /= prev_wr_idx ) then report "coverage: Write caught read"; else report "error: case fall through on rd/wr catching" severity warning; end if; end if; end process;
LEC-12:
3.5.5
Coverage Monitors
27
Coverage Monitor Code

Events related to rd idx wrapping. process (rd_idx) begin if (rd_idx = low_idx) then report "coverage: rd mv to low"; elsif (rd_idx = high_idx) then report "coverage: rd mv to high"; else report "coverage: rd mv normal"; end if; end process;
LEC-12:
3.5.6
Assertions
28
3.5.6
Assertions
LEC-12:
3.5.6
Assertions
29
Assertions for Queue

1. 2. 3. 4. 5. If rd idx changes, then it increments or wraps. If rd idx changes, then do rd was 1, or reset is 1. If wr idx changes, then it increments or wraps. If wr idx changes, then do wr was 1, or reset is 1. And many others....
LEC-12:
3.5.6
Assertions
30
Assertion Template
process (signals read) begin assert (required condition) report "error: message" severity warning; end process;
LEC-12:
3.5.6
Assertions
31
Assertions: Read Index

process (rd_idx) begin assert ((rd_idx > prev_rd_idx) or (rd_idx = low_idx)) report "error: rd inc" severity warning; assert ((prev_do_rd = 1) or (reset = 1)) report "error: rd imp do_rd" severity warning; end process;
LEC-12:
3.5.6
Assertions
32
Assertions: Write Index

process (wr_idx) begin assert ((wr_idx > prev_wr_idx) or (wr_idx = low_idx)) report "error: wr inc" severity warning; assert ((prev_do_wr = 1) or (reset = 1)) report "error: wr imp do_wr" severity warning; end process;
LEC-12:
3.5.7
VHDL Coding Tips
33
3.5.7
VHDL Coding Tips
LEC-12:
3.5.7
VHDL Coding Tips
34
Vector Type Declaration

type data_array_ty is array(natural range <>) of data; signal data_array : data_array_ty(7 downto 0);
LEC-12:
3.5.7
VHDL Coding Tips
35
Functions
function to_idx (i : natural range data_arraylow to data_arrayhigh) return idx_ty is begin return to_unsigned(i, idx_tylength); end to_idx; Conversion to Index Without Function With Function rd_idx <= to_unsigned(5, 3); rd_idx <= to_idx(5); The function code is verbose, but is very maintainable, because neither the function itself nor uses of the function need to know the width of the index vector.
LEC-12:
3.5.7
VHDL Coding Tips
36
Attributes
function inc_idx (idx : idx_ty) return idx_ty is begin if idx < data_arrayhigh then return (idx + 1); else return (to_idx(data_arraylow)); end if; end inc_idx;
LEC-12:
3.5.7
VHDL Coding Tips
37
Feedback Loops, and Functions

Coding guideline: use functions. Dont use procedures. inc as fun wr_idx <= inc_idx(wr_idx); inc as proc inc_idx(wr_idx);
Functions clearly distinguish between reading from a signal and writing to a signal. By examining the use of a procedure, you cannot tell which signals are read from and which are written to. You must examine the declaration or implementation of the procedure to determine modes of signals. Modifying a signal within a procedure results in a tri-state signal. This is bad.
LEC-12:
3.5.7
VHDL Coding Tips
38
File I/O (textio package)

TEXTIO denes read, write, readline, writeline functions. Described in:
These functions can be used to read test vectors from a le and write results to a le.
) )
Smith 10.6.3 http://www.eng.auburn.edu/department/ee/mgc/vhdl.html#textio
LEC-12:
3.5.8
Queue Specication
39
3.5.8
Queue Specication
Most bugs in queues are related to the queue becoming full, becoming empty, and/or wrap of indices. Specication should be obviously correct. Avoid bugs in specication by making specication queue larger than the max number of writes that we will do in test suite. Thus, the specication queue will never become full or wrap. However, the implementation queue will become full and wrap.
LEC-12:
3.5.8
Queue Specication
40
Write Index Update in Specication

We increment write-index on every write, we never wrap. process (clk) begin if rising_edge(clk) then if (reset = 1) then wr_idx <= 0; elsif (do_wr = 1) then wr_idx <= wr_idx + 1; end if; end if; end process;
LEC-12:
3.5.8
Queue Specication
41
Things to Notice
Things to notice in queue specication: 1. dont care conditions (-) 2. uninitialized data (hint: what is the value of rd_data when do more reads than writes?
LEC-12:
3.5.8
Queue Specication
42
Dont Care
rd_data <= data_array(rd_idx) when (do_rd =1) else (others => -);
LEC-12:
3.5.9
Queue Testbench
43
3.5.9
Queue Testbench
Things to notice in queue testbench: 1. running multipe test sequences 2. uninitialized data U 3. std_match to compare spec and impl data
With equality, - 1, but we want to use - to mean dont care in specication. The solution is to use std match, rather than = to check implementation signals against the specication.
10
0 0 0 0 0
0 0 1 1 everything else
0 L 1 H everything everything
12
LEC-12:
3.5.9
Queue Testbench
44
Stimulus Process Structure

The stimulus process runs multiple test vectors in a single simulation run. stimulus : process type test_datum_ty is record r_reset, ... normal fields ... end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := ( -reset ... ( 1, normal fields), ( 0, normal fields), ... -- wr_idx passes rd_idx (overwrite entries) -reset ... ( 1, normal fields), ( 0, normal fields), ... ); begin for i in test_vectorsrange loop if (test_vectors(i).r_reset = 1) then ... reset code ... end if; reset <= 0; ... normal sequence ... wait until rising_edge(clk); end loop; end process; After reset is asserted, set signals to U.
Chapter 4
Performance Analysis and Optimization
LEC-12:
4.1
INTRO
46
4.1
Intro
LEC-12:
4.1.1
Concepts
47
4.1.1
Concepts
) ) ) ) )
denition of performance different ways of measuring performance comparing performance (speedup, n% faster) improving performance Amdahls law (limits on performance improvements)
LEC-12:
4.1.2
Background Material
48
4.1.2
Background Material
Algebra, basic familiarity with assembly language
LEC-12:
4.1.3
Reading Material
49
4.1.3
Reading Material
Performance is not described in Smiths book. Hennessey and Pattersons Quantitative Computer Achitecture (textbook for E&CE 429) has good information on performance.
LEC-13: Introduction to Performance Analysis

Schedule
wk-01 02 wk-03 05 wk-06 wk-07 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Lec-13 Computer Performance Lec-14 Digital Circuit Performance Timing Analysis Power Analysis and Reduction Faults and Testing Review
wk-08 wk-08 10 wk-11 12 wk-13
Overview No more VHDL in lectures!

This lecture introduces the concepts behind performance measurement and illustrates the importance of mathematical analysis when making performance tradeoffs. This lecture overlaps with some material in the computer architecture course E&CE 429. The second lecture on performance will apply performance analysis to dataow diagrams and so will not overlap with E&CE 429.
Concepts
) )
denition of performance different ways of measuring performance comparing performance (speedup, n% faster)
) ) )
improving performance Amdahls law (limits on performance improvements) clock speed, program length, cpi, and performance
Background
Algebra, basic familiarity with assembly language
Reading
Performance is not described in Smiths book. Hennessey and Pattersons Quantitative Computer Achitecture (textbook for E&CE 429) has good information on performance.
LEC-13:
4.2
DEFINING PERFORMANCE
4.2
Dening Performance
Performance Work Time
You can double your performance by: doing twice the work in the same amount of time OR doing the same amount of work in half the time
LEC-13:
4.2
Benchmarking
Performance Work Time
Measuring time is easy, but how do we accurately measure work? The game of benchmarking is nding a denition of work that makes your system appear to get the most work done in the least amount of time. Measure of Work clock cycle instruction synthetic program real program travel 1/4 mile Measure of Performance MHz MIPs Dhrystone, Whetstone, D-MIPs (Dhrystone MIPs) SPEC drag race
LEC-13:
4.2
SPEC Benchmarks
Denition SPEC: Standard Performance Evaluation Corporation MISSION: To establish, maintain, and endorse a standardized set of relevant benchmarks and metrics for performance evaluation of modern computer systems http://www.spec.org.
LEC-13:
4.3
COMPARING PERFORMANCE
10
4.3
Comparing Performance
We can measure the performance of practically anything (cars, computers, vacuum cleaners, printers....)
LEC-13:
4.3
11
Comparing Performance
printer1 printer2 Black and White 9ppm 12ppm Colour 6ppm 4ppm
Question: faster is it?
Which printer is faster at B&W and how much
n% faster
TSlow TFast TFast
LEC-13:
4.3
12
BW Performance
Answer: BW1 1 9ppm
BW2
1 12ppm
0 0833min page TSlow TFast TFast BW1 BW2 BW2 0 1111 0 08333 0 08333 33%faster
BWFaster
4 3
5 5
4 4
2 2 2 2 2 2 2 2
0 1111min page
LEC-13:
4.3.1
Performance for Different Tasks
13
4.3.1
Performance for Different Tasks
Question: If average workload is 90% BW and 10% Colour, which printer is faster and how much faster is it? A potentially helpful formula is the average time to do one of k different tasks:
TAvg
i 1
Answer:
0 1167min page
0 1000min page TSlow TFast TFast Avg1 Avg2 Avg2 0 1167 0 1000 0 1000 16 7%faster
AvgFaster
0 90
0 0833
4 9 4 7 A8 @
0 10
4 3
5 4 4 9 4 7
TAvg2
%BW
BW2
%C
C2 0 2500
0 90
0 1111
4 9 4 7 A8 @
0 10
5 4 4 9 4 7
TAvg1
%BW
BW1
8 78 7
%i Ti %C C1 0 1667
2 2 2 2 2 2 2 2 2 2
LEC-13:
4.3.2
Optimizing Performance
14
4.3.2
Question: If we want to optimize printer1 to match performance of printer2, should we optimize BW or Colour printing?
Answer:
Colour printing is slower, so appears that can save more time by optimizing colour printing. However, look at extreme case of optimizing colour printing to be instantaneous for P1:
0.150m/p 0.100m/p 0.050m/p 0.000m/p P1 P2
Even if make colour printing instantaneous for printer 1 and kept same for printer 2, printer 1 would not be measurably faster. Amdahls law Make the common case fast.
Optimizations need to take into account both run time and frequency of occurrence.
LEC-13:
4.3.2
15
Optimization without Engineering

Question: If you have to re all of the engineers because your stock price plummeted, how can you get printer1 to be faster than printer2?
NOTE: Hmmmm This question was actually humorous during the high-tech bubble...
Answer:
Hire more marketing people! Notice that colour printing on printer 1 is faster than on printer 2. So, marketing suggests that people are increasing the percentage of printing that is done in colour.
Question: Revised question: what percentage of printing must be done in colour for printer1 to beat printer2?
Answer:
%C
0 25
4 3
4 3
%C
0 1111
4 @ 4 3
0 1111 0 0833
0 0833 0 2500
%C
BW1 BW2 BW1 BW2 C2
C1 0 1667
B E8
BW1
%C
C1
BW1
BW2
%C
C2
BW2
7 9
9 D8
7 9
9 C8
%C
BW1
%C
C1
3 7
%BW
%C
%C
BW2
%C
B 2
%BW
BW1
%C
C1
TAvg1
TAvg2 %BW BW2 %C C2
3 7
C2
LEC-13:
4.4
CLOCK SPEED, CPI, PROGRAM LENGTH, AND PERFORMANCE 16
4.4 Clock Speed, CPI, Program Length, and Performance
LEC-13:
4.4.1
Mathematics
17
4.4.1
Mathematics
CPI NumInsts ClockSpeed Cycles per instruction Number of instructions Clock speed
Time
NumInsts CPI ClockSpeed
LEC-13:
4.4.2
Example: CISC vs RISC and CPI
18
4.4.2

AMD Athlon Fujitsu SPARC64 Clock Speed 1.2GHz 675MHz SPECint 409 443
The AMD Athlon is a CISC microprocessor (it uses the IA-32 instruction set). The Fujitsu SPARC64 is a RISC microprocessor (it uses Suns Sparc instruction set). Assume that it requires 20% more instructions to write a program in the Sparc instruction set than the same program requires in IA32.
LEC-13:
4.4.2
19
SPECint and Performance

AMD Athlon Fujitsu SPARC64 Clock Speed 1.2GHz 675MHz SPECint 409 443
Question: Which of the two processors has higher performance?
Answer: SPECint, SPECfp, and SPEC are measures of performance. Therefore, the higher the SPEC number, the higher the performance. The Fujitsu SPARC64 has higher performance
LEC-13:
4.4.2
20
Relative CPI
Question: What is the ratio between the CPIs of the two microprocessors?
LEC-13:
4.4.2
21
Absolute CPI
Question: Can you determine the absolute (actual) CPI of either microprocessor?
LEC-13:
4.4.3
Summary of Equations
22
4.4.3
Time to perform a task: NumInsts CPI ClockSpeed
Time
Average time to do one of k different tasks:
TAvg
i 1
Performance: Performance Work Time
8 78 7
%i Ti
LEC-13:
4.4.3
23
Summary of Equations (Contd)

Speedup: TSlow TFast
Speedup TFast is n% faster than TSlow: n% faster
TSlow TFast TFast
LEC-14: Performance and Dataow Diagrams

Schedule
wk-01 02 wk-03 05 wk-06 wk-07 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Lec-13 Computer Performance Lec-14 Digital Circuit Performance Timing Analysis Power Analysis and Reduction Faults and Testing Review
wk-08 wk-08 10 wk-11 12 wk-13
Overview
In this lecture we relate the general performance equations from Lec-13 to dataow diagrams.
Concepts
predicting performance for dataow diagrams choosing clock speed in dataow diagrams instruction scheduling
) )
) ) )
dataow diagrams with multiple instructions and performance design effort vs performance
LEC-14:
4.5
PERFORMANCE ANALYSIS AND DATAFLOW DIAGRAMS 5
4.5 Performance Analysis and Dataow Diagrams
LEC-14:
4.5.1
Dataow Diagrams, CPI, and Clock Speed
4.5.1 Dataow Diagrams, CPI, and Clock Speed

One of the challenges in designing a circuit is to choose the clock speed. Increasing the clock speed of a circuit doesnt necessarily improve its performance. In this section we will work through several example dataow diagrams to pick a clock speed for the circuit and schedule operations into clock cycles.
LEC-14:
4.5.1
4.5.1.1 Tradeoffs
When partitioning dataow diagrams into clock cycles, need to take both area and performance into account. Goal Minimize area Action decrease clock period Affect fewer operations per clock cycle, so fewer datapath components and more opportunities to reuse hardware more exibility in grouping operations in clock cycles decreases number of ops that data traverses through
Increase scheduling exibility Decrease percentage of clock cycle spent in ops (overhead time in ops is not doing useful work) Decrease time to execute an instruction
increase clock period increase clock period
????
depends on dataow diagram
LEC-14:
4.5.1
General Plan
Our general plan to nd the clock period for maximum performance is: 1. Pick clock period to be delay through slowest component + delay through op. 2. For each instruction, for each operation, schedule the operation in the earliest clock cycle possible without violating clockperiod timing constraints. 3. Calculate average time to execute an instruction as: NumInsts CPI Combine: Time = ClockSpeed
to derive:
Time
i 1
ClockSpeed
4. If the maximum latency through dataow diagram is greater than 1, then increase clock period by minimum amount needed to decrease latency by one clock period and return to Step 2. 5. If the maximum latency through dataow diagram is 1, then clock period for highest performance is clock period resulting in fastest Time. 6. If possible, adjust the schedule of operations to reduce the maximum number of occurrences of a component per instruction per clock cycle without increasing latency for any instruction.
NumInsts
i 1
and:
CPIavg
%i
CPIi
%i
CPIi
LEC-14:
4.5.2
Dataow Diagram with Two Instructions
4.5.2 Dataow Diagram with Two Instructions

Circuit supports two instructions, A and B (e.g. multiply and divide). At any point in time, the circuit is doing either A or B it does not need to support doing A and B simultaneously. The diagrams below show the ow for each instruction and the delay through the components (f,g,h,i) that the instructions use. The delay through a register is 5ns. Each operation (A and B) occurs 50% of the time. Our goal is to nd a clock period and dataow diagram for the circuit that will give us the highest overall performance. Instruction A
f (30ns)
Instruction B
i (40ns)
g (50 ns)
g (50 ns)
h (20 ns)
g (50 ns)
LEC-14:
4.5.2
10
4.5.2.1 Scheduling of Operations for Different Clock Periods
LEC-14:
4.5.2
11
Scheduling (1)
55ns Clock Period
55ns 55ns f (30ns) i (40ns)
75ns
75ns Clock Period

f (30ns) i (40ns)
g (50 ns) h (20 ns)
g (50 ns)
75ns g (50 ns) h (20 ns) g (50 ns) g (50 ns)
55ns
55ns
g (50 ns)
75ns
LEC-14:
4.5.2
12
Scheduling (2)
85ns Clock Period
f (30ns) 85ns g (50 ns) h (20 ns) 85ns g (50 ns) g (50 ns) 95ns g (50 ns) i (40ns) 95ns g (50 ns) h (20 ns)
95ns Clock Period

f (30ns) i (40ns) g (50 ns)
LEC-14:
4.5.2
13
Scheduling (3)
155ns Clock Period
f (30ns) g (50 ns) 155ns h (20 ns) g (50 ns) i (40ns) g (50 ns)
LEC-14:
4.5.2
14
4.5.2.2 Performance Computation for Different Clock Periods

Question: Which clock speed will result in the highest overall performance?
Answer:
3 PI
4 @ 9 4 @ 9 4 @ 9 4 @ 9 4 @ 9
55 75 85 95 155
05 05 05 05 05
4 3 2 2 1
2 8 9 2 8 9 2 8 9 2 8 9 2 8 9
4 9 4 7 9 4 7 9 4 7 9 4 77 9
Clock Period 55ns 75ns 85ns 95ns 155ns
CPIA 4 3 2 2 1
CPIB 2 2 2 1 1
Tavg 05 2 05 2 05 2 05 1 05 1
165 187 5 170 143 155
LEC-14:
4.5.2
15
4.5.2.3 Example: Two Instructions Taking Similar Time
LEC-14:
4.5.2
16
A and B take similar amounts of time

Question: For the ow below, which clock speed will result in the highest overall performance? A 30ns 50ns 20ns 50ns B 40ns 50ns 40ns
Answer:
55ns 55ns
f (30ns)
i (40ns) 75ns
f (30ns)
i (40ns)
g (50 ns) h (20 ns)
g (50 ns) 75ns g (50 ns) h (20 ns) 75ns g (50 ns) i (40ns) g (50 ns) i (40ns)
55ns
55ns
g (50 ns)
f (30ns) 85ns g (50 ns)
i (40ns)
f (30ns) h (20 ns) 85ns g (50 ns) g (50 ns) 95ns g (50 ns) h (20 ns) 85ns i (40ns) 95ns g (50 ns)
i (40ns) g (50 ns) i (40ns)
f (30ns) 105ns g (50 ns) h (20 ns) 105ns g (50 ns)
i (40ns) g (50 ns)
i (40ns)
LEC-14:
4.5.2

i (40ns) g (50 ns) i (40ns)
17
f (30ns) 135ns g (50 ns) h (20 ns)
135ns
g (50 ns)
f (30ns) g (50 ns) 155ns h (20 ns)
i (40ns) g (50 ns) i (40ns)
g (50 ns)
Clock Period 55ns 75ns 85ns 95ns 105ns 135ns 155ns
CPIA 4 3 2 2 2 2 1
CPIB 3 3 3 2 2 1 1
Tavg 193 225 213 190 NO GAIN 203 155
A clock period of 155 ns results in the highest performance. For a clock period of 105 ns, we did not calculate the performance, because we could see that it would be worse than the performance with a clock period of 95 ns. The dataow diagram with a 105 ns clock period has the same latency as the diagram with a clock period of 95 ns. If the data ow diagram with the longer clock period has the same latency as the diagram with the shorter clock period, then the diagram with the longer clock period will have lower performance.
LEC-14:
4.5.2
18
4.5.2.4 Example: Same Total Time, Different Order for A
LEC-14:
4.5.2
19
Example: Different Order of Operations for A

Question: For the ow below, which clock speed will result in the highest overall performance? A 30ns 20ns 50ns 50ns B 40ns 50ns 40ns Answer:
Clock Period 55ns 95ns 105ns 135ns 155ns
CPIA 3 3 2 2 1
CPIB 3 2 2 1 1
Tavg 165ns 238ns 210ns 203ns 155ns
A clock period of 155 ns results in lowest average execution time, and hence the highest performance. This is the same answer as the previous problem, but the total times for higher clock frequencies differ signicantly between the two problems.
LEC-14:
4.5.3
Example: From Algorithm to Optimized Dataow
20
4.5.3 Example: From Algorithm to Optimized Dataow

This question involves doing some of the design work for a circuit that implements InstP and InstQ using the components described below.
Component Delays 2-input Mult 40ns 2-input Add 25ns Register 5ns
9 C8 @ @ @ 8 A8 9 7 R8 9 Q7 7 9 9 @ @ 7
e
Instruction InstP InstQ
Algorithm b a b b d i j k l m
Frequence of Occurrence 75% 25%
LEC-14:
4.5.3
21
NOTES
) ) ) )
There is a resource limitation of a maximum of 3 input ports. (There are no other resource limitations.) You must put registers on your inputs, you do not need to register your outputs. The environment will directly connect your outputs (its inputs) to registers. Each input value (a, b, c, d, e, i, j, k, l, m) can be input only once if you need to use a value in multiple clock cycles, you must store it in a register.
LEC-14:
4.5.3
22
Questions
Question: What clock period will result in the best overall performance?
Question: Find a minimal set of resources that will achieve the performance you calculated.
Answer:
a b d
*
70ns a*b
*
b*d e (a*b) + (b*d) (a*b) + (b*d) + e (a*b)*((a*b) + (b*d) + e)
+ + *
CPI = 2 InstP
i j k
+ + +
70ns l m
*
CPI = 2 InstQ
LEC-14:
4.5.3
23
Resource Usage
Fastest execution time Clock period Inputs Outputs Registers Adders Multipliers 140ns 70ns 3 1 3 2 2
LEC-14:
4.5.4
Optimality: Performance vs Area Tradeoffs
24
4.5.4 Optimality: Tradeoffs
Performance vs Area
You are designing a 16-bit barrel shifter. You have the option of supporting an entire 15-bit shift in a single clock cycle (which gives a latency of 1 clock cycles), shifting 1-bit per clock cycle (which gives a latency of 15 clock cycles), or anything in between. You do the design and measure the following information: Max Shift 1 3 7 15 Min Period 21ns 27ns 40ns 34ns Area (CLBs) 13 36 57 53
Question: Which circuit gives you the best optimality, in terms of MIPs/CLB? Answer: Assume that all shift amounts have same probability of occurrence. Shift amounts can be anywhere from 0 (no shift) to 16 (shift all data out, leaving only zeroes). The data for the shift amounts and latencies were generated using Synopsys Design Compiler for a Xilinx FPGA. Max shift of 1
Max shift of 3 Max Shift 1 3 7 15 Min Period 21ns 27ns 34ns 40ns Latency 15 5 3 1 Time 315ns 135ns 102ns 40ns MIPs 3.2 7.4 9.8 25 Area 13 36 57 53 MIPs/CLB 0.25 0.21 0.17 0.47
3 PI
8 78 7 78 7 6 S 78 7
TAvg
i 0 16 %i Ti Task i is to shift by i bits A shift amount of i requires i clock cycles 16 1 17 i ClkPeriod i 0 16 1 ClkPeriod i 17 i 0 1 21 136 17 168ns
8 78 7
2 2 2 2
LEC-14:
4.5.4
25
New assumptions: 1. All shift amounts have same probability of occurrence. 2. The latency of a shift operation is dependent upon the shift amount. 3. Shift amounts can be anywhere from 0 (no shift) to 15 (shift leastsignicant bit to most signicant position). 4. Shifting by 0 requires 1 clock cycle.
Question: With the revised assumptions, which circuit gives you the best optimality, in terms of MIPs/CLB?
Answer:
Max shift of 1
Max shift of 3 Shift amount 0 3 4 6 7 9 10 12 13 15 5 different tasks. Ti Latency 1 2 3 4 5
ClkPeriod and %i
0 20.
TAvg
i 1 5 i 1
81 ns
Max shift of 7
9 78 4 7 6 8 78 7
%i Ti
0 20 i
27
4 2
8 78 7 @ 78 7 @ 6 S 78 7 @ 6 9 2 44 QQ4 44 QQ4 4QQ4 4 4QQ4 4 44 QQ4 2 2 2
TAvg
i 0 15 %i Ti Task i is to shift by i bits A shift amount of i requires i clock cycles The exception is i 0, which requires 1 clock cycle 15 1 1 ClkPeriod i ClkPeriod 16 i 1 16 15 1 1 ClkPeriod ClkPeriod i 16 16 i 0 1 1 21 21 120 16 16 158 ns
8 78 7
2 2 2 2 2
LEC-14:
4.5.4

Shift amount 0 7 8 14 15 Latency 1 2 3
26
3 different tasks. Ti TAvg
ClkPeriod and %i
i 1 3 i 1
67 ns
Max shift of 15 Shift amount 0 15 1 task. Ti ClkPeriod and %i TAvg Latency 1 1 00.
i 1
%i Ti
40 ns
3 PI
Max Shift 1 3 7 15
Min Period 21ns 27ns 40ns 34ns
Latency 15 5 3 1
Time 158ns 81ns 67ns 40ns
MIPs 6.3 12 15 25
6 8 78 7 4 2
9 78 4 7 6 8 78 7
%i Ti 0 33 i
ClkPeriod
4 2
44 QQ4 44 QQ4 44 QQ4 6
9 2 2 2 2
0 33.
Area 13 36 57 53
MIPs/CLB 0.48 0.33 0.26 0.47
LEC-14:
4.5.5
Affect of Instruction Set on Performance
27
4.5.5 Affect of Instruction Set on Performance
LEC-14:
4.5.5
28
Example: Changing Instruction Set and Performance

Your group designs a microprocessor and you are considering adding a fused multiply-accumulate to the instruction set. (A fused multiply accumulate is a single instruction that does both a multiply and an addition. It is often used in digital signal processing.) Your studies have shown that, on average, half of the multiply operations are followed by an add instruction that could be done with a fused multiply-add. Additionally, you know: ADD MUL Other cpi 0.8 CPIavg 1.2 CPIavg 1.0 CPIavg % 15% 5% 80%
LEC-14:
4.5.5
29
Options
You have three options: option 1 : no change option 2 : add the MAC instruction, increase the clock period by 20%, and MAC has the same CPI as MUL. option 3 : add the MAC instruction, keep the clock period the same, and the CPI of a MAC is 50% greater than that of a multiply.
Question: Which option will result in the highest overall performance?
LEC-14:
4.5.6
Affect of Time to Market on Relative Performance
30
4.5.6 Affect of Time to Market on Relative Performance
LEC-14:
4.5.6
31
Example: Time to Market and Optimizations

Assume that performance of the average product in your market segment doubles every 18 months. You are considering an optimization that will improve the performance of your product by 7%.
Question: If you add the optimization, how much can you allow your schedule to slip before the delay hurts your relative performance compared to not doing the optimization and launching the product according to your current schedule?
LEC-14:
4.5.6
32
Chapter 5
Timing Analysis
LEC-14:
5.1
PRELIMINARIES
34
5.1
Preliminaries
LEC-14:
5.1.1
Overview
35
5.1.1
Overview
) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )
Clock Skew Clock Jitter Clock-to-Q delay (latch and op) Setup Time (latch and op) Hold time (latch and op) Capacitive Load delay Critical path False path Setup, hold, and clock-to-Q times in hierarchical circuits Propagation delay Interconnect (Wire) delay Load delay Elmore time constant Worst case timing Derating factors Speed binning
LEC-14:
5.1.2
Background Material
36
5.1.2
Background Material
) ) ) )
resistance, capacitance, voltage equations over time ip-op timing, setup and hold times (Mano, Digital Design 6-3) digital view of CMOS transistor behaviors a tiny bit of calculus integration in Lec-12
LEC-14:
5.1.3
Reading Material
37
5.1.3
Reading Material
There is a tremendous amount of material on delay and timing scattered throughout Smiths book. Chapter 2 : transistor and logic review 2.1 : transistor review 2.4 : combinational logic cells 2.5 : sequential logic cells Chapter 3 : fundamentals of timing and delay 3.13.2 : transistors and delay Chapter 5 : timing and delay within cells 5.1.5 5.1.7 : Actel cells 5.2.4 : Xilinx LCA timing 5.4.2 : Altera MAX timing Chapter 7 : timing and delay between cells 7.1 : Actel interconnect 7.2 7.4 : Xilinx LCA timing 7.4 : Altera MAX timing (constant delay for all interconnect) Chapter 13 : simulation 13.1 13.2 13.5 13.6 13.7 : : : : : levels of temporal abstraction for simulation simulation example different simulation models for hardware delay models static timing analysis
Chapter 16 16.1.2 : clock trees and timing in oorplanning Chapter 17 17.1.2 : timing in routing Suggestion:
) ) ) )
skim/read Chs 2 and 3 to refresh/learn fundamentals of delay skim relevant sections of Chs 4, 7, 13 16, 17 read in depth: 5.155.1.7 7.1 13.2, 13.6, 13.7 16.1.2 read remaining sections as time and interest dictates
LEC-15: Introduction to Timing Analysis

Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Lec-15 Introduction Lec-16 Math, Physics and Applications Lec-17 Timing Analysis of Storage Elements Power Analysis and Reduction Faults and Testing Review
wk-09 10 wk-11 12 wk-13
Overview
This lecture introduces the fundamentals of timing analysis. In particular, how do we determine the fastest clock speed that a circuit will support?
Concepts
) ) ) ) )
Minimum clock period Hold constraint Clock skew Clock latency Clock jitter Setup time Hold time Clock-to-Q time
) ) ) ) ) ) ) )
Cause and effect of timing violations Propagation delay Load delay Interconnect delay Critical path False path
Background
For those who took E&CE-324, there is some overlap between the material in this chapter and the material in E&CE-324. In E&CE-427, we will focus on calculating the critical path of a circuit and on techniques to calculate the timing parameters of a storage device (e.g. latch or op). One terminology difference: what was called margin in E&CE-324 will be called slack in E&CE-427.
Reading Material
There is a tremendous amount of material on delay and timing scattered throughout Smiths book.
NOTE: Reading and exam All of the exam material will come from the courses notes, but it could be helpful to read the relevant sections in Smiths book to better understand the material. Chapter 2 : Transistor and logic review 2.1 : transistor review 2.4 : combinational logic cells 2.5 : sequential logic cells Chapter 3 : Fundamentals of timing and delay 3.13.2 : transistors and delay
Reading Material (contd)

Chapter 5 : Timing and delay within cells 5.1.5 5.1.7 : Actel cells 5.2.4 : Xilinx LCA timing 5.4.2 : Altera MAX timing Chapter 7 : Timing and delay between cells 7.1 : Actel interconnect 7.2 7.4 : Xilinx LCA timing 7.4 : Altera MAX timing (constant delay for all interconnect)
Reading Material (contd)

Chapter 13 : Simulation 13.1 13.2 13.5 13.6 13.7 : : : : : levels of temporal abstraction for simulation simulation example different simulation models for hardware delay models static timing analysis
Chapter 16 : Floorplanning and placement 16.1.2 : clock trees and timing in oorplanning Chapter 17 : Routing 17.1.2 : timing in routing
Suggested Strategy for Reading
) ) ) )
skim/read Chs 2 and 3 to refresh/learn fundamentals of delay skim relevant sections of Chs 4, 7, 13, 16, 17 read in depth: 2.5.2 (setup and hold) 6.5.1 (clocks) 3.1 (timing model) 13.2, 13.6, 13.7 (timing models and timing analysis) 7.1 (interconnect delay) 16.1.2 (interconnect delay) 5.1.55.1.7 (timing analysis of storage devices)
read remaining sections as time and interest dictates
LEC-15:
5.2
DELAYS AND DEFINITIONS
10
5.2
Delays and Denitions
LEC-15:
5.2.1
Related Background Denitions
11
5.2.1
LEC-15:
5.2.1
12
Fanin
y0 y1 y2 y3 y4 x
Denition fanin: The fanin of a gate or signal x are all of the gates or signals y where an input of x is connected to an output of y.
LEC-15:
5.2.1
13
Fanout
y0 x y1 y2 y3 y4
Denition fanout: The fanout of a gate or signal x are all of the gates or signals y where an output of x is connected to an input of y.
LEC-15:
5.2.1
14
Immediate Fanin and Fanout

y0 y1
x y0 y1 y2 y3 y4
y2 y3 y4
Figure 5.1: Immediate Fanin of x
Figure 5.2: Immediate Fanout of x
Denition immediate fanin/fanout: The phrases immediate fanout and immediate fanin mean that there is a direct connection between the gates.
LEC-15:
5.2.1
15
Transitive Fanin and Fanout
Figure 5.3: Transitive Fanin
Figure 5.4: Transitive Fanout
Denition transitive fanin/fanout: The phrases transitive fanout and transitive fanin mean that there is either a direct or indirect connection between the gates.
LEC-15:
5.2.1
16
Immediate vs. Transitive

NOTE: Immediate vs Transitive fanin and fanout Be careful to distinguish between immediate fan(in/out) and transitive fanin/out. If fanin or fanout are not qualied with immediate or transitive, be sure to make sure whether immediate or transitive is meant. In E&CE 427, fan(in/out) will mean immediate fan(in/out).
LEC-15:
5.2.2
Timing Constraints
17
5.2.2
Timing Constraints
For a circuit to operate correctly, the clock period must be longer than the sum of the delays shown in table 5.1. Each of these timing parameters is described in more detail in section 5.2.3.
LEC-15:
5.2.2
Timing Constraints
18
Minimum Clock Period

Name Skew Jitter Clock-to-Q Interconnect Load Setup T SUD Symbol Denition Difference in arrival times for different clock signals Difference in clock period over time Delay from clock signal to Q output of op Delay along wire Delay due to load (fanout/consumers/readers) Length of time prior to clock/enable that data must be stable
CO
Table 5.1: Summary of delay factors for minimum clock period
LEC-15:
5.2.2
Timing Constraints
19
Propagation Delay
Denition Propagation Delay: Sum of Interconnect and Load delay.
LEC-15:
5.2.2
Timing Constraints
20
Propagation Delay
Denition Slack: Difference between required value of timing parameter and actual value. A negative slack means that there is a timing violation. A positive slack means that the constraint for the timing parameter is satised. NB: Slack was called margin in E&CE 324. Both terms are used commonly.
LEC-15:
5.2.2
Timing Constraints
21
5.2.2.1 Minimum Clock Period

a clk1 clk2 b signal is stable signal may change signal may rise signal may fall
clock period propagation skew jitter clock-to-Q wire + load setup
clk1 clk2 a b slack
CO
U VT
ClockPeriod
Skew
Jitter
Interconnect
Load
SUD
LEC-15:
5.2.2
Timing Constraints
22
5.2.2.2 Hold Constraint

a clk1 clk2 b signal is stable signal may change signal may rise signal may fall
skew
-Q
jitter
hold
io n
to
k-
oc
clk1 clk2 a b slack
cl
HO
CO
Y U PX
Skew
Jitter
pr
op
ag
at
Load
Interconnect
LEC-15:
5.2.3
Clock-Related Timing Denitions
23
5.2.3
LEC-15:
5.2.3
24
5.2.3.1 Clock Skew (Smith 6.5.1)

skew clk1 clk2 clk3 clk4 clk2 clk4 clk1 clk3
Denition Clock Skew: The difference in arrival times for the same clock edge at different ip-ops. Clock skew is caused by the difference in interconnect delays to different points on the chip.
LEC-15:
5.2.3
25
Clock Tree Design

Clock tree design is critical in high-performance designs to minimize clock skew. Sophisticated synthesis tools put lots of effort into clock tree design, and the techniques for clock tree design still generate PhD theses.
LEC-15:
5.2.3
26
5.2.3.2 Clock Latency (Smith 6.5.1)

master clock latency intermediate clock final clock master clock intermediate clock final clock
Denition Clock Latency: The difference in arrival times for the same clock edge at different levels of interconnect along the clock tree. (Intuitively different points in the clock generation circuitry.)
NOTE: Clock latency Clock latency does not affect the limit on the minimim clock period.
LEC-15:
5.2.3
27
5.2.3.3 Clock Jitter (Smith pp873)

ideal clock
clock with jitter jitter
Denition Clock Jitter: Difference between actual clock period and ideal clock period.
LEC-15:
5.2.3
28
Causes of Clock Jitter

Clock jitter is caused by:
` ` ` `
temperature and voltage variations over time temperature and voltage variations across different locations on a chip manufacturing variations between different parts etc.
LEC-15:
5.2.4
Storage Related Timing Denitions (Smith 2.5.2)
29
5.2.4 Storage Related Timing Denitions (Smith 2.5.2)

Storage devices (latches, ip-ops, memory arrays, etc) dene setup, hold and clock-to-Q times.
Setup d
d clk q
Hold
clk q Clock-to-Q
Figure 5.5: Setup, hold, and clock-to-Q times for a ip op
LEC-15:
5.2.4
30
Forward Reference
In this section, we will use the denitions of setup, hold and clock-to-Q. Section 5.6 will show how to calculate setup, hold, and clock-to-Q times for ip ops, latches, and other storage devices.
LEC-15:
5.2.4
31
5.2.4.1 Setup Time

) : Latest time before arrival of SUD clock edge (ip op), or deasserting of enable line (latch), that input data is required to be stable in order for storage device to work correctly. If setup time is violated, current input data will not be stored; input data from previous clock cycle might remain stored. Denition Setup Time (T
LEC-15:
5.2.4
32
5.2.4.2 Hold Time

): Latest time after arrival of clock HO edge (ip op), or deasserting of enable line (latch), that input data is required to remain stable in order for storage device to work correctly. If hold time is violated, current input data will not be stored; input data from next clock cycle might slip through and be stored. Denition Hold Time (T
LEC-15:
5.2.4
33
5.2.4.3 Clock-to-Q Time

Denition Clock-to-Q Time (T ): Earliest time after arrival CO of clock edge (ip op), or asserting of enable line (latch) when output data is guaranteed to be stable.
NOTE: Require / Guarantee Setup and hold times are requirements that the storage device imposes upon its environment. Clock-to-Q is a guarantee that the storage device provides its environment.
LEC-15:
5.2.4
34
5.2.4.4 Example Timing Violations
LEC-15:
5.2.4
35
Good Timing
a clk b c d
a clk b
Clock-to-Q
Prop Setup Hold
c d
Figure 5.6: Good Timing
LEC-15:
5.2.4
36
Setup Violation
a clk b Clock-to-Q Prop Setup c d ??? ???
Figure 5.7: Setup Violation
LEC-15:
5.2.4
37
Hold Violation
a clk b c d
a clk b
Clock-to-Q Prop Hold
c d
???
Figure 5.8: Hold Violation
LEC-15:
5.2.5
Propagation Delays
38
5.2.5
Propagation Delays
LEC-15:
5.2.5
Propagation Delays
39
5.2.5.1 Load Delays (Smith 3.1)

Delay is proportional to load capacitance. Timing of a simple inverter with a load.
Vi
Vo
Schematic
LEC-15:
5.2.5
Propagation Delays
40
Load Delays
1->0 0->1
0->1 1->0
Load capacitance is a dependent on the fanout (how many other gates a gate drives) and how big the other gates are. Section 5.4.1 goes into more detail on timing models and equations for load delay.
Input 1 0: Charge output cap
Input 0 1: Discharge output cap
LEC-15:
5.2.5
Propagation Delays
41
5.2.5.2 Interconnect Delays (Smith 7.1)

Wires, also known as interconnect, have resistance, and there is a capacitance between parallel wires. Both of these factors increase delay.
More on this in section 5.4.3.
` ` ` ` ` `
Wire resistance is dependent upon the material and geometry of the wire. Wire capacitance is dependent on wire geometry, geometry of neighboring wires, and materials. Shorter wires are faster. Fatter wires are faster. FPGAs have special routing resources for long wires. CMOS processes use higher metal layers for long wires, these layers have wires with much larger cross sections than lower levels of metal.
LEC-15:
5.3
CRITICAL PATHS: FALSE AND TRUE
42
5.3
Critical Paths: False and True

Denition critical path: The slowest path on the chip between ops or ops and pins. The critical path limits the maximum clock speed.
Three classes of paths: entry path from an input to a op stage path from one op to another op exit path from a op to an output
LEC-15:
5.3
43
Entry Path
entry path: from an input to a op Quartus does not report this by default. When Quartus reports this path, it is reported as the period associated with System fmax. In Xilinx timing reports this is reported as Maximum Delay
LEC-15:
5.3
44
Stage Path
stage path: from one op to another op In Quartus timing reports, this is reported as the period associated with Internal fmax. In Xilinx timing reports, this is reported as Clock to Setup and Maximum Frequency.
LEC-15:
5.3
45
Exit Path
exit path: from a op to an output Quartus does not report this by default. When Quartus reports this path, it is reported as the period associated with System fmax. In Xilinx timing reports this is reported as Maximum Delay
LEC-15:
5.3.1
Critical Path Example
46
5.3.1
a
Critical Path Example

d f g k h l m i j
b c
gate NOT AND OR XOR
delay 2 4 4 6
Question: Assuming all delay and timing factors other than combinational logic delay are negligible:
The answer to this question appears as Problem 5.3. In this circuit, it is extremely difcult to determine which path is the real critical path and which paths are false paths. There are many paths with reconvergent fanout, which greatly complicates the analysis. Most circuits are not nearly this difcult to analyze.
` `
what is the critical path through this circuit? what is the delay along the path?
LEC-15:
5.3.2
Algorithm to Find Critical Path
47
5.3.2
LEC-15:
5.3.2
48
5.3.2.1 Critical Path Between Two Signals

The following is an algorithm to nd the critical path from a source signal to a destination signal.
LEC-15:
5.3.2
49
Basic Idea to Find Critical Path
` ` `
Start at source node and traverse through fanout to destination node, annotating intermediate nodes with maximum delay to the intermediate nodes. The delay to the destination node is the delay of the critical path. The critical path is found by starting at the destination path and working backwards, choosing node with maximum delay at each step.
LEC-15:
5.3.2
50

1. 2. 3. 4. Start at source signal Set current time to 0 Annotate node with current time For each node in immediate fanout of current node, (a) set current time to time of current node plus interconnect delay (if any) to the input of fanout node (b) annotate input of fanout node with current time 5. For each node that has times on all of its inputs but not a time for itself, (a) annotate the output of the node with the maximum time on the inputs to the node plus the delay through the node (b) go to step 4 6. To nd the critical path, work backwards through fanin from destination node, choosing fanin node with maximum delay at each step.
LEC-15:
5.3.2
51
5.3.2.2 Critical Path Between Sets of Signals

To nd the critical path from a set of source signals to a set of destination signals: run the above algorithm, but start from all source nodes. The destination of the critical path is the destination node that is annotated with the greatest delay. Run the back-tracking procedure for this destination signal of the critical path.
LEC-15:
5.3.3
False Paths
52
5.3.3
False Paths
Denition false path: A path from a source signal to a destination signal such that changes on the source signal will not propagate along the path to cause a change on the destination signal. There are two classes of false paths, static and dynamic. Static are easier to detect, while dynamic false paths can be tedious and difcult to detect.
LEC-15:
5.3.3
False Paths
53
5.3.3.1 Static False Path Example

Question: Ignoring the behaviour of the gates, nd the critical path through the circuit.
a b c
gate NOT AND OR XOR delay 2 4 4 6
f g
z y h
LEC-15:
5.3.3
False Paths
54
Answer
Answer:
The answer follows on the next few slides
LEC-15:
5.3.3
False Paths
55
Annotate Paths with Delays

a b c 2 2 4 4 0 8 0 8 2 2 12 8 2 12 10 16 z y
The path from a to y has a delay of 16. Check if it is a false critical path.
LEC-15:
5.3.3
False Paths
56
False Path from a to y

a b c 2 2 !a 4 a 4 b 0 8 b 0 ab 8 !c 2 2 12 ab 8 !b 2 ab + !c 12 10 16 !a + !b !b!c z y
Equation for y is: !b!c, which does not contain a, so y is independent of a. In other words: changes on a do not lead to changes on y. In other words: the path from a to y is a false path We were able to use static analysis to determine that the path from a to y is a false path.
LEC-15:
5.3.3
False Paths
57
Find Next Candidate Path

a b c 2 2 !a 4 a 4 b 0 8 4 b 0 ab 4 !c 2 2 8 ab 8 !b 2 ab + !c 8 10 12 !a + !b !b!c z y
To nd the next candidate critical path, recompute delay values along the false path. Leave all other delays the same as before. For each node along the false path, maintain two delay values. One delay is the value already calculated. The other delay value is the maximum delay to that node, ignoring the prex of false path. The prex of a false path is the set of nodes whose fanin comes only from false paths.
LEC-15:
5.3.3
False Paths
58
Candidate Path
a b c 2 (0,2)!a (0,4) a (0,4) b 0 (4,8) 8 4 b 0 ab 4 !c 2 2 8 ab 8 !b 2 ab + !c 8 10 12 !a + !b !b!c z y
The next candidate is from b to y. Static analysis shows that b is in the equation for y, so static analysis cannot detect whether this is a false path. We must use dynamic analysis.
LEC-15:
5.3.3
False Paths
59
5.3.3.2 Dynamic False Path Example

Question: Determine if the critical path you found in the previous question is a real critical path or a false path. If it is a false path, nd the real critical path and its delay.
Answer:
LEC-15:
5.3.3
False Paths
60
Test Candidate Path

Try to push a rising edge from source to destination, assign values to nodes not on critical path that allow rising edge to propagate.
a 1 b 0 c 1 0 0 y 1 z
Rising edge fails to generate a change on y.
LEC-15:
5.3.3
False Paths
61
Test Candidate Path

Try to push a falling edge from source to destination.
a 1 b 0 c 1 0 0 y 1 z
Both rising and falling edges failed to generate a change on output, therefore found another false path. NB: Pushing edges forward is not a smart way to explore candidate critical paths, because this technique does not help isolate the cause the of false path. Pushing edges backwards will identify the cause of the false path.
LEC-15:
5.3.3
False Paths
62
Test Candidate Path Intelligently

a 1 b c 1 0 0 1 z 0 0 1 1 y
Try to push a rising edge backwards along path between b and y. Contradictory assignment for b, therefore false path.
LEC-15:
5.3.3
False Paths
63
Reconvergent Fanout
a b y c 0 z
Two paths from point of contradictory assignment to y. This is reconvergent fanout. Reconvergent fanout is most common cause of false paths. It also causes problems with fault-detection (Chapter 7).
LEC-15:
5.3.3
False Paths
64
Pushing Edge with Reconvergent Fanout

a 1 b y c 1 0 0 1 z
Try to push a rising edge backwards along path, but put edge (not constant) on node in reconvergent fanout. Contradictory assignments to b.
LEC-15:
5.3.3
False Paths
65
Next Candidate Path

a b c 2 !a 4 a 4 b 0 8 0 b 0 ab 0 !c 2 2 6 ab !b 2 ab + !c 6 10 10 !a + !b !b!c z y
To nd the next candidate critical path, recompute the delay values for nodes along the false path. Leave all other delays the same as before. To recompute delay along a false path, ignore the prex of the false path. The prex is the set of nodes whose fanin comes only from false paths.
LEC-15:
5.3.3
False Paths
66
Shortcut for Candidate Paths

As a shortcut, you do not need to maintain two delay values for nodes in the sufx of the false path. The sufx is the set of nodes who fanout only to the false path. The nodes in the sufx do not need to maintain their old delay value. They only need their new delay value.
LEC-15:
5.3.3
False Paths
67
Next Candidates
a b c 2 2 !a 4 a 4 b 0 8 0 b 0 ab 0 !c 2 2 6 ab 8 !b 2 ab + !c 6 10 10 !a + !b !b!c z y
LEC-15:
5.3.3
False Paths
68
Test First Candidate

a 1 b 1 y c z
(*CHANGE ver2 (2002/12/02): corrected edge polarity on a *) Propagate a rising edge backwards. It works!
LEC-15:
5.3.3
False Paths
69
Test Second Candidate

a - b 0 c - - 0 - 0 0 0 0 1 1 y z
Propagate a rising edge backwards. It works!
LEC-15:
5.3.3
False Paths
70
Summary
There are two paths with a delay of 10: one from a to z and one from c to y. We can push edges along both of these paths, so they are real critical paths. Note that different values on b result in different critical paths.
LEC-15:
5.3.3
False Paths
71
5.3.3.3 Another Dynamic False Path Example

Question:
a b c d e
Find the false critical path in the circuit below.
f h g
i k j
LEC-15:
5.3.3
False Paths
72
Answer
Answer:
a b c d e 0 /= 1 f g 1 0 h j i 1 k
LEC-15:
5.3.3
False Paths
73
5.3.3.4 And Another Dynamic False Path Example

Question: Find the real critical path in the circuit below.
delay=8 a b y c delay=2 z x
LEC-15:
5.3.3
False Paths
74
First Candidate
Answer:
4 a0 b0 8 c 0 2 2 2 4 0 8 0 2 12 12 2 12 16 delay=8 12 x
delay=2
14
This is a false path, we saw it before in an earlier problem.
LEC-15:
5.3.3
False Paths
75
Second Candidate
4 a0,0 b0 c 0 2 0,2 0,4 0 0,8 0 0,8 2 2 6,12 12 2 6 10 delay=8 12 x
delay=2
14
The real critical path is the path from a to z, which has a delay of 14.
LEC-15:
5.3.3
False Paths
76
5.3.3.5 Algorithm for False Path Detection

To determine if a path through a circuit is a false path: 1. Start at destination node of path, try to push a 1 0 or 0 1 backwards along the candidate critical path. 2. Follow the critical path backwards, at each gate, assign values (0 or 1) to the non-critical input signals according to the rules in gure 5.9. If have reconvergent fanout, then can assign 1 0 or 0 1 to noncritical inputs, otherwise must use just 0 or 1. 3. If assign different values to same signal, then the candidate critical path is a false path. 4. If dont assign different values to same signal, then assignments calculated along path give values that will exercise critical path. 5. Push values on non-critical nodes to primary inputs to give assignment that will exercise the critical path.
LEC-15:
5.3.3
False Paths
77
Rules for Pushing Edges

1 1
1 0
1 0
General rules
Additional rules for reconvergent fanout
Figure 5.9: Rules for pushing rising and falling edges through gates
LEC-15:
5.3.3
False Paths
78
Reconvergent Fanout Rules

Question: Why do the rules for reconvergent fanout have only rising edges for AND gates and falling edge for OR gates?
Answer:
a b a c b c
Falling edge on non-critical path will cause output to change before edge on critical path affects output.
LEC-15:
5.3.3
False Paths
79
Analyzing Rules for Reconvergent Fanout

The pictures below show all combinations of output edge (rising or falling) and input values (constant 1, constant 0, rising edge, falling edge) for AND, OR , NAND , and NOR gates. The pictures that are crossed out illustrate combinations of inputs and outputs that are contradictory to the behaviour of the gate.
LEC-15:
5.3.3
False Paths
80
Reconvergent for AND

0 0 is controlling 1 0 0 is controlling 1
1 glitch on output
constant 0 output
0 is controlling
LEC-15:
5.3.3
False Paths
81
Reconvergent for OR
0 0
1 is controlling
1 is controlling
1 is controlling
constant 1 output
0 glitch on output
LEC-15:
5.3.3
False Paths
82
Reconvergent for NAND

0 0 is controlling 1 0 0 is controlling 1
0 glitch on output
0 is controlling
constant 0 output
LEC-15:
5.3.3
False Paths
83
Reconvergent for NOR

0 0
1 is controlling
1 is controlling
1 is controlling
constant 1 output
0 glitch on output
LEC-15:
5.3.4
Increasing the Accuracy of Critical Path Analysis
84
5.3.4 Increasing the Accuracy of Critical Path Analysis

When doing critical path calculations, often useful to strike a balance between accuracy and effort. In examples so far we have been assuming that all signals have the same wire and load delays. This assumption simplies calculations, but reduces accuracy. Section 5.4 discusses how the analog world affects timing analysis.
LEC-16: Math, Physics, and Applications of Timing Analysis

Schedule
wk-09 10 wk-11 12 wk-13
Overview
This lecture looks at the analog equations that affect delay and relates them up to the digital world.
Concepts
` ` ` `
Timing model Data dependend delay Propagation delay Load delay Interconnect delay Elmore time constant
` ` ` ` ` `
Extrinsic delay Intrinsic delay Worst case timing Derating factors Speed binning
LEC-16:
5.4
ANALOG EFFECTS IN TIMING ANALYSIS
5.4
Analog Effects in Timing Analysis
LEC-16:
5.4.1
Timing Model (Smith 3.1, 13.6)
5.4.1

Rpu Vi Cp Rpd Vo Cout
Rpu Rpd Cp Cout
Timing model pull up resistor in p-tran pull down resistor in n-tran parasitic capacitance load capacitance
LEC-16:
5.4.1
5.4.1.1 Equation for Output Voltage

Output voltage when Vo discharges through Rpd (Equation 3.1 from Smith).
Vo
VDD
i ph
g f
Rpd Cp
t Cout
LEC-16:
5.4.1
Measuring Delay Through an Inverter

Vdd 0.65 Vdd 0.35 Vdd Vout 0 Vin
To measure delay through inverter, what voltage levels do we use?
LEC-16:
5.4.1
Dening Trip Points

Denition Trip Points: A high or 1 trip point is the voltage level where an upwards transition means the signal represents a 1. A low or 0 trip point is the voltage level where a downwards transition means the signal represents a 0.
Vdd 0.65 Vdd 0.35 Vdd 0
LEC-16:
5.4.1
10
Picking Trip Points

We need to pick our trip points, then these determine the start and stop time for measuring delay. Pick the trip points to simplify the delay equation. Pick trips points of 0.35/0.65:
` `
low-voltage (0) trip point of 0.35 Vdd high-voltage (1) trip point of 0.65 Vdd
LEC-16:
5.4.1
11
Trip Points and Delay Equation

Delay equation for falling output with 0.35 trip point: TPD Rpd Cp Cout
g f
s th q r f
Solving for TPD , using ln 1 0 35 TPD
Rpd Cp
0 35VDD
VDD
1, doing some more approximations: Cout
ih
g f
d q
LEC-16:
5.4.1
12
Some Rough Intuition
A larger transistor has a lower resistance, but a higher capacitance. Resistance affects timing of source (driving) signals. Capacitance affects (mostly) timing of destination (load) signals. Decreasing resistance increases the current through drivers. Increasing capacitance slows down (dis)charging of load capacitors.
g f
TPD
Rpd Cp
Cout
` ` ` ` `
LEC-16:
5.4.1
13
5.4.1.2 Extrinsic / Intrinsic Delays (Smith 13.6)

Denition intrinsic delay: Delay resulting from pull(up/down) resistor and parasitic capacitance.
Denition extrinsic delay: Delay resulting from load capacitance.
LEC-16:
5.4.2
Data-Dependent Delay
14
5.4.2
Sometimes the delay through a component is dependent upon the values on signals.
In a ripple-carry adder, if a carry out of the MSB is generated from the least signicant bit, then it will take longer for the output to stabilize than if no carries generated at all.
In a state machine using a one-hot state encoding, false paths might exist when more than one state bit is a 1.
` ` `
Some implementation technologies (e.g. NMOS and exotic latches) have faster transitions from 1 0 than 0 1.
LEC-16:
5.4.2
15
Analysis and Accuracy

Because of these effects, the most accurate delay analysis requires looking at the actual data values that will occur in practice. Conversely, a timing simulation may not demonstrate the actual slowest behaviour of your circuit: if you dont ever generate a carry from LSB to MSB, then youll never exercise the critical path in your adder.
NOTE: Asynchronous circuits Data dependent delays are one motivation for asynchronous circuits. Asynchronous circuits are still an active area of research, but are beginning to be used in commercial circuits.
LEC-16:
5.4.3
Interconnect Delay (Smith 7.1)
16
5.4.3
LEC-16:
5.4.3
17
5.4.3.1 Elmore Time Constant (Smith 7.1.2)

Elmore time constants are used to analyze interconnect delay with intermediate connections and/or fanout.
Di
Elmore time constant for node i n ER Ck (n is the number of nodes in the k,i k 1 circuit)
ER k,i
= resistance along path from node i to the source that is also on the path from node k to source
w w
v u
Vi t
The voltage on node i (capacitor i) at time t t Di
LEC-16:
5.4.3
18
Elmore Time Constant
k 1
If we:
approximate Vi t as an exponential waveform, and use 0.35/0.65 trip points
then the delay from the source to node i is Di seconds.
Di
ERk,iCk
hf
` `
LEC-16:
5.4.3
19
5.4.3.2 Interconnect with Single Fanout

This is similar to the example in Smith 7.1.3, except that Smith has one more wire segment (L4) between the gates.
G1
G2
Ra4 Ra1
G1
C3 Rw3
Ra3
G2 C1 Rw1 G1
Rpu
C2 Rw2 Ra2
G2 Ra1 Rw1 Ra2 Rw2 Ra3 Rw3 Ra4
Vi Cp Rpd
C1
C2
C3
CG2
G* C* Ra* Rw*
gate capacitance on wire resistance through antifuse resistance through antifuse
Question:
Calculate delay from gate 1 to gate 2
Answer:
Gate 2 represents node 4 on the RC tree.
LEC-16:
5.4.3
20
k 1
Ra1
Rw1 C1
Ra1
Rw1
Ra2
Rw2 C2
Ra1
Rw1
Ra2
Rw2
Ra3
Rw3 C3
Ra1
Rw1
Ra2
Rw2
Ra3
Rw3
ER C1 1,4
f g
f g
f g
b b b
D4
ERk,iCk
ER C2 2,4 ER C3 3,4 ER C4 4,4 Ra4 CG2
LEC-16:
5.4.3
21
approximate Rai
Ra j
D4
4 Ra CG2
3 Ra C3
2 Ra C2
Ra C1
h f g h f g h f g h f b g g f g f g h f
h g f g h
D4
Ra1 C1 Ra1 Ra2 C2 Ra1 Ra1 Ra2 Ra3 Ra4 CG2
approximate Ra
Rw Ra2 Ra3 C3
LEC-16:
5.4.3
22
Doubling Antifuses
Question: If you double the number of antifuses and wires needed to connect two gates, what will be the approximate effect on the wire delay between the gates?
Answer:
LEC-16:
5.4.3
23
Doubling Antifuses Answer

Di
k 1
ERk,iCk
Assume all resistances and capacitances are the same values (R and C), and assume that all intermediate nodes are along path between the two gates of interest. k R ER k,i
h xf
Di
k 1
k RC
b b
LEC-16:
5.4.3
24
Antifuse Doubling (Contd)

Using the mathematical theorem:
i 1
n2
We simplify delay equation:
k 1 n2 RC
We see that the delay is propotional to the square of the number of antifuses along the path.
h xf
Di
k RC
h g f
2
n
b s b
1n
LEC-16:
5.4.3
25
5.4.3.3 Interconnect with Multiple Gates in Fanout

G1 G3 G2 G1 G2 G3
Question: Assuming that wire resistance is much less than antifuse resistance and that all antifuses have equal resistance, calculate the delay from the source inverter (G1) to G2
LEC-16:
5.4.3
26
Answer:
2. Label interconnect with resistance and capacitance identiers.
R4 C5 G2 C1 R1 G1
C4
R3 C3 R5 R6 C7
G3 C6 R2 C2
1. There are a total of 7 nodes in the circuit (n
7).
LEC-16:
5.4.3
27
3. Draw RC tree
G1 Rpu R1 n1 R2 n2 Cp Rpd G3 R5 n6 R6 C6 C1 C2 n3 R3 n4 R4 C3 C4 G2
Vi
n5 C5
n7 C7
4. G2 is node 5 in the circuit (i
5).
LEC-16:
5.4.3
28
5. Elmore delay equations
k 1
ER C5 5,5
ER C6 6,5
ER C7 7,5
ER C1 1,5
b b
D5
k 1 7
ERk,5Ck
ER C2 2,5 ER C3 3,5 ER C4 4,5
Di
ERk,iCk
LEC-16:
5.4.3
29
6. Elmore resistances ER = R1 1,5 ER ER ER ER ER ER 2,5 3,5 4,5 5,5 6,5 7,5 = = = = = = R1 + R2 R1 + R2 R1 + R2 + R3 R1 + R2 + R3 + R4 R1 + R2 R1 + R2
= = = = = = =
R 2R 2R 3R 4R 2R 2R
LEC-16:
5.4.3
30
7. Plug resistances into delay equations
D5
R C1 2R C2 2R C3 2R C6 2R C7
3R C4
4R C5
LEC-16:
5.4.3
31
Delay from G1 to G3
Question: Assuming that wire resistance is much less than antifuse resistance and that all antifuses have equal resistance, calculate the delay from the source inverter (G1) to G3
LEC-16:
5.4.3
32
Answer:
1. G3 is node 7 in the circuit (i
7).
LEC-16:
5.4.3
33
2. Elmore delay equations
k 1
ER C5 5,7
ER C6 6,7
ER C7 7,7
ER C1 1,7
D7
k 1 7
ERk,7Ck
ER C2 2,7 ER C3 3,7 ER C4 4,7
Di
ERk,iCk
LEC-16:
5.4.3
34
3. Elmore resistances ER = R1 1,7 ER ER ER ER ER ER 2,7 3,7 4,7 5,7 6,7 7,7 = = = = = = R1 + R2 R1 + R2 R1 + R2 R1 + R2 R1 + R2 + R5 R1 + R2 + R5 + R6
= = = = = = =
R 2R 2R 2R 2R 3R 4R
LEC-16:
5.4.3
35
4. Plug resistances into delay equations
D7
R C1 2R C2 2R C3 3R C6 4R C7
2R C4
2R C5
LEC-16:
5.4.3
36
Delay to G2 vs G3
Question: Assuming all wire segments at same level have roughly the same capacitance, which is greater, the delay to G2 or the delay to G3?
Answer:
1. Equations for delay to G2 (D5 ) and G3 (D7 )
2. Difference in delays
3. Compare capacitances
4. Conclusion: delays are approximately equal.
C5
C4
C6 C7
D7
D5

R C1 2R C2 2R C3 2R C4 2R C5 3R C6 D7 RC4 2RC5 RC6 2RC7
D5
R C1
2R C2
2R C3
3R C4
4R C5
2R C6
2R C7 4R C7
LEC-16:
5.4.3
37
5.4.3.4 FPGAs, Interconnect, and Synthesis

On FPGAs 40-60% of clock cycle is consumed by interconnect. When synthesizing, increasing effort (number of iterations) of place and route can signicantly reduce the clock period on large designs.
LEC-16:
5.5
PRACTICAL USAGE OF TIMING ANALYSIS
38
5.5
Practical Usage of Timing Analysis
LEC-16:
5.5.1
Speed Binning (Smith 5.1.6)
39
5.5.1
Speed Binning (Smith 5.1.6)
Speed binning is the process of testing each manufactured part to determine the maximum clock speed at which it will run reliably. Manufacturers sell chips off of the same manufacturing line at different prices based on how fast they will run. A speed bin is the clock speed that chips will be labeled with when sold. Overclocking: running a chip at a clock speed faster than what it is rated for (and hoping that your software crashes more frequently than your overstressed hardware will).
LEC-16:
5.5.2
Worst Case Timing (Smith 5.1.7)
40
5.5.2
LEC-16:
5.5.2
41
5.5.2.1 Fanout delay

Table 5.2 (Fanout delay) combines two separate parameters:
into a single parameter (fanout). This is common, and ne. But, when reading a table such as this, you need to know whether fanout delay is combining both capacitive load delay and interconnect delay, or is just capacitive load.
capacitive load delay interconnect delay
LEC-16:
5.5.2
42
5.5.2.2 Derating Factors

Delays are dependent upon supply voltage and temperature.
D D
Temp Supply voltage
Delay Delay
LEC-16:
5.5.2
43
Temperature
As temp goes up, atoms vibrate more, and so have greater probability of colliding with electrons owing with current.
Temp
Temp
Delay Resistivity of wires
LEC-16:
5.5.2
44
Supply Voltage
current age
time to charge load capacitors to threshold volt-
Supply voltage
Supply voltage
Delay current (V = IR)
LEC-16:
5.5.2
45
Derating Factor Denition

A derating factor is a number to adjust timing numbers to account for different temperature and voltage conditions. Excerpt from table 5.3 in book (Actel Act 3 derating factors): Derating factor 1.17 1.00 0.63 Temp 125C 70C -55C Vdd 4.5V 5.0V 5.5V
LEC-17: Timing Analysis (Latches and Flip Flops)

Schedule
wk-09 10 wk-11 12 wk-13
Concepts
Setup, hold, and clock-to-Q time calculations for the following circuits:
We wont have time to cover all of these in lecture. Hierarchical FPGA is in Smith. Exotic op is for your interest and buzz-word completedness in interviews, it will not be on the nal exam.
Latch Master/Slave ip op
Exotic ops Hierarchical FPGA cell
LEC-17:
5.6
TIMING ANALYSIS OF LATCHES AND FLIP FLOPS
5.6 Timing Analysis of Latches and Flip Flops
LEC-17:
5.6.1
Simple Latch
5.6.1
Simple Latch
Two modes for latch:
loading data: loads input data into storage circuitry input data passes through to output using stored data input signal is disconnected from output storage circuitry drives output
clk o
Schematic
LEC-17:
5.6.1
Simple Latch
Two Modes for Latch

1 i o i 0 o
Loading / pass-through mode
Storage mode
LEC-17:
5.6.1
Simple Latch
Implementing a Latch
s a b o a sel b
Multiplexor: symbol and implementation

clk i o d clk
Latch implementation
LEC-17:
5.6.1
Simple Latch
Latch Glitching
d clk
NOTE: inverters on sel Both of the inverters on the sel signal are needed. Together, they prevent a glitch on the OR gate when sel is deasserted. If there was only one inverter, a glitch would occur. For more on this, see section 5.6.3.2
LEC-17:
5.6.1
Simple Latch
Loading 0
d=0 clk=1 1 1 0 1 1 0 0 o
LEC-17:
5.6.1
Simple Latch
10
Loading 1
d=1 clk=1 0 1 0 0 0 0 1 o
LEC-17:
5.6.1
Simple Latch
11
Storing 0
d clk=0 0 1 1 0 1 1 0 o=0
LEC-17:
5.6.1
Simple Latch
12
Storing 1
d clk=0 0 1 1 0 1 1 1 o=1
LEC-17:
5.6.1
Simple Latch
13
Timing Analysis Strategy

The key to calculating setup and hold times of a latch, op, etc is to identify: 1. how the data is stored when not connected to the input (often a pair of inverters in a loop) 2. the gate(s) that the clock uses to cause the stored data to drive the output (often a transmission gate or multiplexor) 3. the gate(s) that the clock uses to cause the input to be drive the output (often a transmission gate or multiplexor)
LEC-17:
5.6.1
Simple Latch
14
Storage Path with Gating Gate

d clk=0 0 1 0 o
LEC-17:
5.6.1
Simple Latch
15
Load Path with Gating Gate

d clk=1 1 0
LEC-17:
5.6.1
Simple Latch
16
Clock-to-Q
NOTE: Clock-to-Q for latches For latches, clock-to-Q times are measured with respect to the clock edge that connects the data input to the output. For active-high latches, this is a rising edge.
LEC-17:
5.6.1
Simple Latch
17
Setup and Hold

NOTE: Setup and hold time for latches For latches, hold time and setup time are measured with respect to the clock edge that disconnects the data input from the output. For active-high latches, this is a falling edge. Hold time is concerned with the next data value sneaking in before the latch goes into storage mode. Setup time is concerned with the previous data value still being in the storage circuitry when the input is disconnected.
LEC-17:
5.6.1
Simple Latch
18
Requirements and Guarantees

NOTE: Requirements vs. Guarantees For a storage device, the setup and hold times are requirements that the device imposes upon its environment. The clock-to-Q time is a guarantee. If the environment satises the setup and hold times, then the storage device guarantees that it will satisfy the clock-to-Q time.
LEC-17:
5.6.1
Simple Latch
19
Storage Devices and Signals

NOTE: Storage devices vs. Signals We can talk about the setup and hold time of a signal or of a storage device. For a storage device, the setup and hold times are requirements that it imposes upon all environments in which it operates. For an individual signal in a circuit, there is a setup and hold time, which is the amount of time that the signal is stable before and after a clock edge.
LEC-17:
5.6.2
Clock-to-Q Time of a Simple Latch
20
5.6.2
Clock-to-Q Time of a Simple Latch

d clk l1 c2 cn l2 qn s2 s1 q
Figure 5.10: Latch for Clock-to-Q Analysis

d l1 l2 qn q s1 s2 clk cn c2 clock-to-Q
Calculate clock-to-Q time by nding delay of critical path from where clock signal enters storage circuit to where q exits storage circuit.
LEC-17:
5.6.3
Setup Timing of a Simple Latch
21
5.6.3

Figure 5.11: Latch for Setup Analysis

setup + slack d l1 l2 qn q s1 s2 clk cn c2
Figure 5.12: Setup OK: goal is to store
LEC-17:
5.6.3

l1 c2 cn
22
d clk
l2 qn s2 s1 q
setup with negative slack d l1 l2 qn q s1 s2 clk cn c2

/
Figure 5.13: Setup Violation
LEC-17:
5.6.3

l1 c2 cn
23
d clk
l2 qn s2 s1 q
setup d l1 l2 qn q s1 s2 clk cn c2
Figure 5.14: Minimum Setup Time
LEC-17:
5.6.3

l1
24
d clk
l2 qn q
cn
s2 s1
setup d l1 l2 qn q s1 s2 clk cn c2
Minimum Setup Time must arrive at s1 before cn is asserted. Otherwise, will affect storage circuitry when data input is disconnected. Setup time is difference between path from d to s1 and path from clk to cn.
LEC-17:
5.6.3
25
5.6.3.1 Hold Time of a Simple Latch

d clk cn s2 s1 l1 c2 l2 qn q
Figure 5.15: Latch for Hold Analysis

hold + slack d l1 l2 qn q s1 s2 clk cn c2
Figure 5.16: Hold OK: goal is to store
LEC-17:
5.6.3

l1 c2 cn s2 s1
hold with negative slack
26
d clk
l2 qn q
d l1 l2 qn q s1 s2 clk cn c2
Figure 5.17: Hold violation: slips through to q
LEC-17:
5.6.3

l1 c2 cn s2 s1
hold
27
d clk
l2 qn q
Figure 5.18: Minimum Hold Time

LEC-17:
5.6.3

hold
28
Cant let affect l1 before c2 deasserts. Hold time is difference between path from clk to c2 and path from d to l1.
LEC-17:
5.6.3
29
5.6.3.2 Example of a Bad Latch

d clk l1 c2 cn l2 qn s2 s1 d l1 l2 qn q s1 s2 clk c2 cn q
LEC-17:
5.6.4
Timing Analysis of a Transmission Gate Latch
30
5.6.4 Timing Analysis of a Transmission Gate Latch
LEC-17:
5.6.4
31
5.6.4.1 Transmission Gate (Smith 2.4.3)
Symbol
s 1 0
Implementation 0
Open
0
Closed
Transmit 1
s i o
Transmit 0
Transmission gate as switch
LEC-17:
5.6.4
32
5.6.4.2 Transmission Gate Latch (Smith 2.5.1)

d clk q
LEC-17:
5.6.4
33
Loading Data into Latch

d clk 1 0 1 1 0 q
LEC-17:
5.6.4
34
Using Stored Data from Latch

d clk 1 1 0 0 1 q
LEC-17:
5.6.4
35
5.6.4.3 Clock-to-Q Delay for Latch

d clk 1 q
LEC-17:
5.6.4
36
5.6.4.4 Setup and Hold Times for Latch
LEC-17:
5.6.4
37
Setup Time for Latch

d clk 1 path2 path1 q
Setup time = path1 path2
LEC-17:
5.6.4
38
Hold Time for Latch
path2 d clk 1 path1
Hold time = path1 path2
LEC-17:
5.6.5
Falling Edge Flip Flop (Smith 2.5.2)
39
5.6.5

d clk
EN
m
EN
d clk m clk_b q ??
??
LEC-17:
5.6.5
40
5.6.5.1 Behaviour of Flip-Flop

d clk
EN
m
EN
TInv d clk m clk_b q Tinv Tmd Latch Setup Latch Clock-Q
TInv Tmd
delay through an inverter propagation delay from m to d
LEC-17:
5.6.5
41
5.6.5.2 Clock-to-Q of Flip-Flop

d clk
EN
m
EN
d clk m clk_b q
Tinv Latch Clock-to-Q
Flop Clock-to-Q
Flop CO
TInv
Latch CO
LEC-17:
5.6.5
42
5.6.5.3 Setup of Flip-Flop

d clk
EN
m
EN
d clk m clk_b q
Tinv
Tmd
Latch Setup clock path data path
Flop Setup
SUD
Flop
Tmd
Latch SUD
TInv
LEC-17:
5.6.5
43
5.6.5.4 Hold of Flip-Flop

d clk
EN
m
EN
d clk m clk_b q
Hold time for latch Hold time for flop
The hold of the ip op is the same as the hold time of the master latch. This is because, once the data is stored in the master latch, it will be held for the slave latch.
Flop HO
Latch HO
LEC-17:
5.6.6
Timing Analysis of FPGA Cells (Smith 5.1.5)
44
5.6.6 Timing Analysis of FPGA Cells (Smith 5.1.5)
LEC-17:
5.6.6
45
5.6.6.1 Standard Timing Equations
HO
CO
PD T CLKD T OUT T SUD
delay from D-inputs to storage element delay from clk-input to storage element delay from storage element to output setup time slowest D path fastest clk path T T PD Max CLKD Min hold time slowest clk path fastest D path T T CLKD Max PD Min delay clk to Q clk path output path T T CLKD OUT
LEC-17:
5.6.6
46
5.6.6.2 Hierarchical Timing Equations

Add combinational logic to inputs, clock, and outputs of storage element.
t SUD data inputs t PD d t HO t CO clk clk t CLKD q t OUT
HO CO
CLKD Max CLKD Max
SUD T HO T CO
SUD
PD Max
CLKD Min T PD Min T OUT Max
LEC-17:
5.6.6
47
5.6.6.3 Actel Act 2 Logic Cell

Timing analysis of Actel Act 2 logic cell (Smith 5.1.5). Actel ACT
Basic logic cells are called Logic Module ACT 1 family: one type of Logic Module (see Figure 5.1, Smiths pp. 192) ACT 2 and ACT 3 families: use two different types of Logic Module (see Figure 5.4, Smiths pp. 198) C-Module (Combinatorial Module) combinational logic similar to ACT 1 Logic Module but capable of implementing ve-input logic function S-Module (Sequential Module) C-Module + Sequential Element (SE) that can be congured as a ip-op
LEC-17:
5.6.6
48
Actel Timing
Actel Timing
ACT family: (see Figure 5.5, Smiths pp. 200) Simple. Why? Only logic inside the chip Not exact delay (as no place and route, physical layout, hence not accounting for interconnection delay) Non-Deterministic Actel Architecture All primed parameters inside S-Module are assumed Calculate tSUD, tH, and tCO The combinational logic delay of 3 ns: 0.4 went into increasing the setup time, tSUD, and 2.6 ns went into increasing the clockoutput delay, tCO. From outside we can say that the combinational logic delay is buried in the ip-op set up time
LEC-17:
5.6.6
49
Actel Latch
d clk q d clk clr q
Simple Actel-style latch
Actel latch with active-low clear
LEC-17:
5.6.6
50
d clk clr
Actel op with active-low clear
LEC-17:
5.6.6
51
C-Module d00 d01 d10 d11 a1 b1 a0 b0
SE-Module
m se_clk se_clk_n
clk clr
Actel sequential module
LEC-17:
5.6.6
52
5.6.6.4 Timing Analysis of Actel Sequential Module

Timing parameters for Actel latch with active-low clear T SUD T HO T CO 0.4ns 0.9ns 0.4ns
Other given timing parameters C-Module delay (tPD ) tCLKD (from clk to se clk and se clk n) 3ns 2.6ns
LEC-17:
5.6.6
53
Timing of Actel Module

Question: What are the setup, hold, and T times for the CO entire Actel sequential module?
Answer:
See Smith pp 199. Use Smiths eqn 5.15, 5.16, and assume 2 6ns. t CLKD T SUD T HO T CO
0.8ns 0.5ns 3.0ns
LEC-17:
5.6.7
Exotic Flop
54
5.6.7
Exotic Flop
q d clk
Inverter chain creates evaluation window in time when clock has just risen and p transistors are turned on. When clock is 0, internal nodes precharge to 1. Inverter loops are keepers, which store data value.
Chapter 6
Power Analysis and Design
55
LEC-18: Introduction to Power

Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 wk-09 10 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Lec-18 Introduction Lec-19 Data Encoding Lec-20 Clock Gating Faults and Testing Review
wk-11 12 wk-13
Purpose and List of Concepts

The purpose of this lecture is to convey the importance that power consumption plays in a wide spectrum of digital systems and to introduce the physical equations used to model power consumption. Power Energy Battery Energy Heat Removal Static Power Consumption Dynamic Power Consumption
Activity Factor Switching Power Consumption Short-Circuiting Power Consumption Leakage Power Consumption
Background Material
Basic electricity and magnetism equations for voltage, power, current, etc
Reading Material
All of the articles and papers below are linked to from the Documentation page on the E&CE 427 web site. Smith 15.5 Mudge Power: A rst class design constraint. Trevor Mudge. Computer, vol. 34, no. 4, April 2001, pp. 52-57
http://www.eecs.umich.edu/tnm/papers/computer01.pdf
For more info (optional) :
Infrared Expose: Thermal imaging of 29 200-MHz and 233-MHz notebooks. PC Online. 1997
http://www.zdnet.com/pcmag/features/notebook3/heat.htm
Power-aware microarchitecture: Design and modeling challenges for next-generation microprocessors. Brooks, D.M.; Bose, P.; Schuster, S.E.; Jacobson, H.; Kudva, P.N.; Buyuktosunoglu, A.; Wellman, J.; Zyuban, V.; Gupta, M.; Cook, P.W. IEEE Micro Dec 2000.
http://ieeexplore.ieee.org/iel5/40/19226/00888701.pdf?isNumber=19226
Managing the Impact of Increasing Microprocessor Power Consumption. Stephen H. Gunther, Frank Binns, Douglas M. Carmean, and Jonathan C. Hall. Intel Technology Journal. 2001 Quarter 1.
http://developer.intel.com/technology/itj/q12001/articles/art 4.htm
the following are three papers from the 1998 Design Automation Conference (DAC) in a session on Power Dissipation and Distribution in High Performance Processors Power Considerations in the Design of the Alpha 21264 Microprocessor. Michael K. Gowan, Larry L. Biro, Daniel B. Jackson.
http://www.sigda.acm.org/Archives/ProceedingArchives/Dac/Dac98/papers/1998/dac98/p
Reducing Power in High-Performance Microprocessors. Vivek Tiwari, Deo Singh, Suresh Rajgopal, Gaurav Mehta, Rakesh Patel, Franklin Baez. Design and Analysis of Power Distribution Networks in PowerPC(TM)Microprocessors. Abhijit Dharchoudhury, Rajendran Panda, David Blaauw, Ravi Vaidyanathan, Bogdan Tutuianu, David Bearden.
LEC-18:
6.1
OVERVIEW
6.1
Overview
LEC-18:
6.1.1
Importance of Power and Energy
6.1.1
Importance of Power and Energy
Laptops, PDA, cell-phones, etc obvious! Every watt above 40W adds $1 to manufacturing cost Approx 25% of operating expense of server farm goes to energy bills (Dis)Comfort of Unix labs in E2 Sandia Labs had to build a special sub-station when they took delivery of Teraops massively parallel supercomputer (over 9000 Pentium Pros) High-speed microprocessors today can run so hot that they will damage themselves Athlon reliability problems, Willamette thermal throttling In 2000, information technology consumed 8% of total power in US.
LEC-18:
6.1.2
Industrial Names and Products
6.1.2
Industrial Names and Products
All of the articles and papers below are linked to from the Documentation page on the E&CE 427 web site. AMDs Athlon PowerNow! Reduce power consumption in laptops when running on battery by reducing clock speed and supply voltage. Intel Speedstep Reduce power consumption in laptops when running on battery by reducing clock speed to 70-80% of normal. Intel X-Scale An ARM5-compatible microprocessor for low-power systems: http://developer.intel.com/design/intelxscale/ Synopsys PowerMill A simulator that estimates power consumption of the circuit as it is simulated: http://www.synopsys.com/products/etg/powermill ds.html Compaq Itsy Satellites
LEC-18:
6.1.3
Power vs Energy
10
6.1.3
Power vs Energy
Most people talk about power reduction, but sometimes they mean power and sometimes energy.
Power
Watts
Energy / Time
Volts I Joules sec
Type Energy
Power minimization is usually about heat removal Energy minimization is usually about battery life or energy costs Units Joules Equivalent Types Work Equations Volts Coulombs 1 C Volts2 2
LEC-18:
6.1.4
Batteries, Power and Energy
11
6.1.4
LEC-18:
6.1.4
12
6.1.4.1 Do Power?
Batteries
Store
Coulombs
Energy
or
Batteries rated in Amp-hours at a voltage.
Energy
Batteries store energy.
Coulombs Seconds Seconds Coulombs Volts
battery
Amps
Power
Energy Time
Seconds
Energy
Volts
Volts Volts
LEC-18:
6.1.4
13
6.1.4.2 Battery Life and Efciency

To extend battery life, we want to increase the amount of work done and/or decrease energy consumed. Work and energy are same units, therefore to extend battery life, we truly want to improve efciency. Power efciency of microprocessors normally measured in MIPS/Watt. Is this a real measure of efciency? MIPs Watts millions of instructions Seconds millions of instructions Energy Seconds Energy
Both instructions executed and energy are measures of work, so MIPs/Watt is a measure of efciency. (This assumes that all instructions perform the same amount of work!)
LEC-18:
6.1.5
Example Problem: Battery Life and Power
14
6.1.5 Example Problem: Battery Life and Power

Question: Running a VHDL simulation requires executing an average of 1 million instructions per simulation step. My computer runs at 700MHz and burns 70W of power. My battery is rated at 10V and 2.5AH. Assuming all of my computers clock cycles go towards running VHDL simulations, how many simulation steps can I run on one battery?
Question: If I use the SpeedStep feature of my computer, and run at 600MHz with 60W of power, how much longer can I keep the computer running on one battery? How many more simulation steps can I run on one battery?
LEC-18:
6.2
POWER EQUATIONS
15
6.2
Power Equations
DynamicPower StaticPower
Dynamic Power Static Power Switching Power Short Circuit Power Leakage Power
dependent upon clock speed independent of clock speed useful charges up transistors not useful both N and P transistors are on not useful leaks around transistor
e fd
e fd
Power
SwitchPower
ShortPower
LeakagePower
LEC-18:
6.2.1
Dynamic Power and Activity Factor
16
6.2.1
Dynamic Power and Activity Factor
Dynamic power is proportional to how often signals change their value (switch). Roughly 20% of signals switch during a clock cycle. Need to take glitches into account when calculating activity factor. Equations for dynamic power contain clock speed and activity factor.
LEC-18:
6.2.2
Switching Power
17
6.2.2
Switching Power
1->0 0->1 CapLoad
0->1 1->0 CapLoad
Charging a capacitor 1 2
Disharging a capacitor
f : frequency at which invertor goes through complete charge-discharge cycle. (eqn 15.4 in Smith)
ClockSpeed ActFact
clock speed average number of times that signal switches from 0 1 or from 1 0 during a clock cycle
average switching power
1 2
ActFact ClockSpeed CapLoad VoltSup2
average switching power
CapLoad
VoltSup2
energy to (dis)charge capacitor
CapLoad
VoltSup2
LEC-18:
6.2.3
Short-Circuited Power
18
6.2.3
Short-Circuited Power
IShort
Vi
Vo
VoltSup VoltThresh
VoltSup - VoltThresh GND P-trans on N-trans on TimeShort
Gate Voltage
Charging
a capacitor PwrShort ActFact ClockSpeed TimeShort IShort VoltSup
LEC-18:
6.2.4
Leakage Power
19
6.2.4
Leakage Power
Vi Vo
N P
N-substrate
Cross section of invertor showing parasitic diode

I ILeak V
Leakage current through parasitic diode
l mk
ILeak e
PwrLk
ILeak
VoltSup VoltThresh k T
LEC-18:
6.2.5
Glossary
20
6.2.5
VoltSup
Glossary
def aka def aka def aka = Clock speed f Supply voltage V Threshold voltage Vth voltage at which P transistors turn on
ClockSpeed
VoltThresh
LEC-18:
ILeak
6.2.5
def aka def aka = def aka = def aka = =
Glossary
Leakage current IS (reverse bias saturation current) q VoltThresh k T e short circuit time Time that both N and P transistors are turned on when signal changes value. Short circuit current Ishort Current that goes through transistor network while both N and P transistors are turned on. activity factor A NumTransitions NumSignals NumClockCycles Per signal: percentage of clock cycles when signal changes value. Per clock cycle: percentage of signals that change value per clock cycle. Note: When measuring per circuit, sometimes approximate by looking only at ops, rather than every single signal. load capacitance CL switching power (dynamic) 1 ActFact ClockSpeed CapLoad 2 2 VoltSup switching power (dynamic) ActFact ClockSpeed TimeShort IShort VoltSup leakage power (static) ILeak VoltSup total power PwrSw PwrShort PwrLk
21
TimeShort
IShort
ActFact
CapLoad PwrSw
PwrShort
PwrLk Power
def = def =
def =
def aka def =
p qo
LEC-18:
6.2.5
Glossary
def aka Maximum clock speed that an implementation technology can support. fmax VoltSup VoltThresh 2 VoltSup electron charge 1 60218 10 19 C Boltzmanns constant 1 38066 10 23 J/K temperature in Kelvin
22
MaxClockSpeed
q k T
x x
s s
w w
def = def = def
LEC-18:
6.2.6
Note on Power Equations
23
6.2.6

DynamicPower StaticPower PwrSw PwrShort PwrLk ActFact ClockSpeed 1 CapLoad 2 ActFact ClockSpeed TimeShort ILeak VoltSup
The power equation:
is for an individual signal.
s s
s s
u u
t t
y y y
Power
VoltSup2 IShort VoltSup
LEC-18:
6.2.6
24
Multiple Signals
To calculate dynamic power for n signals with different CapLoad, TimeShort, and IShort:
i 1 n
i 1
ActFacti
ClockSpeed
TimeShorti
IShorti
v
VoltSup
zu zu t
DynamicPower
ActFacti
1 CapLoadi 2
ClockSpeed
VoltSup2
LEC-18:
6.2.6
25
Average Power
If know average CapLoad, TimeShort, and IShort, then the above formula simplies to: DynamicPower n ActFactAV G
If capacitances and short-circuit parameters dont have an even distribution, then dont average them. If high-capacitance signals have high-activity factors, then averaging the equations will result in erroneously low predictions for power.
s u
ActFactAV G
ClockSpeed
TimeShortAV G
IShortAV G
v
VoltSup
1 2 CapLoadAV G
ClockSpeed
VoltSup2
s u
LEC-19: Data Encoding for Power Reduction

Lecture Notes Sections: 6.2.6 6.5.2.3
Schedule
wk-11 12 wk-13

The purpose of this lecture is to give an overview of power reduction techniques and then examine the design process for a common power reduction technique, data encoding.
LEC-19:
6.3
OVERVIEW OF POWER REDUCTION TECHNIQUES
6.3 Overview of Power Reduction Techniques

We can divide power reduction techniques into two classes: analog and digital.
LEC-19:
6.3
Analog Parameters
Power reduction parameters at the analog level. capacitance for example, Silicon on Insulator (SOI) resistance for example, copper wires voltage low-voltage circuits
LEC-19:
6.3
Analog Techniques
Power reduction techniques at the analog level. dual-Vt Two different supply voltages: high voltage for performance-critical portions of design, low voltage for remainder of circuit. exotic circuits Special ops, latches, and combinational circuitry that run at a high frequency while minimizing power adiabatic circuits Special circuitry that consumes power on 0 1 transitions, but not 1 0 transitions. These sacrice performance for reduced power. clock trees Up to 30% of total power can be consumed in clock generation and clock tree
LEC-19:
6.3
Digital Parameters
Power-reduction parameters at the digital level. capacitance (number of gates) activity factor clock frequency
LEC-19:
6.3
Digital Techniques
Power-reduction techniques at the digital level. multiple clocks Put a high speed clock in performance-critical parts of design and a low speed clock for remainder of circuit clock gating Turn off clock to portions of a chip when its not being used data encoding Gray coding vs one-hot vs fully encoded vs ... asynchronous circuits Get rid of clocks altogether.... Additional low-power design techniques for RTL from a Qualis engineer: http://home.europa.com/celiac/lowpower.html
LEC-19:
6.4
VOLTAGE REDUCTION FOR POWER REDUCTION
6.4
Voltage Reduction for Power Reduction
If our goal is to reduce power, the most promising approach is to reduce the supply voltage, because, from:
we observe: Power VoltSup2
TimeShort
s s
s s
u u t t
Power
ActFact ClockSpeed ActFact ClockSpeed ILeak VoltSup
1 2 CapLoad
VoltSup2 IShort VoltSup
LEC-19:
6.4
10
Reducing Difference Between Supply and Threshold Voltage

As the supply voltage decreases, it takes longer to charge up the capacitive load, which increases the load delay of a circuit. In the chapter on timing analysis, we saw that increasing the supply voltage will decrease the delay through a circuit. (From V IR.) However, it is more accurate to take into account both the value of the supply voltage, and the difference between the supply voltage and the threshold voltage. MaxClockSpeed
VoltSup VoltThresh VoltSup
LEC-19:
6.4
11
Effect of Decreasing Supply Voltage on Delay

Question: If the delay along the critical path of a circuit is 20 ns, the supply voltage is 2.8 V, and the threshold voltage is 0.7 V, calculate the critical path delay if the supply voltage is dropped to 2.2 V.
Answer: d 20ns current delay along critical path d ?? new delay along critical path V 2 8V current supply voltage V 2 2V new supply voltage Vt 0 7V threshold voltage
MaxClockSpeed 1 d
y }
y }
d d
20ns
31ns
v w | w u s v w w | w s u w v |} u s v | u s } v |} u s v | u } v | u
MaxClockSpeed
Vt
Vt
Vt
Vt
2 8V 0 7V 2 8V
VoltSup VoltThresh VoltSup V V Vt 2
w w w
} }
2 2V 2 2V 0 7V
LEC-19:
6.4
12
Reducing Threshold Voltage Increases Leakage Current

If we reduce the supply voltage, we want to also reduce the threshold voltage. However, as threshold voltage drops, leakage current increases:
And increasing the leakage current increases the power: Power ILeak So, need to strike a balance between reducing VoltSup (which has a quadratic affect on reducing power), and increasing ILeak, which has a linear affect on increasing power.
ILeak e
s |
VoltThresh k T
LEC-19:
6.5
DATA ENCODING FOR POWER REDUCTION
13
6.5
Data Encoding for Power Reduction
LEC-19:
6.5.1
How Data Encoding Can Reduce Power
14
6.5.1 How Data Encoding Can Reduce Power

Data encoding is a technique that chooses data values so that normal execution will have a low activity factor. The most common example is Gray coding where exactly one bit changes value each clock cycle when counting.
LEC-19:
6.5.2
Example Problem
15
6.5.2
Example Problem
LEC-19:
6.5.2
Example Problem
16
6.5.2.1 Problem Statement

Your task is to do the power analysis for a circuit that should send out a one-clock-cycle pulse on the done signal once every 16 clock cycles. (That is, done is 0 for 15 clock cycles, then 1 for one cycle, then repeat with 15 cycles of 0 followed by a 1, etc.)
1 clk done 2 3 15 16 17 31 32 33
Required behaviour You have been asked to consider three different types of counters: a binary counter, a Gray-code counter, and a one-hot counter. (The table below shows the values from 0 to 15 for the different encodings.)
Question: What is the relative amount of power consumption for the different options?
LEC-19:
6.5.2
Example Problem
17
6.5.2.2 Additional Information

Your implementation technology is an FPGA where each cell has a programable combinational circuit and a ip-op. The combinational circuit has 4 inputs and 1 output. The capacitive load of the combinational circuit is twice that of the ip-op.
PLA
cell 1. You may neglect power associated with clocks. 2. You may assume that all counters: (a) are implemented on the same fabrication process (b) run at the same clock speed (c) have negligible leakage and short-circuit currents Decimal 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Gray 0000 0001 0011 0010 0110 0111 0101 0100 1100 1101 1111 1110 1010 1011 1001 1000 One-Hot 0000000000000001 0000000000000010 0000000000000100 0000000000001000 0000000000010000 0000000000100000 0000000001000000 0000000010000000 0000000100000000 0000001000000000 0000010000000000 0000100000000000 0001000000000000 0010000000000000 0100000000000000 1000000000000000 Binary 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
LEC-19:
6.5.2
Example Problem
18
6.5.2.3 Answer
LEC-19:
6.5.2
Example Problem
19
Outline of Thinking
Factors to consider that distinguish: capacitance and activity factor: Capacitance is dependent upon the number of signals, and whether a signal is combinational or a op.
LEC-19:
6.5.2
Example Problem
20
Sketch Out the Circuitry

Name the output done and the count digits d().
d(0) PLA
d(1) PLA
d(2) PLA
d(3) PLA
PLA
done
Block diagram for Gray and Binary Counters

d(0) PLA PLA d(1) d(15) PLA done
Block diagram for One-Hot Observation:
The Gray and Binary counters have the same design, and the Gray counter will have the lower activity factor. Therefore, the Gray counter will have lower power than the Binary counter.
LEC-19:
6.5.2
Example Problem
21
However, we dont know how much lower the power of the Gray counter will be, and we dont know how much power the One-Hot counter will consume.
LEC-19:
6.5.2
Example Problem
22
Capacitance
Gray d() done 1-Hot d() done Binary d() done PLAs Flops PLAs Flops PLAs Flops PLAs Flops PLAs Flops PLAs Flops cap 2 1 2 1 2 1 2 1 2 1 2 1 number 4 4 1 0 0 16 0 0 4 4 1 0 subtotal cap 8 4 2 0 0 16 0 0 8 4 2 0
LEC-19:
6.5.2
Example Problem
23
One-Hot Activity Factor

NOTE: Activity factor for One-Hot counter Because all clock cycles have the same number of transitions for the One-Hot counter, could have calculated activity factor as two transitions per sixteen signals.
LEC-19:
6.5.2
Example Problem
24
Activity Factor
LEC-19:
6.5.2
Example Problem
25
Gray Coding Activity Factor

clk d(0) d(1) d(2) d(3) done 8/16 4/16 2/16 2/16 2/16
Gray coding
LEC-19:
6.5.2
Example Problem
26
One-Hot Activity Factor

clk d(0) d(1) d(2) 2/16 2/16 2/16 2/16 done 2/16
One-hot coding
LEC-19:
6.5.2
Example Problem
27
Binary Coding Activity Factor

clk d(0) d(1) d(2) d(3) done 16/16 8/16 4/16 2/16 2/16
Binary coding
LEC-19:
6.5.2
Example Problem
28
Summary of Activity Factors

Gray d() done 1-Hot d() done Binary d() PLAs Flops PLAs Flops PLAs Flops PLAs Flops PLAs Flops done PLAs Flops act fact 1/4 signals in each clock cycle 1/4 signals in each clock cycle 2 transitions / 16 clock cycles 2 transitions / 16 clock cycles 16 + 8 + 4 + 2 transitions = 0.47 4 signals 16 clock cycles 16 + 8 + 4 + 2 transitions = 0.47 4 signals 16 clock cycles 2 transitions / 16 clock cycles
s s
LEC-19:
6.5.2
Example Problem
29
Putting it all Together

Gray d() done PLAs Flops PLAs Flops Total PLAs Flops PLAs Flops Total PLAs Flops PLAs Flops Total subtotal cap 8 4 2 0 0 16 0 0 8 4 2 0 act fact 1/4 1/4 2/16 2/16 0.47 0.47 2/16 power 2 1 4/16 0 3.25 0 2 0 0 2 3.76 1.88 0.25 0 5.87
1-Hot
d() done
Binary
d() done
LEC-19:
6.5.2
Example Problem
30
Final Answer
If choose Binary counting as baseline, then relative amounts of power are: Gray One-Hot Binary 54% 35% 100%
If choose One-Hot counting as baseline, then relative amounts of power are: Gray One-Hot Binary 156% 100% 288%
LEC-20: Clock Gating for Power Reduction

Schedule
wk-11 12 wk-13

The purpose of this lecture is to outline the design process for a common power reduction technique, clock-gating, and to analyze the success of the design.
clock gating: idea circuitry for clock gating power analysis of clock gating
LEC-20:
6.6
CLOCK GATING
6.6
Clock Gating
The basic idea of clock gating is to reduce power by turning off the clock when a circuit isnt needed. This reduces the activity factor.
LEC-20:
6.6.1
Introduction to and Overview of Clock Gating
6.6.1 Introduction to and Overview of Clock Gating
LEC-20:
6.6.1
6.6.1.1 Examples of Clock Gating

Condition O/S in standby mode No oating point instructions for k clock cycles Instruction cache miss No instruction in pipe stage i Circuitry turned off Everything except core state (PC, registers, caches, etc) oating point circuitry
Instruction decode circuitry Pipe stage i 1
LEC-20:
6.6.1
6.6.1.2 Design Tradeoffs

Can signicantly reduce activity factor (Synopsys PowerCompiler claims that can cut power to be 5080% of ungated level) Increases design complexity
Increases area
Increases clock skew
| | |
design effort bugs!
LEC-20:
6.6.1
6.6.1.3 Functional Validation and Clock Gating

Its a functional bug to turn a clock off when its needed for valid data. Its functionally ok, but wasteful to turn a clock on when its not needed. (About 5% of the bugs caught on Willamette (Intel Pentium 4 Processor) were related to clock gating.) Nicolas Mokhoff. EE Times. June 27, 2001. http://www.edtn.com/story/OEG20010621S0080
LEC-20:
6.6.2
Implementing Clock Gating
6.6.2
Clock gating is implemented by adding a component that disables the clock when the circuit isnt needed.
i_data i_valid clk o_data
o_valid
Without clock gating

i_* o_* cool_clk clk clk_en i_wakeup Clock Enable State Machine
With clock gating
LEC-20:
6.6.2
10
6.6.2.1 Simple Power Analysis

Sample problem:
Question: How much power will be saved in the following clock-gating scheme?
70% of the time the main circuit has valid data clock gating circuit is 90% effective (90% of the time that the circuit has invalid data, the clock is off) clock gating circuit has 10% of the area of the main circuit clock gating circuit has same activity factor as main circuit neglect short-circuiting and leakage power
LEC-20:
6.6.2
11
Answer to Simple Power Analysis

Answer:
1. Set up main equations
LEC-20:
6.6.2
12
Main
ClkFsm PwrTot PwrTot
2. Find new activity factor for main circuit (A ):
w t}
0 1A A
1 2
V2
s w s
s s s t s
s} s
A 1 2 A C V2 1 2 A 0 1C V2
CClkFsm
AClkFsm
CMain
AMain
A C A 0 1C
PwrTot
1 2
AMain
s s
Pwr
PwrLk PwrShort
negligible negligible 1 A C V2 2 CMain V2 1 2 AClkFsm CClkFsm V2
s s
PwrSw
Pwr
PwrSw PwrLk 1 A C V2 2
PwrTot
y y y
PwrMain PwrMain PwrClkFsm
power for main circuit without clock gating power for main circuit with clock gating power for clock enable state machine PwrMain PwrClkFsm PwrShort
LEC-20:
6.6.2
13
3. Find ratio of new total power to previous total power:
PwrTot
0 73A 0 1A A 0 83
4. Final answer: new power is 83% of original power
w t
w t}
PwrTot
0 1A A
sv DQv
w sCQv w | u w | u v | u | u s
y }
| u
y y y y y y y
Eff PctValid PctClk
effectiveness of clock gating percentage of clock cycles with valid data percentage of clock cycles that clock toggles 1 Eff 1 PctValid Intuition: when E = 0%, PctClk=100%; when E = 100%, PctClk=PctValid PctClk A 1 Eff 1 PctValid A 1 09 1 07 A 0 73A
LEC-20:
6.6.2
14
6.6.2.2 Valid-Bit Protocol

Need a mechanism to tell circuit when to pay attention to data inputs e.g. when is it supposed to decode and execute an instruction, or write data to a memory array?
clk i_valid i_data clk i_valid i_data o_valid o_data
LEC-20:
6.6.2
15
Valid-Bit Protocol
clk i_valid i_data clk i_valid i_data o_valid o_data o_valid o_data
i valid: high when i data has valid data signies whether circuit should pay attention to or ignore data. o valid: high when o data has valid data signies whether whether environment should pay attention to output of circuit. For more on circuit protocols, see section 2.4.10.
LEC-20:
6.6.2
16
Microscopic Analysis of Valid-Bit Propagation

i_valid clk clk i_valid o_valid o_valid
LEC-20:
6.6.2
17
Which Clock Edges Are Needed?

LEC-20:
6.6.2
18
Minimal Sequence of Clock Edges?

LEC-20:
6.6.2
19
Too Few Clock Edges

LEC-20:
6.6.2
20
Minimal Sequence of Clock Edges!

LEC-20:
6.6.2
21
6.6.2.3 Clock Gating and Big Circuit
LEC-20:
6.6.2
22
Before Clock Gating

data_in valid_in clk clk valid_in data_in valid_out data_out dont care uninitialized data_out valid_out
LEC-20:
6.6.2
23
After Clock Gating: Circuitry

data_in valid_in data_out valid_out
hot_clk clk_en wakeup_in Clock Enable State Machine
cool_clk
wakeup_out

hot clk: clock that always toggles cool clk: gated clock sometimes toggles, sometimes stays low wakeup: alerts circuit that valid data will be arriving soon clk en: turns on cool clk
LEC-20:
6.6.2
24
After Clock Gating: New Signals

hot_clk wakeup_in valid_in data_in clk_en cool_clk valid_out data_out wakeup_out
LEC-20:
6.6.2
25
New Signal: Wakeup (no, not you)

hot_clk wakeup_in valid_in
LEC-20:
6.6.2
26
New Signal: Clock Enable, Cool Clock

hot_clk wakeup_in valid_in data_in clk_en cool_clk valid_out data_out
LEC-20:
6.6.2
27
New Signal: Wakeup Out

LEC-20:
6.6.2
28
After Clock Gating: New Signals

LEC-20:
6.6.2
29
6.6.2.4 Designing Clock Gating Circuitry
LEC-20:
6.6.2
30
Design Decisions
What level of granularity for gated clocks? entire module? individual pipe stages? something in between? When should the clocks turn off? When should the clocks turn on? Protocol for incoming wakeup signal? Protocol for outgoing wakeup signal?
LEC-20:
6.6.2
31
Wakeup Protocol
Designers negotiate incoming and outgoing wakeup protocol with environment. Example wakeup protocol:
wakeup in will arrive 1 clock cycle before valid data wakeup in will stay high until have at least 3 cycles of invalid data same protocol for wakeup out
LEC-20:
6.6.3
Design Problem
32
6.6.3
Design Problem
Design a clock enable state machine for a pipelined module whose latency varies from 5 to 10 clock cycles and that can hold a maximum of 6 instructions (parcels of data).
LEC-20:
6.6.3
Design Problem
33
Design Strategy
When designing clock gating circuitry, consider the two extreme case:
For a constant stream of valid data, the key is to not incur a large overhead in design complexity, area, or clock period when clocks will always be toggling. For a single parcel of valid data, the key is to make sure that the clocks are toggling so that data can percolate through circuit. Also, we want to turn off the clock as soon as possible after data leaves.
a constant stream of valid data circuit is turned off and receives a single parcel of valid data
LEC-20:
6.6.3
Design Problem
34
6.6.3.1 Solution Sketch
LEC-20:
6.6.3
Design Problem
35
Scenario 1
1. Scenario: turned off and get one parcel. (a) Need to turn on and stay on until parcel departs (b) idea #1 (parcel count): count number of parcels inside module keep clocks toggling if have non-zero parcels. (c) idea #2 (cycle count): count number of clock cycles since last valid parcel entered module once hit 10 clock cycles without any valid parcels entering, know that all parcels have exited. keep clocks toggling if counter is less than 10
LEC-20:
6.6.3
Design Problem
36
Scenario 2
1. Scenario: constant stream of parcels (a) parcel count would require looking at input and output stream and conditionally incrementing or decrementing counter (b) cycle count would keep resetting counter
LEC-20:
6.6.3
Design Problem
37
Waveforms for Parcel Count

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 i_valid o_valid parcel_count parcel_clk_en
LEC-20:
6.6.3
Design Problem
38
Waveforms for Cycle Count

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 i_valid o_valid cycle_count
0 1 2 0 0 0 0 1 2 3 4 0 1 2 3 4 5 6 7 8 9 10 cycle_clk_en
LEC-20:
6.6.3
Design Problem
39
Parcel Count Design

Need to count (0..6) parcels, therefore need 3 bits for counter. Counter must be able to increment and decrement. Equations for counter action (increment/decrement/no-change): i valid 0 0 1 1 o valid 0 1 0 1 action no change decrement increment no change
LEC-20:
6.6.3
Design Problem
40
Parcel Count Design

Combined increment and decrement can be done with half-adder (AND, NOR , OR ) and one XOR gate. Count each normal gate as one unit of capacitance, XOR as 1.5 units of capacitance, and op as 2 units of capacitance. (This information would be given on an exam.) To perform both increment and decrement, will need 4.5 units of capacitance per bit for the combinational circuitry and 2 units of capacitance per bit for the op. This gives a total of 6.5 units of capacitance per bit.
LEC-20:
6.6.3
Design Problem
41
Cycle Count Design

Need to count (0..10) cycles, therefore need 4 bits for counter. Counter must be able to increment and reset. Increment on each clock cycle, unless get i valid, in which case reset. To perform increment, will need just half adders, which is 3 gates of capacitance per bit for the combinational circuitry. After adding a op, there is a total of 5 units of capacitance per bit.
LEC-20:
6.6.3
Design Problem
42
Design Analysis
Assuming that:
The two factors affecting power are activity factor and capacitance.
both designs will be implemented on same technology leakage current is negligible switching power is negligible
LEC-20:
6.6.3
Design Problem
43
Design Analysis (Contd)

Capacitance num bits circuit total cap parcel count 3 bit counter (0..6) inc/dec 3 6 5 19 5 cycle count 4 bit counter (0..10) half adders 4 5 20
Parcel count wins on capacitance.
Power If parcel leaves after 5 clock cycles, cycle count will continue to power circuit for another 5 cycles (wasting power!). So, it looks like parcel count wins. However, we should carry out a detailed analysis to see how much difference there is between the two options.
y s
y w s
LEC-20:
6.6.3
Design Problem
44
Behavioural Analysis
Assuming:
Answer:
60% of incoming data are valid even distribution of latencies average length of continuosly valid data is 80 instructions
Question:
Which design option has lower power?
Goal: determine what percentage of time cool clk is toggling for each of the two design options.
LEC-20:
6.6.3
Design Problem
45
Construct Average Waveform

1. Assume that all three of the circuits in question (main circuit without clock gating, and the clock enable state machines) have the same activity factor. 2. Construct average waveform for cool clock. (a) 60% of incoming data are valid (b) average length of valid data is 80 instructions (c) length of window for average data is: ValidLength WindowLength PctValid 80 06 133cycles
80 valid data 133 clock cycles
y y y
LEC-20:
6.6.3
Design Problem
46
Parcel Count Clocking

3. Calculate percentage of clock cycles that parcel count circuit is powered. (a) Clock will run for: 80 clock cycles + average latency - 1 + 1 cycle to clear out last parcel The rst clock cycle latency of the last parcel is counted in the 80 clock cycles. The last clock cycle clears out the last valid parcel by opping in an invalid parcel. See section 6.6.2.2. (b) Minimum latency is 5, max is 10, distribution is even. Therefore average latency is 7.5. (c) Clock will run for: 80 7 5 1 1 87 5cycles. (d) Percentage clocking is 87 5 133 65 8%
y ~ w y t | w t
LEC-20:
6.6.3
Design Problem
47
Cycle Count Clocking

4. Calculate percentage of clock cycles that cycle count circuit is powered. (a) Clock will run for: 80 clock cycles + 10 - 1 for powering last parcel + 1 cycle to clear out last parcel = 90.0 clock cycles (b) Percentage clocking is 90 0 133 67 7%
~ w
LEC-20:
6.6.3
Design Problem
48
Wrapup
5. Summary Capacitance Percentage clocking Parcel Count 19.5 65.8% Cycle Count 20 67.7%
6. Parcel count wins on both capacitance and activity factor, therefore it has the lowest power consumption. 7. How much more power does the cycle count design consume?
5 5%
w v w s w u w s w u v w s w u |
y y y
n%more power
CycPwr PclPwr PclPwr 20 0 0 677 19 5 19 5 0 658
0 658
Chapter 7
Fault Testing and Testability
49
LEC-20:
7.1
INTRODUCTION
50
7.1
Introduction
LEC-20:
7.1.1
51
7.1.1
The purpose of this lecture is to explain the sources of manufacturing faults, how the faults are caught, and the tradeoffs in trying to catch these faults. We will then introduce the mathematical models for the physical faults.
physical faults wired-AND wired-OR stronger wins mathematical model of fault causes of faults testing burn in bin sorting scan testing built-in self test IDDQ testing economics of testing locations of faults test vector to detect a fault single stuck-at faults undetectable faults redundant circuitry fault domination fault collapsing
fault equivalence gate collapsing node collapsing fault collapsing (intelligent collapsing) fault coverage test vector generation required test vectors order to run test vectors fault hiding scan testing scan chain testing procedure time to run a test boundary scan testing JTAG IEEE 1149 length of time to do a scan test hardware to do scan testing
LEC-20:
7.1.2
Background Material
52
7.1.2
Background Material
Karnaugh maps
LEC-20:
7.1.3
Reading Material
53
7.1.3
Smith ch14
Reading Material
LEC-21: Introduction to Faults, Testing, and Testability

Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 wk-09 10 wk-11 12 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Lec-21 Introduction Lec-22 Detection and Test Generation Lec-23 Built-In Self Test (BIST) Lec-24 JTAG Review
wk-13

The purpose of this lecture is to explain the sources of manufacturing faults, how the faults are caught, and the tradeoffs in trying to catch these faults. We will then introduce the mathematical models for the physical faults.
physical faults wired-AND wired-OR stronger wins mathematical model of fault causes of faults testing burn in bin sorting
scan testing built-in self test IDDQ testing economics of testing locations of faults test vector to detect a fault single stuck-at faults undetectable faults redundant circuitry
Background Material
Karnaugh maps
Reading Material
Smith ch14
LEC-21:
7.2
FAULTS AND TESTING
7.2
Faults and Testing
LEC-21:
7.2.1
Overview of Faults and Testing
7.2.1
LEC-21:
7.2.1
7.2.1.1 Faults (Smith 14.3)

During manufacturing, faults can occur that make the physical product behave incorrectly. Denition: A fault is a manufacturing defect that causes a wire, poly, diffusion, or via to either break or connect to something it shouldnt.
Good wires
Shorted wires
Open wire
LEC-21:
7.2.1
7.2.1.2 Causes of Faults (Smith 14.3)
Fabrication process (initial construction is bad) chemical mix impurities dust Manufacturing process (damage during construction) handling probing cutting mounting materials corrosion adhesion failure cracking peeling
LEC-21:
7.2.1
10
7.2.1.3 Testing (Smith 14)

Denition Testing is the process of checking that the manufactured wafer/chip/board/system has the same functionality as the simulations.
LEC-21:
7.2.1
11
7.2.1.4 Burn In (Smith 14.3.1)

Some chips that come off the manufacturing line will work for a short period of time and then fail. Denition Burn-in: The process of subjecting chips to extreme conditions (high and low temps, high and low voltages, high and low clock speeds) before and during testing. The purpose is to cause (and catch) failures in chips that would pass a normal test, but fail in early use by customers.
Soon to break wire
The hope is that the extreme conditions will cause chips to break that would otherwise have broken in the customers system soon after arrival. The trick is to create conditions that are extreme enough that bad chips will break, but not so extreme to cause good chips to break.
LEC-21:
7.2.1
12
7.2.1.5 Bin Sorting (Smith 5.1.6)

Each chip (or wafer) is run at a variety of clock speeds. The chips are grouped labeled (binned) at the maximum clock frequency at which they will work reliably. For example, chips coming off of the same production line might be labelled as 800MHz, 900MHz, and 1000MHz. Overclocking is taking a chip rated at nMHz and running it at 1 x nMHz. (Sure your computer often crashes and loses your assignment, but just think how much more productive you are when it is working...)
s w
LEC-21:
7.2.1
13
7.2.1.6 Testing Techniques (Smith 14)

Scan Testing or Boundary Scan Testing (BST, JTAG) (Smith 14.2, 14.6):
Built In Self Test (BIST) (Smith 14.7): Build circuitry on chip that generates tests and compares actual and expected results IDDQ Testing : (Smith 14.3.6)
Load test vector from tester into chip Run chip on test data Unload result data from chip to tester Compare results from chip against those produced by simulation If results are different, then chip was not manufactured correctly
Measure the quiescent current between VDD and GND. Variations from expected values indicate faults.
LEC-21:
7.2.1
14
Challenges
The challenges in testing:
The crux of testing is to use yesterdays technology to nd faults in tomorrows chips. Agilent engineer at ARVLSI 2001.
test circuitry consumes chip area test circuitry reduces performance decrease fault escapee rate of product that ships while having minimal impact on production cost and chip performance external tester can only look at I/O pins ratio of internal signals to I/O pins is increasing some faults will only manifest themselves at high-clock frequencies
LEC-21:
7.2.1
15
7.2.1.7 Design for Testability (DFT) (Smith 14.6)

Scan testing and self-testing require adding extra circuitry to chips. Design for test is the process of adding this circuitry in a disciplined and correct manner. A hot area of research, that is becoming mainstream practice, is developing synthesis tools to automatically add the testing circuitry.
LEC-21:
7.2.2
Example Problem: Economics of Testing (Smith 14.1)
16
7.2.2 Example Problem: Testing (Smith 14.1)

Given information:
Economics of
The ACHIP costs $10 without any testing Each board uses one ACHIP (plus lots of other chips that we dont care about) 68% of the manufactured ACHIPS do not have any faults For the ACHIP, it costs $1 per chip to catch half of the faults Each 50% reduction in fault escapees doubles cost of testing (intuition: doubles number of tests that are run) If board-level testing detects a bad ACHIP, it costs $200 to replace the ACHIP Board-level testing will detect 100% of the faults in an ACHIP
LEC-21:
7.2.2
Example Problem: Economics of Testing (Smith 14.1)
17
Economics of Testing
Question: ACHIP? What escapee fault rate will minimize cost of the
For high-volume, small-area chips, testing can consume more than 50% of the total cost.
w s
w s
w s
w s
w s w s w s
NoTestCost $10 $10 $10 $10 $10 $10 $10
Testcost $0 $1 $2 $4 $8 $16 $32
EscapeeProb 32% 16% 8% 4% 2% 1% 0.5%
ReplaceCost (200 0 32 = $64) (200 0 16 = $32) (200 0 08 = $16) (200 0 04 = $8) (200 0 02 = $4) (200 0 01 = $2) (200 0 005 = $1)
TotCost
NoTestCost
TestCost
EscapeeProb
ReplaceCost
TotCost $74 $43 $28 $22 $22 $28 $43
LEC-21:
7.2.3
Physical Faults (Smith 14.3.3)
18
7.2.3
LEC-21:
7.2.3
19
7.2.3.1 Types of Physical Faults

Good Circuit
a b c d
Bad Circuits open wired-AND bridging short wired-OR bridging short stronger wins bridging short (b is stronger) short to VDD
a b a b a b a b c d c d c d c d
a b a b
c d c d
short to GND
LEC-21:
7.2.3
20
7.2.3.2 Locations of Faults

Each segment of wire, poly, diffusion, via, etc is a potential fault location. Different segments affect different gates in the fanout. A potential fault location is a segment or segments where a fault at any position affects the same set of gates in the same way.
BAD
OK
BAD
b BAD
BAD
OK
Three different locations for potential faults.
LEC-21:
7.2.3
21
7.2.3.3 Layout Affects Locations

a b c d
L2
f g h i b
L2 L1 L4 L3
e g h e
L1
L3 L5 L4
g h
For the same schematic, we can have either four or ve different locations for potential faults, depending upon how the circuit is layed out.
LEC-21:
7.2.3
22
7.2.3.4 Naming Fault Locations

Two ways to name a fault location: pin-fault model Faults are modelled as occuring on input and output pins of gates. net-fault model Faults are modelled as occuring on segments of wires. In E&CE 427, well use the net-fault model, because it is simpler to work with and is closer to what actually happens in hardware.
LEC-21:
7.2.4
Detecting a Fault
23
7.2.4
Detecting a Fault
To detect a fault, we compare the actual output of the circuit against the expected value.
LEC-21:
7.2.4
Detecting a Fault
24
7.2.4.1 Which Test Vectors will Detect a Fault?

a b c d e c a b d e
Good circuit a 0 0 0 0 1 1 1 1 b 0 0 1 1 0 0 1 1 c 0 1 0 1 0 1 0 1 good 0 1 0 1 0 1 1 1 faulty 0 1 0 1 0 1 0 1
Faulty circuit The only test vector that will detect the fault in the circuit is 110. Sometimes multiple test vectors will catch the same fault. Sometimes a single test vector can catch multiple faults.
| P
LEC-21:
7.2.4
Detecting a Fault
25
7.2.4.2 A Single Test-Vector Can Detect Several Faults

a b c d e
Another fault The test vector 110 can catch both this fault and the previous one.
| P
a 1
b 1
c 0
good 1
faulty 0
LEC-21:
7.2.5
Mathematical Models of Faults (Smith 14.3.4)
26
7.2.5 Mathematical Models of Faults (Smith 14.3.4)

Goal: develop reliable and predictable technique for detecting faults in circuits. Problems:
Need: a mathematical model for faults that is abstracted from complexities of circuit layout and plethora of possible faults, yet still detects most or all possible faults.
The possible faults in a circuit are dependent upon the physical layout of the circuit. A very wide variety of possible faults A single test vector can catch many different faults
LEC-21:
7.2.5
27
7.2.5.1 Single Stuck-At Fault Model

Two simplifying assumptions: 1. A maximum of one fault per tested circuit 2. All faults are either: (a) stuck-at 1: short to VDD (b) stuck-at 0: short to GND
LEC-21:
7.2.5
28
Example of Stuck-At Faults

a b c d
L1 L5 L2 L6 L3 L7 L4 L8 L9
L10
L12
L11
If restrict to single stuck-at fault model, then have 24 faulty circuits to consider.
12 fault locations possible faults.
2 types of faults
24
LEC-21:
7.2.5
29
Problems with Multiple Faults

a b c d
L1@0,1 L5@0,1 L2@0,1 L3@0,1 L7@0,1 L4@0,1 L6@0,1 L8@0,1 L9@0,1
L10@0,1
L12@0,1
L11@0,1
If allowed multiple faults, then could have up to 12 different faults in the same circuit. How many faulty circuits would need to be considered? Each of the 12 locations has three possible values: good, stuck-at-1, stuckat-0. Therefore, 312 5 3 105 different circuits would need to be considered! If allowed multiple faults of 4 different types at 12 different locations, then would have 512 1 2 4 108 different faulty circuits to consider!
s w y
s w y |
LEC-21:
7.2.5
30
Faults and Possible Circuits

There are 22 6 6 104 different Boolean functions of four inputs. Thus, there are 6 6 104 possible equations for circuits with four inputs and one output. This is much less than the number of faulty circuit models that would be generated by the simultaneous-faults-at-every-location models. So both of the simultaneous-faults-at-every-location models are too extreme.
s w y
s w
LEC-21:
7.2.6
Generate Test Vector to Find a Mathematical Fault (Smith 14.4) 31
7.2.6 Generate Test Vector to Find a Mathematical Fault (Smith 14.4)
LEC-21:
7.2.6
7.2.6.1 Algorithm
compute Karnaugh map for correct circuit compute Karnaugh map for faulty circuit nd region of disagreement any assignment in region of disagreement is a test vector that will detect fault 5. any assignmemnt outside of region of disagreement will result in same output on both correct and faulty circuit 1. 2. 3. 4.
LEC-21:
7.2.6
7.2.6.2 Example of Finding a Test Vector

a b c
a c b c1 c0 ab ab ab ab 10 11 01 00
d e
a b c
a c
d e
b
Good circuit
a c
Faulty circuit
Difference between good and faulty circuits
LEC-21:
7.2.7
Undetectable Faults
34
7.2.7
Undetectable Faults
Not all faults are detectable. 1. If a circuit is irredundant then all single stuck-at faults can be detected. A redundant circuit is one where one or more gates can be removed without affecting the functional behaviour. 2. If not trying to nd all of the faults in a circuit, then a fault that you arent looking for can mask a fault that you are looking for.
LEC-21:
7.2.7
Undetectable Faults
35
7.2.7.1 Redundant Circuitry
LEC-21:
7.2.7
Undetectable Faults
36
Timing Hazards
Static hazard Dynamic hazard Timing hazards are often removed by adding redundant circuitry.
LEC-21:
7.2.7
Undetectable Faults
37
Redundant Circuitry
a b
a b
1,1 1,0
c 1,0 1,0,1 d
e f g
d c
1,1
0,1
0,1
Irredundant circuit Glitch on g is caused because the on.

AND
Illustration of timing hazard gate for e turns off before f turns
LEC-21:
7.2.7
Undetectable Faults
38
Redundant Circuitry
In this sum-of-products style circuit, each in the Karnaugh map.
a c b
AND
gate corresponds to a cube
We can prevent this transition from causing a glitch by adding a cube that covers the two squares of the transition from 111 to 101. This cube is 1-1, which is the black cube in the Karnaugh map below and the signal h in the redundant circuit below.
a c b c a b
LEC-21:
7.2.7
Undetectable Faults
39
Redundant Circuitry
a b c a b e h d c f d e
L1
f h g
Redundant circuit
No more timing hazards
LEC-21:
7.2.7
Undetectable Faults
40
Redundant Circuitry
L1@0 is undetectable. Correct circuit ab bc Faulty circuit ab bc ac With L1@0, ac 0 ab bc 0 ab bc Same equation as correct circuit
{ |
LEC-21:
7.2.7
Undetectable Faults
41
7.2.7.2 Curious Redundant Circuitry and Fault Detection

The two circuits below have the same steady-state behaviour.
a
L2
a z z c
b c
a c
L1 L3
So, the signal b and the two extra XOR gates are redundant.
LEC-21:
7.2.7
Undetectable Faults
42
Detectable Faults in Redundant Circuitry

In the redundant circuit, a stuck-at fault at L1 is undetectable, but faults at either L2 or L3 are detectable.
a
L2
a z
c
b c
L1 L3
z c
fault L2@0 L2@1
eqn a a b b c c
K-map
a c b
diff w/ ckt
a c b
The lesson is that not all faults in redundant circuitry are undetectable.
v u v u
a c
b c
LEC-22: Fault Detection and Test-Vector Generation

Schedule
wk-13

The purpose of this lecture is to demonstrate the fundamental techniques for detecting faults in circuits. Subsequent lectures will build on these fundamentals to show applications of these techniques.
node collapsing
redundant circuitry addendum fault domination fault collapsing fault equivalence gate collapsing
fault collapsing (intelligent collapsing) fault coverage test vector generation required test vectors order to run test vectors fault hiding
LEC-22:
7.3
FAULTS
7.3
Faults
LEC-22:
7.3.1
Locations of Faults
7.3.1
a b c
Locations of Faults
Throughout this lecture well be using the circuit below:

a
L4 L2 L5
At rst, we will consider only the following faults: L2@1, L4@1, L5@1.
ab
bc
b
LEC-22:
7.3.1
Locations of Faults
Simple Analysis of L2@1, L4@1, L5@1

a b c
L4 L2 L5
fault
eqn
K-map
a c b
diff w/ ckt
a c b
test vectors
3)
L5@1
ab
2)
L4@1
1)
L2@1
c
a c b c a b
101, 001, 100
bc
a c b c a b
101, 100
101, 001
LEC-22:
7.3.1
Locations of Faults
Choose Test Vector

fault eqn K-map
a c b c
diff w/ ckt
a b
test vectors
If we choose 101, we can detect all three faults. Choosing either 001 or 100 will miss one of the three faults.
3)
L5@1
ab
2)
L4@1
1)
L2@1
c
a c b c a b
101, 001, 100
bc
a c b c a b
101, 100
101, 001
a c b
LEC-22:
7.3.2
Choosing Test Vectors (Smith 14.3.7)
7.3.2
The goal of test vector generation is to nd the smallest set of test vectors that will detect the faults of interest. Test vector generation requires analyzing the faults. We can simplify the task of fault analysis by reducing the number of faults that we have to analyze. Smith has examples of this in Figures 14.13 and 14.14.
LEC-22:
7.3.2
7.3.2.1 Fault Domination

fault eqn K-map
a c b c
Diff w/ ckt
a b
test vectors
1)
L5@1
ab+c
a c b c a b
101, 001
2)
L6@1
101, 001, 100, 010, 000
Any test vector that detects L5@1 will also detect L6@1. Denition f1 dominates f2 : any test vector that detects f1 will also detect f2 . L5@1 dominates L6@1. When choosing test vectors we can ignore L6@1 and just include L5@1.
Question: What would happen if we ignored L5@1 and just included L6@1?
Answer: If we chose 100, 010, or 000 as our test vector to detect L6@1, then we would not detect L5@1.
LEC-22:
7.3.2
10
7.3.2.2 Fault Equivalence

fault eqn K-map
a c b c
Diff w/ ckt
a b
1)
L1@1
b
a c b c a b
2)
L3@1
The two faults above are equivalent. Denition f1 is equivalent to f2 : f1 and f2 are detected by exactly the same set of test vectors. That is, all of the test vectors that detect f1 will also detect f2 , and vice versa. When choosing test vectors we can ignore one of the faults and just include the other.
LEC-22:
7.3.2
11
7.3.2.3 Gate Collapsing

A 1 on the input to an OR gate will force the output to be 1. A stuck-at-1 fault on the input to an OR gate is equivalent to a stuck-at-1 fault on the output of the OR gate. By looking at the functionality of a gate, we can nd equivalent faults. Denition: Gate collapsing is the technique of looking at the functionality of a gate and nding equivalent faults between inputs and outputs. Sets of collapsable faults for common gates
@0
AND
@1
@0
@0
@1 OR Question: What is the set of collapsible faults for a NAND gate?
@1
NAND
LEC-22:
7.3.2
12
7.3.2.4 Node Collapsing

When two segments affect the same set of gates (ignoring any gates between the two segments), then faults on the two segments can be collapsed. With an invertor or buffer, the segment on the input affects the same gates as the output. Therefore, faults on the input and output segments are equivalent. Sets of collapsable faults for nodes
@1 @0 @1
NOT-1
@0
NOT-0 With the net-fault model, which is the one we are using in E&CE 427, inverters and buffers are the only gates where we node collapsing is relevant. With the pin-faul model, where faults are modelled as occuring on the pins of gates, there are other instances where node collapsing can be used.
LEC-22:
7.3.2
13
7.3.2.5 Fault Collapsing Summary

When calculating the test-vectors to detect a set of faults, apply the fault collapsing techniques of:
to reduce the number of faults that you must examine.
gate collapsing node collapsing general fault equivalence (intelligent collapsing) fault domination
LEC-22:
7.3.3
Fault Coverage
14
7.3.3
Fault Coverage
Denition Fault coverage: percentage of detectable faults that are detected by a set of test vectors. DetectedFaults DetectableFaults
NOTE: In Smiths book, undetectable faults dont hurt your coverage. This is not universally true. Some peoples denition of fault coverage has denominator of AllPossibleFaults, not just those that are detectable.
FaultCoverage
LEC-22:
7.3.4
Generate Test Vectors for 100% Coverage
15
7.3.4 Generate Test Vectors for 100% Coverage

In this section we will nd the test vectors to achieve 100% coverage of single stuck at faults for the circuit of the day. We will use a simple algorithm, there are much more sophisticated algorithms that are more efcient. The problem of test vector generation is often called Automatic Test Pattern Generation (ATPG) and continues to be an active area of research. A trendy idea is to use Genetic Algorithms (inspired by how DNA works) to generate test vectors that catch the maximum number of faults. The classic algorithm is the D algorithm invented by Roth in 1966 (Smith 14.5.1, 14.5.2). An enhanced version is the Path-Oriented D Algorithm (PODEM), which supports reconvergent fanout and was developed by Goel in 1981 (Smith 14.5.3).
a b c
L1 L4 L2 L5 L3 L7 L6 L8
Example Circuit with Fault Locations and Karnaugh Map
ab
bc
b
LEC-22:
7.3.4
16
7.3.4.1 Collapse the Faults
LEC-22:
7.3.4
17
Potential Fault Locations

Initial circuit with potential faults:
a b
L2@0,1 L5@0,1 L1@0,1 L4@0,1 L6@0,1 L8@0,1 L7@0,1
L3@0,1
Gate collapsing
a b
L2 L5 L1 @0 L4 @0 @0 L6 L8 L7
c a b
L3 L1 L4 L2 L5 @0
L1@0, L4@0, L6@0

L6 L8 @0 L7 L6 @1 @1 L7 @1 L8
c a b
L3 L1 L4 L2 L5
@0
L3@0, L5@0, L7@0
L3
L6@1, L7@1, L8@1
LEC-22:
7.3.4
18
Node Collapsing
Node collapsing: none applicable (no invertors or buffers).
a b
L1@1 L4@1 L2@0,1 L5@1
L6@0 L8@0,1 z L7@0
Remaining faults:
L3@1
LEC-22:
7.3.4
19
Intelligent Collapsing
Intelligent Collapsing
a b
L2@0 L8@0
L2@0, L8@0
c a b z
L1@1
Both L2@0 and L8@0 result in the equation 0.
L1@1, L3@1
c
L3@1
Both L1@1 and L3@1 result in the equation b
a b
L2@1 L5@1 L4@1 L6@0 L8@0,1 z L7@0
Remaining faults:
L3@1
LEC-22:
7.3.4
20
7.3.4.2 Check for Fault Domination

fault eqn K-map
a c b c
Diff w/ ckt
a b
1)
L2@1
a+c
a c b c a b
dominated by L4@1, L5@1
2)
L3@1
b
a c b c a b
3)
L4@1
a+bc
a c b c a b
4)
L5@1
ab+c
a c b c a b
5)
L6@0
bc
a c b c a b
6)
L7@0
ab
a c b c a b
7)
L8@0
0
a c b c a b
dominated by L6@0, L7@0
8)
L8@1
dominated by L2@1, L3@1, L4@1, L5@1
LEC-22:
7.3.4
21
Remove dominated faults

Dominated faults: (L2@1, L8@0, L8@1).
LEC-22:
7.3.4
22
Remaining Faults
fault eqn K-map
a c b c
Diff w/ ckt
a b
1)
L3@1
b
a c b c a b
2)
L4@1
a+bc
a c b c a b
3)
L5@1
ab+c
a c b c a b
4)
L6@0
bc
a c b c a b
5)
L7@0
ab
LEC-22:
7.3.4
23
Remaining Faults
a b c
L4@1 L6@0
z
L5@1 L3@1 L7@0
LEC-22:
7.3.4
24
7.3.4.3 Required Test Vectors

If we have any faults that are detected by just one test-vector, then we must include that test vector in our suite. Denition A test vector tv is required if there is a fault for which tv is the only test vector that will detect the fault. Required vectors L3@1 010 L6@0 110 L7@0 011
LEC-22:
7.3.4
25
7.3.4.4 Faults Not Covered by Required Test Vectors

fault eqn K-map
a c b c
Diff w/ ckt
a b
1)
L4@1
a+bc
a c b c a b
2)
L5@1
ab+c
The intersection of the two difference regions is 101. Choosing 101 detects both L4@1 and L5@1. Add 101 to suite of test vectors. Final set of test vectors is: 010, 110, 011, 101.
LEC-22:
7.3.4
26
7.3.4.5 Order to Run Test Vectors

The order in which the test vectors are run is important because it can affect how long a faulty chip stays in the tester before the chips fault is detected. The rst vector to run should be the one that detects the most faults. Build a table for which faults each test vector will detect.
LEC-22:
7.3.4

Test Vector
a c b c a b c a b c a b
27
fault 110
a c b
010
011
101
1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16)
L1@0
a c b
1 1
a c b
L1@1 L2@0
a c b
1 1
L2@1
a c b
L3@0
a c b
1 1
a c b
L3@1 L4@0
a c b
1 1
a c b
L4@1 L5@0
a c b
1 1
a c b
L5@1 L6@0
a c b
1 1
a c b
L6@1 L7@0
a c b
1 1
L7@1
a c b
1 1
a c b
1 1
L8@0 L8@1 Faults detected
1 5
1 6
LEC-22:
7.3.4
28
101 detects the most faults, so we should run it rst. This reduces the faults found by 010 from 5 to 2 (because L6@1, L7@1, and L8@1 will be found by 101). This leaves 110 and 011 with 5 faults each, we can run them in either order, then run 010. We settle on a nal order for our test suite of: 101, 011, 110, 010.
LEC-22:
7.3.4
29
7.3.4.6 Summary of Technique to Find and Order Test Vectors

1. identify all possible faults 2. gate collapsing 3. node collapsing 4. intelligent collapsing 5. fault domination 6. determine required test vectors 7. choose minimal set of test vectors to detect remaining faults 8. order test vectors based on number of faults detected (NOTE: when iterating through this step, need to take into account faults detected by earlier test vectors)
LEC-22:
7.3.4
30
7.3.4.7 Complete Analysis

In case you dont trust the fault collapsing analysis, heres the complete analysis. fault eqn K-map
a c b c
Diff w/ ckt
a b
1)
L1@0
bc
a c b c a b
2)
L1@1
b
a c b c a b
3)
L2@0
0
a c b c a b
dominated by 1, 5
4)
L2@1
a+c
a c b c a b
dominated by 8, 10
5) 6) 7)
L3@0 L3@1 L4@0
ab b bc
same as 2 same as 1
a c b c a b
8) 9)
L4@1 L5@0
a+bc ab
same as 5
a c b c a b
10) 11)
L5@1 L6@0
ab+c bc
same as 1
a c b c a b
12) 13) 14) 15) 16)
L6@1 L7@0 L7@1 L8@0 L8@1
1 ab 1 0 1
dominated by 8, 10 same as 5 same as 12 same as 3 same as 12
LEC-22:
7.3.5
One Fault Hiding Another
31
7.3.5
a b c
L1

L4 L6 L8 L5 L7
L2 L3
Assume that we are not trying to detect all faults L1 is viewed as not being at risk for faults, but L3 is at risk for faults.
a b z c
L3 L1
a b
L1
z c
L3
LEC-22:
7.3.5
32
Fault Hiding
a b z c
L3 L1
a b
L1
z c
L3
Problem: If L1 is stuck-at 1, the test vectors that normally detect L3@0 will not detect L3@0. In the presence of other faults, the set of test vectors to detect a fault will change. fault(s) L3@0 L1@1,L3@0 eqn ab
a c b c a b
K-map
a c b
Diff w/ ckt
a c b
LEC-23: Built In Self Test

Change Log
Schedule
wk-13

The purpose of this lecture is to connect the theory of testing and testability to the technique of built in self test (BIST). Well also relate gate-level circuit behaviour to Galois theory, which is a eld of mathematics used in information theory (encryption, compression, etc). A meta-level lesson here is that advanced mathematical concepts can sometimes be used to invent new types of circuits, or to better understand existing circuits. Finally, we see that theories created long before the advent of computers are often applied in computing theory. linear feedback shift register (LFSR) built-in self test (BIST)
characteristic polynomials Galois elds
LEC-23:
7.4
BUILT IN SELF TEST (SMITH 14.7)
7.4
Built In Self Test (Smith 14.7)
LEC-23:
7.4.1
Block Diagram
7.4.1
Block Diagram
LEC-23:
7.4.1
Block Diagram
Generic Testing Circuit

mode test generator d(0) i_data(0) o_data(0)
d(1) i_data(1) circuit under test
o_data(1)
d(2) i_data(2)
o_data(2)
d(3) i_data(3) result checker all_ok
LEC-23:
7.4.1
Block Diagram
Circuit in Normal Mode

o_data(1)
d(2) i_data(2)
o_data(2)
LEC-23:
7.4.1
Block Diagram
Circuit in Test Mode

o_data(1)
d(2) i_data(2)
o_data(2)
LEC-23:
7.4.1
Block Diagram
10
Circuit with BIST

mode test gen LFSR test generator i_data(0) signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) i_data(2) o_data(1) signature ok(2) analyzer2 o_data(2)
d(0)
d(1) i_data(1)
d(3) i_data(3)
result checker all_ok
LEC-23:
7.4.1
Block Diagram
11
BIST in Normal Mode

d(0)
d(1) i_data(1)
d(3) i_data(3)
LEC-23:
7.4.1
Block Diagram
12
BIST in Test Mode

d(0)
d(1) i_data(1)
d(3) i_data(3)
LEC-23:
7.4.1
Block Diagram
13
7.4.1.1 Components
There is one test generator per group of inputs (or internal ops) that drive the same circuit to be tested. There is one signature analyzer per output (or internal op).
NOTE: MISR An exception to the above rule is a multiple input signature register (MISR), which can be used to analyze several outputs of the circuit under test. (Smith 14.7.7) The test generator and signature analyzer are both built with linear-feedback shift registers.
LEC-23:
7.4.1
Block Diagram
14
Test generator
generates a psuedo-random set of test vectors for n output bits, generates all vectors from 1 to 2n 1 in a pseudo random order built with a linear-feedback shift register (shift-register portion is the input ops)
LEC-23:
7.4.1
Block Diagram
15
Signature analyzer
checks that the output it is examining has the correct results for the complete set of tests that are run only has a meaningful result at the end of the entire test sequence. built with a linear-feedback shift register similar to a hash function or a lossy compression function if there are no faults, the signature analyzer will denitely say ok (no false negatives) if there is a fault, the signature analyzer might say ok or might say bad (false positives are possible) design tradeoff: more accurate signature analyzers require more hardware
LEC-23:
7.4.1
Block Diagram
16
Result checker
signature analyzers output ok/bad on every clock cycle, but the result is only meaningful at the end of running the complete set of test vectors the result checker looks at test vector inputs to detect the end of the test suite and outputs all ok if all signature analyzers report ok at that moment implemented as an AND gate
LEC-23:
7.4.1
Block Diagram
17
7.4.1.2 Linear (LFSR)
Feedback Shift
Register
Basically, a shift register (sequence of ip-ops) with the output of the last ip-op fed back into some of the earlier ip-ops with XOR gates. Design parameters:
number of ip-ops external or internal XOR feedback taps (coefcients) external-input or self-contained reset or set
LEC-23:
7.4.1
Block Diagram
18
LFSR Example
reset
d0 i
q0 d1
q1 d2
q2
External-XOR, input, reset
LEC-23:
7.4.1
Block Diagram
19
LFSR Example
d0
q0 d1
q1 d2
q2
set
External-XOR, no input, set
LEC-23:
7.4.1
Block Diagram
20
LFSR Example
d0
R
q0
d1
q1 d2
q2
set
Internal-XOR, input, set
LEC-23:
7.4.1
Block Diagram
21
LFSR Example
reset
d0
q0
d1
q1
d2
q2
Internal-XOR, input, reset
LEC-23:
7.4.1
Block Diagram
22
LFSRs in E&CE 427

In E&CE 427, well use internal-XOR LFSRs, because the circuitry matches the mathematics of Galois elds. External-XOR LFSRs work just ne, but they are more difcult to analyze, because their behaviour cant be treated as Galois elds.
LEC-23:
7.4.1
Block Diagram
23
7.4.1.3 Maximal-Length LFSR

Denition maximal-length linear feedback shift register: An LFSR that outputs a pseudo-random sequence of all representable bit-vectors except 0...00.
Denition pseudo random: The same elements in the same order every time, but the relationship between consecutive elements is apparantly random. Maximal-length linear feedback shift registers are used to generate test vectors for built-in self test.
LEC-23:
7.4.1
Block Diagram
24
Maximal-Length LFSR Circuits

The gures below illustrate the two maximal-length internal-XOR linear feedback shift registers that can be constructed with 3 ops.
d0 q0 d1 q1 d2 q2
set
Maximal-length internal-XOR LFSR
d0
q0
d1
q1 d2
q2
set
Maximal-length internal-XOR LFSR
LEC-23:
7.4.1
Block Diagram
25
Maximal Length LFSR Characteristics

Maximal-length LFSRs:
reset clk d0 q0 d1 q1 q2 val
set to all 1s initially self contained (no external i input)
Timing diagram for a maximal-length LFSR
LEC-23:
7.4.1
Block Diagram
26
Maximal-Length LFSR Timing Diagram

1 reset clk d0 q0 d1 q1 q2 val 7 6 4 1 2 5 3 7 6 2 3 4 5 6 7 8
Timing diagram for a 3-op maximal-length LFSR
LEC-23:
7.4.2
Test Generator
27
7.4.2
Test Generator
The test generator component is a maximal-length LFSR with multiplexors on the inputs to each ip-op. In test mode, the data input on each ip op is connected to the output of the previous ip op. In normal mode, the input of each ip op is connected to the environment.
d0 q0 d1 q1 d2 q2
set
A maximal-length internal-XOR LFSR
LEC-23:
7.4.2
Test Generator
28
Test Generator
The test generator component is a maximal-length LFSR with multiplexors on the inputs to each ip-op. In test mode, the data input on each ip op is connected to the output of the previous ip op. In normal mode, the input of each ip op is connected to the environment.
mode d1 q1 d2 q2
d0
q0
i_d(0) i_d(1) i_d(2) set q0 q1 q2
A test generator: maximal-length internal-XOR LFSR with muxes on data inputs
LEC-23:
7.4.2
Test Generator
29
Test Generator
mode
d0 i_d(0)
q0
d1 i_d(1) d2 i_d(2)
q1
q2
A test generator, reset not shown
LEC-23:
7.4.3
Signature Analyzer
30
7.4.3
Signature Analyzer
There are four things that change between different signature analyzers:
number of ops ( ops area, accuracy) choice of feedback taps: a good choice can improve accuracy (more isnt necessarily better) bubbles on input to AND gate for ok: determined by expected result from simulating test sequence through circuit under test and LFSR of analyzer. Vector
LEC-23:
7.4.3
Signature Analyzer
31
Signature Analyzer
This circuit:
reset
i
S S
Two ops, most analyzers use more the HP boards in the 1970s used 37 ops! Feedback taps on both ops. Different signature analyzers have different congurations of feedback taps. Also contains ok tester (AND gate). Expected output of LFSR at end of test sequence is: q0=1 and q1=1, or 01. (We know this because of bubble on AND gate. To see why this is the expected output of the signature analyzer, we would need to know the correct sequence of outputs of the circuit under test.)
d0
q0
d1
q1
ok
LEC-23:
7.4.3
Signature Analyzer
32
Signature Analyzer
reset clk i d0 q0 d1 q1 0 0 i6 i5 i4 i3 i2 i1 i0 -
LEC-23:
7.4.3
Signature Analyzer
33
Signature Analyzer Timing

reset clk i d0 q0 d1 q1 i6 i6 0 0 0 i5 i5 i6 i6 0 i4 i3 i2 i1 i0 -
i4i6 356 i5
245 1346 02356
i4i6 356
245 1346 02356 -
i5i6 i4i5 346 2356 1245 i6
i5i6 i4i5 346 2356 1245
356 = i3i5i6 2356 = i2i3i5i6 etc...
LEC-23:
7.4.4
Result Checker
34
7.4.4
Result Checker
The purpose of the result checker is to check the ok circuit at the end of the test sequence. To do this, we need to recognize the end of the test sequence. The simplest way to do this is to notice that the rst test vector is all 1s and that the test vector sequence will repeat as long as the circuit is in test mode. We want to sample the ok signal one clock cycle after the sequence is over. This is the same as the rst clock cycle of the second test sequence. In this clock cycle, the output of the test generator will be all 1s and reset will be 0. We need to look at reset, because otherwise we could not distinguish the rst sequence (when reset is 1) from the subsequenct sequences.
reset q0 q1 q2 ok
all_ok
LEC-23:
7.4.5
Arithmetic over Binary Fields
35
7.4.5
Galois Fields! Two operations: and Two values: 0 and 1
LEC-23:
7.4.5
36
Addition
represents XOR expression result 0 0 0 0 1 1 1 0 1 1 1 0 x x 0
LEC-23:
7.4.5
37
Multiplication
represents concatenating shift registers expression result x4 1 x4 x2 x3 x5
LEC-23:
7.4.5
38
Example
x5
x3
x2
x3
x2 x
x3 x3
'C
x2 x2 1 1 x5 x4 x4 x2 x x
Calculate x3
x2
x2
LEC-23:
7.4.6
Shift Registers and Characteristic Polynomials (Smith 14.7.5) 39
7.4.6 Shift Registers and Characteristic Polynomials (Smith 14.7.5)

Given a linear feedback shift register with l ops. The feedback register can be represented as a polynomial p x with maximum exponent xl . The polynomial represents the behaviour of the output of the last ip op. The exponent on the variable x represents the number of clock cycles of delay. From polynomials to hardware:
The maximum exponent denotes the number of ops The other exponents denote the ops that tap off of feedback line from last op
LEC-23:
7.4.6
reset
d0
q0
q1
q2
reset
d0
q0
d1
q1
q2
reset
d0 i
q0
q1
q2
reset
d0 i
q0
d1
q1
q2
reset
d0 i
q0
d1
q1
d2
q2
reset
d0 i
q0
d1
q1
q2
d3
q3
px
x4
x3
px
x3
px
x3
px
x3
px
px
x3
x3
x2
LEC-23:
7.4.6
See Smiths Fig 14.27 (pp771), 14.28 (pp773), and Table 14.11 (pp774).
LEC-23:
7.4.6
7.4.6.1 Circuit Multiplication

Redoing the multiplication example as circuits:
x5
The op for the most-signicant bit is represented by a coeffcient of 1 for the maximum exponent in the polynomial. Hence, MSB of the rst partial product cancels the x4 of the second partial product, resulting in a coefcient of 0 for x4 in the answer.
'D
x2
x x3 x2 1 x x3 x2 1 2 x x3 x2 1 x3 x2 x
x3
x2
x2
LEC-23:
7.4.7
Bit Streams and Characteristic Polynomials
43
7.4.7 Bit Streams and Characteristic Polynomials

A bit stream, or bit sequence, can be represented as a polynomial. The oldest (rst) bit in a sequence of n bits is represented by xn youngest (last) bit is x0 .
1 1x6 x6 x4
0 0x5 x 1
1 1x4
0 0x3
0 0x2
1 1x1
The bit sequence 1010011 can be represented as x6
x4
1: 1 1x0
and the
LEC-23:
7.4.8
Division
44
7.4.8
Division
With rules for multiplication and addition, we can dene division. A fundamental theorem of division denes q and r to be the quotient and remainder, respectively, of m p iff: mx
R D D
qx
px
r x
LEC-23:
7.4.8
Division
45
Long Division
In Galois elds, we do division just as with long division in elementary school. Given:
C C
Quotient Remainder
qx r x
x2 x
1x4 1x4
x x
x4
0x5
x2 x x6 x6
1 1x4
1x3 1x3
0x2
R
0x1
Calculate the quotient, q x and remainder r x for m x
mx px
x6 x4
x4 x
x3
px:
0x0
LEC-23:
7.4.8
Division
46
Long Division (Check)

Check result:
x4 x3
The mathematics for an LFSR without an input i:
same polynomial as if the circuit had an input input sequence is all 0s
mx
qx x2 1 x6 x3 x6 x4
px x4 x x
r x x x
LEC-23:
7.4.9
Signature Analysis: Math and Circuits
47
7.4.9 Signature Analysis: Math and Circuits

The input to the signature analyzer is a message, m x , which is a sequence of n bits represented as a polynomial. After n shifts through an LFSR with l ops:
The remainder is the signature.
R D D
mx
qx
px
r x
The sequence of output bits forms a quotient, q x , of length n The ops in the analyzer form a remainder, r x , of length l
LEC-23:
7.4.9
48
Input Streams and Error Polynomials
R C C R
mx
ex
q x
px
r x
e x is the error polynomial bits in the message that are ipped have a coefcient of 1 in e x
An input stream with an error can be represented as m x
ex
LEC-23:
7.4.9
49
Input Streams and Error Polynomials

The error e x will be detected if it results in a different signature (remainder).
That is e x must be a multiple of p x . The larger p x is, the smaller the chances that e x will be a multiple of p x .
m x and m x
e x will have the same remainder iff e x mod p x 0
LEC-23:
7.4.10
Summary
50
7.4.10 Summary
LEC-23:
7.4.10
Summary
51
Adding Test Circuitry

1. Pick number of ops for generator 2. Build generator (maximal-length linear feedback shift register) 3. Pick number of ops for signature analysis 4. Pick coeffecients (feedback taps) for analyzer 5. Based on generator, circuit under test, and signature analyzer; determine expected output of analyzer 6. Based on expected output of analyzer, build result checker
LEC-23:
7.4.10
Summary
52
Running Test Vectors

1. Put circuit in test mode 2. Set reset = 1 3. Run one clock cycle, set reset = 0 4. Run one clock cycle for each test vector 5. At end of test sequence, all ok signals should be 1
6. To run n test vectors requires n
1 clock cycles.
LEC-24: Scan Testing (JTAG)

Schedule
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 wk-09 10 wk-11 12 VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Lec-21 Introduction Lec-22 Detection and Test Generation Lec-23 Built-In Self Test (BIST) Lec-24 Scan Testing (JTAG) Review
wk-13

The purpose of this lecture is to connect the theory of testing and testability to the current techniques of scan testing and the IEEE Standard 1149.1 (aka JTAG). scan testing scan chain testing procedure time to run a test boundary scan testing
JTAG IEEE 1149 length of time to do a scan test hardware to do scan testing
LEC-24:
7.5
SCAN TESTING IN GENERAL (SMITH 14.6)
7.5
Scan Testing in General (Smith 14.6)
LEC-24:
7.5.1
Structure and Behaviour of Scan Testing
7.5.1 Structure and Behaviour of Scan Testing

data_in(3) another circuit #0 zeta_in(3) another circuit #1 yet another circuit scan_out1
data_in(2) circuit under test
zeta_in(2)
data_in(1)
zeta_in(1)
data_in(0)
zeta_in(0)
Normal Circuit
mode0 scan_in0 mode1 scan_in1
another circuit
scan chain 0
circuit under test
scan_out0
Circuit with Scan Chains Added
scan chain 1
LEC-24:
7.5.2
Scan Chains
7.5.2
data_in(3)
Scan Chains
mode1 scan_in1 zeta_in(3)
mode0 scan_in0
data_in(2)
circuit under test
zeta_in(2)
data_in(1)
zeta_in(1)
data_in(0) scan_out0 scan_out1
zeta_in(0)
LEC-24:
7.5.2
Scan Chains
7.5.2.1 Circuitry in Normal Mode

circuit under test
scan_out0
scan_out1
Normal Mode
LEC-24:
7.5.2
Scan Chains
Scan Mode
circuit under test
scan_out0
scan_out1
Scan Mode
LEC-24:
7.5.2
Scan Chains
7.5.2.2 Scan in Operation

mode0 scan chain 0 scan_in0 mode1 scan chain 0 scan_in1
another circuit
scan_out0
scan_out1
Circuit under test with scan chains
yet another circuit
circuit under test
LEC-24:
7.5.2
Scan Chains
10
From Test Vector to Results

clk mode0 scan_out0 scan_in0 scan_out1 scan_in1 current vector0 current results1
LEC-24:
7.5.2
Scan Chains
11
Load Test Vector

mode0 scan chain 0 current vector0 scan_in0 mode1 scan chain 0 scan_in1
another circuit
scan_out0
scan_out1
yet another circuit
circuit under test
LEC-24:
7.5.2
Scan Chains
12
Run Test Vector Through Circuit

another circuit
scan_out0
scan_out1
yet another circuit
circuit under test
LEC-24:
7.5.2
Scan Chains
13
Unload Test Vector

another circuit
scan_out0
scan_out1 current results1
yet another circuit
circuit under test
LEC-24:
7.5.2
Scan Chains
14
Unload Prev and Load Current

mode0 scan chain 0 current vector0 scan_in0 mode1 scan chain 0 current vector1 scan_in1
another circuit
scan_out0 previous results0
scan_out1 previous results1
Optimization: Unload and Load and Same Time
yet another circuit
circuit under test
LEC-24:
7.5.2
Scan Chains
15
Run Tests
another circuit
scan_out0
scan_out1
yet another circuit
circuit under test
LEC-24:
7.5.2
Scan Chains
16
Unload Current and Load Next

mode0 scan chain 0 next test vector0 scan_in0 mode1 scan chain 0 next test vector1 scan_in1
another circuit
yet another circuit
circuit under test
LEC-24:
7.5.2
Scan Chains
17
Behaviour of Scan Testing

clk mode0 scan_out0 scan_in0 scan_out1 scan_in1 previous results0 current vector0 previous results1 current vector1 current results0 next test vector0 current results1 next test vector1
Behaviour of scan testing
LEC-24:
7.5.2
Scan Chains
18
7.5.2.3 Scan in Operation with Example Circuit
LEC-24:
7.5.2
Scan Chains
19
a b y z c d
Circuit under test
LEC-24:
7.5.2
Scan Chains
20
mode0 scan_in0 a
mode1 scan_in1
y b z c
scan_out0
scan_out1
Circuit under test with scan test circuitry
LEC-24:
7.5.2
Scan Chains
21
mode0 scan_in0 a
mode1 scan_in1
y b z c
scan_out0 clk mode0
scan_out1
Start Loading Test Vector (Load )
LEC-24:
7.5.2
Scan Chains
22
mode0 scan_in0 a
mode1 scan_in1
y b z c
scan_out0 clk mode0
scan_out1
Load
LEC-24:
7.5.2
Scan Chains
23
mode0 scan_in0 a
mode1 scan_in1
y b z c
scan_out0 clk mode0
scan_out1
Load
LEC-24:
7.5.2
Scan Chains
24
mode0 scan_in0 a
mode1 scan_in1
y b z c
scan_out0 clk mode0
scan_out1
Load
LEC-24:
7.5.2
Scan Chains
25
mode0 scan_in0
mode1 scan_in1
scan_out1
scan_out0 clk mode0
Run Test Vector
LEC-24:
7.5.2
Scan Chains
26
mode0 scan_in0
mode1 scan_in1
__
+
__
__
__
scan_out1
scan_out0 clk mode0
Test Values Propagate
LEC-24:
7.5.2
Scan Chains
27
mode0 scan_in0
mode1 scan_in1
__
__
scan_out0 clk mode0
scan_out1 (+)
__
Flop-In Result, Start (Un)loading Test Vector
LEC-24:
7.5.2
Scan Chains
28
mode0 scan_in0
mode1 scan_in1
__
scan_out0
__
scan_out1 (+, +)
__
clk mode0
Continue (Un)loading Test Vector
LEC-24:
7.5.2
Scan Chains
29
mode0 scan_in0
mode1 scan_in1
scan_out0
__
scan_out1 (+, +)
__
clk mode0
Continue (Un)loading Test Vector
LEC-24:
7.5.2
Scan Chains
30
mode0 scan_in0
mode1 scan_in1
scan_out0
__
scan_out1 (+, +)
__
clk mode0
Finish (Un)loading Test Vector
LEC-24:
7.5.2
Scan Chains
31
mode0 scan_in0
mode1 scan_in1
scan_out0
__
scan_out1 (+, +)
__
clk mode0
Run Next Test Vector
LEC-24:
7.5.3
Summary of Scan Testing
32
7.5.3
Summary of Scan Testing
Adding scan circuitry 1. Registers around circuit to be tested are grouped into scan chains 2. Replace each op with mux + op 3. Flops and muxes wired together into scan chains 4. Each scan chain is connected to dedicated I/O pins for loading and unloading test vectors Running test vectors 1. Put scan chain in scan mode 2. Load in test vector (one element of vector per clock cycle) 3. Put scan chain in normal mode 4. Run circuit for one clock cycle load result of test into ops 5. Unload results of current test vector while simultaneously loading in next test vector (one element of vector per clock cycle)
LEC-24:
7.5.4
Example: Time to Test a Chip
33
7.5.4
Example: Time to Test a Chip
A 800MHz chip has scan chains of length 20,000 bits, 18,000 bits, 21,000 bits, 22,000 bits, and two of 15,000 bits. 500,000 test vectors are used for each scan chain. The tests are run at 80% of full speed.
Question:
Calculate the total test time.
Answer:
We can load and unload all of the scan chains at the same time, so time will be limited by the longest (22,000 bits). For the rst test vector, we have to load it in, run the circuit for one clock cycle, then unload the result. Loading the second test vector is done while unloading the rst.
Q Q
'CQ
TimeTot
ClockPeriod MaxLengthVec 1 0 80 800 106 17secs
NumVecs MaxLengthVec 1 22 000 500 000 22 000 1
LEC-24:
7.6
BOUNDARY SCAN
34
7.6
Boundary Scan
Boundary scan originated as technique to test wires on printed circuit boards (PCBs). Goal was to replace bed-of-nails style testing with technique that would work for high-density PCBs (lots of small wires close together) Now used to test both boards and chip internals. Used both on boundaries (I/O pins) and internal ops.
LEC-24:
7.6
BOUNDARY SCAN
35
Boundary Scan with JTAG

Standardized by IEEE (1149) and previously by JTAG:
JTAG circuitry now commonly built-into FPGAs and ASICS, or part of a celllibrary. Rarely is a JTAG circuit custom-built as part of a larger part. So, youll probably be choosing and using JTAG circuits, not constructing new ones. Using JTAG circuitry is usually done by giving a description of your printed circuit board (PCB) and the JTAG components on each chip (in BSDL) to test generation software. The software then generates a sequence of JTAG commands and data that can be used to test the wires on the circuit board for opens and shorts.
4 required signals (Scan Pins: TDI, TDO, TCK, TMS) 1 optional signal (Scan Pin: TRST) protocol to connect circuit under test to tester and other circuits state machine to drive test circuitry on chip Boundary Scan Description Language (BSDL): structural language used to describe which features of JTAG a circuit supports
LEC-24:
7.6.1
Boundary Scan History
36
7.6.1
Boundary Scan History
1985 JETAG: Joint European Test Action Group 1986 JTAG (North American companies joined) 1990 JTAG 2.0 formed basis for IEEE 1491 Test access port and boundary scan architecture
LEC-24:
7.6.2
Scan Pins
37
7.6.2
TDO TCK TMS TRST
Scan Pins
test data input: input testvector to chip test data output: output result of test test clock: clock signal that test runs on test mode select: controls scan state machine test reset (optional): resets the scan state machine
'
TDI
LEC-24:
7.6.2
Scan Pins
38
Overview
chip scan registers
normal input pins
circuit under test
normal output pins
TDI TCK TMS
TDO control
LEC-24:
7.6.2
Scan Pins
39
Expanded View
chip BSR BSC circuit under test BSC BSC control TDI BR Instruction Decoder IR TCK IDCODE TAP Controller IRC IRC TDO BSC BSC BSC
TMS
LEC-24:
7.6.3
Scan Registers and Cells
40
7.6.3
LEC-24:
7.6.3
41

TDR DR Fig 14.2 Test data register The boundary scan registers on a chip Data register cell Often used as a Boundary scan cell (BSC)
LEC-24:
7.6.3
42
JTAG Components
BSR BSC Fig 14.8 Fig 14.5 Fig 14.2 Top level diagram Boundary scan register A chain of boundary scan cells (BSCs) Boundary scan cell Connects external input and scan signal to internal circuit. Acts as wire between external input and internal circuit in normal mode. Bypass-register cell Allows direct connection from TDI to TDO. Acts as a wire when executing BYPASS instruction. Device identication register data register to hold manufacturers name and chip identier. Used in IDCODE instruction. Instruction register cell Cells are combined together as a shift register to form an instruction register (IR) Instruction register Two or more IR cells in a row. Holds data that is shifted in on TDI, sends this data in parallel to instruction decoder. Instruction decoder Reads instruction stored in instruction register (IR) and sends control signals to bypass register (BR) and boundary scan register (BSR) TAP Controller State machine that, together with instruction decoder, controls the scan circuitry.
BR
Fig 14.3
IDCODE
IR cell
Fig 14.4
IR
Fig 14.6
IDecode
Table 14.4
Fig 14.7
LEC-24:
7.6.4
Scan Instructions
43
7.6.4
EXTEST
Scan Instructions
Test board-level interconnect. Drive output pins of chip with hard-coded test vector. Sample results on inputs. Sample result data Load test vector Directly connect TDI to TDO. This is used when several chips are daisy chained together to skip loading data into some chips. Output manufacturer and part number
This the set of required instructions, other instructions are optional.
SAMPLE PRELOAD BYPASS
IDCODE
LEC-24:
7.6.5
TAP Controller
44
7.6.5
TAP Controller
The TAP controller is required to have 16 states and obey the state machine shown in Fig 14.7.
LEC-24:
7.6.6
Other descriptions of JTAG/IEEE 1194.1
45
7.6.6 Other 1194.1
descriptions
of
JTAG/IEEE
Texas Instruments introductory seminar on IEEE 1149.1 http://www.ti.com/sc/docs/jtag/seminar1.pdf Texas Instruments intermediate seminar on IEEE 1149.1 http://www.ti.com/sc/docs/jtag/seminar2.pdf Sun midroSPARC-IIep scan-testing documentation http://www.sun.com/microelectronics/whitepapers/wpr-0018-01/ Intellitech JTAG overview: http://www.intellitech.com/resources/technology.html Actels JTAG description: http://www.actel.com/appnotes/97s05d15.pdf Description of JTAG support on Motorola Coldle microprocessor: http://e-www.motorola.com/collateral/MCF5307TR-JTAG.pdf
LEC-24:
7.7
SUMMARY AND CONCLUSIONS ON TESTING
46
7.7
Summary and Conclusions on Testing
LEC-24:
7.7.1
Faults
47
7.7.1
Faults
Faults are manufacturing defects. Common occurences are opens (wire is broken) and shorts (two wires are connected together). When working with faults, we work with wire segments, not signals. In the circuit below, there are 8 different wire segments (L1L8). Each wire segment corresponds to a logically distinct fault location. All physical faults on a segment affect the same set of signals, so they are grouped together into a logical fault. If a signal has a fanout of 1, then there is one wire segment. A signal with a fanout of n, where n 1, has n 1 wire segments one for the source signal and one for each gate of fanout.
a L1 L4 L2 L5 c L3 L7
For signal b in the circuit here, the fanout is 2, so there are three wire segments (L2, L4, and L5).
Although there are many different bad behaviours that faults can lead to, the simple model of single-stuck-at-faults has proven very capable of nding real faults in real circuits. single stuck-at-0 (s@0) stuck-at-1 (s@1) assume that at most wire segment in circuit has a fault. assume that the faulty behaviour is that the segment is hardwired to 0. assume that the faulty behaviour is that the segment is hardwired to 1.
L6 L8 z
LEC-24:
7.7.2
Testing
48
7.7.2
Testing
Faults are detected by stimulating circuits (real, manufactured circuit, not a simulation!) with test-vectors and checking that real circuit gives correct output. Standard practice in testing is to test circuits for single stuck-at faults. Mathematics and empirical evidence demonstrate that testing a circuit for single stuck-at faults will also detect many other types of faults and will often detect multiple faults. Some faults are undetectable. Undetectable stuck-at faults are located in redundant parts of a circuit. These redundant parts are added to prevent timing hazards. As such, a stuck-at fault in redundant circuitry will not affect the steady state behaviour of the circuit, but could allow timing glitches to occur. If a circuit has 100% single stuck-at fault coverage with a suite of test vectors, then each stuck-at fault in the circuit can be detected by one or more vectors in the suite. This also means that the circuit has no undetectable faults, and hence, no redundant circuitry. It is possible that achieving 100% coverage for single stuck at faults will allow defective chips to pass if they have faults that are not stuck-at-1 or stuck-at-0, or if they have multiple faults. I think, but havent seen a proof, that achieving 100% single stuck-at coverage will detect all combinations of multiple stuck-at faults. But, if you do not achieve 100% coverage, then a stuck-at fault that you arent testing for can mask (hide) a fault that you are testing for. There are two ways to generate vectors and check result: built-in tests and scan testing. Both require:
generate test vectors overide normal datapath to send test-vectors, rather than normal inputs, as inputs to ops compare outputs of ops to expected result
LEC-24:
7.7.2
Testing
49
7.7.2.1 Scan Testing

In scan testing, the generation and checking are done off-chip. This has the advantage of exibility and reduced on-chip hardware, but increases the length of time required to run a test. We want to individually drive and read every op in the circuit. Even without using any I/O pins for testing purposes, chips are already I/O bound, so scan-testing must be very frugal in its use of pins. Flops are connected together in a scan chain with one input pin and one output pin. If the length (number of ops) of a scan chain is n, then it takes 2n 1 clock cycles to run a single test: n clock cycles to scan in the test vector, 1 clock cycle to execute the test vector, and n cycles to scan out the results. Once the results are scanned out, they can be compared to the expected results for a correctly working circuit. If we run 2 or more tests (and chips generally are subjected to hundreds of thousands of tests), then we speed things up by scanning in the next test vector while we scan out the previous result. ScanLength NumVectors TimeScan = = = = number of ip ops in a scan chain number of test vectors in test suite number of clock cycles to run test suite NumVectors ScanLength 1 ScanLength
To nd a test vector that will detect a fault:
1. build Boolean equation (or Karnaugh map) of correct circuit 2. build Boolean equation (or Karnaugh map) of faulty circuit 3. compare equations (or Karnaugh maps), regions of difference represent test vectors that will detect fault Because it takes so much time to perform a scan test, reducing the number of test vectors that are needed is very important. fault1 dominates fault2 is dened as: any test vector that will detect fault1 will also detect fault2.
'
LEC-24:
7.7.2
Testing
50
Summary of Technique to Find and Order Test Vectors: 1. identify all possible faults 2. gate collapsing 3. node collapsing 4. intelligent collapsing 5. fault domination 6. determine required test vectors 7. choose minimal set of test vectors to detect remaining faults 8. order test vectors based on number of faults detected (NOTE: when iterating through this step, need to take into account faults detected by earlier test vectors)
LEC-24:
7.7.2
Testing
51
7.7.2.2 Built-In Self Test (BIST)

With built-in self test, the circuit tests itself. Both test vector generation and checking are done using linear feedback shift registers (LFSRs). The gure below shows an LFSR that generates all possible 3-bit vectors except 000. (An n bit LFSR that generates 2n 1 different vectors is called a maximal-length LFSR.) Assume that reset initializes the circuit to 111. The sequence that is generated is: 111, 011, 001, 100, 010, 101, 110. This sequence is repeated, so the number after 110 is 111.
Each linear feedback shift register has a characteristic polynomial, that corresponds to the behaviour of the signal that is the input to the rst ip-op in the shift register. The exponents in the polynomial correspond to the delay x0 is the input to the shift register, x1 is the output of the rst ip-op, x2 is the output of the second, etc. The coefcient is 1 if theres a feedback tap from the output of the op. Checking is done by building one signature analyzer circuit for each signal tested. The circuit returns true if the signal generates the correct sequence of outputs for the test vectors. Doing this with complete accuracy would require storing 2n bits of information for each output for a circuit with n inputs. This would be as expensive as the original circuit. So, BIST uses mathematics similar to error correction/detection to approximate whether the outputs are correct. This technique is called signature analysis and originated with Hewlett-Packard in the 1970s. The checking is done with an LFSR, similar to the BIST generation circuit. The checking circuit is designed to output a 1 at the end of the sequence of 2n 1 test results if the sequence of results matches the correct circuit. We could do this with an LFSR of 2n 1 ops, but as said before, this would be at least as expensive as duplicating the original circuit.
q2 q1 q0
LEC-24:
7.7.2
Testing
52
The checking LFSR is designed similarly to a hashing function or parity checking circuit. If it returns 0, then we know that there is a fault in the circuit. If it returns a 1, then there is probably not a fault in the circuit, but we cant say for sure. There is a tradeoff between the accuracy of the analyzer and its area. The more accurate it is, the more ip ops are required. The LFSR here recognizes the sequence 1, 0, 1, 1, 1, 0, 0:
output from circuit under test
q2
It could be used, in conjunction with the maximal-length LFSR above, to detect faults in a circuit that, when stimulated with the sequence with the sequence 111, 011, 001, 100, 010, 101, 110; outputs the sequence 1, 0, 1, 1, 1, 0, 0.
LEC-24:
7.7.3
Scan vs Self Test
53
7.7.3
Scan
Scan vs Self Test
less hardware
Self Test
slower well dened coverage test vectors are easy to modify
more hardware faster ill dened coverage test vectors are hard to modify
LEC-24:
7.7.3
Scan vs Self Test
54
Chapter 8
Review
This chapter is a collection of information cover the major topics of the term. The Topics List section for each major area is meant to be relatively complete. The notes sections are less focused and are not indicative of the relative importance of the different topics we covered.
55
LEC-25: Review
Lecture Notes Sections: 8.1 8.9
wk-01 02 wk-03 05 wk-06 wk-07 wk-08 09 wk-09 10 wk-11 12 wk-13
VHDL Design and Optimization Functional Validation Performance Analysis, Prediction, and Optimization Timing Analysis Power Analysis and Reduction Faults and Testing Review
LEC-25:
8.1
OVERVIEW OF THE TERM
8.1
Overview of the Term

The purely digital world VHDL design process optimization techniques functional validation performance analysis
Analog effects in the digital world timing analysis power faults and testing
LEC-25:
8.1
Topics and Lectures

Design techniques Lec-01 Lec-02 Lec-03 Lec-04 Introduction and overview VHDL syntax and synthesis VHDL simulation semantics Hardware building blocks;
Design and optimization techniques Lec-05 Lec-06 Lec-07 Lec-08 Lec-09 Lec-10 Dataow diagrams and high-level models State machines Memory arrays Design example (stack) Optimization and coding guidelines FPGA-Specic optimizations
Functional Validation Lec-11 Datapath Validation Lec-12 Control Validation Performance analysis and prediction Lec-13 Measuring performance, comparing optimizations Lec-14 Digital-circuit performance
LEC-25:
8.1
Topics and Lectures (2)

Timing Analysis Lec-15 Denitions, equations, sources of delay Lec-16 Math, physics and applications Lec-17 Storage Power Lec-18 Power and energy analysis Lec-19 Data encoding for power reduction Lec-20 Clock gating Testing and testability Lec-21 Lec-22 Lec-23 Lec-24 Faults; fault models; testability Fault detection and test vector generation Built-in self test (I) Built-in self test (II)
LEC-25:
8.2
VHDL
8.2
VHDL
simple syntax and semantics things that you should know simply by having done the miniproject and project synthesizing VHDL
match up VHDL code with hardware choose VHDL fragment to generate more optimal hardware identify whether a particular signal will be the output of combinational circuitry or a op identify whether a particular process is combinational or clocked VHDL semantics match up VHDL code with waveforms identify whether two VHDL fragments have same behaviour perform delta-cycle simulation of VHDL perform clock-cycle simulation of VHDL
LEC-25:
8.3
DESIGN AND OPTIMIZATION TECHNIQUES
8.3
Design and Optimization Techniques

from algorithm to dataow diagram from dataow diagram to hardware optimizing dataow diagrams nite state machines and hardware calculate resource usage for a dataow diagram (input ports, output ports, registers, datapath components) calculate performance data for a dataow diagram (clock period and number of cycles to execute (CPI)) given a dataow diagram, calculate the clock period that will result in the optimum performance given an algorithm, design a dataow diagram given a dataow diagram, design the datapath and nite state machine optimize a dataow diagram to improve performance or reduce resource usage
LEC-25:
8.4
VALIDATION
8.4
Validation
test benches assertions coverage monitors relational specication functional specication boundary conditions / corner cases
LEC-25:
8.5
PERFORMANCE PREDICTION AND ANALYSIS
8.5
Performance Prediction and Analysis

time to execute a program denition of performance speedup n% faster calculating performance of different different tasks and average task choosing which task to optimize to best improve overall performance cpi calculations performance increase over time design tradeoffs (CPI vs NumInsts vs ClockSpeed vs time-to-market) CPI calculations MIPs calculations Clock speed vs. performance Optimality performance / area tradeoffs
LEC-25:
8.6
TIMING ANALYSIS
8.6
Timing Analysis
what affects delay setup, hold, clock-to-Q times, skew, jitter, etc clock period clock skew clock jitter propagation delay load delay setup time hold time clock-to-Q time critical path
nd the critical path through a circuit nd the minimum clock period for a circuit nd a pair of assignments to signals that exercises the critical path false path determine whether a critical path is real or false derating factors
LEC-25:
8.7
POWER
10
8.7
Power
power vs energy equations for power
dynamic power static power switching power short circuit power leakage power activity factor leakage current threshold voltage
power reduction techniques clock gating data encoding
LEC-25:
8.8
TESTING
11
8.8
Testing
causes of faults locations of faults physical faults
mathematical models of faults single stuck-at fault will a test for a mathematica fault detect a physical fault?
testable / untestable fault fault masking redundant circuitry timing hazards
economics of testing fault coverage
open short wired AND wired OR stronger wins
LEC-25:
8.8
TESTING
12
Testing II
built-in self-testing linear feedback shift register characteristic polynomials addition multiplication division (quotient and remainder) relationship to hardware maximal length linear feedback shift register signature analyzer fault aliasing process and time to run a BIST test
test vector generation generate test vector to nd a particular fault generate test vectors to nd a set of faults fault collapsing gate collapsing node collapsing fault domination order test vectors to reduce test time
LEC-25:
8.9
FORMULAS TO BE GIVEN ON FINAL EXAM
13
8.9
Formulas to be Given on Final Exam
p
106
i
i 0
LEC-25:
8.9
1 2
R A t
10
1 38066
q e k
1 60218
10
Formulas II
23 19
J/K C 14
LEC-25:
8.9
Part II
Solutions to Tutorial Notes
Chapter 1
VHDL Problems
SOL-01 Preliminaries
SOL-01: VHDL Syntax

SOL-01:
1.1
IEEE 1164
1.1
IEEE 1164
For each of the values in the list below, answer whether or not it is dened in the ieee.std_logic_1164 library. If it is part of the library, write a 23 word description of the value. Values: -, #, 0, 1, A, h, H, L, Q, X, Z.
Answer:
- # 0 1 A h H L Q X Z
In std logic 1164? Yes No X X X X X X X X X X X
Description dont care strong 0 strong 1
weak 1 weak 0 strong unknown high impedance
NOTE: h is not in the package, because characters are case sensitive. For example a /= A.
SOL-01:
1.2
FLOPS, LATCHES, AND COMBINATIONAL CIRCUITRY 3
1.2 Flops, Latches, and Combinational Circuitry

For each of the signals p...z in the architecture main of montevido, answer whether the signal is a latch, combinational gate, or ip-op. entity montevido is port ( a, b0, b1, c0, c1, d0, d1, e0, e1 : in std_logic; l : in std_logic_vector (1 downto 0); p, q, r, s, t, u, v, w, x, y, z : out std_logic ); end montevido;
SOL-01:
1.2
architecture main of montevido is signal i, j : std_logic; begin process begin i <= c0 XOR c1; wait until rising_edge(a); j <= c0 XOR c1; t <= b0 XOR b1; process (a, i, j) begin u <= NOT t; if (a = 1) then v <= NOT x; p <= i AND j; end process; else process begin p <= NOT i; case l is end if; when "00" => end process; wait until rising_edge(a); process (a, b0, b1) begin w <= b0 AND b1; if rising_edge(a) then x <= 0; q <= b0 AND b1; when "01" => end if; wait until rising_edge(a); end process; w <= -; process (a, c0, c1, d0, d1, e0, e1) <= 1; x begin when "1-" => if (a = 1) then wait until rising_edge(a); r <= c0 OR c1; w <= c0 XOR c1; s <= d0 AND d1; x <= -; else end case; r <= e0 XOR e1; end process; end if; y <= c0 XOR c1; end process; z <= x XOR w; end main;
SOL-01:
Answer:
1.2
Latch p q r s t u v w x y z
Combinational X X
Flip-op X
X X X X X X X X
SOL-01:
1.3
COUNTING CLOCK CYCLES
1.3
NOTES: 1. 2. 3. 4.
Counting Clock Cycles
This question refers to the VHDL code shown below.
... represents a legal fragment of VHDL code assume all signals are properly declared the VHDL code is intendend to be legal, synthesizable code all signals are initially U
SOL-01:
1.3
architecture main of tinyckt is component bigckt ( ... ); signal ... : std_logic; begin p0 : process begin entity bigckt is wait until rising_edge(clk); port ( p0_a <= i; a, b : in std_logic; wait until rising_edge(clk); c : out std_logic end process; ); p1 : process begin end bigckt; wait until rising_edge(clk); p1_b <= p1_d; architecture main of bigckt is p1_c <= p1_b; begin p1_d <= s2_k; process (a, b) end process; begin p2 : process (p1_c, p3_h, p4_i, clk) begin if (a = 0) then if rising_edge(clk) then c <= 0; p2_e <= p3_h; else p2_f <= p1_c = p4_i; if (b = 1) then end if; c <= 1 end process; else p3 : process (i, s4_m) begin c <= 0; p3_g <= i; end if; p3_h <= s4_m; end if; end process; end process; p4 : process (clk, i) begin end main; if (clk = 1) then p4_i <= i; entity tinyckt is else port ( p4_i <= 0; clk : in std_logic; end if; i : in std_logic; end process; o : out std_logic huge : bigckt ); (a => p2_e, b => p1_d, c => h_y); end tinyckt; s1_j <= s3_l; s2_k <= p1_b XOR i; s3_l <= p2_f; s4_m <= p2_f; end main;
For each of the pairs of signals below, what is the minimum length of time between when a change occurs on the source signal and when that change
SOL-01:
1.3
affects the destination signal?
Answer:
NOTE: i doesnt affect the value of p2 f just before a rising edge of clock, so i doesnt affect p2 e at all along the path that goes through p2 f source signal destination signal no connection same clock cycle 1 clock cycle 2 clock cycle 3 clock cycle 4 clock cycle 5 clock cycle 6 clock cycle 7 clock cycle 8 clock cycle 9 clock cycle 10 or more clock cycles i p0 a i p1 b i p1 c i p2 e i p3 g X X X X X i p4 i X X X s4 m hy p1 b p1 d p2 f s1 j X
SOL-01:
1.4
ARITHMETIC OVERFLOW
1.4
Arithmetic Overow
Implement a circuit to detect overow in 8-bit signed arithmetic.
Answer:
An overow in 8 bit arithmetic happens when the carry into the most signicant bit is different from the carry out of the most signicant bit. library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity overflow is port ( num1, num2 : in signed(7 downto 0); cin : in std_logic; overflow : out std_logic ); end overflow; architecture main of overflow is signal num1_ext, num2_ext, result : signed(8 downto 0); begin num1_ext <= 0 & num1; num2_ext <= 0 & num2; result <= num1_ext + num2_ext + ("00000000" & cin); ovrflw <= not (num1_ext(7) xor num2_ext(7)) and ( num1_ext(7) xor result(7) ); end overflow;
SOL-01:
1.5
8-BIT REGISTER
10
1.5
8-Bit Register
Implement an 8 bit register that has:
clock signal clk input data vector d output data vector q synchronous active-high input reset synchronous active-high input enable
SOL-01:
1.5
8-BIT REGISTER
11
Answer: library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity reg_8 port ( clk, reset, enable : d : q : ); end reg_8; is
in std_logic; in std_logic_vector (7 downto 0); out std_logic_vector (7 downto 0)
architecture main of reg_8 is begin reg: process begin wait until (rising_edge(clk)); if reset = 1 then q <= (others => 0); elsif enable = 1 then q <= d; end if; end process reg; end main;
SOL-01:
1.5.1
Asynchronous Reset
12
1.5.1
Asynchronous Reset
Modify your design so that the reset signal is asynchronous, rather than synchronous.
Answer: reg : process(clk, reset) begin if reset = 1 then q <= (other => 0); elsif rising_edge(clk) then if enable = 1 then q <= d; end if; end if; end process reg;
SOL-01:
1.5.2
Discussion
13
1.5.2
Discussion
Describe the tradeoffs in using synchonous versus asynchronous reset in a circuit implemented on an FPGA.
SOL-01:
1.5.3
Testbench for Register
14
1.5.3
Write a test bench to validate the functionality of the 8-bit register with synchronous reset.
Answer:
SOL-01:
1.5.3
15
library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity reg_8_tb is end reg_8_tb; architecture main of reg_8_tb is component reg_8 is port ( clk : in std_logic; reset : in std_logic; enable : in std_logic; d : in std_logic_vector (7 downto 0); q : out std_logic_vector (7 downto 0)); end component; signal clk, reset, enable : std_logic; signal d, q : std_logic_vector(7 downto 0); begin uut : reg_8 port map ( clk => clk, reset => reset, enable => enable, d => d, q => q ); process begin clk <= 1 ; reset <= 0 ; wait for 20 ns; -- time=20 ns clk <= 0 ; reset <= 1 ; enable <= 1 ; d <= "10101011"; wait for 20 ns; -- time=40 ns clk <= 1 ; wait for 20 ns; -- time=60 ns clk <= 0 ; en <= 0 ; d <= "00001011" wait for 20 ns; -- time=80 ns clk <= 1 ; wait for 20 ns; -- time=100 ns clk <= 0 ; en <= 1 ; wait for 20 ns; -- time=120 ns clk <= 1 ;
SOL-01:
1.6
VHDL SYNTAX
16
1.6
VHDL Syntax
Answer whether each of the VHDL code fragments q2a through q2f is legal VHDL code. NOTES: ... represents a fragment of legal VHDL code. For full marks, if the code is illegal, you must explain why. The code has been written so that, if it is illegal, then it is illegal for both simulation and synthesis.
1) 2) 3)
architecture main of anchiceratops is signal a, b, c : std_logic; begin process begin architecture main of tulerpeton i wait until rising_edge(c); begin a <= if (b = 1) then lab: for i in 15 downto 0 loop q2a q2b ... ... else end loop; ... end main; end if; ILLEGAL: loop statements are sequential, end process; while architecture bodies contain concurrent end main; statements. ILLEGAL: if-then-else is a statement, not an expression, so cant have if-then-else on right-hand-side of assignment.
SOL-01:
1.6
VHDL SYNTAX
17
architecture main of temnospondyl is component compa port ( architecture main of metaxygnathus ais in std_logic; : signal a : std_logic; b : out std_logic begin ); q2d q2c lab: if (a = 1) generate end component; ... signal p, q : std_logic; end generate; begin end main; coma_1 : compa port map (a => p, b => q); ILLEGAL: condition for ... if-generate statements must end main; be statically determined; testing the value of a signal is dynamic. LEGAL architecture main of pachyderm is architecture main of apatosaurus is function inv(a : std_logic) type state_ty is (S0, S1, S2); return std_logic is signal st : state_ty; begin signal p : std_logic; return(NOT a); begin q2e q2f end inv; case st is signal p, b : std_logic; when S0 | S1 => p <= 0; begin when others => p <= 1; p <= inv(b => a); end case; ... end main; end main; ILLEGAL: case statements are ILLEGAL: the argument to inv sequential; but the body of an should be (a => b) architecture contains concurrent statements.
SOL-02: VHDL Semantics

SOL-02:
1.7
CLOCK-CYCLE SIMULATION
1.7
Clock-Cycle Simulation
Given the VHDL code for deinonychus and waveform diagram below, answer what the values of the signals y, z, and p will be at the given times.
SOL-02:
1.7
architecture main of deinonychus is signal y, z : unsigned(15 downto 0) signal state : state_ty; begin proc_herzog: process begin top_loop: loop wait until (rising_edge(clk)); library ieee; next top_loop when (reset = 1 use ieee.std_logic_1164.all; state <= durian; use ieee.numeric_std.all; wait until (rising_edge(clk)); state <= papaya; package deinonychus_pkg is while y < z loop type state_ty is wait until (rising_edge(clk)) (mango, guava, durian, papaya); if sel = 1 then end deinonychus_pkg; wait until (rising_edge(clk next top_loop when (reset = library ieee; state <= mango; use ieee.std_logic_1164.all; end if; use ieee.numeric_std.all; state <= papaya; use work.deinonychus_pkg.all; end loop; end loop; entity deinonychus is end process; port ( proc_hillary: process (clk) clk, reset, sel : in std_logic; begin a, b : in unsigned(15 downto 0); if rising_edge(clk) then p : out unsigned(15 downto 0) if (state = durian) then ); z <= a; end deinonychus; else z <= z + 2; end if; end if; end process; y <= b; p <= y + z; end main;
SOL-02:
0 reset clk
1.7
20
40 60 80 100 120 140 160 180 200
sel
01 0E 02 0C 04 0A 06 08 0E 02 0C 04 0A 06 08 0E 02 0C 04 0A
b state
0F 03 0D 05 0B 07 01 0F 03 0D 05 0B 07 01 0F 03 0D 05 0B 07
0F 03 0D 05 0B 07 01 0F 03 0D 05 0B 07 01 0F 03 0D 05 0B 07
z p
U
U
2
07
6
15
A
11
55ns
107ns
147ns
195ns
Answer: y z p 55ns 7 U U 107ns 147ns 195ns 5 F 7 2 6 A 7 15 11
SOL-02:
1.8
DELTA-CYCLE SIMULATION: PONG
1.8
Delta-Cycle Simulation: Pong
Simulate the following VHDL code by drawing a timing diagram. INSTRUCTIONS: 1. The simulation is to be done at the granularity of simulation-steps. 2. Show all changes to process modes and signal values. 3. Each column of the timing diagram corresponds to a simulation step that changes a signal or process. 4. Clearly show the beginning and end of each simulation cycle, delta cycle, and simulation round by writing in the appropriate row a B at the beginning and an E at the end of the cycle or round. NOTES: 1. The initial value of all signals is U. 2. The signal reset becomes 1 at 0 ns and then becomes 0 at 5 ns.
SOL-02:
1.8
DELTA-CYCLE SIMULATION: PONG
architecture main of pong_machine is signal ping_i, ping_n, pong_i, pong_n : std_logic; begin process (clk) begin if rising_edge(clk) then ping_n <= ping_i; pong_n <= pong_i; end if; end process; process (pong_n, ping_n, reset) begin if (reset = 1) then ping_i <= 1; pong_i <= 0; else ping_i <= pong_n; pong_i <= ping_n; end if; end process; out_pong_proc : process (pong_i) begin pong <= pong_i; end process; ping <= ping_i; end main;
SOL-02:
1.9
DELTA-CYCLE SIMULATION: FEMUR
1.9
Delta-Cycle Simulation: Femur
Simulate the following VHDL code by completing the timing diagram on the next page. INSTRUCTIONS: 1. The simulation is to be done at the granularity of simulation-steps. 2. Show all changes to process modes and signal values. 3. Each column of the timing diagram corresponds to a simulation step that changes a signal or process. 4. Clearly show the beginning and end of each simulation cycle, delta cycle, and simulation round by writing in the appropriate row a B at the beginning and an E at the end of the cycle or round. 5. Write t=5ns and t=10ns at the top of columns where time advances to 5 ns and 10 ns. NOTES: 1. The initial value of all of the signals are shown in the timing diagram. 2. The only changes on clk, a, and b are: (a) At 5 ns, a changes from 0 to 1. (b) At 5 ns, b changes from 0 to 1. (c) At 10 ns, clk changes from 0 to 1.
SOL-02:
1.9
entity femur is port ( clk, a, b : in std_logic; f : out std_logic ); end femur; architecture main of femur is signal c, d, e : std_logic; begin proc_1 : process (a, b, c) begin c <= a and b; d <= a xor c; end process; proc_2 : process begin e <= d; wait until rising_edge(clk); end process; proc_3 : process (c, e) begin f <= c xor e; end process; end main;
SOL-02:
t=5 ns
t=10 ns
simulation round E E E S P A S P A S P A B E B E S B E B E
B B B
E E E
1.9
simulation cycle
delta cycle
proc_external
proc_1
proc_2 P A S
proc_3
clk
SOL-02:
1.10
VHDL VHDL BEHAVIOURAL COMPARISON: TERADACTYL 10
1.10 VHDL VHDL Behavioural Comparison: Teradactyl

For each of the VHDL architectures q3a through q3c, does the signal v have the same behaviour as it does in the main architecture of teradactyl? NOTES: For full marks, if the code has different behaviour, you must explain why. Ignore any differences in behaviour in the rst few clock cycles that is caused by initialization of ip-ops, latches, and registers. All code fragments in this question are legal, synthesizable VHDL code.
1) 2) 3)
entity teradactyl is port ( architecture q3a of teradactyl is a : in std_logic; signal b, c, d : std_logic; v : out std_logic begin ); b <= a; end teradactyl; architecture main of teradactyl is c <= b; d <= c; signal m : std_logic; v <= d; begin end q3a; m <= a; v <= m; SAME end main;
SOL-02:
1.10
VHDL VHDL BEHAVIOURAL COMPARISON: TERADACTYL 11
architecture q3c of teradactyl is architecture q3b of teradactyl is signal m : std_logic; signal m : std_logic; begin begin process (a) begin process (a, m) begin m <= a; v <= m; end process; m <= a; process (m) begin end process; v <= m; end q3b; end process; end q3c; SAME SAME
SOL-02:
1.11
VHDL VHDL BEHAVIOURAL COMPARISON: ICHTYOSTEGA 12
1.11 VHDL VHDL Behavioural Comparison: Ichtyostega

For each of the VHDL architectures q4a through q4c, does the signal v have the same behaviour as it does in the main architecture of ichthyostega? NOTES: For full marks, if the code has different behaviour, you must explain why. Ignore any differences in behaviour in the rst few clock cycles that is caused by initialization of ip-ops, latches, and registers. All code fragments in this question are legal, synthesizable VHDL code.
1) 2) 3)
SOL-02:
1.11
entity ichthyostega is port ( clk : in std_logic; b, c : in signed(3 downto 0); architecture q4a of ichthyostega is v : out signed(3 downto 0) signal bx, cx : signed(3 downto 0); ); begin end ichthyostega; process begin wait until (rising_edge(clk)); architecture main of ichthyostega is bx <= b; signal bx, cx : signed(3 downto 0); cx <= c; begin end process; process begin process begin wait until (rising_edge(clk)); if (cx > 0) then bx <= b; wait until (rising_edge(clk)); cx <= c; v <= bx; end process; else process begin wait until (rising_edge(clk)); wait until (rising_edge(clk)); v <= to_signed(-1, 4); if (cx > 0) then end if; v <= bx; end process; else end q4a; v <= to_signed(-1, 4); end if; DIFFERENT: evaluations of cx > 0 and end process; v <= bx are separated by a clock cycle. end main;
SOL-02:
1.11
architecture q4b of ichthyostega is architecture q4c of ichthyostega is signal bx, cx : signed(3 downto 0); signal bx, cx, dx : signed(3 downto begin begin process begin process begin wait until (rising_edge(clk)); wait until (rising_edge(clk)); bx <= b; bx <= b; cx <= c; cx <= c; wait until (rising_edge(clk)); end process; if (cx > 0) then process begin v <= bx; wait until (rising_edge(clk)); else v <= dx; v <= to_signed(-1, 4); end process; end if; dx <= bx when (cx > 0) end process; else to_signed(-1, 4); end q4b; end q4c; DIFFERENT: each assignment statement SAME (e.g. bx <= b) will execute every other clock cycle, rather than every clock cycle.
SOL-02:
1.12
WAVEFORM VHDL BEHAVIOURAL COMPARISON 15
1.12 Waveform VHDL Behavioural Comparison

Answer whether each of the VHDL code fragments q3a through q3d has the same behaviour as the timing diagram. NOTES: Same behaviour means that the signals a, b, and c have the same values at the end of each clock cycle in steadystate simulation (ignore any irregularities in the rst few clock cycles). For full marks, if the code does not match, you must explain why. Assume that all signals, constants, variables, types, etc are properly dened and declared. All of the code fragments are legal, synthesizable VHDL code.
1)
2) 3) 4)
clk a b c
SOL-02:
1.12
q3a q3b architecture q3a of q3 is architecture q3b of q3 is begin begin process begin process begin a <= 1; b <= 0; loop a <= 1; wait until rising_edge(clk); wait until rising_edge(clk); a <= NOT a; a <= b; end loop; b <= a; end process; wait until rising_edge(clk); b <= NOT a; end process; c <= NOT b; c <= a; end q3a; end q3b; SAME SAME
q3c q3d architecture q3c of q3 is architecture q3d of q3 is begin begin process begin process (b, clk) begin a <= 0; a <= NOT b; b <= 1; end process; wait until rising_edge(clk); process (a, clk) begin b <= a; b <= NOT a; a <= b; end process; wait until rising_edge(clk); c <= NOT b; end process; end q3d; c <= NOT b; end q3c; DIFFERENT: this code has combinaSAME tional loops
SOL-02:
1.12
q3e q3f architecture q3e of q3 is architecture q3f of q3 is begin begin process process begin begin a <= 1; b <= 0; b <= 0; a <= 1; c <= 1; wait until rising_edge(clk); wait until rising_edge(clk); a <= c; a <= c; b <= a; b <= a; wait until rising_edge(clk); c <= NOT b; end process; wait until rising_edge(clk); c <= not b; end process; end q3e; end q3f; DIFFERENT: c is a constant 1 DIFFERENT: a is a constant 1
SOL-02:
1.13
HARDWARE VHDL COMPARISON
18
1.13
Hardware VHDL Comparison

entity q2 is port ( a, clk, reset : in std_logic; d : out std_logic ); end q2; architecture main of q2 is signal b, c : std_logic; begin b <= 0 when (reset = 1) else a; process (clk) begin if rising_edge(clk) then c <= b; d <= c; end if; end process; end main;
For each of the circuits q2aq2d, answer whether the signal d has the same behaviour as it does in the main architecture of q2.
reset 0 a 0 a q2b clk reset d d
q2a clk
SOL-02:
1.13
HARDWARE VHDL COMPARISON

reset clk
19
reset 0 0 d a q2c clk a clk d
q2d
SOL-02:
1.14
SYNTHESIZABLE VHDL AND HARDWARE
20
1.14
Synthesizable VHDL and Hardware
For each of the fragments of VHDL q4a...q4d, answer whether the the code is synthesizable. If the code is synthesizable, draw the circuit most likely to be generated by synthesizing the datapath of the code. If the the code is not synthesizable, explain why.
process begin wait until rising_edge(a); e <= d; q4a wait until rising_edge(b); e <= NOT d; end process;
Answer: Unsynthesizable: different conditions in wait statements in same process. This would lead to a single ip-op requiring multiple clock signals.
Answer: unsynthesizable: while process begin loop around code where while (c /= 1) loop some paths have wait if (b = 1) then statements and some do wait until rising_edge(a); not. Even having a while e <= d; loop with a dynamic q4b else condition around code e <= NOT d; without a wait statement end if; would be end loop; unsynthesizable, e <= b; because it would lead to end process; combinational loops in the hardware.
SOL-02:
1.14
SYNTHESIZABLE VHDL AND HARDWARE
21
process (a, d) begin e <= d; end process; process (a, e) begin q4c if rising_edge(a) then f <= NOT e; end if; end process;
Answer: Flop with inverter on input
process (a) begin if rising_edge(a) then if b = 1 then e <= 0; q4d else e <= d; end if; end if; end process;
Answer: Synchronous reset (AND with bubble). The Reset pin on a ip-op is generally asynchronous, so a op with a reset pin would be incorrect.
SOL-02:
1.15
DATAPATH DESIGN
22
1.15
Datapath Design
Each of the three VHDL fragments q4aq4c, is intended to be the datapath for the same circuit. The circuit is intended to perform the following sequence of operations (not all operations are required to use a clock cycle):
read in source and destination addresses from i src1, i src2, i dst read operands op1 and op2 from mem- clk i_src1 ory compute sum of operands sum i_src2 write sum to memory at destination ad- i_dst dress dst write sum to output o result
o_result
SOL-02:
1.15.1
Correct Implementation?
23
1.15.1 Correct Implementation?

For each of the three fragments of VHDL q4aq4c, answer whether it is a correct implementation of the datapath. If the datapath is not correct, explain why. If the datapath is correct, answer in which cycle you need load=1. NOTES: 1. You may choose the number of clock cycles required to execute the sequence of operations. 2. The cycle in which the addresses are on i src1, i src2, and i dst is cycle #1. 3. The control circuitry that controls the datapath will output a signal load, which will be 1 when the sum is to be written into memory. 4. The code fragment with the signal declaractions, connections for inputs and outputs, and the instantiation of memory is to be used for all three code fragments q4aq4c. 5. All of the VHDL is legal, synthesizable code.
SOL-02:
1.15.1
24
-- This code is to be used for all three code fragments q4a--q4c. signal state : std_logic_vector(3 downto 0); signal src1, src2, dst, op1, op2, sum, mem_in_a, mem_out_a, mem_out_b mem_addr_a, mem_addr_b : unsigned(7 downto 0); ... process (clk) begin if rising_edge(clk) then src1 <= i_src1; src2 <= i_src2; dst <= i_dst; o_result <= sum; end if; end process; mem : ram256x16d port map (clk => clk, i_addr_a => mem_addr_a, i_addr_b => mem_addr_b, i_we_a => mem_we, i_data_a => mem_in_a, o_data_a => mem_out_a, o_data_b => mem_out_b); q4a
SOL-02:
op1
1.15.1
25
<= mem_out_a when state = "0010" else (others => 0); op2 <= mem_out_b when state = "0010" else (others => 0); sum <= op1 + op2 when state = "0100" else (others => 0); mem_in_a <= sum when state = "1000" else (others => 0); mem_addr_a <= dst when state = "1000" else src1; mem_we <= 1 when state = "1000" else 0; mem_addr_b <= src2; process (clk) begin if rising_edge(clk) then if (load = 1) then state <= "1000"; else -- rotate state vector one bit to left state <= state(2 downto 0) & state(3); end if; end if; end process;
SOL-02:
1.15.1
26
Answer: The circuit is not correct: all of the signals are combinational. Also, there could be initialization problems with state.
SOL-02:
q4b
1.15.1
27
process (clk) begin if rising_edge(clk) then op1 <= mem_out_a; op2 <= mem_out_b; end if; end process; sum <= op1 + op2; mem_in_a <= sum; mem_we <= load; mem_addr_a <= dst when load = 1 else src1; mem_addr_b <= src2;
SOL-02:
1.15.1
28
Answer:
The circuit is correct. load = 1 in clock cycle 2
SOL-02:
q4c
1.15.1
29
process begin wait until rising_edge(clk); op1 <= mem_out_a; op2 <= mem_out_b; sum <= op1 + op2; mem_in_a <= sum; end process; process (load, dst, src1) begin if load = 1 then mem_addr_a <= dst; else mem_addr_a <= src1; end if; end process; mem_addr_b <= src2;
SOL-02:
1.15.1
30
Answer: If take code exactly as is:
If assume that add mem we:
the circuit is incorrect, because mem we is missing.
The circuit correct. Need load = 1 in cycle 4.
SOL-02:
1.15.2
Smallest Area
31
1.15.2 Smallest Area

Of all of the circuits (q4aq4c), including both correct and incorrect circuits, predict which will have the smallest area. If you dont have sufcient information to predict the relative areas, explain what additional information you would need to predict the area prior to synthesizing the designs.
SOL-02:
1.15.2
Smallest Area
32
Answer: Assuming that q4c includes mem we: All of the circuits have an adder, memory, input ops, output ops, and a mux for mem addr a. The differences are in the ops and misc circuitry: q4a 1*4 5*4 q4b 2*8 0 q4c 4*8 0
ops ands
From this analysis, q4a has the smallest area.
SOL-02:
1.15.3
Shortest Clock Period
33
1.15.3 Shortest Clock Period

Of all of the circuits (q4aq4c), including both correct and incorrect circuits, predict which will have the shortest clock period. If you dont have sufcient information to predict the relative periods, explain what additional information you would need to predict the period prior to performing any synthesis or timing analysis of the designs.
SOL-02:
1.15.3
Shortest Clock Period
34
Answer:
q4c has the shortest clock period, because it does the least amount of computation between ip ops all of the signals are opped.
Chapter 2
Design Problems
35
SOL-03: Datapath and Control Design

SOL-03:
2.1
SYNTHESIS
2.1
Synthesis
This question is about using VHDL to implement memory structures on FPGAs.
SOL-03:
2.1.1
Data Structures
2.1.1
Data Structures
If you have to write your own code (i.e. you do not have a library of memory components or a special component generation tool such as LogiBlox or CoreGen). What datastructures in VHDL would you use when creating a register le?
SOL-03:
2.1.2
Own Code vs Libraries
2.1.2
Own Code vs Libraries
When using VHDL for an FPGA, under what circumstances is it better to write your own VHDL code for memory, rather than instantiate memory components from a library?
SOL-03:
2.2
DESIGN GUIDELINES
2.2
Design Guidelines
While you are grocery shopping you encounter your co-op supervisor from last year. Shes now forming a startup company in Waterloo that will build digital circuits. Shes writing up the design guidelines that all of their projects will follow. She asks for your advice on some potential guidelines. What is your response to each question? What is your justication for your answer? What are the tradeoffs between the two options? 0. Sample Should all projects use silicon chips, or should all use biological chips, or should each project choose its own technique? Answer: All projects should use silicon based chips, because biological chips dont exist yet. The tradeoff is that if biological chips existed, they would probably consume less power than silicon chips. 1. Should all projects use an asynchronous reset signal, or should all use a synchronous reset signal, or should each project choose its own technique? Answer: Synchronous reset: Synchronous reset leads to more robust designs. With asynchronous reset, a op is reset whenever the reset signal arrives. Due to wire delays, signals will arrive at different ops at different times. If an asynchronous reset occurs at about the time as a clock edge, some ops might be reset in one clock cycle and some in the next. This can lead to glitches and/or illegal values on internal state signals. The tradeoff is that asynchronous reset is often easier to code in VHDL and requires less hardware to implement. 2. Should all projects use latches, or should all projects use ip-ops, or should each project choose its own technique?
SOL-03:
2.2
DESIGN GUIDELINES
Answer: Flops Flip ops lead to more robust designs than latches. Latches are level sensitive and act as wires when enabled. For a latch based design to work correctly, there cannot be any overlap in the time when a consecutive pair of latches are enabled. If this happens, the value on a signal will leak through the latch and arrive at the next set of latches one clock phase too early. Thus, latch based designs are more sensitive to the timing of clock signals. Another disadvantage of latches is that some FPGAs and cell libraries do not support them. In comparison, D-type ip ops are (almost?) always supported. The tradeoff is that latches are smaller and faster than ip ops. A common implementation of a ip-op is a pair of latches in a master/slave combination. 3. Should all chips have registers on the inputs and outputs or should chips have the inputs and outputs directly connected to combinational circuitry, or should each project choose its own technique? By register we mean either ip-ops or latches, based upon your answer to the previous question. If your answer is different for inputs and outputs, explain why. Answer: Flops on outputs and inputs Putting ops on inputs and outputs will make the clock speed of the chip less dependent of the propagation delay between chips. Flops can also be used to isolate the internals of the chip from glitches and other anomolous behaviour that can occur on the boards. The tradeoff is that ops consume area and will increase the latency through the chip. 4. Should all circuit modules on all chips have ip-ops on the inputs and outputs or should chips have the inputs and outputs directly connected to combinational circuitry, or should each project
SOL-03:
2.2
DESIGN GUIDELINES
choose its own technique? By register we mean either ip-ops or latches, based upon your answer to the previous question. If your answer is different for inputs and outputs, explain why. Answer: Each project should adopt a convention of either using ops on inputs of modules or outputs of modules. It is rarely necessary to put ops on both inputs and outputs of modules on the same chip. This is because the wire delay between modules is usually less than a clock period. Putting ops on either the inputs or outputs is advantageous because it provides a standard design convention that makes it easier to glue modules together without violating timing constraints. If modules were allowed to have combinational circuitry on both inputs and outputs, the maximum clock speed of the design could not be determined until all of the modules were glued together. The tradeoff is that ops add area and latency. Sometimes there will be two modules where the combinational circuitry on the outputs of one can be combined with the combinational circuitry on the inputs of the second without violating timing constraints. This discipline prevents that optimization. Aside: Sometimes, to meet performance targets, in situations such as this, a project will remove or move the ops between modules and do clock borrowing to t the maximum amount of circuitry into a clock period. This is a rather low-level optimization that happens late in the design cycle. It can cause big headaches for functional validation and equivalence verication, because the specications for modules are no longer clean and the boundaries between modules on the low-level design might be different from the boundaries in the high-level design. 5. Should all projects use tri-state buffers, or should all projects use multiplexors, or should each project choose its own technique?
SOL-03:
2.2
DESIGN GUIDELINES
Answer: Multiplexors Multiplexors lead to more robust designs. Tri-state buffers rely on analog characteristics of devices to work correctly. Latches can work incorrectly in the presence of voltage uctuations or fabrication process variations. Multiplexors work on a purely Boolean level and as such are less sensitive to changes in voltages or fabrication processes. The tradeoff is that latches are smaller and faster than multiplexors.
SOL-03:
2.3
DATAFLOW DIAGRAM OPTIMIZATION
2.3
Dataow Diagram Optimization

a b c
Use the dataow diagram below to answer questions 2.3.1 and 2.3.2.
f f d e
g f
SOL-03:
2.3.1
Resource Usage
10
2.3.1
Resource Usage
List the number of items for each resource used in the dataow diagram.
Answer: input ports output ports registers f components g components 3 1 4 2 1
SOL-03:
2.3.2
Optimization
11
2.3.2
Optimization
Draw an optimized dataow diagram that improves the performance and produces the same output values. Or, if the performance cannot be improved, describe the limiting factor on the preformance. NOTES:
Answer:
a b d
you may change the times when signals are read from the environment you may not increase the resource usage (input ports, registers, output ports, f components, g components) you may not increase the clock period
f c f g
g e f
SOL-03:
2.4
DATAFLOW DIAGRAM DESIGN
12
2.4
Dataow Diagram Design
Your manager has given you the task of implementing the following pseudocode in an FPGA: if is_odd(a + d) p = (a + d)*2 + ((b + c) - 1)/4; else p = (b + c)*2 + d;
1) 2) 3) 4) 5)
6)
NOTES: You must use registers on all input and output ports. p, a, b, c, and d are to be implemented as 8-bit signed signals. A 2-input 8-bit ALU that supports both addition and subtraction takes 1 clock cycle. A 2-input 8-bit multiplier or divider takes 4 clock cycles. A small amount of additional circuitry (e.g. a NOT gate, an AND gate, or a MUX) can be squeezed into the same clock cycle(s) as an ALU operation, multiply, or divide. You can require that the environment provides the inputs in any order and that it holds the input signals at the same value for multiple clock cycles.
SOL-03:
2.4.1
Maximum performance
13
2.4.1
Maximum performance
What is the minimum number of clock cycles needed to implement the pseudocode with a circuit that has two input ports?
Answer:
Optimizations:
Data ow for odd case
Multiplication by a constant power of 2 can be done without hardware, just connect the wires between the signals. For example, if we have a <= b*2;, we can do this with a(0) <= b(1); a(1) <= b(2); etc. Testing if a signal is odd or even can be done simply by extracting the least signicant bit of the signal.
b c
d 1
SOL-03:
2.4.1
Maximum performance
b c
14
Data ow for even case Even ow requires 4 clock cycles (3 cycles in the datapath plus one more because we have to have ops on both inputs and outputs). Therefore total design will require 4 clock cycles. What is the minimum number of ALUs, multipliers, and dividers needed to achieve the minimum number of clock cycles that you just calculated?
Answer:
SOL-03:
2.4.1
Maximum performance
15
c 4 2 0 0 clock cycles ALUs dividers multipliers
-1 xor and
Dataow for entire circuit
SOL-03:
2.4.2
Minimum area
16
2.4.2
Minimum area
What is the minimum number of datapath storage registers (8, 6, 4, and 1 bit) and clock cycles needed to implement the pseudocode if the circuit can have at most one ALU, one multiplier, and one divider?
Answer:
a d 3 0 0 0 5 8b regs 6b regs 4b regs 1b regs clock cycles
d -1
SOL-03:
2.5
DESIGN AND OPTIMIZATION
17
2.5
Design and Optimization
Design a circuit that performs the following operation: P = (a+d) + ((b - c) - 1) Optimize your design for area.
Answer:
VHDL code for implementing: P = (a+d) + ((b-c)-1)
SOL-03:
2.5
DESIGN AND OPTIMIZATION
18
library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity fsm1 is port( in1: in signed(3 downto 0); in2: in signed(3 downto 0); clk: in std_logic; p: out signed(4 downto 0) ); end fsm1; architecture fsm1_arch of fsm1 is signal add_sel, sub_sel : std_logic; signal add1, add2, sub1, sub2, r1, r2: signed(4 downto 0); begin fsm: process begin wait until rising_edge(clk); add_sel <= - ; sub_sel <= 1 ; wait until rising_edge(clk); add_sel <= 1 ; sub_sel <= 0 ; wait until rising_edge(clk); add_sel <= 0 ; sub_sel <= - ; end process; reg: process begin wait until rising_edge(clk); r1 <= sub1 sub2; r2 <= add1 + add2; end process; -- concurrent statements add1 <= ( 0 & in1) when (add_sel = 1 ) else r1; add2 <= ( 0 & in2) when (add_sel = 1 ) else r2; sub1 <= ( 0 & in1) when (sub_sel = 1 ) else r1; sub2 <= ( 0 & in2) when (sub_sel = 1 ) else to_signed(1,5); p <= r2; end fsm1_arch;
SOL-04: Memory Design

SOL-04:
2.6
DATAFLOW DIAGRAMS WITH MEMORY ARRAYS
2.6 Dataow Diagrams with Memory Arrays

Component Register Adder Subtracter ALU with , , Memory read Memory write Multiplication 2:1 Multiplexor NOTES: 1. 2. 3. 4. The inputs of the algorithms are a and b. The outputs of the algorithms are p and q. You must register both your inputs and outputs. You may choose to read your input data values at any time and produce your outputs at any time. For your inputs, you may read each value only once (i.e. the environment will not send multiple copies of the same value). Execution time is measured from when you read your rst input until the latter of producing your last output or the completion of writing a result to memory M is an internal memory array, which must be implemented as dualported memory with one read/write port and one write port. M supports synchronous write and asynchronous read. Assume all memory address and other arithmetic calculations are within the range of representable numbers (i.e. no overows occur). If you need a circuit not on the list above, assume that its delay is 30 ns. You may sacrice area efciency to achieve high performance, but marks will be deducted for extra hardware that does not contribute to performance. Delay 5 ns 25 ns 30 ns 40 ns 60 ns 60 ns 65 ns 5 ns
5.
6. 7. 8. 9. 10.
, AND, XOR
SOL-04:
2.6.1
Algorithm 1
2.6.1
Algorithm
Algorithm 1
q = M[b]; M[a] = b; p = (M[b-1]) * b) + M[b]; Assuming a > b, draw a dataow diagram that is optimized for the fastest overall execution time.
Answer:
1. a > b means that a b 1, therefore can do M[b+1] read in parallel with M[a] write or with M[b] read.
3. Initial dataow diagram: M a
M(wr)
4. Find the critical path
2. But, could have a with M[b] read.
b, so cant do M[a] write in parallel
b 1
M(rd)
M(rd)
SOL-04:
2.6.1
Algorithm 1
M a b -1 25ns M(rd) 60ns 60ns M(wr) 65ns M q p 150ns M(rd) 60ns
Critical path is from b to p: 150ns. 5. Explore performance with different clock periods
M a b 1 25ns 5ns 5ns
M(rd) 60ns 60ns M(wr)
M(rd) 60ns 5ns 65ns 5ns
period latency time
70 ns 4 cycles 280 ns
SOL-04:
M
2.6.1
a
Algorithm 1
b 1 25ns 5ns
M(rd) 60ns 5ns 65ns 5ns
period latency time
6. Minimum latency is 3 clock cycles, because cant do all memory operations in parallel and need registers on both inputs and outputs. 7. Best performance is with clock period of 90 ns. 8. Resource usage: Component Quantity Input 2 Output 2 Register 5 (including mem array) Adder 1 Memory read 2 Memory write 1 Multiplication 1 Clock Period 90 ns Latency 3 cycles Execution Time 270 ns
SOL-04:
2.6.2
Algorithm 2
2.6.2
Algorithm 2
q = M[b]; M[a] = b; p = (M[b-1]) * b) + M[b]; Assuming a b, draw a dataow diagram that is optimized for the fastest overall execution time.
Answer:
1. a b means that a b and a b-1, so no memory address conicts to create dependencies and complications. 2. Explore performance with different clock periods
M a b 1 30ns 5ns 5ns
M(rd)
M(rd)
60ns 5ns
60ns
M(wr)
65ns 5ns q 25ns
5ns M p
period latency time
SOL-04:
M
2.6.2
a
Algorithm 2
b 1 30ns 5ns
M(rd)
M(rd)
60ns 5ns
60ns
M(wr)
65ns
25ns
5ns M p
period latency time
3. Without going to a triple-ported memory, cant reduce latency below 3. 4. Area optimization: change b - 1 to b + (-1).
SOL-04:
M
2.6.2
a
Algorithm 2
b -1 25 ns 5ns
M(rd)
M(rd)
60ns 5ns
60ns
M(wr)
65ns
25ns
5ns M p
5. Resource usage: Component Quantity Input 2 Output 1 Register 5 (including mem array) Adder 1 Memory read 2 Memory write 1 Multiplication 1 Clock Period 90 ns Latency 3 cycles Execution Time 270 ns
SOL-05: Optimization and FPGA Implementation

SOL-05:
2.7
2-BIT ADDER
2.7
2-bit adder
This question compares an FPGA and generic-gates implementation of 2bit full adder.
SOL-05:
2.7.1
Generic Gates
2.7.1
Generic Gates
Show the implementation of a 2 bit adder using NAND, NOR, and NOT gates.
SOL-05:
2.7.2
Xilinx FPGA
2.7.2
Xilinx FPGA
Show the CLB implementation of a 2 bit adder in a Xilinx Spartan XCS10 FPGA by drawing the schematic of a CLB and showing the equations for the lookup tables.
SOL-05:
2.8
SKETCHES OF PROBLEMS
2.8
Sketches of Problems
1. calculate resource usage for a dataow diagram (input ports, output ports, registers, datapath components) 2. calculate performance data for a dataow diagram (clock period and number of cycles to execute (CPI)) 3. given a dataow diagram, calculate the clock period that will result in the optimum performance 4. given an algorithm, design a dataow diagram 5. given a dataow diagram, design the datapath and nite state machine 6. optimize a dataow diagram to improve performance or reduce resource usage 7. given fsm diagram, pick VHDL code that best implements diagram correct behaviour, simple, fast hardware or critique hardware
SOL-05:
2.8
SKETCHES OF PROBLEMS
Chapter 3
Functional Validation Problems
SOL-06: Functional Validation

SOL-06:
3.1
FUNCTIONAL VALIDATION PROBLEMS
3.1
Functional Validation Problems
SOL-06:
3.1.1
Carry Save Adder
3.1.1
Carry Save Adder
1. Functionality Briey describe the functionality of a carry-save adder. 2. Testbench Write a testbench for a 16-bit combinational carry save adder. 3. Testbench Maintenance Modify your testbench so that it is easy to change the width of the adder and the latency of the computation. NOTES: (a) You do not need to support pipelined adders. (b) VHDL generics might be useful.
SOL-06:
3.1.2
Trafc Light Controller
3.1.2
1. Functionality Briey describe the functionality of a trafc-light controller that has sensors to detect the presence of cars. Answer:
Given a normal trafc light, which spends a constant amount of time as green in direction, add the following two transitions to the system: (a) If the less-busy road does not have any cars present for t1 minutes, transition the trafc light to make the busier of the two roads as green. (b) If the busy road has a car waiting for t2 minutes, transition the trafc light to make the busier of the two roads as green. 2. Boundary Conditions Make a list of boundary conditions to check for your trafc light controller. Answer:
(a) A car arrives at the intersection and triggers the sensor, but makes a right turn before the light turns green in its direction. Should the light turn to green in the direction of the now vacant road, or stay green in the current direction? (b) Same as 1, but the makes a right turn after the other road already has a yellow light. Should the light turn to green in the direction of the now vacant road, or transition from yellow back to green, or very briey stay green in the vacant direction?
SOL-06:
3.1.2
(c) If the less-busy road is yellow, theres no car at the busy road, and a car arrives at the less busy road. Same questions as the rst two situations. 3. Assertions Make a list of assertions to check for your trafc light controller. Answer:
(a) (b) (c) (d)
if a light is green, the next colour will be yellow if a light is yellow, the next colour will be red if a light is red, the next colour will be green if no car has been at the less-busy road for at least t1 minutes then the less-busy road is red. (e) if the car sensor has been continuously on for the busy road for at least t2 minutes then the busy road is green.
SOL-06:
3.1.3
State Machines and Validation
3.1.3
1. Three Different State Machines

s0
1/0
*/0
s1
*/0
s2 */0
*/1
s1
s0
s9
0/0 */1 */0
*/0
s8 */0
s3 */0 s4 */0
s6
s3 */0 s2
*/0 */0
s7
Figure 3.1: A very simple machine
s5
Figure 3.2: A very big machine
s0
*/0
s1
q0
*/0
q1
*/0 input/output q2 * = dont care */0
*/1 s2 */0
*/0
*/1
q4
*/0
q3
Figure 3.4: Legend
Figure 3.3: A concurrent machine Answer each of the following questions for the three state machines in Figures 3.13.3. (a) How many test scenarios (sequences of test vectors) would you need to fully validate the behaviour of the state machine?
SOL-06:
3.1.3
(b) What is the maximum length (number of test vectors) in a test scenario for the state machine? (c) Assuming that neither the inputs nor the outputs are registered, what is the minimum number of ip-ops needed to implement the state machine? Answer: scenarios sequence expected behaviour 1) 000 s0, s2, s3, s0 2) 001 s0, s2, s3, s0 3) 010 s0, s2, s3, s0 4) 011 s0, s2, s3, s0 5) 1000 s0, s1, s2, s3, s0 6) 1001 s0, s1, s2, s3, s0 ... 12) 1111 s0, s1, s2, s3, s0 sequence expected behaviour 1) 0000000000 s0, s1, s2 ..., s9, s0 2) 0000000001 s0, s2, s2 ..., s9, s0 1024) 1111111111 s0, s1, s2 ..., s9, s0 sequence expected behaviour 1) 0...00 (s0,q0), (s1,q1), (s2,q2), (s0,q3), (s1,q4), (s2,q0), (s0,q1), (s1,q2), (s2,q3), (s0,q4), (s1,q0), (s2,q1), (s0,q2), (s1,q3), (s2,q4), (s0,q0) 2) 0...01 same behaviour 215 ) 1..11 same behaviour max len 4 min ops 2
Fig 3.1
Fig 3.2
10
Fig 3.3
15
5 or 4
For Fig 3.3, if we implement each machine separately we need 5 ops, 2 for the S machine and 3 for the Q machine. If we merge the state machines, we need log2 3 5 4 ops.
SOL-06:
3.1.3
One of the purposes of this exercise is to illustrate how many test vectors it requires to exhaustively test the behaviour of even simple circuits. Also, this demonstrates how the structure of a circuit affects the number of test vectors needed. Size alone is not the determining factor. 2. State Machines in General If a circuit has n signals of 1-bit each that are the outputs of ip-ops and m 1-bit signals that are the outputs of combinational circuitry, what is the maximum number of states that the circuit can have? Answer:
The maximum number of states for a circuit with n ops is 2n . The values of combinational signals are determined by the ops and the inputs, and so they dont contribute to the total number of states.
SOL-06:
3.1.4
Additional Problem
3.1.4
Additional Problem
SOL-06:
3.1.5
Test Plan Creation
10
3.1.5
Test Plan Creation
Youre on the functional validation team for a chip that will control a simple portable CD-player. Your task is to create a plan for the functional validation for the signals in the entity cd digital. Youve been told that the player behaves just like all of the other CD players out there. If your test plan requires knowledge about any potential nonstandard features or behaviour, youll need to document your assumptions. track min sec
prev
stop
play
next
pwr
entity cd_digital is port ( ----------------------------------------------------- buttons prev, stop, play, next, pwr : in std_logic; ----------------------------------------------------- detect if player door is open open : in std_logic; ----------------------------------------------------- output display information track : out std_logic_vector(3 downto 0); min : out unsigned(6 downto 0); sec : out unsigned(5 downto 0) ); end cd_digital;
SOL-06:
3.1.5
Test Plan Creation
11
3.1.5.1 Early Tests

Describe ve tests that you would run as soon as the VHDL code is simulatable. For each test: describe what your specication, stimulus, and check. Summarize the why your collection of tests should be the rst tests that are run.
Answer: test1 specication when power is turned on, the display will show the number of tracks on the CD, and the minutes and seconds will show the total length of the CD. stimulus power=0; wait; power=1, all other signals are 0. check display outputs of circuit match specication test2 specication when power is on, play starts CD playing, display for track=1, min and sec show remaining time for song and start decrementing. stimulus power=1; play=0; wait; play=1, all other signals are 0. check display outputs of circuit match specication test3 specication when power is on and CD is playing, next starts next song. Display for track increments, min and sec show remaining time for next song and start decrementing. stimulus power=1; play=0; next=0; wait; play=1; wait; next=1, all other signals are 0. check display outputs of circuit match specication test4 specication when power is on and CD is playing, prev starts previous song. Display for track decrements, min and sec show remaining time for previous song and start decrementing. stimulus power=1; play=0; prev=0; wait; play=1; wait; prev=1, all other signals are 0.
SOL-06:
3.1.5
Test Plan Creation
12
check display outputs of circuit match specication test5 specication when power is on and CD is playing, stop causes CD to stop. stimulus power=1; play=0; stop=0; wait; play=1; wait; stop=1, all other signals are 0. check display outputs of circuit match specication justication for choices These cases test the basic operations of the CD player. Each test focusses on a different aspect of the players behaviour.
SOL-06:
3.1.5
Test Plan Creation
13
3.1.5.2 Corner Cases

Describe ve corner-cases or boundary conditions, and explain the role of corner cases and boundary conditions in functional validation. NOTES: 1. You may reference your answer for question 3.1.5.1 in this question. 2. If you do not know what a corner case or boundary condition is, you may earn partial credit by: checking this box ve things that you would do in functional validation. and explaining
Answer: case 1 : press both prev and next while a CD is playing case 2 : open the case while a CD is playing case 3 : press play and stop at the same time case 4 : press any button other than power when the player is off case 5 : press next repeatedly until track counter wraps around role of corner cases : The purpose of corner cases is to test unusual situations that designers might not have thought of, and so are more likely to contain bugs than normal behaviour.
SOL-06:
3.1.5
Test Plan Creation
14
Chapter 4
Performance Analysis and Optimization Problems
15
SOL-07: Performance Analysis and Optimization

Lecture Notes Sections: 4 4.7.3
SOL-07:
4.1
FARMER
4.1
Farmer
A farmer is trying to decide which of his two trucks to use to transport his apples from his orchard to the market. Facts: capacity of truck big truck small truck 12 tonnes 6 tonnes speed when loaded with apples 15kph 30kph speed when unloaded (no apples) 38kph 70kph
distance to market amount of apples NOTES:
120 km 85 tonnes
1. All of the loads of apples must be carried using the same truck 2. Elapsed time is counted from beginning to deliver rst load to returning to the orchard after the last load 3. Ignore time spent loading and unloading apples, coffee breaks, refueling, etc. 4. For each trip, a truck travels either its fully loaded or empty speed.
Question: Which truck will result in the least elapsed time and what percentage faster will the elapsed time be?
Answer:
SOL-07:
4.1
FARMER
NumTrips Harvest Capacity All trips are for the same distance, so distance cancels out of the equations: Time 1 Speed TimeTotBig 85 12 1 15 1 38 8 0 0930 0 7439 TimeTotSmall 85 6 1 30 1 70 15 0 0477 0 7143 Small truck will take less time TimeSlow TimeFast PctFaster TimeFast TimeTotBig TimeTotSmall TimeTotSmall 0 7439 0 7143 0 7143 4 15%
Question: In planning ahead for next year, is there anything the farmer could do to decrease his delivery time with little or no additional expense? If so, what is it, if not, explain.
Answer: Use two drivers Use a combination of the small truck and large truck to improve his utilization.
' '
'
TimeTot
NumTrips
TimeLoaded
TimeUnloaded
SOL-07:
4.2
NETWORK AND ROUTER
4.2
Network and Router
The BigLan network protocol runs at a data rate of 160 Mbps (Mega bits per second). Each BigLan packet contains 100 Bytes of routing information and 1000 Bytes of data. You are working on the DataChopper router, which has the following performance numbers: 75MHz 500 4 clock speed number of clock cycles to process the routing information for a packet CPI for a byte of data
SOL-07:
4.2.1
Maximum Throughput
4.2.1
Maximum Throughput
Which has a higher maximum throughput (as measured in bits per second), the network or your router, and how much faster is it? Answer: The maximum data throughput of the two technologies in terms of bits can be calculated as follows: 1. BigLan Network Protocol Maximum data throughput 2. DataChopper Router Time required for a packet
= = = = = = =
160 Mbps * (8000 data bits / 8800 packet bits) 145.45 Mbps 500 clock cycles + 0.5 CPI per data bit * 8800 packet bits 500 clock cycles + 4400 clock cycles 4900 clock cycles 4900 clock cycles * 13.33 ns per cycle 65333 ns per packet 65333 ns per packet / 8000 data bits 8.167 ns per data bit 1 / 8.167 ns per data bit 122.46 Mbps
Time required for a data bit
= = = =
Maximum data throughput
The network has a higher maximum throughput. What percentage higher? n% higher performance = = = (perf high - perf low) / perf low (145 - 122)/122 19%
The network has 19% higher maximum performance.
SOL-07:
4.2.2
Packet Size and Performance
4.2.2
Packet Size and Performance
Explain the effect of an increase in packet length on the performance of the DataChopper (as measured in the maximum number of bits per second that it can process).
Answer:
As packet size increases, the overhead associated with the constant routing delay will become less signicant. The data rate of the router will slowly approach that of the network but it will never surpass the network throughput. If there was not any overhead for routing, the peak data rate for the router would be 150 Mbps compared to 160 Mbps of the network.
SOL-07:
4.3
PERFORMANCE SHORT ANSWER
4.3
Performance Short Answer
If performance doubles every two years, by what percentage does performance go up every month?
Answer:
Therefore, performance goes up by 2.9% each month.
2t 24 where t is measured in months 21 24 1 029
SOL-07:
4.4
MICROPROCESSORS
4.4
Microprocessors
The Yme microprocessor is very small and inexpensive. One performance sacrice the designers have made is to not include a multiply instruction. Multiplies must be written in software using loops of shifts and adds. The Yme currently ships at a clock frequency of 200MHz and has an average CPI of 4. A competitor sells the Y!v1 microprocessor, which supports exactly the same instructions as the Yme. The Y!v1 runs at 150MHz, and the average program is 10% faster on the Yme than it is on the Y!v1.
SOL-07:
4.4.1
Average CPI
4.4.1
Average CPI
Question: What is the average CPI for the Y!v1? If you dont have enough information to answer this question, explain what additional information you need and how you would use it?
Answer:
Use the following subscripts: Yme Y!v1 Y!u2 The Yme is 10% faster than the Y!v1.
1 2 3
Solve for CPI2 .
NumInst2 ClockSpeed1 ClockSpeed2 CPI1
NumInst1 200MHz 150MHz 4
SOL-07:
4.4.1
Average CPI
10
1 10
33
Common mistakes:
Swapping performance of Yme and Y!v1.
A new version of the Y!, the Y!u2 has just been announced. The Y!u2 includes a multiply instruction and runs at 180MHz. The Y!u2 publicity brochures claim that using their multiply instruction, rather than shift/add loops, can eliminate 10% of the instructions in the average program. The brochures also claim that the average performance of Y!u2 is 30% better than that of the Y!v1.
1 10
1 10
CPI2
1 10
1 10
NumInst2 CPI2 ClockSpeed2
Time2
1 10
Time2 Time1 Time1 Time2 Time1
Time
NumInst CPI ClockSpeed 0 10
Time1 NumInst1 CPI1 ClockSpeed1 ClockSpeed2 NumInst1 CPI1 NumInst2 ClockSpeed1 ClockSpeed2 CPI1 ClockSpeed1 150MHz 4 200MHz
SOL-07:
4.4.2
Why not you too?
11
4.4.2
Why not you too?
Question: Assuming the advertising claims are true, what is the average CPI for the Y!u2? If you dont have enough information to answer this question, explain what additional information you need and how you would use it?
Answer:
Solve forCPI3
3 38
Common mistakes:
Comparing performance of Y!u2 to Yme, rather than Y!v1.
Saying that time for Y!u2 is 70% of Y!v1.
Forgeting to take into account reduced number of instructions.
CPI3
13
: ClockSpeed3 NumInst2 CPI2 1 3 NumInst3 ClockSpeed2 180MHz 3 3 1 3 0 9 150MHz
NumInst3 CPI3 ClockSpeed3
13
Time3
Time2 NumInst2 CPI2 ClockSpeed2
SOL-07:
4.4.3
Analysis
12
4.4.3
Analysis
Which of the following do you think is most likely
Question: and why.
1. the Y!u2 is basically the same as the Y!v1 except for the multiply 2. the Y!u2 designers made performance sacrices in their design in order to include a multiply instruction 3. the Y!u2 designers performed other signicant optimizations in addition to creating a multiply instruction
Answer: The most likely analysis is that the Y!u2 is basically the same as the Y!v1 except for the multiply. This is because the Y!u2 has a slightly larger CPI than the Y!v1, this is in keeping with the addition of a multiply instruction. A multiply instruction probably has a larger-than-average CPI. The increase in clock speed likely comes from a new fabrication process, and would not have required signicant changes to the design of the chip.
SOL-07:
4.5
DATAFLOW DIAGRAM OPTIMIZATION
13
4.5
Dataow Diagram Optimization
Draw an optimized dataow diagram that improves the performance and produces the same output values. Or, if the performance cannot be improved, describe the limiting factor on the performance. NOTES:
you may change the times when signals are read from the environment you may not increase the resource usage (input ports, registers, output ports, f components, g components) you may not increase the clock period
a b c a b f f d e f c f f g g f g e
After Optimization Before Optimization
SOL-07:
4.6
OPTIMIZATION WITH MEMORY ARRAYS
14
4.6
Optimization with Memory Arrays
This question deals with the implementation and optimization for the algorithm and library of circuit components shown below. Algo- Component q = M[b]; Register if (a > b) then Adder M[a] = b; Subtracter p = (M[b-1]) * b) + M[b]; with , , ALU rithm else Memory read M[a] = b; Memory write p = M[b+1] * a; Multiplication end; 2:1 Multiplexor NOTES: 1. 2. 3. 4. 5. 25% of the time, a > b The inputs of the algorithm are a and b. The outputs of the algorithm are p and q. You must register both your inputs and outputs. You may choose to read your input data values at any time and produce your outputs at any time. For your inputs, you may read each value only once (i.e. the environment will not send multiple copies of the same value). Execution time is measured from when you read your rst input until the latter of producing your last output or the completion of writing a result to memory M is an internal memory array, which must be implemented as dualported memory with one read/write port and one write port. Assume all memory address and other arithmetic calculations are within the range of representable numbers (i.e. no overows occur). If you need a circuit not on the list above, assume that its delay is 30 ns. Your dataow diagram must include circuitry for computing a > b and using the result to choose the value for p Delay 5 ns 25 ns 30 ns 40 ns 60 ns 60 ns 65 ns 5 ns
6.
7. 8. 9. 10.
, AND, XOR
SOL-07:
4.6
15
Draw a dataow diagram for each operation that is optimized for the fastest overall execution time. NOTE: You may sacrice area efciency to achieve high performance, but marks will be deducted for extra hardware that does not contribute to performance.
Answer: q M[a] p q M[a] p
a > b (25%) = M[b]; = b; = (M[b-1] * b) + M[b]; b (75%) a = M[b]; = b; = M[b+1] * a;
b happens 75% of the time, so initially focus on 1. a common case. b means that a b 1, therefore can do (a) a M[b+1] read in parallel with M[a] write or with M[b] read. (b) But, could have a b, so cant do M[a] write in parallel with M[b] read. M a b -1
25ns M(rd) 60ns
65ns p 150ns
SOL-07:
4.6
16
(c) Critical path is from b to p: 150ns + 5ns for mux on p = 155ns. (d) Longest operation in diagram is multiplication: 65ns. (e) Minimum clock period is 65ns + 5ns for register = 70ns.
M a b 1 25ns 5ns
M 5ns
b 1 25ns 5ns
M(rd) 60ns 5ns
M(rd) 60ns
M(rd) 60ns 5ns 65ns
60ns M(wr)
65ns 5ns
5ns M q p
M
M a
q
b
1 30ns
5ns
5ns
M(rd)
M(rd)
60ns 5ns
60ns
M(wr)
65ns 5ns q 25ns
5ns M p
period 70 ns 75 ns 90 ns latency 5 cycles 4 cycles 3 cycles time 350 ns 300 ns 270 ns (f) Minimum latency is 3 clock cycles, because cant do all memory operations in parallel and need registers on both inputs and outputs. (g) Best overall performance for a b case is with clock period of 90 ns. 2. Now try a b with 90 ns clock period.
SOL-07:
4.6
17
(a) a b means that a b and a b-1, so no memory address conicts to create dependencies and complications.
M a b 1 30ns M 5ns a
M(rd)
M(rd)
60ns 5ns
60ns
M(wr)
65ns
60ns
M(wr)
25ns
5ns M p M p
period 90 ns 95 ns latency 4 cycles 3 cycles time 360 ns 285 ns (b) Without going to a triple-ported memory, cant reduce latency below 3. b case is with clock period (c) Best performance for a of 95 ns. 3. Choose 95 ns clock period, which gives a latency of 3 clock cycles for both options. 4. Optimize dataow diagrams to reduce area without sacricing performance.
b -1 25 ns 5ns
M(rd)
M(rd)
60ns 5ns
65ns
25ns
5ns
SOL-07:
4.6
M

b M 1 25ns a a 5ns 1 b
18
5ns 30ns
M(rd) M(rd) 60ns M(rd) 60ns 5ns M(wr) 60ns M(wr) q 65ns
M(rd)
60ns 5ns
65ns q
25ns 5ns
5ns M M p p
5ns
5. Merge dataow diagrams.

M a b 5ns 1 1 30ns M(rd) M(rd) 60ns 5ns
5ns
M(wr)
M(rd) 0
65ns
25ns 5ns M q p
Optimal performance (Period = 95 ns)
SOL-07:
4.6

Quantity 2 2 5 1 1 0 2 1 1 2 95 ns 3 cycles 285 ns
5ns 1 1 30ns a M(rd) M(rd) 60ns 5ns
19
Component Input Output Register Adder Subtracter ALU Memory read Memory write Multiplication 2:1 Multiplexor Clock Period Average Latency Average Execution Time
M b
M(wr)
M(rd)
65ns
25ns 5ns 5ns M q p
Suboptimal area (two multipliers)
SOL-07:
M
4.6
a

b 5ns 1 1 30ns 5ns M(rd) 60ns 5ns
20
M(wr)
M(rd)
65ns
25ns 5ns 5ns M q p
Suboptimal performance (Period = 100 ns)
SOL-07:
4.7
MULTIPLY INSTRUCTION
21
4.7
Multiply Instruction
You are part of the design team for a microprocessor implemented on an FPGA. You currently implement your multiply instruction completely on the FPGA. You are considering using a specialized multiply chip to do the multiplication. Your task is to evaluate the performance and optimality tradeoffs between keeping the multiply circuitry on the FPGA or using the external multiplier chip. If you use the multipliplier chip, it will reduce the CPI of the multiply instruction, but will not change the CPI of any other instruction. Using the multiplier chips will also force the FPGA to run at a slower clock speed and will raise the cost. FPGA option FPGA + MULT option
MULT FPGA FPGA
average CPI % of instrs that are multiplies CPI of multiply Clock speed Cost
5 10% 20 200 MHz $20
??? 10% 6 160 MHz $23
SOL-07:
4.7.1
Highest Performance
22
4.7.1
Highest Performance
Which option, FPGA or FPGA+MULT, gives the higher performance (as measured in MIPs), and what percentage faster is the higher-performance option?
Answer: MIPs for FPGA option:
40
Find MIPs for FPGA+MULT option:
Find CPI for MIPS+FPGA option: FM mult mult other
Find CPI for non-multiply (other) instructions: FPGA other mult mult other
FM
01 09
mult other 20
mult
3 333
MIPsFM
MHzFM FM
MIPsFPGA
MHzFPGA FPGA 200 5
other
other
SOL-07:
4.7.1
Highest Performance
23
FM
mult
mult
other
01
09
3 333
36
44 4
MIPsFM MIPsFPGA , therefore the FPGA+MULT is the higher performance option.
FM
44 4 40
FPGA FPGA 40
11 1%
The FPGA+MULT option is 11% faster than the FPGA option.
MIPsFM
MIPsFM
MHzFM FM 160 36
other
SOL-07:
4.7.2
Optimality
24
4.7.2
Optimality
Which option, FPGA or FPGA+MULT, is more optimal (as measured in MIPs/$), and what percentage more optimal is the more optimal option?
Answer:
The FPGA+MULT option is 3.4% more optimal than the FPGA option.
n-pct-optimal
optFM optFPGA optFPGA 0 034
optFM
optFPGA
MIPsFPGA PriceFPGA 40 20 2
MIPsFM PriceFM 44 4 23 1 93
SOL-07:
4.7.3
Performance Metrics
25
4.7.3
Performance Metrics
Explain whether MIPs is a good choice for the performance metric when making this decision.
Answer:
MIPs is a good metric for this example, because we are comparing two microprocessors that use the same instruction set and will be used in the same environment. In general, the disadvantage of MIPs is that it doesnt take into account that different instructions accomplish different amounts of work. This causes problems when comparing microprocessors that use different instruction sets (e.g. one with a cosine instruction and one without).
SOL-07:
4.7.3
Performance Metrics
26
Chapter 5
Timing Analysis Problems
27
SOL-08: Timing Analysis

SOL-08:
5.1
TERMINOLOGY
5.1
Terminology
Assume that the timing diagram shows the limits of the allowed times (either minimum or maximum). For each of the terms in the table below, answer which time periods (one or more of t1 t9 or NONE) are examples of the term. t7 t4 signal is stable
t3 t1 t2 t6 signal may change t9
clk1 t8 clk2 a b b t10 t11 t5
clock skew clock period setup time hold time
SOL-08:
5.2
CRITICAL PATH AND FALSE PATH
5.2
Critical Path and False Path
Find the critical path through the following circuit: a

b c
SOL-08:
5.3
CRITICAL PATH
5.3
a
Critical Path
d f g k h l m i j
b c
gate NOT AND OR XOR
delay 2 4 4 6
Assume all delay and timing factors other than combinational logic delay are negligible.
SOL-08:
5.3.1
Ignoring potential false paths, list the signals in the critical path through this circuit. 5
5.3.1 Ignoring potential false paths, list the signals in the critical path through this circuit.
a
2 2
d6
6 6
f8 g 12
8 12
i 16
b c
e8
12 8
j 18 m 16 l 16
k 10
10 12 12
h4
Critical path is: b, e, g, j
SOL-08:
5.3.2
What is the combinational delay through the critical path?
5.3.2 What is the combinational delay through the critical path?

Delay: 18
SOL-08:
5.3.3
Missing Factors
5.3.3
Missing Factors
What factors that affect the maximum clock speed does your analysis for parts 1 and 2 not take into account?
Answer:
wire delay clock skew clock jitter
SOL-08:
5.3.4
False Path?
5.3.4
False Path?
Is the critical path you found a real critical path, or a false path? If it is a false path, nd the real critical path. If it is a critical path, nd a set of assignments to the primary inputs that exercises the critical path. a d f i
b c e 0 k g 0 j m l
Answer:
Therefore, the rst candidate critical path is a false path. NOTE: rules for XOR require one of the inputs to remain stable, otherwise output of XOR will not change. Find next candidate.
Contradictory assignment to e. Critical path requires 0 while input to j requires 1.
1,
SOL-08:
5.3.4
a
False Path?
d 6
6 6
9
f 8
8 12
2 2
i 16
b c
e6,8
8
6,8
g 10,12 10
8
j 16 m 16 l 16
k 10
10 12 12
h 2
1. There are four paths with a delay of 16. All go through g. a 6 6 8 f 8 2 d i 16 12

b c
2 6
e6,8
8
6,8
g 10,12 10
8
j 16 m 16 l 16
k 10
10 12 12
h 2
2. Quick check if g can change: Static equation for g a b bc Therefore, g can change. 1 on f, because of (a) For f, have choice of 1 or 0 reconvergent fanout. (b) Try 1 rst, because its simpler. (c) For g, have choice of 0 or 0 1 on d, because of reconvergent fanout. (d) Try 0 rst, because its simpler. (e) d is ok, 0 from both sides (f) Conict on output of inverter.
3. Try 0
1 on i
SOL-08:
5.3.4
False Path?
0 0 0
10
0 0
f g
1 i
b c
j k m l
b c
5. Try 0 1 on i, 0 Conict on d
1 on f, 0
4. Try 0 1 on i with 0 Conict on d a d
1 on f.
f 0 g j k m l
1 on d.
SOL-08:
5.3.4
a
False Path?
d f g j k m l
11
b c
7. Try 0
(a) For e, have choice on b of whether to invert or not, because e is an xor. 0 and e is (b) Because path from h is propagating 1 0 1, need to invert. (c) For inversion, need to put a 1 on c. (d) Conict on b a d f i
0 b c e g j k m l
8. Need to get g to toggle. (a) Static equation for g is a b bc Only assignment that makes g=0 is abc Only assignment that causes g to toggle because of change on b is a=0, b=1 0, c=0.
6. Try 0
1 on m. Conict on e. 1 on l.
SOL-08:
5.3.4
False Path?
12
(b) Try to push rising edge on b through g to i, j, m, or l; with a=0 and c=0. 0 a d f i 0
b 0 c e g j 0 k m 0 l 0
(c) Cant get rising edge on b to toggle both g and an output. Therefore, critical path does not go through both b and g. 9. Find next candidate path. a d
2 6 6
f g 10
8 10
i 14
b c
e 6,8
10 8
j 16 m 14 l 14
k 10
10 10 10
SOL-08:
5.3.4
a
False Path?
d
2 6 6
13
f g 10
8 10
i 14
b c
e 6,8
10 8
j 16 m 14 l 14
k 10
10 10 10
h 4 a 0 d0 0 e 0 g f 1
b c
1 0
j 0 k m 0 l
h 1
10. Cant get rising edge on c to toggle both g and j. However, the rising edge can toggle i and l. Both the path from c to j and from c to l have a delay of 14. a 8 6 d 6 f 2 i 14 10
b c
2 6
e 6,8
g 10
10 8
j 16 m 14 l 14
k 10
10 10 10
h 4
SOL-08:
5.3.4
False Path?
14
11. The pair of assignments abc and abc will exercise the critical paths from c to i and c to l, both of which have a delay of 14.
SOL-08:
5.4
TIMING MODELS
15
5.4
Timing Models
In your next job, you have been told to use a fanout timing model, which states that the delay through a gate increases linearly with the number of gates in the immediate fanout. You dimly recall that a long time ago you learned about a timing model named Elmo, Elmwood, Elmore, El-Morre, or something like that. For the circuit shown below as a schematic and as a layout, answer whether the fanout timing model closely matches the delay values predicted by the Elmore delay model.
G2 G3 G1 G4 G5 G1
Gate Cg 0 Symbol Description Interconnect level 2 Capacitance Cx Resistance 0
Interconnect level 1
Cy
Antifuse
G2
G3
G4
G5
Assumptions:
The capacitance of a node on a wire is independent of where the node is located on the wire.
SOL-08:
5.5
WORST CASE CONDITIONS AND DERATING FACTOR 16
5.5 Worst Case Conditions and Derating Factor

Assume that we have a Std speed grade Actel A1415 (an ACT 3 part) Logic Module that drives 4 other Logic Modules:
SOL-08:
5.5.1
Worst-Case Commercial
17
5.5.1
Worst-Case Commercial
Estimate the delay under worst-case commercial conditions (assume that the junction temperature is the same as the ambient temperature)
Answer: For worst-case commercial condition, assuming that TA = TJ, Logic Module delay, tPD, for ACT 3 Std with 4 fanout is 5.7 ns (see Smith Table 5.2). Assume this is the slowest path, then estimated critical path delay between registers, tCRIT (worst-case commercial) is:
tCRIT
tPD tSUD tCO 5 7ns 0 8ns 3 0ns 9 5ns
SOL-08:
5.5.2
Worst-Case Industrial
18
5.5.2
Worst-Case Industrial
Find the derating factor for worst-case industrial conditions and calculate the delay (assume that the junction temperature is the same as the ambient temperature).
Answer: For worst-case industrial conditions, assuming that TA = TJ, the derating factor is 1.07 (see Table 5.3). Hence the delay tCRIT (worst-case industrial) is: 7% greater than worst case commercial delay: 1 07 9 5 10 2ns
SOL-08:
5.5.3
Worst-Case Industrial, Non-Ambient Junction Temperature 19
5.5.3 Worst-Case Industrial, Non-Ambient Junction Temperature

Estimate the delay under the worst-case industrial conditions (assuming that the junction temperature is 105C).
Answer: For worst-case industrial conditions, the derating factor at 105C is found by linear interpolation between the values for 85C (1.07) and 125C (1.17). The interpolated derating factor is 1.12. Hence the delay is: tCRIT (worst-case industrial, TJ = 105 0C) 1 12 9 5 10 6ns.
SOL-09: Timing Analysis (II)

SOL-09:
5.6
SHORT ANSWER
5.6
Short Answer
SOL-09:
5.6.1
Wires in FPGAs
5.6.1
Wires in FPGAs
In an FPGA today, what percentage of the clock period is typically consumed by wire delay?
Answer: 4060%
SOL-09:
5.6.2
Age and Time
5.6.2
Age and Time
If you were to compare a typical digital circuit from 5 years ago with a typical digital circuit today, would you nd that the percentage of the total clock period consumed by capacative load has increased, stayed the same, or decreased?
Answer: Decreased. Justication:
Transistors have gotten smaller, die size has remained roughly the same size or even increased, clock speeds are increasing. Signals are travelling roughly the same distance as before, but driving smaller capactive loads. Thus, wire delay is not decreasing much, but capacitive load is decreasing. The clock period is decreasing, so the wire delay is taking up a larger percentage of the clock period and capacitive load delay is taking up a smaller percentage.
SOL-09:
5.6.3
Temperature and Delay
5.6.3
Temperature and Delay
As temperature increases, does the delay through a typical combinational circuit increase, stay the same, or decrease?
Answer: Increase. Justication: As temperature increases, atoms vibrate more, and so have greater probability of colliding with electrons owing with current. This increases resistivity, which increases delay.
SOL-09:
5.7
HOLD TIME VIOLATIONS
5.7
Hold Time Violations
SOL-09:
5.7.1
Cause
5.7.1
Cause
What is the cause of a hold time violation?
SOL-09:
5.7.2
Behaviour
5.7.2
Behaviour
What is the bad behaviour that results if a hold time violation occurs?
SOL-09:
5.7.3
Rectication
5.7.3
Rectication
If a circuit has a hold time violation, how would you correct the problem with minimal effort?
SOL-09:
5.8
LATCH ANALYSIS
10
5.8
Latch Analysis
Does the circuit below behave like a latch? If not, explain why not. If so, calculate the hold time and answer whether it is active-high or active-low.
d
Gate Delays AND 4 OR 2 NOT 1
d en
Answer:
0 1 1 1
1 0 0
en
en
Load mode
Store mode
From the mode diagrams, if the circuit is a latch, it is active high, because latch is in load mode when en=1.
Now check if timing of circuit is correct. The critical transition is from load mode to store mode.
SOL-09:
5.8
LATCH ANALYSIS
d l1 q s1 en cn
11
cn
l1 q
en
s1
Node labels
Timing diagram for transition from load to store mode.
circuit is latch? hold time latch type

Hold time constraint must prevent new value arriving at d before en sets l1 to 1. Delay along data path is 0. Delay along clock path is 1. Hold time is 1. Y 1 active high
SOL-09:
5.9
COMBINATIONAL TIMING (SMITH 13.23)
12
5.9
Combinational Timing (Smith 13.23)
Chapter 6
Power Problems
13
SOL-10: Power Analysis and Reduction

SOL-10:
6.1
POWER ANALYSIS AND REDUCTION PROBLEMS
6.1 Power Analysis and Reduction Problems
SOL-10:
6.1.1
Short Answers
6.1.1
Short Answers
SOL-10:
6.1.1
Short Answers
6.1.1.1 Power and Temperature

As temperature increases, does the power consumed by a typical combinational circuit increase, stay the same, or decrease?
Answer:
Power will increase. Justication:
where T is temperature. Short circuiting power will increase because: As temperature increases, atoms vibrate more, and so have greater probability of colliding with electrons owing with current. This increases resistivity, which increases delay. Signals will rise and fall more slowly, which will increase the short circuiting time, and hence increase short circuiting power
"
Leakage power will increase, because the equation for the leakage power is: q e k T
SOL-10:
6.1.1
Short Answers
6.1.1.2 Leakage Power

The new vice president of your company has set up a contest for ideas to reduce leakage power in the next generation of chips that the company fabricates. The prize for the person who submits the suggestion that makes the best tradeoff between leakage power and other design goals is to have a door installed on their cube. What is your door-winning idea, and what tradeoffs will your idea require in order to achieve the reduction in leakage power?
Answer: Increase transistor size so as to increase threshold voltage. This will require an increase in supply voltage, which will likely increase total power. Alternative: when increase transistor size, keep the supply voltage the same, but decrease performance. Alternative: change fabrication process and materials to reduce leakage current. This will likely be expensive. Alternative: Use dual-Vt fabrication process.
SOL-10:
6.1.1
Short Answers
6.1.1.3 Clock Gating

In what situations could adding clock-gating to a circuit increase power consumption?
Answer:
Alternative: Even if the utilization rate is low, the utilization pattern could prevent the clock gating circuitry from turning off the clock to main circuit. For example, if the circuit receives new data every other clock cycle, it would have a utilization rate of 50%, but might need to be powered up 100% of the time.
If the circuitry has a high utilization rate, then the power consumed by the clock gating circuit could be more than that saved in the main circuit.
SOL-10:
6.1.1
Short Answers
6.1.1.4 Gray Coding

What are the tradeoffs in implementing a program counter for a microprocessor using Gray coding?
Answer:

Gray coding is designed to reduce power, because only one bit changes when incrementing or decrementing. Program counters usually increment, rather than jump to completely different values. So, using gray coding should reduce power consumption. The downside is that the memory system probably doesnt use gray-coded addresses, so additional circuitry would be needed to convert between gray and binary codes. This will increase area and likely decrease performance. Additionally, the extra circuitry to do the translation might require more power than is saved by using gray coding.
SOL-10:
6.1.2
VLSI Gurus
6.1.2
VLSI Gurus
The VLSI gurus at your company have come up with a way to decrease the average rise and fall time (0-to-1 and 1-to-0 transitions) for signals. The current value is 1ns. With their fabrication tweaks, they can decrease this to 0.85ns .
SOL-10:
6.1.2
VLSI Gurus
6.1.2.1 Affect on Power If you implement their suggestions, and make no other changes, what affect will this have on power? (NOTE: Based on the information given, be as specic as possible.)
Answer: Reducing short circuit time from 1 ns to 0.85 ns means reducing raising/falling time. Hence, the new short circuit power is 85% of original.
SOL-10:
6.1.2
VLSI Gurus
10
6.1.2.2 Critique
A group of wannabe performance gurus claim that the above optimization can be used to improve performance by at least 15%. Briey outline what their plan probably is, critique the merits of their plan, and describe any affect their performance optimization will have on power.
Answer: The plan was probably to increase clock speed by 15%. However reducing Tshort by 0.15 ns can at most decrease clock period by 2 0 15 0 30 ns, while clock period 1 ns. Therefore, it does not work.
SOL-10:
6.1.3
Advertising Ratios
11
6.1.3
Advertising Ratios
One day you are strolling the hallways in search of inspiration, when you bump into a person from the marketing department. The marketing department has been out surng the web and has noticed that companies are advertising the MIPs/mm2 , MIPs/Watt, and Watts/cm3 of their products. This wide variety of different metrics has confused them. Explain whether each metric is a reasonable metric for customers to use when choosing a system. If the metric is reasonable, say whether bigger is better (e.g. 500 MIPs/mm2 is better than 20 MIPs/mm2 ) or smaller is better (e.g. 20 MIPs/mm2 is better than 500 MIPs/mm2 ), and which one type of product (cell phone, desktop computer, or compute server) is the metric most relevant to.
MIPs/mm2 MIPs/Watt Watts/cm3
SOL-11: Power Analysis and Reduction

Lecture Notes Sections: 6.1.4 6.1.8.3
SOL-11:
6.1.4
Vary Supply Voltage
6.1.4
Vary Supply Voltage
As the supply voltage is scaled down (reduced in value), the maximum clock speed that the circuit can run at decreases. The scaling down of supply voltage is a popular technique for minimizing power. The maximum clock speed is related to the supply voltage by the following equation: MaxClockSpeed
2
With a supply voltage of 3V and a threshold voltage of 0.8V, the maximum clock speed is measured to be 200MHz. What will the maximum clock speed be with a supply voltage of 1.5V?
Answer: MaxClockSpeed
2
MaxClockSpeed1
MaxClockSpeed1
MaxClockSpeed1
MaxClockSpeed2 200MHz 40MHz
m m
MaxClockSpeed1 MaxClockSpeed2
Where
is supply voltage and
is threshold voltage.
1 5V 0 8V 1 5V
3V
3V 0 8V
SOL-11:
6.1.5
Power Reality and Math (Smith prob 15.16)
6.1.5 Power Reality and Math (Smith prob 15.16)
SOL-11:
6.1.6
Clock Speed Increase Without Power Increase
6.1.6 Clock Speed Increase Without Power Increase

The following are given:
You need to increase the clock speed of a chip by 10% You must not increase its dynamic power consumption The only design parameter you can change is supply voltage Assume that short-circuiting current is negligible
SOL-11:
6.1.6
6.1.6.1 Supply Voltage

How much do you need to decrease the supply voltage by to achieve this goal? Answer: Total power:
Only need to reduce dynamic power, therefore neglect static (leakage) power.
11 0 95
(0
'
'
11
%#&
2 2 2
$
%#&
1 2
1 2
%
#"!
Power
Power
Power
Power
2
11
Power
1 2
Neglect short circuiting current.
"
' ( &

Power
m
2
1 2
SOL-11:
6.1.6
We need to decrease the supply voltage to be 95.3% of its original value.
SOL-11:
6.1.6

What problems will you encounter if you continue to decrease the supply voltage?
Answer: Decreasing the supply voltage will bring it closer to the threshold voltage. As the difference between the supply and threshold voltage decreases, it will limit the maximum frequency that the circuit can run at. This then leads to decreasing the threshold voltage, which will then increase the leakage current, and raise the static power dissipation:
SOL-11:
6.1.7
Power Reduction Strategies
6.1.7
In each low power approach described below identify which component(s) of the power equation is (are) being minimized and/or maximized:
SOL-11:
6.1.7

Designers scaled down the supply voltage of their ASIC
Answer: Scaling the supply voltage (V) reduces the dynamic power
SOL-11:
6.1.7
10
6.1.7.2 Transistor Sizing

The transistors were made larger.
Answer: Resizing transistor to increase the width to length ratio decreases the resistance of the transistor, which makes it faster. This means that the supply voltage can be reduced to save power while maintaining performance. However, increasing the width to length ratio increases the capacitance. After a certain point, the capacitance increase becomes more signicant than the reduction in supply voltage, causing power to increase. Therefore, resizing is adjusting supply voltage and load capacitance to minimize their product in the switching power component.
SOL-11:
6.1.7
11
6.1.7.3 Adding Registers to Inputs

All inputs to functional units are registered
Answer: When inputs are registered, the activity factor is decreased, which decreases the dynamic power.
SOL-11:
6.1.7
12
6.1.7.4 Gray Coding

Gray coding of signals is used for address signals.
Answer: Gray coding reduces the activity factor on signals that typically change by 1 or a small amount. Address signals have this behaviour, in contrast to data signals, where consecutive values are often completely different. Reducing the activity factor will reduce the dynamic power.
SOL-11:
6.1.8
Power Consumption on New Chip
13
6.1.8
While you are eating lunch at your regular table in the company cafeteria, a vice president sits down and starts to talk about the difculties with a new chip. The chip is a slight modication of existing design that has been ported to a new fabrication process. Earlier that day, the rst sample chips came back from fabrication. The good news is that the chips appear to function correctly. The bad news is that they consume about 10% more power than had been predicted. The vice president explains that the extra power consumption is a very serious problem, because power is the most important design metric for this chip. The vice president asks you if you have any idea of what might cause the chips to consume more power than predicted.
SOL-11:
6.1.8
14
6.1.8.1 Hypothesis
Hypothesize a likely cause for the surprisingly large power consumption, and justify why your hypothesis is likely to be correct.
SOL-11:
6.1.8
15
6.1.8.2 Experiment
Briey describe how to determine if your hypothesized cause is the real cause of the surprisingly large power consumption.
SOL-11:
6.1.8
16
6.1.8.3 Reality
The vice president wants to get the chips out to market quickly and asks you if you have any ideas for reducing their power without changing the design or fabrication process. Describe your ideas, or explain why her suggestion is infeasible.
Chapter 7
Problems on Faults, Testing, and Testability
17
SOL-12: Faults, Testing, and Testability

SOL-12:
7.1
BASED ON SMITH Q14.9: TESTING COST
7.1
Based on Smith q14.9: Testing Cost
A modern (circa 1995) production tester costs US$510 million. This cost is depreciated over the life of the tester (usually ve years in the States due to tax guidelines). 1. Neglecting all operating expenses other than depreciation, if the tester is in use 24 hours a day, 365 days per year how much does one second of test time cost? Answer:
$0 031 for a US$ 5 million tester $0 062 for a US$ 10 million tester
2. A new tester sits idle for 6 months, because the design of the chips that it is to test is behind schedule. After the chips begin shipping, the tester is used 100% of the time. What is the cost of testing the chips relative to the cost if the chips had been completed on time? Answer: 6 months is 10% of a 5 year lifespan Therefore the tester will test 90% of the total number of chips that it would normally test. The cost per chip for testing will be: 1 0 90
111%OrigTestCost
NewTestCost
OrigTestCost
365
CostPerSecond
PurchaseCost Lifespan 5 106 24 60 60
SOL-12:
7.1
BASED ON SMITH Q14.9: TESTING COST
3. The dimensions of the die to be tested are 20mm 10mm. The wafers are 200mm in diameter. Fabricating a wafer with die costs $3000. The yield is 70%. Assume that the number of die per wafer is equal to wafer area divided by chip area. What percentage of the fabrication + test cost is for test if the chip is on schedule and requires 1 minute to test? Answer:
157
16 3%
TestCostPct
DieTestCost DieTestCost DieFabCost $3 72 $3 72 $19 10
DieTestCost
TestCostPerSec $0 062 60 $3 72
DieFabCost
WaferFabCost DiePerWafer $3000 157
$19 10
200 2 10 20
DiePerWafer
WaferArea DieArea
TestTime
SOL-12:
7.2
TESTING COST AND TOTAL COST
7.2
Testing Cost and Total Cost
Given information:
What fault escapee rate will result in the lowest total cost for ACHIPs?
Answer: From section 7.2.2: TotCost NoTestCost TestCost EscapeeProb ReplaceCost
However, here we have two ACHIPs per board, so we need to use the escapee probability to compute the probability of board needing to be replaced. The revised equation for total cost is:
TotCost
NoTestCost TestCost ReplaceProb ReplaceCost
The ACHIP costs $10 without any testing Each board uses two ACHIPs (plus lots of other chips that we dont care about) 68% of the manufactured ACHIPS do not have any faults For the ACHIP, it costs $1 per chip to catch half of the faults Each 50% reduction in fault escapees doubles cost of testing (intuition: doubles number of tests that are run) If board-level testing detects faults in either one or both ACHIPs, it costs $200 to replace the ACHIP(s) (This is an approximation, based on the fact that the cost of the chip is much less than the total cost of $200). Board-level testing will detect 100% of the faults in an ACHIP
SOL-12:
7.2
TESTING COST AND TOTAL COST
The testing cost doubles, because we have two ACHIPs per board to test. The probablity of a board having at least one bad ACHIP (and therefore needing to be replaced) is 1 - the probability that both ACHIPs are good.
2
The chips will have a lowest cost if either $8 or $16 is spent on testing and they have a fault escapee rate of 4% or 2%. We choose to spend $16 on testing, because that has a lower escapee rate for the same total cost. The lower escapee rate will improve our reputation for quality.
NoTestCost $10 $10 $10 $10 $10 $10 $10
Testcost $0 2 $1 = $2 2 $2 = $4 2 $4 = $8 2 $8 = $16 2 $16 = $32 2 $32 = $64
EscapeeProb 32% 16% 8% 4% 2% 1% 0.5%
ReplaceProb 54% 29% 15% 8% 4% 2% 1%
ReplaceProb
EscapeeProb
AvgReplaceCost $108 $58 $30 $16 $8 $4 $2
TotCos $118 $70 $44 $34 $34 $46 $76
SOL-12:
7.3
MINIMUM NUMBER OF FAULTS
7.3
4
Minimum Number of Faults
In a circuit with i inputs, o outputs, and g gates with an average fanout of fo (fo 1), and average fanin of , what is the minimum number of faults that must be considered when using a single-stuck-at fault model?
Answer:
The minimum number of wire segments to connect a gate or input to fo other gates or outputs is fo + 1. (Assuming fo 1. If fo = 1, then the minimum number of wire segments is 1. With i inputs and g gates, this results in (i g) (fo 1) wire segments. Each wire segment has two possible faults (stuck-at-1 and stuck-at-0), therefore there are 2 (i g) (f 1) potential single-stuck-at faults that must be considered. NOTE: the fanin degree does not direcly factor into this equation. However, there is a relationship between the number of gates g, the number of inputs i, the depth of the circuitry, the fanout degree fo, and the fanin degree . For example, the maximum number of gates whose inputs are all primary inputs is i fo .
SOL-12:
7.4
SMITH Q14.10: FAULT COLLAPSING
7.4
Smith q14.10: Fault Collapsing
Draw the set of faults that collapse for AND, OR, NAND, and NOR gates, and a two-input mux.
Answer:
@0 @0
@0
@1 @1
@1
@0 @0
@1
@1 @1
@0
A two-input mux does not have any controlling inputs, so it does not have any collapsible faults.
SOL-12:
7.5
MATHEMATICAL MODELS AND REALITY
7.5
Mathematical Models and Reality
Given a correct circuit, and a non-stuck-at fault (e.g. bridging AND), will a single-stuck-at fault model detect the fault? If so, identify a single-stuck at fault that will detect, or explain why cant be detected.
SOL-12:
7.6
UNDETECTABLE FAULTS
7.6
Undetectable Faults
Identify one of the undetectable single stuck-at fault in the circuit below, or say NONE if all single stuck-at faults are detectable. a L1 L6 L4 b L2 L8 z L5 L7 c L3
SOL-12:
7.7
TEST VECTOR GENERATION
10
7.7
Test Vector Generation
Your task is to generate test vectors to detect faults in the circuit shown below. Your manager has said that manufacturing only has time to run three test vectors on the circuit. L1 a L6
L4
b c
L2 L5 L3
L7
L8
SOL-12:
7.7.1
Choice of Test Vectors
11
7.7.1
Choice of Test Vectors
Which test vectors should you run and in what order should you run them?
SOL-12:
7.7.2
Number of Test Vectors
12
7.7.2
Number of Test Vectors
Write a brief statement (backed up with data) to support either staying with three test vectors or increasing the test suite to four vectors.
SOL-12:
7.8
TIME TO DO A SCAN TEST
13
7.8
Time to do a Scan Test
A 1.2GHz chip has scan chains of length 30,000 bits, 20,000 bits, 24,000 bits, 25,000 bits, and two of 12,000 bits. 500,000 test vectors are used for each scan chain. The tests are run at 50% of full speed. Calculate the total test time.
Answer:
We can load and unload all of the scan chains at the same time, so time will be limited by the longest (30,000 bits). For the rst test vector, we have to load it in, run the circuit for one clock cycle, then unload the result. Loading the second test vector is done while unloading the rst. Clock Cycles 30,000 1 30,000 1 30,000 ... Vector 1 Load Run Dump Vector 2 Vector 3 ...
...
Load Run Dump ...
Load ...
20 8secs
TimeTot
ClockPeriod MaxLengthVec NumVecs MaxLengthVec 1 30 000 500 000 30 000 1 0 50 1 2 109
SOL-12:
7.9
BIST
14
7.9
BIST
In this problem, we will revisit the circuit from section 7.3.1, which is shown below. But, this time well use BIST to test the circuit, rather than analyzing the faults and then choosing test vectors to catch the potential faults.
a b c
L1 L4 L2 L5 L3 L7 L6 L8
SOL-12:
7.9.1
Characteristic Polynomials
15
7.9.1
Characteristic Polynomials
Derive the characteristic polynomials for the linear feedback shift registers shown below:
d0
R
q0
d1
q1
d2
q2
d0
q0
d1
q1 d2
q2
set
set
Answer: Both circuits have three ops, so their maximum exponent is x3 . A feedback tap on each signal di has corresponds to a coefcient of 1 on xi in the characteristic polynomial. The rst circuit has feedback taps for d0, d1, and d2. This gives a characteristic polynomial of: x3 x2
The second circuit has taps on d0 and d1, but not one on d2: x3
SOL-12:
7.9.2
Test Generation
16
7.9.2
Test Generation
Do either of the circuits generate a maximal-length non-repeating sequence?
Answer:
For an LFSR with n ops, the length of a maximal-length non-repeating sequence is 2n 1. Both of the LFSRs under consideration have 3 ops, so we are looking for a sequence of 7 non-repeating values. We will rst simulate the circuits to see their values, and then demonstrate how characteristic polynomials and division over Galois elds can be used to accomplish the same thing. d0 1 0 0 1 1 q0 1 1 0 0 1 x3 d0 1 1 0 0 1 0 1 q0 1 1 1 0 0 1 0 x3 For x3 x2 d1 0 1 0 1 0 x2 q1 1 0 1 0 1 x d2 0 0 1 1 0 q2 1 0 0 1 1
1) 2) 3) 4)
1 q2 1 1 0 0 1 0 1
1) 2) 3) 4) 5) 6) 7)
d1 0 0 1 0 1 1 1 x
q1 1 0 0 1 0 1 1 1
1, we see that it repeats after 4 values.
same as 1)
SOL-12:
7.9.2

Test Generation
17
For x3 x 1, we see that it generates a sequence of 7 different values before repeating. The circuit has three ops, so the maximum length sequence of non-repeating values it can generate is 23 1, which is 7. Thus, x x3 is a maximal length linear feedback shift register. Format for division: lfsr quotient message ... remainder
For an LFSR with no external input and n ops, the rst n coefcients of the message are the reset values of the LFSR, and all of the other remaining coefcients are 0. For a test vector generator LFSR, the reset values are all 1s. We hope to have a sequence of 7 unique remainders. With the three initial values in the LFSR ops, we require a message polynomial of 3 + 7=10 values. 0x2 0x1
Carry out the division:
The message polynomial is then: 1x9 1x8 1x7 0x6 0x5 0x4 0x3
0x0
SOL-12:
7.9.2
Test Generation
18
1x
0x
1x
1x
The values on the ip ops inside an LFSR with n ops show up as the n-most-signicant coefcients on the polynomials immediately below the subtraction lines in the long-divison. For example, after the second subtraction, the polynomial is: 0x7 0x6 1x5 0x4 . The three most signicant coefcients are: 001 and the value on (q2,q1,q0) after two steps of execution is also 001.
Quotient Remainder
1x6 1x2
1x5 1x1
1x2 1x0
1x0
0x5 1x5 1x5 0x5 1x5 0x5 1x5 1x5
0x4 0x4 0x4 0x4 0x4 0x4 0x4 0x4
0x3 0x3 0x3 1x3 1x3 0x3 1x3 1x3
0x2 1x2 1x2 0x2 1x2 0x2 1x2
0x1 0x1 0x1 1x1 1x1
7 7 7 7 7 7 7
7 7 7 7 7 7
7 7 7 7
7 7
1x6 1x9 1x9
1x5 1x8 0x8 1x8 1x8
0x4 1x7 1x7 0x7 0x7 0x7 0x7
0x3 0x6 1x6 1x6 1x6 0x6 0x6 0x6 0x6
1x2 0x5
0x1 0x4
1x0 0x3
0x2
0x1
0x0
0x0 1x0 1x0
SOL-12:
7.9.3
Signature Analyzer
19
7.9.3
Signature Analyzer

Given a signature analyzer equation of x2 x 1, nd the expected value of the ops in the signature analyzer at the end of the test sequence. Also, design the hardware for the signature analyzer and result checker.
Answer:
set mode q0
i_d(0)
S
q1
i_d(1)
S
q2
i_d(2)
S
Expected sequence of values from circuit: z q0 q1 q2 1) 1 1 1 1 x6 2) 1 0 1 0 x5 z 3) 1 0 0 0 x4 4) 0 1 0 0 x3 5) 0 0 1 0 x2 6) 1 1 0 1 x1 7) 0 1 1 1 x0 Polynomial for output sequence of circuit under test: x6 x 1
Connect test generator to circuit Remainder of result sequence divided by signature analyzer is values in ops of signature analyzer at end of test sequence.
Format for division:
mx px qx r x
message (output of circuit under test) polynomial of signature analyzer quotient remainder
x6 x2
x x
1 1
SOL-12:
7.9.3
Signature Analyzer
quotient circuit under test ... remainder
20
signature analyzer
Carry out the division: 1x4 1x6 1x6 1x3 0x5 1x5 1x5 1x5 0x2 0x4 1x4 1x4 1x4 0x4 0x4 1x1 0x3 0x3 1x3 1x3 0x3 1x3 1x3
1x2
1x1
1x0
1x1
Quotient Remainder
1x4 1x3 1x1 0
1x1
1x0
Check division:
x6
1x6
Division was done correctly. The nal value on the three ops in the signature analyzer will be the remainder: 1x1 0x0 10.
1x6
1x4
1x3
1x1
1x0
1x2
1x0
mx x
qx
px 1x1
0x2 0x2 0x2 1x2 1x2 1x2
1x1 1x1 0x1 1x1 1x1
1x0 0x2
1x0
1x0 1x0 0x0
r x x1 x
SOL-12:
7.9.3
Signature Analyzer
21
NOTE: When looking at the remainder (signature), we look at the outputs of the ops, representing the op nearest the input as x0 . Using hardware:
clk i d0 q0 d1
reset d0 i
S S R
1 0 0 0 0 1 1 1 0 1 1 0 0 0 0
remaind
0 1 0 1 1 0 0 0
0 1 1 0 1 1 1 0
q0
d1
q1
q1
0 0 1 1 0 1 1 1
quotient
Signature analyzer and timing diagram The quotients and the remainder calculated using long division match the ones that were calculated using the circuit. The values on the ops in the signature analyzer match, cycle by cycle, the two most signicant coefcients on the intermediate remainders calculated during long division. The intermediate remainders are the polynomials below the subtraction lines. (When looking at the circuit, remember that for an LFSR with n ops, it takes n clock cycles for the circuit to become primed with the input sequence and match the long-division arithmetic.) The ok circuit for this signature analyzer is just a 2-input AND gate, because the remainder is 11.
SOL-12:
7.9.3
Signature Analyzer
22
reset d0 i
S S R
q0
d1
q2
q0 q1
ok
Signature analyzer with ok circuit The result checker should check the ok signal one cycle after the last test vector. The last test vector in the sequence is 110. We can either look for 110 and delay by one clock cycle, or we can look for the rst test vector (111) in second iteration the sequence. To make sure that we are looking at the second iteration of the sequence, and not the rst, we look at reset.
max-length LFSR q0 q1 q2 circuit under test z signature analyzer ok
all_ok
Result checker circuit option 1

q0 q1 q2 reset circuit under test z signature analyzer ok all_ok
max-length LFSR
Result checker circuit option 2
SOL-12:
7.9.4
Probabilty of Catching a Fault
23
7.9.4
Find the approximate probability of a fault not being detected
Answer:
We have a sequence of 7 bits coming from the circuit under test. This gives us 27 128 possible sequences. Of these, 1 is the good sequence and 127 are faulty sequences. The signature analyzer stores 2 bits of data, which gives us 4 possible values. Thus, on average 128 4 32 different result sequences will map to the same 2-bit signature. Of these 32 vectors, 1 is the good sequence and 31 are faulty sequences. Assume that each result sequence is equally likely to occur. (NOTE: this is a poor assumption, a full analysis would make each stuck-at fault equally likely, then compute the result vector for each fault.) With this assumption, there is a 31 127 24% chance that a faulty sequence will result in the same signature as the good sequence. There is approximately a 24% chance that a faulty circuit will not be detected.
SOL-12:
7.9.5
24
7.9.5
If we increase the size of the signature analyzer by one ip op, by how much do we change the the approximate probability of a fault not being detected?
Answer:
A signature analyzer with 3 bits of data gives us 8 possible values. Thus, on average 128 8 16 different result sequences will map to the same 3-bit signature. Assuming that each result sequence is equally likely to occur, there is a 15 127 11 8% chance that a faulty sequence will result in the same signature as the good sequence. There is approximately a 12% chance that a faulty circuit will not be detected. Thus, we have decreased the probability of a faulty circuit not being detected from 24% to 12%.
SOL-12:
7.9.6
Detecting a Specic Fault
25
7.9.6
Determine if a L7@0 is detectable
Answer:
There is an error somewhere in this solution

Equation for faulty circuit: a
AND
b.
Faulty sequence of values from circuit: a 1 1 1 0 0 1 0 b 1 0 0 1 0 1 1 c 1 1 0 0 1 0 1 z 1 x6 0 x5 0 x4 0 x3 0 x2 1 x1 0 x0
Polynomial for result sequence: x6 Compute remainder
SOL-12:
7.9.6
26
1x2
1x1
1x0
1x1
This remainder is the same as the remainder for the correct circuit, thus the fault will be not detected! In hardware:
clk i d0 q0 d1 q1 1 0 0 0 0 1 1 1 0 1 1 0 0 0 0 remainder
0 1 0 1 1 0 0 0
0 1 1 0 1 1 1 0
0 0 1 1 0 1 1 1
quotient
Quotient Remainder
1x4 1x1
1x3 1x0
1x1
0x3 1x3 1x3 0x3 1x3 1x3
0x2 0x2 0x2 1x2 1x2 1x2
1x1 1x1 0x1 1x1 1x1
1x4 1x6 1x6
1x3 0x5 1x5 1x5 1x5
0x2 0x4 1x4 1x4 1x4 0x4 0x4
1x1 0x3
1x0 0x2
0x0 1x0 1x0
SOL-12:
7.9.7
Time to Run Test
27
7.9.7
Time to Run Test
Find the number of clock cycles to run the test
Answer: For a maximal-length LFSR of n bits, it takes 2n 1 clock cycles to generate the 2n 1 test vectors, plus one cycle at the end to op the results. This gives a total of 2n clock cycles, which in our case is 8.
SOL-12:
7.10
POWER AND BIST
28
7.10
Power and BIST
You add a BIST circuit to a chip. This causes the chip to exceed the power envelop that marketing has dictaed is needed. What can you do to reduce the power consumption of the chip without negatively affecting performance or incuring signicant design effort?
Answer: When in test mode, run the clock at a lower frequency so that the chip will consume less power. Add clock gating to signature analyzer so that it is turned off when the chip is in normal mode.
SOL-12:
7.11
TIMING HAZARDS AND TESTABILITY
29
7.11
a
L1
Timing Hazards and Testability

L7 L8 L4 L9 L12
This question deals with with following circuit:
L2 L5 L10
L13
L15
L14
L3
L6
L11
1. Does the circuit have any untestable single-stuck-at faults? If so, identify them. Answer:
a c
None of the minterms are completely covered by other minterms, so the circuit is irredundant and does not have undetectable faults. The two minterms ac and ab overlap, but neither is completely covered by other minterms. So, if one of them was stuck at 0, there would be at least one set of input values that would cause the faulty circuit to differ from the correct circuit. 2. Does the circuit have any static timing hazards?
SOL-12:
7.11
30
Answer: Moving from abc to abc moves between minterms. Thus, there is a potential timing hazard.
a
c
Potential glitch (static hazard)
3. Add any circuitry needed to prevent static timing hazards in the circuit below, then identify any untestable single-stuck-at faults in the resulting circuit. Answer:
a c
L1 L7 L8 L4 L9@0 L16@0
L12
L2
L13@0 L15
L17@0 L18@0 L5 L3 L6 L10 L11 L4
L19@0
L14
SOL-12:
7.11
31
The minterms ab and bc are both completely covered by other minterms. Thus, these minterms are redundant and are sources of undetectable faults. This gives us L13@0 and L19@0 as undetectable single stuck-at faults. Using gate collapsing, we see that the following faults are equivalent to L13@0: L9@0, L160. And the following are equivalent to L19@0: L17@0, L18@0. NOTE: although both L16@0 and L17@0 are undetectable, this does not mean that L2@0 is undetectable. L2@0 is equivalent to having both L16@0 and L17@0 at the same time. Check the Boolean equations if you are in doubt about this.
SOL-12:
7.12
TESTING SHORT ANSWER
32
7.12
Testing Short Answer
SOL-12:
7.12.1
Are there any physical faults that are detectable by scan testing but not by built-in self
7.12.1 Are there any physical faults that are detectable by scan testing but not by built-in self testing?
If not, explain why. If so, describe such a fault.
Answer: Yes.
A fault that is only detectable with 000 will be detectable by scan testing but not by built-in self test. A fault that results in the same signature as the correct circuit will be detectable by scan testing but not by built-in self test.
SOL-12:
7.12.2
Are there any physical faults that are detectable by built-in self testing but not by scan t
7.12.2 Are there any physical faults that are detectable by built-in self testing but not by scan testing?
If not, explain why. If so, describe such a fault.
Answer: No. Any fault that is detectable by built-in self testing can be detected by scan testing where the test vector that we scan in in the BIST test vector that triggers the fault. If scan testing is interpreted as boundary scan testing and built-in self test is allowed inside a chip, then there are faults that are detectable by built-in self test but not by boundary scan testing. These faults would be inside redundant sequential circuitry. But, this scenario was not intended to be part of this question.
SOL-12:
7.13
FAULT TESTING
35
7.13
Fault Testing
In this question, you will design and analyze built-in self test circuitry for the circuit-under-test shown below.
SOL-12:
7.13.1
Design test generator
36
7.13.1 Design test generator

Draw the schematic for a 2-bit maximal-length linear feedback shift register and demonstrate that it is maximal length.
Answer:
clk d0 1 0 1 q0 1 1 0 1 d1 0 1 1 q1 1 0 1 1 value 3 1 2 3
SOL-12:
7.13.2
Design signature analyzer
37
7.13.2 Design signature analyzer

Design a signature analyzer circuit for a characteristic polynomial of x
Answer:
1.
SOL-12:
7.13.3
Determine if a fault is detectable
38
7.13.3 Determine if a fault is detectable

Is a stuck-at-1 fault on the output of the inverter detectable with the circuitry that youve designed?
Answer:
This solution has an error

8
1. Equation for correct circuit-under-test is a a b output 1 1 0 1 1 0 0 1 1 b.
2. Simulating correct output sequence 011 through signature analyzer: i 0 1 1 d0 0 1 0 q0 0 0 1 0 3. Equation for faulty circuit-under-test is ab a b output 1 1 1 0 1 0 0 1 0
4. Simulating faulty output sequence 100 through signature analyzer: i 1 0 0 d0 1 1 1 q0 0 1 1 1 5. Output of signature analyzer is different from correct circuit, so the fault will be detected.
ab.
SOL-12:
7.13.4
Testing time
39
7.13.4 Testing time

How many clock cycles does your BIST circuitry require to test the circuit under test? Explain how each clock cycle is used.
Answer:
1. reset circuit 2. run rst of three test vectors 3. run second of three test vectors 4. run three of three test vectors 5. op result from circuit under test into signature analyzer 5 clock cycles.

Good VLSI Design Test Power Tutorial

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Good VLSI Design Test Power Tutorial

Transféré par

Droits d'auteur :

Formats disponibles

E&CE 427: Digital System Engineering

E&CE 427: 2003t1Winter 0

II Solutions to Tutorial Notes

VHDL: The Language

LEC-02: Introduction to VHDL

wk-06 wk-07 wk-08 wk-08 10 wk-11 12 wk-13

synthesis simulation entity architecture process concurrent statement sequential statement

Topics in this Chapter

Topics in this Chapter

VHDL syntax VHDL semantics synthesizing VHDL

Smith Chapters 1 and 2

First pass Ch 8 10.5 entities and architectures,

10.10 sequential statements, 10.13 concurrent statements,

10.14 execution 12.2 synthesis 12.6 VHDL logic synthesis

VHDL Origins and History

VHDL Origins and History

VHDL is a lot more than synthesis of digital hardware

VHDL Origins and History

VHDL Origins and History

VHDL History (Contd)

VHDL Origins and History

VHDL History (Contd)

Synthesis is a computer-aided design (CAD) technique that transforms a

designers concise, high-level description of a circuit into a structureal description of a circuit.

Synthesis of a Simulation-Based Language

1.2.3 Synthesis of a Simulation-Based Language

Solution to Synthesis Sanity

Solution to Synthesis Sanity

VHDL and Other Languages

VHDL and Other Languages

VHDL and Other Languages

1.2.7.1 VHDL vs Verilog

VHDL and Other Languages

1.2.7.2 VHDL vs SystemC

VHDL and Other Languages

1.2.7.3 VHDL vs Other Hardware Description Languages

VHDL and Other Languages

1.2.7.4 Summary of VHDL Evaluation

dene interface to circuit

Architecture (section 1.3.3) dene internal signals and gates of circuit

Entities and Architecture

Entities and Architecture

Each hardware module is described with an Entity/Architecture pair

Entities and Architecture

names, modes (in / out), types of externally visible signals of circuit

Architecture: internals structure and behaviour of module

Entities and Architecture

Figure 1.3: Simplied grammar of entity

Entities and Architecture

Entities and Architecture

Figure 1.6: The order of concurrent statements doesnt matter

... <= ... when ... else ...;

...: ... port map ( ... => ..., ... );

...: for ... in ... generate ... end generate;

...: if ... generate ... end generate;

process ... begin ... end process;

Figure 1.7: The most commonly used concurrent statements

case/switch style assignment Smith Section 10.13.4

use an existing circuit section 1.3.5, Smith Section 10.13.6

replicate some hardware Smith Section 10.13.7

conditionally create some hardware Smith Section 10.13.7

Component Declaration and Instantiations

1.3.5 Component Declaration and Instantiations

Example Process with Sensitivity List