Unit 1 and 2 PPTs

Reconfigurable Computing
By
Prof. K. R. Saraf
Syllabus
Module I: - Types of computing and introduction to RC: General Purpose Computing,
Domain-Specific Processors, Application Specific Processors; Reconfigurable
Computing, Fields of Application; Reconfigurable Device Characteristics,
Configurable, Programmable, and Fixed-Function Devices; General-Purpose
Computing, General-Purpose Computing Issues
Module II: - Metrics: Density, Diversity, and Capacity; Interconnects, Requirements,

Delays in VLSI Structures; Partitioning and Placement, Routing; Computing
Elements, LUTs, LUT Mapping, ALU and CLBs; Retiming, Fine-grained & Coarse-
grained structures; Multi-context
Module III: - Different architectures for fast computing viz. PDSPs, RALU,
VLIW, Vector Processors, Memories, CPLDs, FPGAs, Multi-context FPGA, Partial
Reconfigurable Devices; Structure and Composition of Reconfigurable Computing
Devices: Interconnect, Instructions, Contexts, Context switching, RP space model
Module IV: - Reconfigurable devices for Rapid prototyping, Non-frequently

reconfigurable systems, Frequently reconfigurable systems; Compile-time
reconfiguration, Run-time reconfiguration; Architectures for Reconfigurable
computing: TSFPGA, DPGA, Matrix; Applications of reconfigurable computing:
Various hardware implementations of Pattern Matching such as the Sliding
Windows Approach, Automaton-Based Text Searching. Video Streaming
Types of Computing
1) General Purpose Computing
2) Application Specific IC
General Purpose Computing
General-purpose computing devices allow us to
(1) customize computation after fabrication and
(2) conserve area by reusing expensive active circuitry for different functions in
time.
 General-purpose computers have served us well over the past couple of
decades.
 Broad applicability has led to wide spread use and volume commoditization.
 Flexibility allows a single machine to perform a multitude of functions and
be deployed into applications unused at the time the device was designed
or manufactured.
 The flexibility inherent in general-purpose machines was a key component
of the computer revolution.
Application Specific IC
 Despite the fact that processor performance steadily increases, we often
find it necessary to prop up these general-purpose devices with specialized
processing assists, generally in the form of specialized co-processors or
ASICs.
 For the past several years, industry and academia have focused largely on
the task of building the highest performance processor, instead of trying to
build the highest performance general-purpose computing engine. When
active area was extremely limited, this was a very sensible approach.
 However, as silicon real estate continues to increase far beyond the space
required to implement a competent processor, it is time to re-evaluate
general-purpose architectures in light of shifting resource availability and
cost.
 In particular, an interesting space has opened between the
extremes of general-purpose processors and specialized ASICs.
 That space is the domain of reconfigurable computing and offers

all the benefits of general-purpose computing with greater
performance density than traditional processors.
 This space is most easily seen by looking at the binding time for
device function.
 ASICs bind function to active silicon at fabrication time making

the silicon useful only for the designated function.
 Processors bind functions to active silicon only for the duration of a single
cycle, a restrictive model which limits the amount the processor can
accomplish in a single cycle while requiring considerable on-chip resources
to hold and distribute instructions.
 Reconfigurable devices allow functions to be bound at a range of intervals
within the final system depending on the needs of the application.
 This flexibility in binding time allows reconfigurable devices to make better
use of the limited device resources including instruction distribution.
 Consequently, reconfigurable computing architectures offer:
 More application-specific adaptation than conventional processors
 Greater computational density than conventional processors
 More and broader reuse of silicon than ASICs
 Better opportunities to ride hardware and algorithmic technology curves
than ASICs
 Better match to current technology costs than ASICs or processors
 Standard Definition: A reconfigurable computer is a device which computes
by using post-fabrication spatial components of compute elements.
 It refers to systems incorporating some form of hardware programmability,
that customizes how the hardware is used using a number of physical control
points.
 These control points can be changed periodically in order to execute different
applications using the same hardware.
 Its key feature is the ability to perform computations in hardware to increase
performance, while retaining much of the flexibility of a software solution.
 The innovative development of FPGAs whose configuration could be re-
programmed an unlimited number of times spurred the invention of a new
field in which many different hardware algorithms could execute, in turn, on a
single device, just as many different software algorithms can run on a
conventional processor.
 The speed advantage of direct hardware execution on the FPGA –
routinely 10X to 100X the equivalent software algorithm – attracted the
attention of the supercomputing community as well as Digital Signal
Processing systems developers.
 RC researchers found that FPGAs offer significant advantages over

microprocessors and DSPs for high performance, low volume applications,
particularly for applications that can exploit customized bit widths and
massive instruction-level parallelism.
Advantages of RC
 Even in application with fixed algorithms, reconfigurable devices may offer
advantages over a full-custom or ASIC approach.
 FPGAs are processes drivers, therefore a generation ahead of ASIC.
 Increasing NREs for ASIC and full-custom has pushed "cross-over" point
way out.
 Time to market advantage.
 Programmability leads to project risk management extended product life-
times.
Disadvantages of RC
 Reconfiguration time might be critical in run-time reconfigurable systems.
 Low utilization of hardware in configurable systems.
Qn. 1. State and explain characteristics of Reconfigurable
Devices. (20 Andrew Dehon Reconf. Arch. For General Purp. Comp.)
 Reconfigurable Device Characteristics
 Broadly considered, reconfigurable devices fill their silicon area with a
large number of computing primitives, interconnected via a configurable
network.
 The operation of each primitive can be programmed as well as the

interconnect pattern.
 Computational tasks can be implemented spatially (relating to space) on

the device with intermediates flowing directly from the producing function to
the receiving function.
 Since we can put thousands of reconfigurable units on a single die,

significant data flow may occur without crossing chip boundaries.
 To first order, one can think about turning an entire task into hardware
dataflow and mapping it on the reconfigurable substrate
Qn. 1. State and explain characteristics of Reconfigurable
 Reconfigurable computing generally provides spatially-oriented processing
rather than the temporally-oriented processing (limited by time) typical of
programmable architectures such as microprocessors.
 The key differences between reconfigurable machines and conventional

processors are:
 Instruction Distribution – Rather than broadcasting a new instruction to

the functional units on every cycle, instructions are locally configured,
allowing the reconfigurable device to compress instruction stream
distribution and effectively deliver more instructions into active silicon on
each cycle.
 Spatial routing of intermediates – As space permits, intermediate values

are routed in parallel from producing function to consuming function rather
than forcing all communication to take place in time through a central
resource bottleneck.
Qn.1. State and explain characteristics of Reconfigurable
 More, often finer-grained, separately programmable building blocks –
Reconfigurable devices provide a large number of separately
programmable building blocks allowing a greater range of computations to
occur per time step. This effect is largely enabled by the compressed
instruction distribution.
 Distributed deployable resources, eliminating bottlenecks – Resources

such as memory interconnect, and functional units are distributed and
deployable based on need rather than being centralized in large pools.
Independent, local access allows reconfigurable designs to take advantage
of high, local, parallel on-chip bandwidth, rather than creating a central
Qn.2. What are the key differences between
reconfigurable machine and conventional processors,
explain them (20 Andrew Dehon Reconf. Arch. For General Purp. Comp.)
 RC has More application-specific adaptation than conventional

processors.
 Greater computational density than conventional processors.
 Instruction Distribution – Rather than broadcasting a new instruction to

the functional units on every cycle, instructions are locally configured,
allowing the reconfigurable device to compress instruction stream
distribution and effectively deliver more instructions into active silicon on
each cycle.
 Spatial routing of intermediates – As space permits, intermediate values

are routed in parallel from producing function to consuming function rather
than forcing all communication to take place in time through a central
Qn.2. What are the key differences between reconfigurable
machine and conventional processors, explain them
 More, often finer-grained, separately programmable building blocks –

Reconfigurable devices provide a large number of separately
programmable building blocks allowing a greater range of computations to
occur per time step. This effect is largely enabled by the compressed
instruction distribution.
 Distributed deployable resources, eliminating bottlenecks –

Resources such as memory, interconnect, and functional units are
distributed and deployable based on need rather than being centralized in
large pools. Independent, local access allows reconfigurable designs to
take advantage of high, local, parallel on-chip bandwidth, rather than
creating a central resource bottleneck.
Qn.3 What are configurable, programmable and fixed
function devices? Explore each with suitable example in
brief. (20 Andrew Dehon Reconf. Arch. For General Purp. Comp.)
• Combining all the observations, we can categorize the circumstances under
which the various structures are preferred.
• Fixed Function, Limited Operation Diversity, High Throughput – When

the function and data granularity (Granularity is the level of depth
represented by the data in a fact or dimension table in a data warehouse.
High granularity means a minute, sometimes atomic grade of detail, often
at the level of the transaction) to be computed are well-understood and
fixed, and when the function can be economically implemented in space,
dedicated hardware provides the most computational capacity per unit area
to the application.
• Variable Function, Low Diversity – If the function required is unknown or

varying, but the instruction or data diversity is low, the task can be mapped
directly to a reconfigurable computing device and efficiently extract high
computational density.
Qn.3 What are configurable, programmable and fixed
function devices? Explore each with suitable example in
brief. (20 Andrew Dehon Reconf. Arch. For General Purp. Comp.)
• Space Limited, High Entropy (numerical measure of the uncertainty of

an outcome or lack of order or predictability; gradual decline into disorder)
– If we are limited spatially and the function to be computed has a high
operation and data diversity, we are forced to reuse limited active space
heavily and accept limited instruction and data bandwidth. In this regime,
conventional processor organization are most effective since they
dedicate considerable space to on-chip instruction.
Qn. Compare ASIC, GPP, FPGA and RD with respect to design
efforts, power consumption, throughput and NRE.
Parameter ASIC GPP FPGA RD
1. Full Form Application General Purpose Field Reconfigurable
Specific IC Processor Programmable Device
Gate Array
2. Design More design Less design efforts Less design Moderate
Efforts efforts efforts design efforts
required
3. Power Low More Moderate Low
Consumption
4.Computation High Less Moderate Moderate
Throughput
5. NRE Cost High Moderate Low Low
Qn. Comparison of ASIC, ASSP, Configurable, DSP, FPGA, MCU and
RISC
Qn. Compare RISC-styled processor, VLIW processor, DSP
processor and with respect to limited instruction and data
bandwidth, abstraction overhead and data movement consume
capacity and coarse grained data path network.
Parameter RISC-styled VLIW processor DSP processor FPGA
processor
1) Limited It has limited
instruction and instructions and
data bandwidth data bandwidth
2) Abstraction
overhead
3) Data
movement
consume
capacity
4) Coarse
grained data
path network
Qn. Compare RISC-styled processor, VLIW processor, DSP
processor and FPGA with respect to limited instruction and data
bandwidth, abstraction overhead and data movement consume
capacity and coarse grained data path network.
Ans:RISC-styled processor (45 Andrew Dehon Reconf. Arch. For General Purp. Comp.)
Limited Instructions and Data Bandwidth: coupled with long latencies
to off-chip resources, limited bandwidth can cause the processor to stall
waiting on the information which it needs to determine the course of the
computation or the data it needs to operate upon. As a result, the
number of instructions completed per cycle is always less than the
number of ALUs available.
Abstraction Overhead and Data Movement Consume Capacity:

procedure calls, data marshaling, traps, and data conversion consume
capacity without, directly, providing capacity to the problem. Operations
which simply move data around do not provide capacity to the problem
either. Since we have a mix of the instructions shown in Table 1 below,
we will never achieve the peak condition where every operation provides
2 gate-evaluations to the computational task.
Quantitative studies tell us, for given systems or application sets,
average values for the number of gate evaluations per instruction and
the number of instructions executed per clock cycle. This gives us an
expected computational capacity:
For example, while HaL’s SPARC64 should be able to issue 4 instructions
per cycle, in practice it only issues 1.2 instructions per cycle on common
workloads [EG95]. Thus Instructions per Issue Slot =30%, resulting in a 70%
reduction from expected peak capacity.
Assuming the integer DLX instructions mixes given in Appendix C of [HP90]
are typical, we can calculate (Gate Evaluations) per (Datapath Bit) by
weighting the instructions by their provided capacities from Table 1. In Table
4 below we see that one ALU bit op in these applications is roughly 0.6 gate
evaluations.
If this effect is jointly typical with the instructions per issue slot number
above, then we would yield at most of the theoretical,
functional density.
For the HaL case, this reduces
There are, of course, several additional aspects which prevent most

any application from achieving even this expected peak capacity and
which cause many applications to not even come close to it.
Coarse Grained Data Path Network: the word-oriented datapaths in

conventional processors prevents efficient interaction of arbitrary bits –
e.g. A simple operation like XOR-ing together two fields in a data word
requires a mask, shift, and XOR operation even though the whole
operation may effect only a few gate evaluations. Returning to the HaL
example above, we note that the processor has a 64-bit datapath.
When running code developed for 32-bit SPARCs, half of the datapath
is idle, further reducing the yielded capacity by 50% to 0.7 gate-
evaluations per λ2 sec
Qn: State and explain issues in reconfigurable network design.
(83 Andrew Dehon Reconf. Arch. For General Purp. Comp.)
1. Provide adequate flexible –The network must be capable of
implementing the interconnection topology required by the programmed
logic design with acceptable interconnect delays.
2. Use configuration memory efficiently – Space required for

configuration memory can account for a reasonable fraction of the array
real-estate. However, configuration encodings can be tight and do not
have to take up substantial area relative to that required for wires and
switches.
3. Balance bisection bandwidth– As discussed above, interconnect wiring

takes space and can, in some topologies, dominate the array size. The
wiring topology should be chosen to balance interconnect bandwidth with
array size and expected design interconnect requirements.
4. Minimize delays – The delay through the routing network can easily be
the dominant delay in a programmable technology. Care is required to
minimize interconnect delays. Two significant factors of delay are:
Qn: State and explain issues in reconfigurable network design.
(a) Propagation and fan out delay – Interconnect delay on a wire is
proportional to distance and capacitive loading (fan out). This makes
interconnect delay roughly proportional to distance run, especially when
there are regular taps into the signal run. Consequently, small/short signal
runs are faster than long signal runs.
(b) Switched element delay – Each programmable switching element in a

path (e.g. crossbar, multiplexor) adds delay. This delay is generally much
larger than the propagation or fan out delay associated with covering the
same physical distance. Consequently, one generally wants to minimize
the number of switch elements in a path, even if this means using some
longer signal runs. Switching can be used to reduce fan out on a line by
segmenting tracks, and large fan out can be used to reduce switching by
making a signal always available in several places. Minimizing the
interconnect delay, therefore, always requires technology dependent
tradeoffs between the amount of switching and the length of wire runs.
Q. Explore in detail the evolution of general purpose computing
with VLSI technology.
OR
Q. What are the limitations of General Purpose Processors? Why
evolution in it is necessary? What are characteristics of an ideal
reconfigurable device?
OR
Q. Write a short note on General purpose computing issues.
OR
Q. Give empirical view of general purpose computing
architectures in the age of MOS VLSI. Compare their performance
OR
Q. What are general purpose computing issues?
OR
Q. What are the typical issues and constraints while development
of general purpose reconfigurable device?
(27 Andrew Dehon Reconf. Arch. For General Purp. Comp.) General-purpose
computing devices
are specifically intended for those cases where, economically, we cannot or
need not dedicate sufficient spatial resources to support an entire
computational task or where we do not know enough about the required task
or tasks prior to fabrication to hardwire the functionality.
The key ideas behind general-purpose processing are:
1. Defer binding of functionality until device is employed – i.e. after fabrication
2. Exploit temporal reuse of limited functional capacity
Delayed binding and temporal reuse work closely together and occur at many
scales to provide the characteristics we now expect from general-purpose
computing devices.
We are quite accustomed to exploiting these properties so that unique

hardware is not required for every task or application.
This basic theme recurs at many different levels in our conventional systems:
Market Level – Rather than dedicating a machine design to a single
application or application family, the design effort may be utilized for many
different applications.
System Level – Rather than dedicating an expensive machine to a single
application, the machine may perform different applications at different times
by running different sets of instructions.
Application Level – Rather than spending precious real estate to build a
separate computational unit for each different function required, central
resources may be employed to perform these functions in sequence with an
additional input, an instruction, telling it how to behave at each point in time.
Algorithm Level – Rather than fixing the algorithms which an application uses,
an existing general-purpose machine can be reprogrammed with new
techniques and algorithms as they are developed.
User Level–Rather than fixing the function of the machine at the supplier, the
instruction stream specifies the function, allowing the end user to use the
machine as best suits his needs.
Machines may be used for functions which the original designers did not
conceive. Further, machine behavior may be upgraded in the field without
incurring any hardware or hardware handling costs. In the past, processors
were virtually the only devices which had these characteristics and served as
general-purpose building blocks. Today, many devices, including
reconfigurable components, also exhibit the key properties and benefits
associated with general-purpose computing. These devices are economically
interesting for all of the above reasons.
General-Purpose Computing Issues: - There are two key features associated

with general-purpose computers which distinguish them from their specialized
counterparts.
The way these aspects are handled plays a large role in distinguishing
various general-purpose computing architectures.
1 Interconnect
In general-purpose machines, the data paths between functional units
cannot be hardwired.
Different tasks will require different patterns of interconnect between

the functional units.
Within a task individual routines and operations may require different

interconnectivity of functional units.
General-purpose machines must provide the ability to direct data flow

between units.
In the extreme of a single functional unit, memory locations are used to

perform this routing function.
As more functional units operate together on a task, spatial switching is

required to move data among functional units and memory.
The flexibility and granularity of this interconnect is one of the big

factors determining yielded capacity on a given application.
2 Instructions
Since general-purpose devices must provide different operations over
time, either within a computational task or between computational tasks,
they require additional inputs, instructions, which tell the silicon how to
behave at any point in time.
Each general-purpose processing element needs one instruction to tell it

what operation to perform and where to find its inputs.
As we will see, the handling of this additional input is one of the key
distinguishing features between different kinds of general-purpose
computing structures.
When the functional diversity is large and the required task throughput is
low, it is not efficient to build up the entire application dataflow spatially
in the device.
Rather, we can realize applications, or collections of applications, by

sharing and reusing limited hardware resources in time (See Figure 2.1)
and only replicating the less expensive memory for instruction and
intermediate data storage.
Q. Write a short note on Regular and Irregular Computing Tasks.
Computing tasks can be classified informally by their regularity.
Regular tasks perform the same sequence of operations repeatedly.
Regular tasks have few data dependent conditionals such that all data
processed requires essentially the same sequence of operations with highly
predictable flow of execution.
Nested loops with static bounds and minimal conditionals are the typical
example of regular computational tasks.
Irregular computing tasks, in contrast, perform highly data dependent

operations.
The operations required vary widely according to the data processed and the
flow of operation control is difficult to predict.
Q. Explore design metrics such as density, diversity, and capacity.
Give mathematical analysis and circuit schematics to support.
The goal of general-purpose computing devices is to provide a common,

computational substrate which can be used to perform any particular task.
In this section, we look at characterizing the amount of computation provided by

a given general-purpose structure.
To perform a particular computational task, we must extract a certain amount of

computational work from our computing device.
If we were simply comparing tasks in terms of a fixed processor, instruction set,

we might measure this computational work in terms of instruction evaluations.
If we were comparing tasks to be implemented in a gate-array we might

compare the number of gates required to implement the task and the number of
time units required to complete it.
Here, we want to consider the portion of the computational work done for the
task on each cycle of execution of diverse, general-purpose computing devices
of a certain size.
The ultimate goal is to roughly quantify the computational capacity per unit
area provided to various application types by the computational organizations
under consideration.
That is, we are trying to answer the question:

“What is the general-purpose computing capacity provided by this computing
structure.”
We have two immediate problems answering this questions:

1. How do we characterize general-purpose computing tasks?
2. How do we characterize capacity?
The first question is difficult since it places little bounds on the properties of
the computational tasks.
We can, however, talk about the performance of computing structures in

terms of some general properties which various important subclasses of
general-purpose computing tasks exhibit.
We thus end up addressing more focused questions, but ones which give us
insight into the properties and conditions under which various computational
structures are favorable.
We will address the second question – that of characterizing device capacity –
using a “gate evaluation” metric.
That is, we consider the minimum size circuit which would be required to
implement a task and count the number of gates which must be evaluated to
realize the computational task.
We assume the collection of gates available are all -input logic gates, and use
4.
This models our 4-LUT as discussed in Section 2.4, as well as more traditional
gate-array logic.
The results would not change characteristically using a different, finite value
as long as 2.
1 Functional Density
Functional capacity is a space-time metric which tells us how much
computational work a structure can do per unit time.
Correspondingly, our “gate-evaluation” metric is a unit of space-time capacity.
That is, we can get 4 “gate-evaluations” in one “gate-delay” out of 4 parallel

and gates (Figure 2.5) or in 4 “gate-delays” out of a single and gate (Figure
2.6).
That is, if a device can provide capacity gate evaluations per second,
optimally, to the application, a task requiring gate evaluations can be
completed in time:
In practice, limitations on the way the device’s capacity may be employed will
often cause the task to take longer and the result in a lower yielded capacity.
If the task takes time , to perform , then the yielded capacity is:
The capacity which a particular structure can provide generally increases with
area.
To understand the relative benefits of each computing structure or architecture
independent of the area in a particular implementation, we can look at the
capacity provided per unit area.
We normalize area in units of , half the minimum feature size, to make the
results independent of the particular process feature size.
Our metric for functional density, then, is capacity
per unit area and is measured in terms of gate-
evaluations per unit space-time in units of gate-
evaluations/s.
The general expression for computing functional

density given an operational cycle time for a unit of
silicon of size evaluating gate evaluations per
cycle is:
This capacity definition is very logic centric, not

directly accounting for interconnect capacity.
We are treating interconnect as a generality overhead which shows up in the
additional area associated with each compute element in order to provide
general-purpose functionality.
This is somewhat unsatisfying since interconnect capacity plays a big role in

defining how effectively one uses logic capacity.
Unfortunately, interconnect capacity is not as cleanly quantifiable as logic

capacity, so we make this sacrifice to allow easier quantification.
As noted, the focus here is on functional density.
This density metric tells us the relative merits of dedicating a portion of our
silicon budget to a particular computing architecture.
The density focus makes the implicit assumption that capacity can be
composed and scaled to provide the aggregate computational throughput
required for a task.
To the extent this is true, architectures with the highest capacity density that
can actually handle a problem should require the least size and cost.
Whether or not a given architecture or system can actually be composed and
scaled to deliver some aggregate capacity requirement is also an interesting
issue, but one which is not the focus here.
Remember also that resource capacity is primarily interesting when the

resource is limited.
We look at computational capacity to the extent we are compute limited.
When we are i/o limited, performance may be much more controlled by i/o
bandwidth and buffer space which is used to reduce the need for i/o.
Nge = Number of gate evaluations per cycle

A = Space or Area
tcycle = Operational Cycle Time
2 Functional Diversity – Instruction Density
Functional density alone only tells us how much raw throughput we get. It says
nothing about how many different functions can be performed and on what
time scale.
For this reason, it is also interesting to look at the functional diversity or

instruction density. Here we use functional diversity to indicate how many
distinct function descriptions are resident per unit of area.
This tells us how many different operations can be performed within part or all
of an IC without going outside of the region or chip for additional instructions.
We thus define functional diversity as:
To first order, we count instructions as native processor instructions or

processing element configurations assuming a nominally 4-LUT equivalent
logic block for the logic portion of the processing element.
3 Data Density
Space must also be allocated to hold data for the computation. This area also
competes with logic, interconnect, and instructions for silicon area. Thus, we
will also look at data density when examining the capacity of an architecture.
If we put of data for an application into a space, A , we get a data density:

Qn. What are the limitations of conventional FPGA that are responsible
for not to perform an application at its peak.
FPGA capacity has not change dramatically over time, but the sample size is
small.
There is a slight upward trend which is probably representative of the relative
youth of the architecture.
This peak, too, is not achievable for every application.
Some effects which may prevent an application from achieving this peak
include:
Limited interconnect – When the network connectivity is inadequate, all of a
device’s capacity cannot be used.
This may require either that cells buffer and transmit data or simply that cells
go unused in the area.
Conventional FPGA interconnect routes most applications with over 80%

utilization.
Pipeline efficiency limits – In heavily pipelined systems, capacity can be
required to pass data across pipeline stages to all of its points of consumption.
This transit capacity consumes device capacity without contributing to the
evaluation capacity required for the application.
Limited ability to pipeline operations – Some tasks have cyclic dependencies
where a result is required before the next round of computation can begin.
Unless several, orthogonal tasks are interleaved on the FPGA, the cyclic path
length limits the rate at which resource can be reused and, in turn, prevents
the application from fully utilizing the FPGAs capacity.
Limited I/O Bandwidth – The I/O cycle time on most FPGAs is higher than the
logic cycle time.
Data transfer to and from the FPGA may limit the capacity which can actually
be applied to a problem.
Limited need for this functionality – If a piece of functionality implemented in
an FPGA is not required at the rate and frequency achievable on the FPGA,
the FPGA can be employed far below its available capacity.
Need for additional functionality – When the FPGA cannot hold the required
functional diversity for a task and must be reprogrammed in order to complete
the task, the device goes partially or entirely unused during the
reprogramming cycle.
Qn. What are conventional interconnect? What are their limitations?
Which scheme of interconnect is most suitable for RD?
(84 andrew dehon general purpose computing)
Conventional FPGA interconnect takes a hybrid approach with a mix of short,

neighbor connections and longer connections.
Full connectivity is not supported even within the interconnect of a single tile.
Typically, the interconnect block includes:
 A hierarchy of line lengths – some
interconnect lines span (amount of space
that something covers ) a single cell, some
a small number of cells, and some an entire
row or column
 Limited, but not complete, opportunity for
corner turns
 Limited opportunity to link together shorter

segments for longer routes
 Options for the value generated by the LUT to connect to some lines of
each hierarchical length in each direction – perhaps including some local
interconnect lines dedicated to the local LUT output.
 Opportunity to select the k-LUT inputs from most of the lines converging in
the interconnect block
 The amount of interconnect in each of the two dimensions is not
necessarily the same .
Figure below shows these features in a caricature (a picture, description, or

imitation of a person in which certain striking characteristics are
exaggerated in order to create a comic or grotesque effect) of
conventional FPGA interconnect.
One of the key differences between FPGAs and traditional “multiprocessor”
networks is that
 FPGA interconnect paths are locked down serving a single function.
 The FPGA must be able to simultaneously route all source-sink connections
using unique resources to realize the connectivity required by the FPGA.
 Another key difference is that the interconnection pattern is known a prior to

execution, so offline partitioning and placement can be used to exploit
locality and thereby reduce the interconnect requirements.
Qn. What do you mean by coarse and fine grained architecture? What
are the expected characteristics of RD? (wikipedia)
Granularity is the extent to which a material or system is composed of
distinguishable pieces or grains.
It can either refer to the extent to which a larger entity is subdivided, or the
extent to which groups of smaller indistinguishable entities have joined
together to become larger distinguishable entities.
For example, a kilometer broken into centimeters has finer granularity than a
kilometer broken into meters.
In contrast, molecules of photographic emulsion may clump together to form

distinct noticeable granules, reflecting coarser granularity.
Coarse-grained materials or systems have fewer, larger discrete components

than fine-grained materials or systems.
A coarse-grained description of a system regards large subcomponents while

a fine-grained description regards smaller components of which the larger
ones are composed.
The terms granularity, coarse, and fine are relative, used when comparing
systems or descriptions of systems. An example of increasingly fine
granularity: a list of nations in the United Nations, a list of all states/provinces
in those nations, a list of all cities in those states, etc.
Coarse-grained - larger components than fine-grained, large subcomponents.

Simply wraps one or more fine-grained services together into a more coarse-
grained operation.
Fine-grained - smaller components of which the larger ones are composed,

lower level service.
It is better to have more coarse-grained service operations, which are

composed by fine-grained operations
Qn. Explain delta delay, intrinsic delay, interconnect delay pertaining to RD.
Delta Delay is the default signal assignment propagation delay if there

no delay is explicitly prescribed.
Delta is an infinitesimal VHDL time unit so that all signal assignments

can result in signals assuming their values at a future time. e.g.-
output <= NOT input; -- output assumes new value in one delta cycle.
Intrinsic delay is the delay internal to the gate. Input pin of the cell to
output pin of the cell.
It is defined as the delay between an input and output pair of a cell,

when a near zero slew is applied to the input pin and the output does not
see any load condition. It is predominantly caused by the internal
capacitance associated with its transistor.
This delay is largely independent of the size of the transistors forming
the gate because increasing size of transistors increase internal
capacitors.
Interconnect Delay
Interconnections contribute to the delay time two-fold: by capacitive loading
and by distributed-RC delay.
The interconnect capacitance CM LM adds to the capacitive load of the driving
stages.
This effect dominates for short interconnections.
Interconnect capacitance can be determined accurately with two and three-
dimensional device simulation For some purposes there are also analytical
models which can also be calibrated to simulated or measured data.
The interconnect RC delay dominates for long interconnections.

Clearly, long interconnections will be avoided by inserting repeaters to
eliminate the square-law length dependence and to regenerate the signal.
However, as the feature size is scaled down remains constant so that the
interconnect RC-delay does not scale with feature size.
This is a serious problem in deep-sub-micron technologies, where

interconnect delay starts to dominate
Qn. What are research challenges in RC?
Trusted Hardware
Problem Statement Due to the immense costs associated with fabricating chips
in the latest technology node, the design, manufacturing and testing of
integrated circuits has moved across a number of companies residing in
various countries. This makes it difficult to secure the supply chain and opens
up a number of possibilities for security threats.
The problem boils down the question: how can these fabless companies
assure that the chip received from the foundry is exactly the same as the
design given to the foundry − nothing more, nothing less?
Physical attacks
Problem Statement Physical attacks monitor characteristics of the system in an
attempt to glean (collect) information about its operation.
These attacks can be invasive (aggressive, disturbing), semi-invasive or non-

invasive. Invasive attacks include probing stations, chemical solvents, lasers,
ion beams and/or microscopes to tap into the device and gain access to its
internal workings.
This involves de-packaging and possibly removing layers of the device to
enable probing or imaging.
Side channel attacks are less invasive attacks that exploit physical
phenomena such as power, timing, electromagnetic radiation, thermal profile,
and acoustic characteristics of the system
Design Tool Subversion (Destabilization, Damage)

Problem Statement Programming an FPGA relies on a large number of
sophisticated CAD tools that have been developed by a substantial number of
people across many different companies and organizations.
In the end, these different design tools and IP produce a set of inter-operating
cores, and as Ken Thompson said in his Turing Award Lecture, “You can’t
trust code that you did not totally create yourself”.
You can only trust your final system as much as your least-trusted design
path.
Subversion of the design tools could easily result in malicious hardware being
inserted into the final implementation.
Alternatively, you must develop ways in which you can trust the hardware and
the design tools.
Design Theft
Problem Statement The bit stream necessarily holds all the information that is
required to program the FPGA to perform a specific task. This data is often
proprietary information, and the result of many months/years of work.
Therefore, protecting the bit stream is the key to eliminating intellectual
property theft. Furthermore, the bit stream may contain sensitive data, e.g. the
cryptographic key.
System Security
Problem Statement The problem that we are dealing with here is the ability to
make assurances that during the deployment of the device it does not do
anything potentially harmful. This requires that we perform monitoring of the
system and provide the system with information on what it can and cannot do.
Qn. Write a short note on Instruction and interconnect growth
 Interconnect requirements grow faster than interconnect description

requirements. Specifically:
 This is one reason that single context device scan afford to use sparse
interconnect encodings. Since the wires and switches are the dominant and
limiting resource, additional configuration bits are not costly. In the wire
limited case, we may have free area under long routing channels for memory
cells. In fact, dense encoding of the configuration space has the negative
effect that control signals must be routed from the configuration memory cells
to the switching points. The closer we try to squeeze the bit stream encoding
to its minimum ,the less locality we have available between configuration bits
and controlled switches. These control lines compete with network wiring,
exacerbating the routing problems on a wire dominated layout.
Qn. Write a short note on RP space area model
Qn. Write a short note on Granularity
• Fine-Grained Reconfigurable Architectures and Coarse-Grained

Reconfigurable Architectures
• Reconfigurable computing are divided into two groups, according to the

level of the used hardware Abstraction or Granularity:
i. Fine-Grained Reconfigurable Architectures
ii. Coarse-Grained Reconfigurable Architectures
• Fine-grained reconfigurable devices

• These are hardware devices, whose functionality can be modified on a very
low level of granularity. For instance, the device can be modified such as to
add or remove a single inverter or a single two input NAND-gate.
• Fine-grained reconfigurable devices are mostly represented by

programmable logic devices (PLD). This consists of programmable logic
arrays (PLA), programmable array logics (PAL), complex programmable
logic devices (CPLD) and the field programmable gate arrays (FPGA).
• On fine-grained reconfigurable devices, the number of gates is often used

as unit of measure for the capacity of an RPU.
• Example: FPGA
• Issues: Bit level operation => Several processing units => Large routing
overhead => Low silicon area efficiency => Use up more Power
• High volume of configuration data needed => Large configuration memory

=> More power Dissipation and long configuration time
• Coarse-grained reconfigurable devices Also called pseudo reconfigurable

device.
• These devices usually consist of a set of coarse-grained elements such as
ALUs that allow for the efficient implementation of a dataflow function.
• Hardware abstraction is based on a set of fixed blocks (like functional units,

processor cores and memory tiles)
• Merely the reprogramming of interconnections
• Coarse-grained reconfigurable devices do not have any established

measure of comparison.
• However, the number of PEs on a device as well as their characteristics

such as the granularity, the amount of memory on the device and the
variety of peripherals may provide an indication on the quality of the device.
• Coarse Grained architecture overcomes disadvantages of fine grain

architecture
• Provides multiple-bit wide datapath and complex operator
• Avoids routing overhead
• Needs Fewer lines ->Lower area use

• The granularity of the reconfigurable logic is defined as the size of the

smallest functional unit (configurable logic block, CLB) that is addressed by
the mapping tools.
• High granularity, which can also be known as fine-grained, often implies a

greater flexibility when implementing algorithms into the hardware. However,
there is a penalty associated with this in terms of increased power, area and
delay due to greater quantity of routing required per computation.
• Fine-grained architectures work at the bit-level manipulation level; whilst

coarse grained processing elements (reconfigurable data path unit, rDPU)
are better optimised for standard data path applications.
• One of the drawbacks of coarse grained architectures are that they tend to
lose some of their utilization and performance if they need to perform smaller
computations than their granularity provides, for example for a one bit add on
a four bit wide functional unit would waste three bits. This problem can be
solved by having a coarse grain array (reconfigurable datapath array, rDPA)
and a FPGA on the same chip.
• Coarse-grained architectures (rDPA) are intended for the implementation

for algorithms needing word-width data paths (rDPU).
• As their functional blocks are optimized for large computations and typically
comprise word wide arithmetic logic units (ALU), they will perform these
computations more quickly and with more power efficiency than a set of
interconnected smaller functional units; this is due to the connecting wires
being shorter, resulting in less wire capacitance and hence faster and lower
power designs.
• A potential undesirable consequence of having larger computational blocks

is that when the size of operands may not match the algorithm an inefficient
utilization of resources can result.
• Often the type of applications to be run are known in advance allowing the
logic, memory and routing resources to be tailored to enhance the
performance of the device whilst still providing a certain level of flexibility for
future adaptation. Examples of this are domain specific arrays aimed at
gaining better performance in terms of power, area, throughput than their
more generic finer grained FPGA cousins by reducing their flexibility.
Qn. Write a short note on Contexts
(136 Andrew Dehon Reconf. Arch. For General Purp. Comp.) (99 RC FPGA By Andrew Dehon)
• Computational density is heavily dependent on the number of instruction

contexts supported. Architectures which support substantially more
contexts than required by the application, allow a large amount of silicon
area dedicated to instruction memory to go unused. Architectures which
support too few contexts will leave active computing and switching
resources idle waiting for the time when they are needed.
• For RTR systems, the overhead of serial programming may be prohibitive.

An attractive alternative may be to provide storage in the device for
multiple configurations simultaneously, facilitating configuration
prefetching and fast reconfiguration.
• A multi-context device (sometimes called “time-multiplexed”) contains

multiple planes (contexts) of configuration data. Each configuration point
of the device is controlled by a multiplexer that chooses between the
context planes. Two configuration points for a 4-context device are shown
in Figure below.
• Multi-context devices have two main benefits over single-context devices.

First, they permit background loading of configuration data during circuit
operation, overlapping computation with reconfiguration.
• Second, they can switch between stored configurations quickly—some in a

single clock cycle— dramatically reducing reconfiguration overhead if the
next configuration is present in one of the alternate contexts. However, if
the next needed configuration is not present, there is still a significant
penalty while the data is loaded.
• For that reason, either all needed contexts must fit in the available
hardware or some control must determine when contexts should be loaded
in order to minimize the number of wasted cycles stalling while
reconfiguration completes.
• One of the drawbacks of multi-context architectures is that the additional

configuration data and required multiplexing occupies valuable area that
could otherwise be used for logic or routing.
• Therefore, although multi-contexting can facilitate the use of an FPGA as

virtual hardware, the physical capacity of a multi-contexted FPGA device is
less than that of a single-context device of the same area.
• For example, a 4-context device has only 80 percent of the “active area”
(simultaneously usable logic/routing resources) that a single-context device
occupying the same fixed silicon area has. A multi-context device limited to
one active and one inactive context (a single SRAM plus a flip-flop) would
have the advantages of background loading and fast context switching
coupled with a lower area overhead, but it may not be appropriate if several
different contexts are frequently reused.
• Another drawback of multi-contexted devices is a direct consequence of its

ability to perform a reconfiguration of the full device in a single cycle: spikes
in dynamic power consumption. All configuration points are loaded from
context memory simultaneously, and potentially the majority of configuration
locations may be changed from 0 to 1 or vice versa. Switching many
locations in a single cycle results in a significant momentary increase in
dynamic power, which may violate system power constraints.
• Finally, if any state-storing component of the FPGA is not connected to the

configuration information, as may be true for flip-flops, its state will not be
restored when switching back to the previous context. However, this issue
can also be seen as a feature because it facilitates communication
between configurations in other contexts by leaving partial results in place
across configurations.
Qn. Write a short note on Functional Capacity and Functional Density
OR
Qn. What is the computational density? Explain with example.
• Functional capacity is a space-time metric which tells us how much

computational work a structure can do per unit time. Correspondingly, our
“gate-evaluation” metric is a unit of space-time capacity.
• That is, if a device can provide capacity Dcap gate evaluations per second,
optimally, to the application, a task requiring Nge gate evaluations can be
completed in time:
• In practice, limitations on the way the device’s capacity may be employed

will often cause the task to take longer and the result in a lower yielded
capacity. If the task takes time TTask-actual, to perform NTask_ge, then the
yielded capacity is:
OR
OR
• This density metric tells us the relative merits of dedicating a portion of our
silicon budget to a particular computing architecture.
• The density focus makes the implicit assumption that capacity can be
composed and scaled to provide the aggregate computational throughput
required for a task.
• To the extent this is true, architectures with the highest capacity density that
can actually handle a problem should require the least size and cost.
• Remember also that resource capacity is primarily interesting when the

resource is limited. We look at computational capacity to the extent we are
compute limited. When we are i/o limited, performance may be much more
controlled by i/o bandwidth.
OR
• Functional Diversity – Instruction Density

• Functional density alone only tells us how much raw throughput we get. It
says nothing about how many different functions can be performed and on
what time scale.
• For this reason, it is also interesting to look at the functional diversity or
instruction density. We thus define functional diversity as:
Concluding Remarks on Functional Diversity:

• We use functional diversity to indicate how many distinct function
descriptions are resident per unit of area.
• This tells us how many different operations can be performed within part or
all of an IC without going outside of the region or chip for additional
instructions.
OR
Data Density : - Space must also be allocated to hold data for the
computation. This area also competes with logic, interconnect, and
instructions for silicon area. Thus, we will also look at data density when
examining the capacity of an architecture. If we put of Nbits of data for an
application into a space, A , we get a data density:
Qn. Give mathematical analysis of switch, channel and wire growth.
OR
Qn. Give the mathematical model for Rent’s rule based channel and wire
growth.
( contains many equations 87 Andrew Dehon Reconf. Arch. For General Purp. Comp.)
Qn. What are the effects of Interconnect Granularity?
( contains many equations 112 andrew dehon general purpose computing)
Qn. What is single context and multi context FPGA? Discuss these
issues pertaining to present FPGA architecture. (7 An Introduction to Reconfigurable
Computing)
Single Context FPGA

• They have only one configuration each time and can be programmed using
a serial stream of configuration information. Because only sequential
access is supported.
• Any change to a configuration on this type of FPGA requires a complete

reprogramming of the entire chip.
• Although this does simplifies the reconfiguration hardware, it does incur a

high overhead when only a small part of the configured memory needs to
be changed.
• A single context device is a serially programmed chip that requires a

complete reconfiguration in order to change any of the programming bits.
Most commercial FPGAs are of this variety
• To implement runtime reconfiguration on this type of device, configurations

must be grouped into full contexts, and the complete contexts are swapped
in and out of the hardware as needed.
Computing)
Computing)
Multi Context FPGA
• It includes multiple memory bits for each programming location.
• These memory bits can be thought of as multiple planes of configuration

information, each of which can be active at a given moment, but the device
can quickly switch between different planes, or context, of already
programmed configuration.
• A multi-context device can be considered as a multiplexed set of single

context devices, which requires that a context be fully reprogrammed to
perform any modification
• This system does allow the background loading of the context, where one
plane is active and in execution, while an inactive plane is in the process of
being programmed.
• Fast switching between the context makes the grouping of configurations

into contexts slightly less critical, because if a configuration is on a different
context than the one that is currently active, it can be activated within an
order of nanoseconds, as opposed to milliseconds or longer.
issues pertaining to present FPGA architecture.
• A multi-context device has multiple layers of programming bits, where each
layer can be active at a different point in time.
• An advantage of the multi-context FPGA over a single-context architecture

is that it allows for an extremely fast context switch (on the order of
nanoseconds), whereas the single-context may take milliseconds or more to
reprogram.
• The multi-context design does allow for background loading, permitting one
context to be configuring while another is in execution.
• Each context of a multi-context device can be viewed as a separate single-

context device.
Multi-FPGA Partitioning and Placement Approaches (676 andrew dehon
FPGA)
Design partitioning and placement play an important role in system
performance for dedicated-wire FPGA-based logic emulators.
Because FPGA pins are such critical resources for these systems, the primary
goal of partitioning is to minimize communication between partitions.
A large number of algorithms have been developed that split logic into two
pieces (bi partitioning) and multiple pieces (multiway partitioning) based on
both logic and I/O constraints.
Unfortunately, the need to satisfy dual constraints complicates their application

to dedicated-wire emulation systems.
One way to address the partitioning and placement problem is to perform both
operations simultaneously.
For example, a multiway partitioning algorithm can be used to simultaneously

generate multiple partitions while respecting inter-FPGA routing limitations
Unfortunately, multiway partitioning algorithms are computationally
expensive (often exhibiting exponential runtime in the number of
partitions), which makes them infeasible for systems containing tens or
hundreds of FPGA devices.
As a result of inter-FPGA bandwidth limitations and the need for

reasonable CAD tool runtime, most dedicated wire FPGA emulation
systems use iterative bi partitioning for combined partitioning and
placement [6].
This approach has been effectively applied to both crossbar and mesh
topologies [31].
The use of recursive bi partitioning for dedicated-wire emulators creates

several problems.
Although it can be used effectively to locate an initial cut, it is inherently

greedy.
The bandwidth of the initial cut is optimized, but may not serve as an
effective start point for further cuts.
This issue may be resolved by ordering hierarchical bipartition cuts
based on criticality [5].
Partitioning for multiplexed-wire systems is simple compared to the

dedicated-wire case, because it must meet only FPGA logic constraints,
rather than both logic and pin constraints.
Unlike the dedicated-wire case, partitioning and placement are generally

performed not simultaneously but rather sequentially [2].
First, recursive bi partitioning successively divides the original design

into a series of logic partitions that meet the logic capacity
requirements of the target FPGAs.
During partitioning, the amount of logic required to multiplex inter-

FPGA signals must be estimated because both design partition logic
and multiplexing logic must be included in the logic capacity analysis.
Following partitioning, individual partitions are assigned to individual

FPGAs.
Placement typically attempts to minimize system-wide communication
by minimizing inter-FPGA distance, particularly on critical paths.
To fully explore placement choices, simulated annealing is frequently

used for multi-FPGA placement [2].

Unit 1 and 2 PPTs

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Unit 1 and 2 PPTs

Transféré par

Droits d'auteur :

Formats disponibles

Reconfigurable Computing

Module II: - Metrics: Density, Diversity, and Capacity; Interconnects, Requirements,

Module IV: - Reconfigurable devices for Rapid prototyping, Non-frequently

 That space is the domain of reconfigurable computing and offers

 ASICs bind function to active silicon at fabrication time making

 RC researchers found that FPGAs offer significant advantages over

 The operation of each primitive can be programmed as well as the

 Computational tasks can be implemented spatially (relating to space) on

 Since we can put thousands of reconfigurable units on a single die,

 The key differences between reconfigurable machines and conventional

 Instruction Distribution – Rather than broadcasting a new instruction to

 Spatial routing of intermediates – As space permits, intermediate values

 Distributed deployable resources, eliminating bottlenecks – Resources

 RC has More application-specific adaptation than conventional

 Greater computational density than conventional processors.

 Instruction Distribution – Rather than broadcasting a new instruction to

 Spatial routing of intermediates – As space permits, intermediate values

 More, often finer-grained, separately programmable building blocks –

 Distributed deployable resources, eliminating bottlenecks –

• Fixed Function, Limited Operation Diversity, High Throughput – When

• Variable Function, Low Diversity – If the function required is unknown or

• Space Limited, High Entropy (numerical measure of the uncertainty of

Abstraction Overhead and Data Movement Consume Capacity:

There are, of course, several additional aspects which prevent most

Coarse Grained Data Path Network: the word-oriented datapaths in

2. Use configuration memory efficiently – Space required for

3. Balance bisection bandwidth– As discussed above, interconnect wiring

(b) Switched element delay – Each programmable switching element in a

We are quite accustomed to exploiting these properties so that unique

General-Purpose Computing Issues: - There are two key features associated

Different tasks will require different patterns of interconnect between

Within a task individual routines and operations may require different

General-purpose machines must provide the ability to direct data ﬂow

In the extreme of a single functional unit, memory locations are used to

As more functional units operate together on a task, spatial switching is

The ﬂexibility and granularity of this interconnect is one of the big

Each general-purpose processing element needs one instruction to tell it

Rather, we can realize applications, or collections of applications, by

Computing tasks can be classiﬁed informally by their regularity.

Regular tasks perform the same sequence of operations repeatedly.

Irregular computing tasks, in contrast, perform highly data dependent

The goal of general-purpose computing devices is to provide a common,

In this section, we look at characterizing the amount of computation provided by

To perform a particular computational task, we must extract a certain amount of

If we were simply comparing tasks in terms of a ﬁxed processor, instruction set,

If we were comparing tasks to be implemented in a gate-array we might

That is, we are trying to answer the question:

We have two immediate problems answering this questions:

We can, however, talk about the performance of computing structures in

Correspondingly, our “gate-evaluation” metric is a unit of space-time capacity.

That is, we can get 4 “gate-evaluations” in one “gate-delay” out of 4 parallel

The general expression for computing functional

This capacity deﬁnition is very logic centric, not

This is somewhat unsatisfying since interconnect capacity plays a big role in

Unfortunately, interconnect capacity is not as cleanly quantiﬁable as logic

As noted, the focus here is on functional density.

Remember also that resource capacity is primarily interesting when the

We look at computational capacity to the extent we are compute limited.

Nge = Number of gate evaluations per cycle

For this reason, it is also interesting to look at the functional diversity or

To ﬁrst order, we count instructions as native processor instructions or

If we put of data for an application into a space, A , we get a data density:

This peak, too, is not achievable for every application.

Conventional FPGA interconnect routes most applications with over 80%