Académique Documents
Professionnel Documents
Culture Documents
By
Prof. K. R. Saraf
Syllabus
Module I: - Types of computing and introduction to RC: General Purpose Computing,
Domain-Specific Processors, Application Specific Processors; Reconfigurable
Computing, Fields of Application; Reconfigurable Device Characteristics,
Configurable, Programmable, and Fixed-Function Devices; General-Purpose
Computing, General-Purpose Computing Issues
Module III: - Different architectures for fast computing viz. PDSPs, RALU,
VLIW, Vector Processors, Memories, CPLDs, FPGAs, Multi-context FPGA, Partial
Reconfigurable Devices; Structure and Composition of Reconfigurable Computing
Devices: Interconnect, Instructions, Contexts, Context switching, RP space model
For the past several years, industry and academia have focused largely on
the task of building the highest performance processor, instead of trying to
build the highest performance general-purpose computing engine. When
active area was extremely limited, this was a very sensible approach.
However, as silicon real estate continues to increase far beyond the space
required to implement a competent processor, it is time to re-evaluate
general-purpose architectures in light of shifting resource availability and
cost.
Reconfigurable Computing
In particular, an interesting space has opened between the
extremes of general-purpose processors and specialized ASICs.
This space is most easily seen by looking at the binding time for
device function.
Disadvantages of RC
Reconfiguration time might be critical in run-time reconfigurable systems.
Low utilization of hardware in configurable systems.
Qn. 1. State and explain characteristics of Reconfigurable
Devices. (20 Andrew Dehon Reconf. Arch. For General Purp. Comp.)
Reconfigurable Device Characteristics
Broadly considered, reconfigurable devices fill their silicon area with a
large number of computing primitives, interconnected via a configurable
network.
To first order, one can think about turning an entire task into hardware
dataflow and mapping it on the reconfigurable substrate
Qn. 1. State and explain characteristics of Reconfigurable
Devices. (20 Andrew Dehon Reconf. Arch. For General Purp. Comp.)
Reconfigurable computing generally provides spatially-oriented processing
rather than the temporally-oriented processing (limited by time) typical of
programmable architectures such as microprocessors.
If this effect is jointly typical with the instructions per issue slot number
above, then we would yield at most of the theoretical,
functional density.
For the HaL case, this reduces
4. Minimize delays – The delay through the routing network can easily be
the dominant delay in a programmable technology. Care is required to
minimize interconnect delays. Two significant factors of delay are:
Qn: State and explain issues in reconfigurable network design.
(a) Propagation and fan out delay – Interconnect delay on a wire is
proportional to distance and capacitive loading (fan out). This makes
interconnect delay roughly proportional to distance run, especially when
there are regular taps into the signal run. Consequently, small/short signal
runs are faster than long signal runs.
Delayed binding and temporal reuse work closely together and occur at many
scales to provide the characteristics we now expect from general-purpose
computing devices.
This basic theme recurs at many different levels in our conventional systems:
Market Level – Rather than dedicating a machine design to a single
application or application family, the design effort may be utilized for many
different applications.
System Level – Rather than dedicating an expensive machine to a single
application, the machine may perform different applications at different times
by running different sets of instructions.
Application Level – Rather than spending precious real estate to build a
separate computational unit for each different function required, central
resources may be employed to perform these functions in sequence with an
additional input, an instruction, telling it how to behave at each point in time.
Algorithm Level – Rather than fixing the algorithms which an application uses,
an existing general-purpose machine can be reprogrammed with new
techniques and algorithms as they are developed.
User Level–Rather than fixing the function of the machine at the supplier, the
instruction stream specifies the function, allowing the end user to use the
machine as best suits his needs.
Machines may be used for functions which the original designers did not
conceive. Further, machine behavior may be upgraded in the field without
incurring any hardware or hardware handling costs. In the past, processors
were virtually the only devices which had these characteristics and served as
general-purpose building blocks. Today, many devices, including
reconfigurable components, also exhibit the key properties and benefits
associated with general-purpose computing. These devices are economically
interesting for all of the above reasons.
As we will see, the handling of this additional input is one of the key
distinguishing features between different kinds of general-purpose
computing structures.
When the functional diversity is large and the required task throughput is
low, it is not efficient to build up the entire application dataflow spatially
in the device.
Regular tasks have few data dependent conditionals such that all data
processed requires essentially the same sequence of operations with highly
predictable flow of execution.
Nested loops with static bounds and minimal conditionals are the typical
example of regular computational tasks.
The operations required vary widely according to the data processed and the
flow of operation control is difficult to predict.
Q. Explore design metrics such as density, diversity, and capacity.
Give mathematical analysis and circuit schematics to support.
(32 Andrew Dehon Reconf. Arch. For General Purp. Comp.)
The first question is difficult since it places little bounds on the properties of
the computational tasks.
We thus end up addressing more focused questions, but ones which give us
insight into the properties and conditions under which various computational
structures are favorable.
We will address the second question – that of characterizing device capacity –
using a “gate evaluation” metric.
That is, we consider the minimum size circuit which would be required to
implement a task and count the number of gates which must be evaluated to
realize the computational task.
We assume the collection of gates available are all -input logic gates, and use
4.
This models our 4-LUT as discussed in Section 2.4, as well as more traditional
gate-array logic.
The results would not change characteristically using a different, finite value
as long as 2.
1 Functional Density
Functional capacity is a space-time metric which tells us how much
computational work a structure can do per unit time.
That is, if a device can provide capacity gate evaluations per second,
optimally, to the application, a task requiring gate evaluations can be
completed in time:
In practice, limitations on the way the device’s capacity may be employed will
often cause the task to take longer and the result in a lower yielded capacity.
If the task takes time , to perform , then the yielded capacity is:
The capacity which a particular structure can provide generally increases with
area.
To understand the relative benefits of each computing structure or architecture
independent of the area in a particular implementation, we can look at the
capacity provided per unit area.
We normalize area in units of , half the minimum feature size, to make the
results independent of the particular process feature size.
Our metric for functional density, then, is capacity
per unit area and is measured in terms of gate-
evaluations per unit space-time in units of gate-
evaluations/s.
This density metric tells us the relative merits of dedicating a portion of our
silicon budget to a particular computing architecture.
The density focus makes the implicit assumption that capacity can be
composed and scaled to provide the aggregate computational throughput
required for a task.
To the extent this is true, architectures with the highest capacity density that
can actually handle a problem should require the least size and cost.
Whether or not a given architecture or system can actually be composed and
scaled to deliver some aggregate capacity requirement is also an interesting
issue, but one which is not the focus here.
When we are i/o limited, performance may be much more controlled by i/o
bandwidth and buffer space which is used to reduce the need for i/o.
Functional density alone only tells us how much raw throughput we get. It says
nothing about how many different functions can be performed and on what
time scale.
This tells us how many different operations can be performed within part or all
of an IC without going outside of the region or chip for additional instructions.
We thus define functional diversity as:
FPGA capacity has not change dramatically over time, but the sample size is
small.
There is a slight upward trend which is probably representative of the relative
youth of the architecture.
Some effects which may prevent an application from achieving this peak
include:
Limited interconnect – When the network connectivity is inadequate, all of a
device’s capacity cannot be used.
This may require either that cells buffer and transmit data or simply that cells
go unused in the area.
Full connectivity is not supported even within the interconnect of a single tile.
Typically, the interconnect block includes:
A hierarchy of line lengths – some
interconnect lines span (amount of space
that something covers ) a single cell, some
a small number of cells, and some an entire
row or column
Limited, but not complete, opportunity for
corner turns
It can either refer to the extent to which a larger entity is subdivided, or the
extent to which groups of smaller indistinguishable entities have joined
together to become larger distinguishable entities.
For example, a kilometer broken into centimeters has finer granularity than a
kilometer broken into meters.
output <= NOT input; -- output assumes new value in one delta cycle.
Intrinsic delay is the delay internal to the gate. Input pin of the cell to
output pin of the cell.
However, as the feature size is scaled down remains constant so that the
interconnect RC-delay does not scale with feature size.
Trusted Hardware
Problem Statement Due to the immense costs associated with fabricating chips
in the latest technology node, the design, manufacturing and testing of
integrated circuits has moved across a number of companies residing in
various countries. This makes it difficult to secure the supply chain and opens
up a number of possibilities for security threats.
The problem boils down the question: how can these fabless companies
assure that the chip received from the foundry is exactly the same as the
design given to the foundry − nothing more, nothing less?
Physical attacks
Problem Statement Physical attacks monitor characteristics of the system in an
attempt to glean (collect) information about its operation.
Side channel attacks are less invasive attacks that exploit physical
phenomena such as power, timing, electromagnetic radiation, thermal profile,
and acoustic characteristics of the system
In the end, these different design tools and IP produce a set of inter-operating
cores, and as Ken Thompson said in his Turing Award Lecture, “You can’t
trust code that you did not totally create yourself”.
You can only trust your final system as much as your least-trusted design
path.
Subversion of the design tools could easily result in malicious hardware being
inserted into the final implementation.
Alternatively, you must develop ways in which you can trust the hardware and
the design tools.
Design Theft
Problem Statement The bit stream necessarily holds all the information that is
required to program the FPGA to perform a specific task. This data is often
proprietary information, and the result of many months/years of work.
Therefore, protecting the bit stream is the key to eliminating intellectual
property theft. Furthermore, the bit stream may contain sensitive data, e.g. the
cryptographic key.
System Security
Problem Statement The problem that we are dealing with here is the ability to
make assurances that during the deployment of the device it does not do
anything potentially harmful. This requires that we perform monitoring of the
system and provide the system with information on what it can and cannot do.
Qn. Write a short note on Instruction and interconnect growth
(109 Andrew Dehon Reconf. Arch. For General Purp. Comp.)
This is one reason that single context device scan afford to use sparse
interconnect encodings. Since the wires and switches are the dominant and
limiting resource, additional configuration bits are not costly. In the wire
limited case, we may have free area under long routing channels for memory
cells. In fact, dense encoding of the configuration space has the negative
effect that control signals must be routed from the configuration memory cells
to the switching points. The closer we try to squeeze the bit stream encoding
to its minimum ,the less locality we have available between configuration bits
and controlled switches. These control lines compete with network wiring,
exacerbating the routing problems on a wire dominated layout.
Qn. Write a short note on RP space area model
(126 Andrew Dehon Reconf. Arch. For General Purp. Comp.)
Qn. Write a short note on Granularity
(132 Andrew Dehon Reconf. Arch. For General Purp. Comp.)
• Example: FPGA
• Issues: Bit level operation => Several processing units => Large routing
overhead => Low silicon area efficiency => Use up more Power
• One of the drawbacks of coarse grained architectures are that they tend to
lose some of their utilization and performance if they need to perform smaller
computations than their granularity provides, for example for a one bit add on
a four bit wide functional unit would waste three bits. This problem can be
solved by having a coarse grain array (reconfigurable datapath array, rDPA)
and a FPGA on the same chip.
Qn. Write a short note on Granularity
(132 Andrew Dehon Reconf. Arch. For General Purp. Comp.)
• As their functional blocks are optimized for large computations and typically
comprise word wide arithmetic logic units (ALU), they will perform these
computations more quickly and with more power efficiency than a set of
interconnected smaller functional units; this is due to the connecting wires
being shorter, resulting in less wire capacitance and hence faster and lower
power designs.
• Often the type of applications to be run are known in advance allowing the
logic, memory and routing resources to be tailored to enhance the
performance of the device whilst still providing a certain level of flexibility for
future adaptation. Examples of this are domain specific arrays aimed at
gaining better performance in terms of power, area, throughput than their
more generic finer grained FPGA cousins by reducing their flexibility.
Qn. Write a short note on Contexts
(136 Andrew Dehon Reconf. Arch. For General Purp. Comp.) (99 RC FPGA By Andrew Dehon)
• For that reason, either all needed contexts must fit in the available
hardware or some control must determine when contexts should be loaded
in order to minimize the number of wasted cycles stalling while
reconfiguration completes.
• For example, a 4-context device has only 80 percent of the “active area”
(simultaneously usable logic/routing resources) that a single-context device
occupying the same fixed silicon area has. A multi-context device limited to
one active and one inactive context (a single SRAM plus a flip-flop) would
have the advantages of background loading and fast context switching
coupled with a lower area overhead, but it may not be appropriate if several
different contexts are frequently reused.
• That is, if a device can provide capacity Dcap gate evaluations per second,
optimally, to the application, a task requiring Nge gate evaluations can be
completed in time:
• This density metric tells us the relative merits of dedicating a portion of our
silicon budget to a particular computing architecture.
• The density focus makes the implicit assumption that capacity can be
composed and scaled to provide the aggregate computational throughput
required for a task.
• To the extent this is true, architectures with the highest capacity density that
can actually handle a problem should require the least size and cost.
• This tells us how many different operations can be performed within part or
all of an IC without going outside of the region or chip for additional
instructions.
Qn. Write a short note on Functional Capacity and Functional Density
OR
Qn. What is the computational density? Explain with example.
(33 Andrew Dehon Reconf. Arch. For General Purp. Comp.)
Data Density : - Space must also be allocated to hold data for the
computation. This area also competes with logic, interconnect, and
instructions for silicon area. Thus, we will also look at data density when
examining the capacity of an architecture. If we put of Nbits of data for an
application into a space, A , we get a data density:
Qn. Give mathematical analysis of switch, channel and wire growth.
OR
Qn. Give the mathematical model for Rent’s rule based channel and wire
growth.
( contains many equations 87 Andrew Dehon Reconf. Arch. For General Purp. Comp.)
Qn. What are the effects of Interconnect Granularity?
( contains many equations 112 andrew dehon general purpose computing)
Qn. What is single context and multi context FPGA? Discuss these
issues pertaining to present FPGA architecture. (7 An Introduction to Reconfigurable
Computing)
• This system does allow the background loading of the context, where one
plane is active and in execution, while an inactive plane is in the process of
being programmed.
• The multi-context design does allow for background loading, permitting one
context to be configuring while another is in execution.
Because FPGA pins are such critical resources for these systems, the primary
goal of partitioning is to minimize communication between partitions.
A large number of algorithms have been developed that split logic into two
pieces (bi partitioning) and multiple pieces (multiway partitioning) based on
both logic and I/O constraints.
One way to address the partitioning and placement problem is to perform both
operations simultaneously.
This approach has been effectively applied to both crossbar and mesh
topologies [31].
The bandwidth of the initial cut is optimized, but may not serve as an
effective start point for further cuts.
This issue may be resolved by ordering hierarchical bipartition cuts
based on criticality [5].