IRAM

SEMINAR2006 IRAM
SEMINAR
ON
INTELLIGENT RAM
DEPARTMENT OF ECE 1 VJEC

SEMINAR2006 IRAM
Abstract
As the name suggests ‘Intelligent RAM’ is the integration of Intelligence and RAM.
Intelligence stands for Microprocessor, and RAM, the random access memory which is the
volatile memory that is an important part of computing systems from desktop computers to
supercomputers. Intelligent RAM, or IRAM, merges processing and memory into a single
chip to lower memory latency, increase memory bandwidth, and improve energy efficiency
as well as to allow more flexible selection of memory size and organization. In addition,
IRAM promises savings in power and board area. The seminar is genuine effort to introduce
this new idea which covers the inspirations, advantages, architecture of IRAM and the
technologies which makes possible the revolutionary idea Intelligent RAM.

SEMINAR2006 IRAM
Contents
1. Introduction
2. Inspirations of IRAM
3. IRAM - Architecture
4. IRAM - Benchmarking
5. Advantages of IRAM
6. Disadvantages of IRAM
7. Conclusion
8. References

SEMINAR2006 IRAM
1.Introduction
The division of the semiconductor industry into microprocessor and memory
camps provides many advantages. First and foremost, a fabrication line can be tailored to
the needs of the device. Microprocessor fabrication lines offer fast transistors to make fast
logic and many metal layers to accelerate communication and simplify power distribution,
while DRAM fabrications offer many polysilicon layers to achieve both small DRAM cells
and low leakage current to reduce the DRAM refresh rate. Separate chips also mean
separate packages, allowing microprocessors to use expensive packages that dissipate high
power (5 to 50 watts) and provide hundreds of pins to make wide connections to external
memory, while allowing DRAMs to use inexpensive packages which dissipate low power (1
watt) and use only a few dozen pins. Separate packages in turn mean computer designers
can scale the number of memory chips independent of the number of processors: most
desktop systems have 1 processor and 4 to 32 DRAM chips, but most server systems have 2
to 16 processors and 32 to 256 DRAMs. Memory systems have standardized on the Single
In-line Memory Module (SIMM) or Dual In-line Memory Module (DIMM), which allows
the end user to scale the amount of memory in a system. Quantitative evidence of the
success of the industry is its size: in 1998 DRAMs were a $37B industry and
microprocessors were a $20B industry. In addition to financial success, the technologies of
these industries have improved at unparalleled rates. DRAM capacity has quadrupled on
average every 3 years since 1986, while microprocessor speed has done the same since
1996. The split into two camps has its disadvantages as well. Two trends call into
question the current practice of microprocessors and DRAMs being fabricated as different
chips on different fabrication lines: 1) the gap between processor and DRAM speed is
growing at 50% per year; and 2) the size and organization of memory on a single DRAM
chip is becoming awkward to use in a system, yet size is growing at 60% per year. These
disadvantages made the people to think of developing a new technology and as a result the

SEMINAR2006 IRAM
integrated technology ‘Intelligent RAM’ came into exist. Intelligent RAM, or IRAM,
merges processing and memory into a single chip to lower memory latency, increase
memory bandwidth, and improve energy efficiency as well as to allow more flexible
selection of memory size and organization. In addition, IRAM promises savings in power
and board area.The IRAM technology was proposed by David Patterson, University of
California, Berkeley, U S A as an alternative for current memory-processor combination
which has many pit falls. It was implemented by a group of post graduate students lead by
Patterson.
2. Inspirations of IRAM
Processor - Memory (DRAM) Gap or Latency
It is the delay time between the performance of Microprocessor and RAM. Since
processor and RAM are fabricated separately in two fabrication lines the clock speed of
them will vary in two different rates creating a performance gap between them.
Figure 2.1 shows that while microprocessor performance has been improving at a
rate of 60% per year, the access time to DRAM has been improving at less than 10% per
year. Hence computer designers are faced with an increasing “Processor-Memory
Performance Gap”, which is now the primary obstacle to improved computer system
performance. System architects have attempted to bridge the processor-memory
performance gap by introducing deeper and deeper cache memory hierarchies;
unfortunately, this makes the memory latency even longer in the worst case. The main
memory latency in the system is a factor of four larger than the raw DRAM access time; this
difference is due to the time to drive the address off the microprocessor, the time to
multiplex the addresses to the DRAM, the time to turn around the bidirectional data bus, the
overhead of the memory controller, the latency of the SIMM connectors, and the time to
drive the DRAM pins first with the address and then with the return data.

SEMINAR2006 IRAM
Figure 2.1
Processor - Memory Performance Gap Penalty
To overcome the performance gap we have to provide a cache memory which

needs to invest more money. Also because of the increase in the number of transistors the
power consumption also is increased. But the cache is inefficient as the performance gap is
very high and is growing at a rate of 50% per year. The extraordinary delays in the memory
hierarchy occur despite tremendous resources being spent trying the bridge the processor-
memory performance gap. We call the percent of die area and transistors dedicated to
caches and other memory latency-hiding hardware the “Memory Gap Penalty”. Table 2.1
quantifies the penalty; it has grown to 60% of the area and almost 90% of the transistors in
several microprocessors. Also due to the rapid growth there may arise a situation where the
degree of integration has to be increased and fabrication size of transistors need to be
decreased which will further increase the cost of production which implies the penalty will
increase with new generations of microprocessors.

SEMINAR2006 IRAM
Year Processor On-Chip Memory Gap Memory Gap Die Area Total
Cache Size Penalty: Penalty: (mm2) Transistors
% Die Area Transistors
1994 Digital I: 8 KB, 37.4% 77.4% 298 9.3 M

Alpha D: 8 KB,
21164 L2: 96 KB
1996 Digital I: 16 KB, 60.8% 94.5% 50 2.1 M

Strong-Arm D: 16 KB
SA-110
1993 Intel I: 8 KB, 31.9% ≈ 32% ≈ 300 3.1 M

Pentium D: 8 KB
1995 Intel I: 8 KB, P: 18.5% P: 11.2% P: 226 P: 3.5 M

Pentium Pro D: 8 KB, +L2: 100% +L2: 100% +L2:186 +L2: 25.0M
L2: 512 KB (Total:52.2%) (Total:76.5%)
2000 Intel I: 8 KB, P: 22.5% P: 18.2% P: 242 P: 5.5 M

Pentium4 D:12 KB, +L2: 100% +L2: 100% +L2:282 +L2: 31.0M
L2: 512 KB (Total: 64.2%) (Total:87.5%)
2001 AMD I: 32 KB, P: 20.2% P: 20.2% P: 268 P: 6.5 M

Athlon D: 32 KB, +L2: 100% +L2: 100% +L2:312 +L2: 33.0M
L2: 512 KB (Total:62.0%) (Total:88.5%)
Table 2.1
Memory Revenue
The memory revenue is decreasing rapidly nowadays. Even though the need for
DRAM chips is increasing the DRAM manufacturers are not getting the benefit because of
the high cost of production. This makes the RAM manufactures to think of another
alternative which can reduce the cost of production to maintain the revenue as well as their
business.
Figure 2.2 shows that the DRAM revenue has been falling continuously from first
quarter of the year 1999 after reaching a maximum of 16 Billion U S Dollars. Again in the
first quarter of the year 2000 it showed a slight rise by reaching 7 Billion after which seems
to be sliding down continuously for three consecutive years which has not ceased till now.

SEMINAR2006 IRAM
Figure 2.2
I/O Bus Performance Lag
The parallel I/O bus is not efficient because it lags behind the processor and
memory in band width. If we scale the bus by increasing clock speed and bus width for
increasing performance the packaging cost is increased. Scaling also results in increased
number of pins. So the performance lag of parallel I/O bus points to the requirement of a
much more efficient technological implementation which rectifies the band width scarcity
and increase in both cost of production and in number of pins while scaling it for increasing
the performance and efficiency.
PCI Bits Pin Number

16 ~20
32 ~50
64 ~90
Table 2.2
Table 2.2 shows the rapid increase in pin number which in turn increases the cost
of production when the PCI bits are scaled for increasing the performance of the I/O system.
Database Demand for Processing Power and Memory

SEMINAR2006 IRAM
The database or software is demanding for more and more processing power and
memory. But both of them are inadequate when compared to the actual demand. Also the
demand is increasing rapidly because more and more high end applications like multimedia
applications which need high processing power and RAM. Also their requirements are
increasing by the release of their continuing versions which are capable of squeezing the
final drop of performance from the computing system.
Figure 2.3
Figure 2.3 shows the database demand for processing power and memory is
increasing by a multiple of 2 or becoming twice in 9 months according to Greg’s law. But
the microprocessor speed and DRAM speed becomes twice in 18 months and 120 months
respectively. The microprocessor speed is increasing according to the Moore’s Law. So both
microprocessor and memory is less when compared to the actual requirement demanded by
the database and other software applications making a Database - Processor performance
gap and Database - Memory performance gap respectively. Also the performance gap is
growing continuously and rapidly since the new software applications are hungrier in
matters of processing power and memory. So the database demand for more processing

SEMINAR2006 IRAM
power and memory made the computer experts to think of a technology which reduces the
performance gap between the database demand and processor as well as the memory.
Fewer DRAMs/System over Time
While the Processor-Memory Performance Gap has widened to the point where it
is dominating performance for many applications, the cumulative effect of two decades of
60% per year improvement in DRAM capacity has resulted in huge individual DRAM
chips. This has put the DRAM industry in something of a bind. That is the DRAM width or
the memory per DRAM is growing at a rate of 60% per year. But the minimum memory
required per system is growing only at a rate of 25% - 30% per year.
Figure 2.4
Figure 2.4 shows that over time the number of DRAM chips required for a
reasonably configured PC has been shrinking. The required minimum memory size,
reflecting application and operating system memory usage, has been growing at only about
half to three-quarters the rate of DRAM chip capacity. For example, consider a word
processor that requires 8MB; if its memory needs had increased at the rate of DRAM chip
capacity growth, that word processor would have had to fit in 80KB in 1986 and 800 bytes

SEMINAR2006 IRAM
in 1976. The result of the prolonged rapid improvement in DRAM capacity is fewer DRAM
chips needed per PC, to the point where soon many PC customers may require only a single
DRAM chip. Also unused memory bits increases effective cost. So customers may no
longer automatically switch to the larger capacity DRAM as soon as the next generation
matches the same cost per bit in the same organization because 1) the minimum memory
increment may be much larger than needed, 2) the larger capacity DRAM will need to be in
a wider configuration that is more expensive per bit than the narrow version of the smaller
DRAM, or 3) the wider capacity does not match the width needed for error checking and
hence results in even higher costs.
3. IRAM - Architecture
Key Technologies
The Key Technologies behind the IRAM technology are,
1) Vector Processing 2) Embedded DRAM and 3) Serial I/O
Vector Processing
High speed microprocessors rely on instruction level parallelism (ILP) in
programs, which means the hardware has the potential short instruction sequences to
execute in parallel. As mentioned above, these high speed microprocessors rely on getting
hits in the cache to supply instructions and operands at a sufficient rate to keep the
processors busy.
An alternative model to exploiting ILP that does not rely on caches is vector
processing. It is a well established architecture and compiler model that was popularized by
supercomputers, and it is considerably older than superscalar. Vector processors have high-
level operations that work on linear arrays of numbers.
Advantages of vector computers and the vectorized programs on them include:
1. Each result is independent of previous results, which enables deep pipelines and high
clock rates in them.

SEMINAR2006 IRAM
2. A single vector instruction does a great deal of work, which means fewer instruction
fetches in general and fewer branch instructions and so fewer mispredicted branches.
3. Vector instructions often access memory a block at a time, which allows memory latency
to be amortized over, say, 64 elements.
4. Vector instructions often access memory with regular (constant stride) patterns, which
allows multiple memory banks to simultaneously supply operands.
Figure 4.1
Figure 4.1 shows the ‘vector processing model’ in which the difference between
scalar and vector instructions is schematically represented. In scalar processing the
instructions are carried out sequentially while in vector processing a number of instructions
are carried out in parallel which depends on the vector length of the processor. So parallel
processing is much faster than scalar processing.
IRAM - Vector Architecture

Since vector architecture deals with vector processing it represents only the
processor architecture of IRAM. It helps to study about instruction level parallelism or
parallel processing of IRAM. The parallel processing is carried out by virtual processing of
the IRAM processor. Figure 4.2 shows the ‘vector architectural model’ of IRAM. The
parallel processing is carried out by the virtual processors VP0, VP1, …….. ,VP$vlr-1. It

SEMINAR2006 IRAM
consists of the 32 general purpose registers which are vr0, vr1, …….., vr31. The general
purpose registers are for execution of general instructions.
Figure 4.2
The 32 flag registers vf0, vf1, ……, vf31 are for executing floating point
instructions. It also consists of 32 control registers vcr0, vcr1, ……., vcr31 for the control of
instruction execution carried out by the processor.
Advantages of Vector Processing

Advantages of vector processing are,
1. It has high performance on demand for multimedia processing. That is it enhances
the multimedia processing by means of parallel processing which makes it ideal
processing method since multimedia processing plays a key role in today’s computing.
2. It has low power for issue of control logic. The control logic voltage is 1.2 V when
compared to the 3V to 5V of other processing methods.
3. It doesn’t have much complexity in design. Due to the simple design the
implementation becomes very easy and cost effective.

SEMINAR2006 IRAM
4. It has a well understood programming model. The compiler instruction language is

very simple and easy to understand. Also the instruction language is very efficient even
though it is simple when compared to the other assembly level programming languages.
Embedded DRAM
The embedded DRAM technology used in IRAM is by means of embedded
technology. It is the technology by which a chip is embedded into a device for the control
and well execution of the operations of that particular device. Usually chip embedding is
done in devices handled by common people where they don’t need to interact with the chip
directly, but by means of embedded chip he executes and controls the device.
Figure 4.3
Figure 4.3 shows how embedded technology is used in the manufacturing of
IRAM. During the fabrication the memory chip is embedded into the microprocessor to
produce IRAM. Thus IRAM becomes a single chip into which both memory and processor
are integrated for high quality performance due to their coexistence.
Advantages of Embedded DRAM

The Advantages of Embedded DRAM are,
1. It offers high bandwidth for vector processing. Due to the high memory bandwidth
possessed by the DRAM chip it can enhance the performance of vector processor which

SEMINAR2006 IRAM
needs high memory usage and bandwidth because of the abundant parallelism in vector
processing.
2. It has a low latency which makes the memory accesses much faster and efficient.
3. The energy or power required for memory accesses is very low. Also the memory
access frequency is less. So the power consumption of the DRAM chip is less compared
to other memory chips which consume more power.
4. The memory flexibility of IRAM is due to the embedded technology used in its
manufacturing process. The designers can specify exact length and width of the DRAM
since it is not restricted by powers of 2. So embedded DRAM offers system memory size
benefits.
Serial I/O
Due to the poor performance of parallel I/O both in the case of band width and
scaling processes the I/O system of IRAM is using a much more efficient and cost effective
technology the ‘Serial I/O system’. It enhances the performance of IRAM without hindering
the memory and processor performances by offering a smooth and faster path for data
transfer.
Figure 4.4
Figure 4.4 shows the schematic representation of the interaction between IRAM
and the I/O devices. The interaction is through the serial I/O lines implemented in it as

SEMINAR2006 IRAM
shown in figure. Due to the high band width offered by the serial I/O lines the data transfer
takes place much faster and efficiently in IRAM which enhances its performance.
Advantages of Serial I/O

The advantages of serial I/O are,
1. Serial I/O offers very high band width which enhances both processing and
memory intensive operations. The typical band width of serial I/O is of the order of
Gigabits/Sec.
2. The pin count of serial I/O is very less compared to parallel I/O. It requires only 1-
2 pins per unidirectional link while the parallel I/O requires 5-10 pins per unidirectional
link.
3. Serial I/O band width can be incrementally scaled for increasing the band width
and efficiency. Also the scaling will not cause any increase in pin number which implies
that the scaling is very cost effective in the case of serial I/O while scaling in parallel I/O
increases both the number of pins and cost of production.
4. The power consumption of serial I/O system is very less when compared to that of
parallel I/O system. So serial I/O enhances the performance of IRAM along with
reduction in power consumption. The reduced power consumption of serial I/O system
makes it suitable for low power consuming devices.
IRAM - Floor Plan
Each and every technological implementation needs a basic plan on which the
device or circuit is built or implemented. Similarly IRAM too have a basic plan or floor plan
for its implementation which includes the design specification and structure for
implementation. Figure 4.5 shows the floor plan of IRAM. It consists of 1024 1Megabit
memory modules split into two memory zones each having 512 1Megabit modules. The
memory capacity is 64 Mbytes per memory zone. The 8 vector lanes are for multiple
instances of parallel processing of the vector processor. The CPU and IO are the central

SEMINAR2006 IRAM
processing unit and input-output system of the IRAM chip. The crossbar switch acts as a
link between the memory, central processing unit and the IO system.
Figure 4.5
Floor Plan Specification

The IRAM Floor plan specification is as follows, 0.13 Micron - The
manufacturing size of the transistors. 1GHZ - The processing power of the IRAM chip.
1GBit DRAM - The memory subsystem capacity in terms of memory modules. 128MB -
Total memory capacity. 16GFLOPS (64b) - The number of floating point operations per
second. 64GOPS (16b) - The number of operations per second.
IRAM - Complete Architecture

IRAM is implemented from the floor plan, according to the floor plan
specifications. Figure 4.6 represents the complete architecture of IRAM implementation.
The 1024 1Mbit modules are embedded in IRAM as shown in figure. The 16Kilobyte L1

SEMINAR2006 IRAM
cache is split into two, 8K Instruction cache and 8K Data cache. The Instruction cache is for
processor instruction operations and Data cache is for the various memory ‘load and store’
operations.
Figure 4.6
The 2-way superscalar processor is for scalar processing. It is called 2-way
because the issue of control logic is different for FPU (Floating Point Unit) operations and
LSC (Load/Store/Coprocessor) operations. The vector registers are meant for vector
instruction executions which register the vector instructions issued by the vector processor.
The vector instructions are queued from 2-way superscalar processor to the vector
instruction queue unit. The arithmetic and logic unit is for the execution of various
arithmetic and logic instructions issued by both super scalar processor and vector processor.
The Load/Store unit is meant for the various memory load and store operations. The serial
I/O lines are for the interaction of IRAM with the various input and output devices. The
memory crossbar switch acts as a link between the memory, processor and input-output
devices for their mutual interactions during their operations. The integrated architecture
makes the memory and processor to coexist and perform as a single unit. Since there is high

SEMINAR2006 IRAM
level of interaction between the memory, processor and the I/O devices the performance of
IRAM is very high compared to the separate chip processor -memory unit. This unified
architecture from the same fabrication line is in fact the secret behind the excellent
performance of IRAM in both processing and memory intensive operations.
4. IRAM -Benchmarking
Benchmarking is a process or group of processes by means of which one can take
a right decision upon the performance and efficiency of a product or technology in
comparison with the other product or technology which took part in the event.
Benchmarking helps us to find the product or technology which is appropriate for our
requirements. So it is a solid proof for the performance and efficiency of a product or
technology in comparison with others.
Benchmarking Environment
Benchmarking environment is the environment where the whole processes are
carried out for arriving at a right decision. It may include the various products based on the
same technology or different technology depending upon the type and requirement of
benchmarking. In this case we are benchmarking the technology IRAM with other
processor-memory combinations for proving the real potential of IRAM. So the
benchmarking environment will consist of the various processors from different
manufacturers along with various memory modules from different sources to perform as a
single unit while the IRAM is itself a single unit manufactured from a single fabrication
process. Table 5.1 shows the various processors selected for the benchmarking and the
different memory modules used for the benchmarking process.
µP IRAM SPARC R10K P III P4 EV 6
Make Berkeley Sun Origin Intel Intel Alpha

SEMINAR2006 IRAM
Clock 1GHz 833MHz 900MHz 950MHz 1.5GHz 966MHz
L1 8+8KB 16+16KB 32+32KB 32KB 12+8KB 64+64KB
L2 NA 2MB 1MB 256KB 256KB 2MB
Memory 128 MB 256MB 1GB 256MB 1GB 512MB
Table 5.1
The various processors used for the benchmarking are SPARC from Sun, R10K
from Origin, P III and P 4 from Intel, EV6 from Alpha and IRAM from Berkeley. The clock
frequencies of the various processors, the L1 and L2 caches and the memory modules used
with them are all mentioned in the table 5.1 given above. Also it can be noted that the
processor with minimum L1 and L2 cache (L2 cache is not possessed by IRAM) and
minimum memory capacity is IRAM. All other processors are having more L1 and L2
caches and memory capacity associated with them for their operations.
Benchmarking Processes
The various benchmarking processes carried out are,
1. Transitive Closure: The first benchmark problem is to compute the transitive closure
of a directed graph in a dense representation. The code taken from the DIS reference
implementation used non-unit stride, but was easily changed to unit stride. This benchmark
performs only 2 arithmetic operations (an addition and a subtraction) at each step, while it
executes 2 loads and 1 store.
2. Giga Updates per Second (GUPS): This benchmark is a synthetic problem, which
measures giga-updates-per-second. It repeatedly reads and updates distinct, pseudo-random

memory locations. The inner loop contains 1 arithmetic operation, 2 loads, and 1 store, but
unlike transitive, the memory accesses are random. It contains abundant data-parallelism
because the addresses are pre-computed and free of duplicates.

SEMINAR2006 IRAM
3. Sparse Matrix-Vector Multiplication (SPMV): This problem also requires random
memory access patterns and a low number of arithmetic operations. It is common in

scientific applications, and appears in both the DIS and NPB suites in the form of a
Conjugate Gradient (CG) solver. We have a CG implementation for IRAM, which is
dominated by SPMV, but here we focus on the kernel to isolate the memory system issues.
The matrices contain a pseudo-random pattern of non-zeros using a construction algorithm
from the DIS specification, parameterized by the matrix dimension, n, and the number of
non zeroes, m.
4. Histogram: Computing a histogram of a set of integers can be used for sorting and in
some image processing problems. Two important considerations govern the algorithmic
choice: the number of buckets, b, and the likelihood of duplicates. For image processing, the
number of buckets is large and collisions are common because there are typically many
occurrences of certain colors (e.g.: white) in an image. Histogram is nearly identical to
GUPS in its memory behavior, but differs due to the possibility of collisions, which limit
parallelism and are particularly challenging in a data-parallel model.
Mesh Adaptation: The final benchmark is a two dimensional unstructured mesh
adaptation algorithm based on triangular elements. This benchmark is more complex than
the others, and there is no single inner loop to characterize. The memory accesses include
both random and unit stride, and the key problem is the complex control structure, since
there are several different cases when inserting a new point into an existing mesh. Starting
with a coarse-grained task parallel program, we performed significant code reorganization
and data preprocessing to allow vectorization. The various benchmarking processes are
selected specially for testing both the processing and memory handling of the various
processor-memory combinations and IRAM. These benchmarking processes are both
processing and memory intensive which squeezes the final drop of performance from the
different processor-memory systems.

SEMINAR2006 IRAM
Transitive Closure
The best-case scenario for both caches and vectors is a unit stride memory access
pattern, as found in the transitive closure benchmark. In this case, the main advantage for
IRAM is the size of its on-chip memory, since DRAM is denser than SRAM. IRAM has 12
MB of on-chip memory compared to 10s of KB for the L1 caches on the cache-based
machines. IRAM is admittedly a large chip, but this is partly due to being an academic
research project with a very small design team - the 2-3 orders of magnitude advantage in
on-chip memory size is primarily due to the memory technology. Figure 5.1 shows the
performance of the transitive closure benchmark. Results confirm the expected advantage
for IRAM on a problem with abundant parallelism and a low arithmetic/memory operation
ratio. Performance is relatively insensitive to graph size, although IRAM performs better on
larger problems due to the longer average vector length. The Pentium 4 has a similar effect,
which may be due to improved branch prediction because of the sparse graph structure in
the test problem.
Figure 5.1

SEMINAR2006 IRAM
Giga Updates per Second (GUPS)

A more challenging memory access pattern is one with either non-unit strides or
indexed loads and stores (scatter/gather operations). The first challenge for any machine is
generating the addresses, since each address needs to be checked for validity and for
collisions. IRAM can generate only 4 addresses per cycle, independent of the data width.
For 64-bit data, this is sufficient to load or store a value on every cycle, but if the data width
is halved to 32-bits, the 4 64-bit lanes perform arithmetic operations at the rate of 4 32-bit
lanes, and the arithmetic unit can more easily be starved for data. In addition, details of the
memory bank structure can become apparent, as multiple accesses to the same DRAM bank
require additional latency to charge the DRAM. The frequency of these bank-conflicts
depends on the memory access pattern and the number of banks in the memory system.
The GUPS benchmark results, shown in Figure 5.2, highlights the address
generation issue. Although performance improves slightly when moving from 64 to 32 bits,
after that performance is constant due to the limits for 4 address generators. Overall, though,
IRAM does very well on this benchmark, nearly doubling the performance of its nearest
competitor, the Pentium 4, for 32 and 64 bit data. In fairness, GUPS was the one of the
benchmarks in which the benchmarking conductors tidied up the compiler-generated
assembly instructions for the inner loops, which produced a 20-60% speedup.

SEMINAR2006 IRAM
Figure 5.2
In addition to the MOP rate, it is interesting to observe the memory bandwidth
consumed in this problem. GUPS achieves 1.77, 2.36, 3.54, and 4.87 GB/s memory
bandwidth on IRAM at 8, 16, 32, and 64-bit data widths, respectively. This is relatively
close to the peak memory bandwidth of 6.4 GB/s.
Sparse Matrix-Vector Multiplication (SPMV)

For the SPMV benchmark, the matrix dimension was set to 10,000 and the number
of non zeroes to 177,782, i.e., there were about 18 non zeroes per row. The computation is
done in single precision floating-point. The pseudorandom pattern of non zeroes is
particularly challenging, and many matrices taken from real applications have some
structure that would have better locality, which would especially benefit cache-base
machines.
Four different algorithms were considered for SPMV, reflecting the best practice
for both cache-based and vector machines. The performance results are shown in Figure 5.3.
Compressed Row Storage (CRS) is the most common sparse matrix format, which stores an
array of column indices and non-zero values for each row; SPMV is then performed as a

SEMINAR2006 IRAM
series of sparse dot products. The performance on IRAM is better than some cache-based
machines, but it suffers from lack of parallelism. The dot product is performed by recursive
halving, so vectors start with an average of 18 elements and drop from there. Both the P4
and EV6 exceed IRAM performance for this reason. CRS-banded uses the same format and
algorithm as CRS, but reflects a different nonzero structure that would likely result from
bandwidth reduction orderings, such as reverse Cuthill-McKee (RCM). This has little effect
on IRAM, but improves the cache hit rate on some of the other machines.
Figure 5.3
The Ellpack (or Itpack) format forces all rows to have the same length by padding
them with zeros. It still has indexed memory operations, but increases available data
parallelism through vectorization across rows. The raw Ellpack performance is excellent,
and this format should be used on IRAM and PIII for matrices with the longest row length
close to the average. If we instead measure the effective performance (eff), which discounts
operations performed on padded zeros, the efficiency can be arbitrarily poor. Indeed, the
randomly generated DIS matrix has an enormous increase in the matrix size and number of
operations, making it impractical. The Segmented-sum algorithm was first proposed for the

SEMINAR2006 IRAM
Cray PVP. The data structure is an augmented form of the CRS format and the
computational structure is similar to Ellpack, although there is additional control
complexity. The underlying Ellpack algorithm was modified that converts roughly 2/3 of
the memory accesses from a large stride to unit stride. The remaining 1/3 are still indexed
references. This was important on IRAM, because we are using 32-bit data and have only 4
address generators as discussed above.
Histogram
This benchmark builds a histogram for the pixels in a 500x500 image from the
DIS Specification. The number of buckets depends on the number of bits in each pixel, so
we use the base 2 logarithm (i.e., the pixel depth) as the parameter in our study.
Performance results for pixel depths of 7, 11, and 15 are shown in Figure 5.4. The first five
sets are for IRAM, all but the second (Retry 0%) use this image data set. The first set
(Retry) uses the compiler default vectorization algorithm, which vectorizes while ignoring
duplicates, and corrects the duplicates in a serial phase at the end. This works well if there
are few duplicates, but performs poorly for our case. The second set (Retry 0%) shows the
performance when the same algorithm is used on data containing no duplicates. The third
set (Priv) makes several private copies of the buckets with the copies merged at the end. It
performs poorly due to the large number of buckets and gets worse as this number increases
with the pixel depth.

SEMINAR2006 IRAM
Figure 5.4
The fourth and fifth algorithms use a more sophisticated sort-diff-find-diff
algorithm that performs inregister sorting. Bitonic sort was used because the communication
requirements are regular and it proved to be a good match for IRAM's “butterfly”
permutation instructions, designed primarily for reductions and FFTs. The compiler
automatically generates in-register permutation code for reductions, but the sorting
algorithm used here was hand-coded. The two sort algorithms differ on the allowed data
width: one works when the width is less than 16 bits and the other when it is up to 32 bits.
The narrower width takes advantage of the higher arithmetic performance for narrow data
on IRAM. Results show that on IRAM, the sort-based and privatized optimization methods
consistently give the best performance over the range of bit depths. It also demonstrates the
improvements that can be obtained when the algorithm is tailored to shorter bit depths.
Overall, IRAM does not do as well as on the other benchmarks, because the presence of
duplicates hurts vectorization, but can actually help improve cache hits on cache-based
machines. We therefore see excellent timings for the histogram computation on these
machines without any special optimizations. A memory system advantage starts to be

SEMINAR2006 IRAM
apparent for 15-bit pixels, where the histograms do not fit in cache, and at this point IRAM's
performance is comparable to the faster microprocessors.
Mesh Adaptation
This benchmark performs a single level of refinement starting with a mesh of 4802
triangular elements, 2500 vertices, and 7301 edges. In this application, we use a different
algorithm organization for the different machines: The original code was designed for
conventional processors and is used for those machines, while the vector algorithm uses
more memory bandwidth but contains smaller loop bodies, which helps the compiler
perform vectorization. The vectorized code also pre-sorts the mesh points to avoid branches
in the inner loop, as in Histogram. Although the branches negatively affect superscalar
performance, presorting is too expensive on those machines. Mesh adaptation also requires
indexed memory operations, so address generation again limits IRAM.
Figure 5.5
Figure 5.5 shows the performance of processors in Mesh Adaptation. It indicates
that that IRAM has performed well in mesh adaptation when compared to other processors.
The only competitor was Intel Pentium 4. So IRAM has emerged a clear winner in the
processing intensive benchmark, Mesh Adaptation.

SEMINAR2006 IRAM
Summary of Benchmark Characteristics

An underlying goal in the benchmark was to identify the limiting factor in these
memory-intensive benchmarks. The graph in Figure 5.6 shows the memory bandwidth used
on IRAM and the MOPS rate achieved on each of the benchmarks using the best algorithm
on the most challenging input. GUPS uses the 64-bit version of the problem, SPMV uses the
segmented sum algorithm, and Histogram uses the 16-bit sort. While all of these problems
have low operation counts per memory operation, the memory and operation rates are quite
different in practice. Of these benchmarks, GUPS is the most memory-intensive, where as
Mesh Adaptation is the least. Histogram, SPMV and Transitive Closure have roughly the
same balance between computation and memory, although their absolute performance
varies dramatically due to differences in parallelism. In particular, although GUPS and
Histogram are nearly identical in the characteristics, the difference in parallelism results in a
very different absolute performance as well as relative bandwidth to operation rate.
Figure 5.7 shows the summary of performance for each of the benchmarks across
machines. The y-axis is a log scale, and IRAM is significantly faster than the other
machines on all applications except SPMV and Histogram.
An even more dramatic picture is seen from measuring the MOPS/Watt ratio, as
shown in Figure 5.8. Most of the cache-based machines use a small amount of parallelism,
but spend a great deal of power on a high clock rate. Indeed a graph of Flops per machine
cycle is very similar. Only the Pentium III, designed for portable machines, has a
comparable power consumption of 4 Watts compared to IRAM’s 2 Watts. The Pentium III
cannot compete on performance, however, due to lack of parallelism.

SEMINAR2006 IRAM
Figure 5.6
Figure 5.7
Figure 5.8

SEMINAR2006 IRAM
5.Advantages of IRAM
Low Latency
To reduce latency, the wire length should be kept as short as possible. This
suggests the fewer bits per block the better. In addition, the DRAM cells furthest away from
the processor will be slower than the closest ones. Rather than restricting the access timing
to accommodate the worst case, the processor could be designed to be aware when it is
accessing “slow” or “fast” memory. Some additional reduction in latency can be obtained
simply by not multiplexing the address as there is no reason to do so on an IRAM. Also,
being on the same chip with the DRAM, the processor avoids driving the off chip wires,
potentially turning around the data bus, and accessing an external memory controller. In
summary, the access latency of an IRAM processor does not need to be limited by the same
constraints as a standard DRAM part. Much lower latency may be obtained by intelligent
floor planning, utilizing faster circuit topologies, and redesigning the address/data bussing
schemes.
The potential memory latency for random addresses of less than 30 ns is possible
for a latency-oriented DRAM design on the same chip as the processor; this is as fast as
second level caches. Recall that the memory latency on the Alpha Server 8400 is 253 ns.
IRAM offers performance opportunities for applications with unpredictable
memory accesses and very large memory “footprints”, such as data bases, which may take
advantage of the potential 5X to 10X decrease in IRAM latency. The lower latency of
IRAM is due to the fact that it doesn’t have any parallel DRAM’s, memory controller,
longer bus to turn around and also it has lower number of pins. The latency of an IRAM
chip is 10-15 ns for a 64-128 MB memory capacity which is very low compared to other
memory chips.

SEMINAR2006 IRAM
High Bandwidth
A DRAM naturally has extraordinary internal bandwidth, essentially fetching the
square root of its capacity each DRAM clock cycle; an on-chip processor can tap that
bandwidth. The potential bandwidth of the gigabit DRAM is even greater than indicated by
its logical organization. Since it is important to keep the storage cell small, the normal
solution is to limit the length of the bit lines, typically with 256 to 512 bits per sense amp.
This quadruples the number of sense amplifiers. To save die area, each block has a small
number of I/O lines, which reduces the internal bandwidth by a factor of about 5 to 10 but
still meets the external demand. One IRAM goal is to capture a larger fraction of the
potential on-chip bandwidth. For example, two prototypes 1 gigabit DRAMs were presented
at ISSCC in1996. As mentioned above, to cope with the long wires inherent in 600 mm 2
dies of the gigabit DRAMs, vendors are using more metal layers: 3 for Mitsubishi and 4 for
Samsung. The total number of memory modules on chip is 512 2-Mbit modules and 1024 1-
Mbit modules, respectively. Thus a gigabit IRAM might have 1024 memory modules each
1K bits wide. Not only would there be tremendous bandwidth at the sense amps of each
block, the extra metal layers enable more cross-chip bandwidth. Assuming a 1Kbit metal
bus needs just 1mm, a 600 mm2 IRAM might have 16 busses running at 50 to 100 MHz.
Thus the internal IRAM bandwidth should be as high as 200-300 GBytes/sec. For
comparison, the sustained memory bandwidth of the Alpha Server 8400 which includes a 75
MHz, 256-bit memory bus is 1.2 Gbytes/sec. Cross bar switch in IRAM architecture
delivers only 1/3 to 2/3 of the theoretical band width, so actual band width will be 100-200
GB/Sec. Applications with predictable memory accesses, such as matrix manipulations,
may take advantage of the potential 50X to 100X increase in IRAM bandwidth.
High Energy Efficiency

Integrating a microprocessor and DRAM memory on the same die offers the
potential for improving energy consumption of the memory system. DRAM is much denser
than SRAM, which is traditionally used for on-chip memory. Therefore, an IRAM will have

SEMINAR2006 IRAM
many fewer external memory accesses, which consume a great deal of energy to drive high
capacitance off-chip buses. Even on-chip accesses will be more energy efficient, since
DRAM consumes less energy than SRAM. Finally, an IRAM has the potential for higher
performance than a conventional approach. Since higher performance for some fixed energy
consumption can be translated into equal performance at a lower amount of energy, the
performance advantages of IRAM can be translated into lower energy consumption.
Besides reducing the frequency of memory accesses IRAM also reduces the
energy per instruction which is given by the equation,
Energy per memory access = AEL1 + MRL1 x AEL2 + MRL2 x AEoff-chip
where AE = access energy and MR = miss rate
The main contributing term in the above equation is the access energy of off-chip
which vanishes along with the second term from the equation for energy per instruction for
IRAM since there is no L2 cache in IRAM. So the energy per memory access will be only
the access energy of L1 cache which is also less when compared to the L1 cache access
energy of other microprocessor chips. So the energy consumption of IRAM chips is very
low which is very good for low power consuming devices and portable devices.
Memory Flexibility
Another advantage of IRAM over conventional designs is the ability to adjust both
the size and width of the on-chip DRAM. Rather than being limited by powers of 2 in
length or width, as is conventional DRAM, IRAM designers can specify exactly the number
of words and their width. This flexibility can improve the cost of IRAM solutions versus
memories made from conventional DRAMs.
Low Cost of Production

Fabrication of RAM and Processor is done in a single fabrication line. So cost is
reduced due to unified fabrication. Tax is reduced since no need for individual taxes for

SEMINAR2006 IRAM
RAM and Processor. Since both cost of production and tax is less the market price of IRAM
will be less than the combined price for RAM and Processor. Thus it is suitable for budget
conscious customers. In fact it should be an obvious choice of performance conscious
customers because it delivers high quality performance at bare minimum price.
Small Board Area

IRAM integrates several chips into ‘One Chip’. So board area is very much
reduced due to integration. So it may be attractive in applications where board area is
precious such as cellular phones or portable computers. Since the size and portability of
devices is decreasing and increasing rapidly IRAM offers a well defined path for achieving
the future goals.
6.Disadvantages of IRAM
No product or technology is hundred percent perfect. Sometimes it may have

defects or drawbacks. Similarly IRAM is also not a perfect technology. It also has
disadvantages. The disadvantages of IRAM chip are,
1. Completely New Architecture: IRAM is a new technology which is entirely
different from current technological implementations since it integrates the processor and
memory into a single chip. For the acceptance of this new technology we have to discard
our current products and technologies. That is we have to revamp the complete system from
the scratch itself. So it may effect the wide acceptance of this technology even though the
performance is excellent. But once it is accepted widely it won’t be a problem as it becomes
the current technology making other technologies obsolete.
2. Non Upgradeability of Memory: Since the DRAM chips are embedded in the IRAM chip
we will not be able to upgrade the memory further. This may limit the popularity of IRAM
chips since the demand for more memory capacity in increasing rapidly. Researches are

SEMINAR2006 IRAM
going on for finding a solution for this problem. Sometimes the next generation chips may
have the provision for upgrading the memory capacity.
3. High Cost of Testing: The testing cost of IRAM chip is high when compared to the other
memory testing processes. This is because the cost of testing during manufacturing is
significant for DRAMs. Adding a processor would significantly increase the test time on
conventional DRAM testers. But once it establishes the testing cost will not be a problem in
a long run since the revenue or profit will account for the high cost of testing. Sometimes a
new cost effective method of testing the DRAMs may emerge in course of time.
4. Overheating: The high level of integration decreases the chip area considerably. So even
though the heat produced is less when compared to that of current processors, it may
overheat the chip due to the small area. But we can rectify the effect of heat by means of an
efficient heat sink or cooling system which exhausts the heat produced to the external
environment.
7.Conclusion
Merging a microprocessor and DRAM on the same chip presents opportunities in
performance, energy efficiency, and cost: a factor of 5 to 10 reduction in latency, a factor of
50 to 100 increase in bandwidth, a factor of 2 to 4 advantage in energy efficiency, and an

SEMINAR2006 IRAM
unquantified cost savings by removing superfluous memory and by reducing board area.
The surprise is that these claims are not based on some exotic, unproven technology; they
based instead on tapping the potential of a technology in use for the last 10 years. The
popularity of IRAM is only limited by the amount of memory on-chip, which should expand
by about 60% per year. A best case scenario would be for IRAM to expand its beachhead in
graphics, which requires about 10 Mbits, to the game, embedded, and personal digital
assistant markets, which require about 32 Mbits of storage. Such high volume applications
could in turn justify creation of a process that is more friendly to IRAM, with DRAM cells
that are a little bigger than in a DRAM fabrication but much more amenable to logic and
SRAM. As IRAM grows to 128 to 256 Mbits of storage, an IRAM might be adopted by the
network computer or portable PC markets. Such a success could in turn entice either
microprocessor manufacturers to include substantial DRAM on chip, or DRAM
manufacturers to include processors on chip. Hence IRAM presents an opportunity to
change the nature of the semiconductor industry. From the current division into logic and
memory camps, a more homogeneous industry might emerge with historical microprocessor
manufacturers shipping substantial amounts of DRAM - just as they ship substantial
amounts of SRAM today - or historical DRAM manufacturers shipping substantial numbers
of microprocessors. Both scenarios might even occur, with one set of manufacturers
oriented towards high performance and the other towards low cost. Also IRAM with its
potential can create a new generation of computers with increased portability, reduced size
and power consumption without compromising on performance and efficiency.
8.References
• IRAM - Chips that remember and compute, IEEE International Solid State
Circuits Conference
• A Case for Intelligent RAM, IEEE Micro
• IRAM - the Industrial Setting, Applications, and Architectures, Computer
Science Division, University of California, Berkeley

SEMINAR2006 IRAM
• Vector IRAM - ISA and Micro-architecture, Computer Science Division,

University of California, Berkeley
• Vector IRAM - A Media-oriented Vector Processor with Embedded DRAM,
Computer Science Division, University of California, Berkeley
• A Media-enhanced vector architecture for embedded memory systems,
Computer Science Division, University of California, Berkeley
• Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines, Computer
Science Division, University of California, Berkeley
• IRAM - Overcoming the I/O Bus Bottleneck, Denver, CO, USA
• The energy efficiency of IRAM architectures, 24th Annual International
Symposium on Computer Architecture
• http://iram.cs.berkeley.edu/

IRAM

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

IRAM

Transféré par

Droits d'auteur :

Formats disponibles

SEMINAR2006 IRAM

DEPARTMENT OF ECE 1 VJEC

DEPARTMENT OF ECE 2 VJEC

DEPARTMENT OF ECE 3 VJEC

DEPARTMENT OF ECE 4 VJEC

DEPARTMENT OF ECE 5 VJEC

Processor - Memory Performance Gap Penalty

To overcome the performance gap we have to provide a cache memory which

DEPARTMENT OF ECE 6 VJEC

1994 Digital I: 8 KB, 37.4% 77.4% 298 9.3 M

1996 Digital I: 16 KB, 60.8% 94.5% 50 2.1 M

1993 Intel I: 8 KB, 31.9% ≈ 32% ≈ 300 3.1 M

1995 Intel I: 8 KB, P: 18.5% P: 11.2% P: 226 P: 3.5 M

2000 Intel I: 8 KB, P: 22.5% P: 18.2% P: 242 P: 5.5 M

2001 AMD I: 32 KB, P: 20.2% P: 20.2% P: 268 P: 6.5 M

DEPARTMENT OF ECE 7 VJEC

PCI Bits Pin Number

Database Demand for Processing Power and Memory

DEPARTMENT OF ECE 8 VJEC

DEPARTMENT OF ECE 9 VJEC

DEPARTMENT OF ECE 10 VJEC

DEPARTMENT OF ECE 11 VJEC

IRAM - Vector Architecture

DEPARTMENT OF ECE 12 VJEC

Advantages of Vector Processing

DEPARTMENT OF ECE 13 VJEC

4. It has a well understood programming model. The compiler instruction language is

Advantages of Embedded DRAM

DEPARTMENT OF ECE 14 VJEC

DEPARTMENT OF ECE 15 VJEC

Advantages of Serial I/O

DEPARTMENT OF ECE 16 VJEC

Floor Plan Specification

IRAM - Complete Architecture

DEPARTMENT OF ECE 17 VJEC

DEPARTMENT OF ECE 18 VJEC

µP IRAM SPARC R10K P III P4 EV 6

Make Berkeley Sun Origin Intel Intel Alpha

DEPARTMENT OF ECE 19 VJEC

Clock 1GHz 833MHz 900MHz 950MHz 1.5GHz 966MHz

L1 8+8KB 16+16KB 32+32KB 32KB 12+8KB 64+64KB

L2 NA 2MB 1MB 256KB 256KB 2MB

Memory 128 MB 256MB 1GB 256MB 1GB 512MB

measures giga-updates-per-second. It repeatedly reads and updates distinct, pseudo-random

DEPARTMENT OF ECE 20 VJEC

3. Sparse Matrix-Vector Multiplication (SPMV): This problem also requires random

memory access patterns and a low number of arithmetic operations. It is common in

Mesh Adaptation: The final benchmark is a two dimensional unstructured mesh

DEPARTMENT OF ECE 21 VJEC

DEPARTMENT OF ECE 22 VJEC

Giga Updates per Second (GUPS)

DEPARTMENT OF ECE 23 VJEC

Sparse Matrix-Vector Multiplication (SPMV)

DEPARTMENT OF ECE 24 VJEC

DEPARTMENT OF ECE 25 VJEC

DEPARTMENT OF ECE 26 VJEC

DEPARTMENT OF ECE 27 VJEC

DEPARTMENT OF ECE 28 VJEC

Summary of Benchmark Characteristics

DEPARTMENT OF ECE 29 VJEC

DEPARTMENT OF ECE 30 VJEC

DEPARTMENT OF ECE 31 VJEC

High Energy Efficiency