Académique Documents
Professionnel Documents
Culture Documents
high performance in image processing. However, the programming using these SIMD instructions is very tricky, and
the performance varies considerably according to programming skill.
We have implemented several applications in image processing on FPGAs, and tried to achieve the highest performance by minimizing the number of operations and memory accesses. The methods used for the designs can also
be used for the programming using SIMD instructions. In
this paper, we try to make it clear how fast is an FPGA compared with the recent processors with SIMD instructions and
multi-cores[9]. We compare the performance using three applications; two-dimensional lters[1], stereo-vision[2][3][4],
and k-means clustering [5][6][7][8]. In these applications,
the performance by an FPGA can be improved using larger
FPGAs. The comparison is discussed from the view point
of the problem size, FPGA size and memory bandwidth.
1. INTRODUCTION
Many applications in image processing have high inherent
parallelism, and the data width of many operations is less
than 16 bit. FPGA can execute those operations in parallel
by conguring dedicated circuits for each application. Large
number of internal memory banks on FPGAs also support
this parallel processing by enabling parallel accesses to several hundreds data which are cached in them. Because of
this high parallelism, FPGAs show very high performance in
image processing in spite of their low operational frequency.
In order to achieve high performance using a hardware platform with higher operational frequency, graphics processing
units (GPUs) have also been used, and shown very good performance in some applications. However, they are originally
designed for a specic sequence of operations, and it is difcult to realize high parallelism in various applications.
Micro processors have also supported SIMD instructions
for parallel processing, and it becomes possible to execute
a SIMD instruction for 128 bit data in one clock cycle in
the recent processors. These processors also support multicores, and each core can execute SIMD instructions independently. Furthermore, the cache size is large enough for
storing all image data for each core. Because of these progresses in the processors, it becomes possible to realize very
2. SIMD OPERATIONS
In recent micro processors such as Intel Core 2, SIMD instructions for 128 bit data (16 operations for 8 bit data, 8 operations for 16 bit data, and 4 operations for 32 bit data etc)
can be executed in one clock cycle (these SIMD instructions
were supported in previous processors, but they take more
than one clock cycle). Furthermore, these processors support multi-cores, and large cache memory which can hold
all image data for each core. The maximum parallelism becomes 416 in the current version with quad cores. With the
very high operational frequency of these processors (3GHz
or more), this parallelism will enable very high performance
in image processing.
However, the programming using these SIMD instructions is very tricky. Sequential parts in the programs dominate the total computation time (Amdahls law), and we
need to reduce those parts very carefully. This programming
is very similar to the hardware design, because we need to
make all stages of all pipelined circuits busy to realize higher
performance. In FPGA design, we also need to minimize
the number of memory accesses to external memory banks.
This also helps the processors to realize higher performance
by reducing the memory accesses.
77
reg
Input
reg
...
reg
...
reg
x G(3,0)
x G(3,1)
...
reg
...
reg
x G(3,2)
...
x G(3,3)
...
...
x G(3,4)
I(x,y-1)
I(x,y)
x G(0,0)
x G(0,1)
x
reg
...
x
reg
...
...
x
Gy[1]
Gy[2]
x
Gy[3]
x
x
...
Gy[4]
+
S(x,y)
Gx[0]
reg
Gx[2]
reg
x G(4,1)
x
Gy[0]
Gx[1]
reg
x G(4,0)
reg
Gx[4]
reg
reg
Gx[3]
line buffers
...
reg
register array
...
I(x,y-4)
2w
+
S(x,y)
non-separable
separable
This equation means that S(x, y) can be obtained by applying Gy (dy) rst to pixels on the same column, and then
Gx (dx) to the results.
Fig.1 shows block diagrams of circuits for non-separable
and separable lters (w = 2). In Fig.1 left (non-separable),
one pixel value I(x, y) is given to the circuit every clock cycle, and sent to the register array. At the same time, data in
the line buffers (the number of line buffers is 2w) are read
out in parallel (these data are I(x, yk) k = 1, .., 2w), and
given to the register array. The read-out data and I(x, y) are
held on the register array for 2w+1 clock cycles and multiplied by G(dx, dy). The products are summed up by an
adder tree. The read-out data and I(x, y) are written back to
the next line buffers for the calculation of I(x, y+1). This
circuit is fully pipelined, and can apply the lter to one
pixel in one clock cycle (as its throughput). The number
of operations is (2w+1) (2w+1) for multiply operations,
and (2w+1) (2w+1)1 for add operations. As for separable lters, the outputs of the line buffers and I(x, y) are
multiplied by Gy (dy) rst, and then summed up. Then, the
sums are held on a shift register for 2w+1 clock cycles, and
multiplied by Gx (dx). The 2w+1 products are summed up
by an adder tree, and S(x, y) is obtained. Therefore, the
number of operations is (2w+1)+(2w+1) for multiply operations, and 2w+2w for add operations.
Fig.2 show an outline of the program for non-separable
lters using SIMD instructions. The numbers following to
the variable names show their data width, and the small
number below an arrow shows the parallelism of the SIMD
instruction. In the program, the values of eight pixels are
multiplied by the eight coefcients, and the 2k-th and (2k+1)th data in the eight products are added in parallel. Then, the
four sums are added to four partial sums in one 128 bit data
respectively. Finally, the four partial sums are summed up
sequentially, and S(x, y) is obtained.
4. STEREO VISION
In the stereo vision system, projections of the same location
are searched in two images (Ir and Il ) take by two cameras,
and the distance to the location is obtained from the disparity. In order to nd the projections, a small window centered at a given pixel in Ir is compared with windows in Il
on the same line (epipolar restriction) in area-based matching algorithms. The sum of absolute difference (SAD) is
widely used to compare the windows because of its simplicity. When SAD is used, the value of d in [0, D1] which
minimizes the following equation is searched.
w
w
SADxy (x, y, d)=
|Ir (x+dx,y+dy)Il (x+dx+d,y+dy)|
dx=w dy =w
In this equation, (2w+1) (2w+1) is the size of the window centered at a pixel Ir (x, y), d is the disparity, and its
range D decides how many windows in Il (their centers are
Il (x+d, y)) are compared with the window.
Fig.3 shows how to calculate SADxy (x, y, d) efciently.
The right half in Fig.3(A), shows a window in Ir whose
center is Ir (x, y). This window is compared with D windows in its target area (whose width is D+2w) in Il (the
left half). Suppose that SADxy (x, y, d) have been calculated for all d. Then, the window in Ir is shifted by one
pixel along the x axis, and compared with D windows in its
target area (Fig.3(B)). SADxy (x+1, y, d) can be calculated
from SADxy (x, y, d) as follows.
SADxy (x+1, y, d) =
SADxy (x, y, d)+SADy (x+1+w, y, d)SADy (xw, y, d)
w
SADy (x, y, d) =
|Ir (x, y+dy)Il (x+d, y+dy)|
dy=w
78
x-d
D+2w
x w
w
w
(A)
w
x-d-w x-d+1 d
x-w
x+1
y
w
x-d-w x-d+1 d
6. calculate |Ir (xw, yw)Il (xw+d, yw)|, and subtract them from SADy (xw, y, d) (SADs for the dark
gray pixels on xw are obtained).
(B)
w
y+1
w
x-d+1+w
(C)
Ir (x+1+w,y+1+w)
D
x-d-w x-d+1 d
x+1+w
x-d+1+w
Ir (x-w,y-w)
}}
FIFO
FIFO
FIFO
FIFO
x+1 x+1+w
w
}}
(D)
x-d+1+w
memory
X
79
Min
....
....
+1
....
........
....
+1
m6
d& d 47
d& d2 48
d& d2 49
2
d& d 50
Min
|Ci|
........
m2
....
+1
d& d2 23
d& d2 24
2
d& d 25
2
d& d 26
m4
(x-center i)
x in Ci
....
Next Images
Sum
FPGA
........
2
d& d
Min
d& d 71
d& d2 72
d& d2 73
d& d2 74
....
....
+1
95
(x-centeri)
cluster number
Converge?
....
....
(x-center i)
x in Ci
m7
i x in Ci
Div
m3
ure 6 shows a block diagram of a circuit for the simple kmeans clustering algorithm for 24-bit full color RGB images. In Fig.6, pixels in one image are stored in three memory banks (m0, m1 and m2), and four pixels in the three
memory banks are read out at the same time (data width of
four pixels is 24b4, and the data width of three memory
banks is 32b3) every clock cycle. The four pixels are processed in parallel on the fully pipelined circuit, and the results (cluster numbers for the four pixels) are stored in m3.
After processing all pixels, four partial sums stored in internal memory banks are summed up to calculate new cluster centers. While processing one image using four memory banks, next image can be downloaded to other memory
banks. In order to nd the closest center, squared Euclidean
distances to K centers have to be calculated for each pixel.
Suppose that the value of a pixel is (xR , xG , xB ). Then, its
squared Euclidean distance to centeri is
d2 = (xR centeriR )2 +(xG centeriG )2 +(xB centeriB )2
and, we need three multipliers to calculate one distance. In
Fig.6, 96 3 units to calculate squared distance are used,
which means that 96 squared Euclidean distances can be calculated in parallel (24 d2 for each pixel because four pixels
are processed in parallel).
Fig.7 shows an outline of the program for the k-means
d& d
2
d& d
d& d2
(x-centeri)
i x in Ci
Min
m0
........
s
;
t816b
816b
816b
8
SADxy816b [d/8]
SADxy816b [d/8] + t816b ;
8
// 8 SADxy (x+1, y, d) are obtained
t816b
the 8 values in Vm168b ;
8
s816b
s816b t816b ; // 8 partial SADy on xw
8
mem816b [xw][d/8]
s816b ; // store the 8 partial SADy
8
buf432b [d/82]
the 1st four data in SADxy816b [d/8];
4
the 2nd four data in SADxy816b [d/8];
buf432b [d/82+1]
4
shift buf432b [d/82] and buf432b [d/82+1] to left by 8b;
ll the 8b elds with d;
}
}
min432b
buf432b [0];
4
for each data in buf432b [k]
min432b
min {min432b , buf432b [k]};
4
min132b the minimum of four 32b in min432b ;
mind18b the lowest 8b of min132b ;
}
}
|Ci|
80
sec
K LUTs
0.08
FPGA(circuit size)
6
0.06
non-separable
(1 thread)
FPGA(66MHz)
(non-separable &
separable)
0.04
separable(4 thread)
separable(1 thread)
0.02
0.00
non-separable
(4 threads)
11
13
15
speedup
XC2V6000
400
XC4VLX160
(speedup)
300
200
XC2V6000
(speedup)
4 threads
100
1 thread
0
10
0.10
fps
0
20
40
60
80
100
120
140
160
180
200
220
240
81
40
15
30
20
speedup
fps
XC4VLX160
(speedup)
speedup
50
memory throughput
2
K-means clustering
(K=48)
XC4VLX160
3
GB/sec
K-means clustering
(K=256)
10
30
XC2V6000
XC2V6000
(speedup)
filter (non-separable)
(15x15)
20
5
XC4VLX160
XC2V6000
10
16
24
32
40
48
56
64
72
80
88
96
0
112 128 144 160 176 192 208 224 240 256
k
104 120 136 152 168 184 200 216 232 248
100
200
300
memory throughput
400
MB/sec
4 threads 1 thread
0
stereo-vision
(D =241)
10
82