Vous êtes sur la page 1sur 6

HOW FAST IS AN FPGA IN IMAGE PROCESSING ?

Takashi Saegusa, Tsutomu Maruyama and Yoshiki Yamaguchi


Systems and Information Engineering, University of Tsukuba
1-1-1 Ten-ou-dai Tsukuba Ibaraki 305-8573 JAPAN
{saegusa,maruyama}@darwin.esys.tsukuba.ac.jp
ABSTRACT
In image processing, FPGAs have shown very high performance in spite of their low operational frequency. This high
performance comes from (1) high parallelism in applications in image processing, (2) high ratio of 8 bit operations,
and (3) a large number of internal memory banks on FPGAs
which can be accessed in parallel. In the recent micro processors, it becomes possible to execute SIMD instructions
on 128 bit data in one clock cycle. Furthermore, these processors support multi-cores and large cache memory which
can hold all image data for each core. In this paper, we compare the performance of FPGAs with those processors using
three applications in image processing; two-dimensional lters, stereo-vision and k-means clustering, and make it clear
how fast is an FPGA in image processing, and how many
hardware resources are required to achieve the performance.

high performance in image processing. However, the programming using these SIMD instructions is very tricky, and
the performance varies considerably according to programming skill.
We have implemented several applications in image processing on FPGAs, and tried to achieve the highest performance by minimizing the number of operations and memory accesses. The methods used for the designs can also
be used for the programming using SIMD instructions. In
this paper, we try to make it clear how fast is an FPGA compared with the recent processors with SIMD instructions and
multi-cores[9]. We compare the performance using three applications; two-dimensional lters[1], stereo-vision[2][3][4],
and k-means clustering [5][6][7][8]. In these applications,
the performance by an FPGA can be improved using larger
FPGAs. The comparison is discussed from the view point
of the problem size, FPGA size and memory bandwidth.

1. INTRODUCTION
Many applications in image processing have high inherent
parallelism, and the data width of many operations is less
than 16 bit. FPGA can execute those operations in parallel
by conguring dedicated circuits for each application. Large
number of internal memory banks on FPGAs also support
this parallel processing by enabling parallel accesses to several hundreds data which are cached in them. Because of
this high parallelism, FPGAs show very high performance in
image processing in spite of their low operational frequency.
In order to achieve high performance using a hardware platform with higher operational frequency, graphics processing
units (GPUs) have also been used, and shown very good performance in some applications. However, they are originally
designed for a specic sequence of operations, and it is difcult to realize high parallelism in various applications.
Micro processors have also supported SIMD instructions
for parallel processing, and it becomes possible to execute
a SIMD instruction for 128 bit data in one clock cycle in
the recent processors. These processors also support multicores, and each core can execute SIMD instructions independently. Furthermore, the cache size is large enough for
storing all image data for each core. Because of these progresses in the processors, it becomes possible to realize very

978-1-4244-1961-6/08/$25.00 2008 IEEE.

2. SIMD OPERATIONS
In recent micro processors such as Intel Core 2, SIMD instructions for 128 bit data (16 operations for 8 bit data, 8 operations for 16 bit data, and 4 operations for 32 bit data etc)
can be executed in one clock cycle (these SIMD instructions
were supported in previous processors, but they take more
than one clock cycle). Furthermore, these processors support multi-cores, and large cache memory which can hold
all image data for each core. The maximum parallelism becomes 416 in the current version with quad cores. With the
very high operational frequency of these processors (3GHz
or more), this parallelism will enable very high performance
in image processing.
However, the programming using these SIMD instructions is very tricky. Sequential parts in the programs dominate the total computation time (Amdahls law), and we
need to reduce those parts very carefully. This programming
is very similar to the hardware design, because we need to
make all stages of all pipelined circuits busy to realize higher
performance. In FPGA design, we also need to minimize
the number of memory accesses to external memory banks.
This also helps the processors to realize higher performance
by reducing the memory accesses.

77

reg

Input

reg

...
reg

...
reg

x G(3,0)
x G(3,1)

...

reg

...

reg

x G(3,2)

...

x G(3,3)

...

...

x G(3,4)

I(x,y-1)
I(x,y)

x G(0,0)
x G(0,1)
x

reg

...
x

reg

...

...

x
Gy[1]

reg reg reg reg reg

Gy[2]

x
Gy[3]

x
x

...

Gy[4]

+
S(x,y)

Gx[0]

reg

Gx[2]

reg

x G(4,1)

x
Gy[0]

Gx[1]

reg

x G(4,0)

reg

Gx[4]

reg

reg

Gx[3]

line buffers

...

reg

from line buffers

register array
...

I(x,y-4)
2w

+
S(x,y)

non-separable

separable

Fig. 1. Circuits for non-separable and separable lters

for each (x, y) {


sum432b
0;
4
for dy in [yw, y+w] {
for each 8 dx in [xw, x+w]) {
eight I(x+dx, y+dy);
t816b
8
u816b
eight coefcients; // stored in 816b wide array;
8
m816b
t816b u816b &&
8
v432b
the 2k-th value + (2k+1)-th value in m816b ;
4
sum432b
sum432b + v432b ;
4
}
}
sum up the four 32b data in sum432b ;
}

Fig. 2. A program for non-separable lter


3. TWO-DIMENSIONAL FILTERS
Let w be the radius of the lter, and I(x, y) the value of a
pixel at (x, y) in the image. Then, the computation for applying a non-separable lter to I(x, y) becomes as follows.
w
w


S(x, y) =
I(x+dx, y+dy) G(dx, dy)
dx=w dy=w

where G(dx, dy) is the coefcient on (dx, dy)


If G(dx, dy) can be rewritten as Gx (dx)Gy (dy) (this means
that the coefcients in the lter can be decomposed along the
x and y axes), the lter is called separable. Then, S(x, y)
can be calculated as follows.
w
w


{
I(x+dx, y+dy) Gy (dy)} Gx (dx)
S(x, y) =
dx=w dy=w

This equation means that S(x, y) can be obtained by applying Gy (dy) rst to pixels on the same column, and then
Gx (dx) to the results.
Fig.1 shows block diagrams of circuits for non-separable
and separable lters (w = 2). In Fig.1 left (non-separable),
one pixel value I(x, y) is given to the circuit every clock cycle, and sent to the register array. At the same time, data in
the line buffers (the number of line buffers is 2w) are read
out in parallel (these data are I(x, yk) k = 1, .., 2w), and
given to the register array. The read-out data and I(x, y) are
held on the register array for 2w+1 clock cycles and multiplied by G(dx, dy). The products are summed up by an
adder tree. The read-out data and I(x, y) are written back to
the next line buffers for the calculation of I(x, y+1). This
circuit is fully pipelined, and can apply the lter to one
pixel in one clock cycle (as its throughput). The number
of operations is (2w+1) (2w+1) for multiply operations,
and (2w+1) (2w+1)1 for add operations. As for separable lters, the outputs of the line buffers and I(x, y) are
multiplied by Gy (dy) rst, and then summed up. Then, the
sums are held on a shift register for 2w+1 clock cycles, and
multiplied by Gx (dx). The 2w+1 products are summed up
by an adder tree, and S(x, y) is obtained. Therefore, the
number of operations is (2w+1)+(2w+1) for multiply operations, and 2w+2w for add operations.
Fig.2 show an outline of the program for non-separable
lters using SIMD instructions. The numbers following to

the variable names show their data width, and the small
number below an arrow shows the parallelism of the SIMD
instruction. In the program, the values of eight pixels are
multiplied by the eight coefcients, and the 2k-th and (2k+1)th data in the eight products are added in parallel. Then, the
four sums are added to four partial sums in one 128 bit data
respectively. Finally, the four partial sums are summed up
sequentially, and S(x, y) is obtained.
4. STEREO VISION
In the stereo vision system, projections of the same location
are searched in two images (Ir and Il ) take by two cameras,
and the distance to the location is obtained from the disparity. In order to nd the projections, a small window centered at a given pixel in Ir is compared with windows in Il
on the same line (epipolar restriction) in area-based matching algorithms. The sum of absolute difference (SAD) is
widely used to compare the windows because of its simplicity. When SAD is used, the value of d in [0, D1] which
minimizes the following equation is searched.
w 
w

SADxy (x, y, d)=
|Ir (x+dx,y+dy)Il (x+dx+d,y+dy)|
dx=w dy =w

In this equation, (2w+1) (2w+1) is the size of the window centered at a pixel Ir (x, y), d is the disparity, and its
range D decides how many windows in Il (their centers are
Il (x+d, y)) are compared with the window.
Fig.3 shows how to calculate SADxy (x, y, d) efciently.
The right half in Fig.3(A), shows a window in Ir whose
center is Ir (x, y). This window is compared with D windows in its target area (whose width is D+2w) in Il (the
left half). Suppose that SADxy (x, y, d) have been calculated for all d. Then, the window in Ir is shifted by one
pixel along the x axis, and compared with D windows in its
target area (Fig.3(B)). SADxy (x+1, y, d) can be calculated
from SADxy (x, y, d) as follows.
SADxy (x+1, y, d) =
SADxy (x, y, d)+SADy (x+1+w, y, d)SADy (xw, y, d)
w

SADy (x, y, d) =
|Ir (x, y+dy)Il (x+d, y+dy)|
dy=w

In Fig.3(B), SAD of the pixels on x+1+w (gray boxes) corre-

78

x-d

D+2w

4. get D SADy (xw, y, d) from the FIFO (they were put


when SADxy (x2w, y, d) were calculated), and subtract them from the values in step 3. (SADxy (x+1, y, d)
are obtained).

x w

w
w

(A)

w
x-d-w x-d+1 d

x-w

x+1

y
w

x-d-w x-d+1 d

6. calculate |Ir (xw, yw)Il (xw+d, yw)|, and subtract them from SADy (xw, y, d) (SADs for the dark
gray pixels on xw are obtained).

(B)

7. stored the SADs in the memory (those SADs are used


for the calculation of SADxy (x2w, y+1, d) on the next
line).

x-w w x+1 wx+w+1

w
y+1

w
x-d+1+w

(C)

Ir (x+1+w,y+1+w)
D

x-d-w x-d+1 d

5. nd d which minimizes SADxy (x+1, y, d).

x+1+w

x-d+1+w

Ir (x-w,y-w)

}}

FIFO
FIFO
FIFO
FIFO

x+1 x+1+w
w

}}

(D)

store x-w I (x+1+w,y+w) load


r

x-d+1+w

memory
X

Fig. 3. A computation method of the stereo vision


sponds to SADy (x+1+w, y, d), and that on xw correspond
to SADy (xw, y, d). SAD of other pixels is already calculated in Fig.3(A), and can be reused. Therefore, we only
need to calculate SADy (x+1+w, y), because SADy (xw, y, d)
was already calculated for SADxy (x2w, y, d) (and stored in
a temporal buffer).
When the window is moved to the right end along the x
axis, the window is moved back to the left end of the next
line. Then, SADxy (x+1, y+1, d) is calculated in the same
way (Fig.3(C)). In this case, SAD of the dark gray pixels
was already calculated in Fig.3(B). Therefore, by storing
SAD for the dark gray pixels during the computation of the
previous line, SADy (x+1+w, y+1, d) can be obtained by just
adding the absolute difference for Ir (x+1+w, y+1+w) (light
gray box) to it.
Fig.3(D) summarizes how to calculate SADxy (x+1, y, d).
The following procedures are executed for all d in [0, D1]
in parallel starting from (x=w, y=w) (initial values of
SADxy (x, y, d) are zero).
1. load D partial SADs stored in the memory (SADs for
the dark gray pixels on x+1+w).
2. calculate |Ir (x+1+w, y+w)Il (x+1+w+d, y+w)|, and
add them to the partial SADs (SADy (x+1+w, y, d) are
obtained).
3. put SADy (x+1+w, y, d) into the FIFO, and add them
to SADxy (x, y, d).

The total number of the operations is 2D for absolute


differences, 2D for add operations, 2D for subtract operations, and D 1 for comparison in step 5.
Fig.4 shows an outline of the program for the stereovision using SIMD instructions. First, 16 absolute differences are calculated in parallel for Ir (x+1+w, y+w) and
Ir (xw, yw). Then, 8 SADxy (x+1, y, d) are calculated in
parallel using the rst and second 8 of the absolute differences, because the data width of SADxy is 16 bit (w = 3 or
4 in general). In Fig.3, FIFOs are used to hold SADy temporarily, but in this program, SADy are stored in the memory. The 8 SADxy (x+1, y, d) are copied into two words of 4
32b wide buffer (buf[]) by expanding 16b data to 32b. The
two words are shifted to left by 8 bit, and d,d+1,...,d+7 (up
to 8 bit) are inserted to the eight 8b elds generated by the
shift operations. After calculating all SADxy (x+1, y, d), the
four 32 bit data in buf[] are compared (the k-th 32 bit data
in buf[i] is compared with only the k-th 32b data in buf[j]),
and the four minimums are obtained. Then, the minimum of
the four minimums is chosen, and its lowest 8 bit gives the d
which minimizes SADxy (x+1, y, d). As shown in Fig.4, the
parallelism of most sentences in the main loop is 8.
5. K-MEANS CLUSTERING
Given a set S of D-dimensional points, and an integer K,
the goal of the k-means clustering is to partition the points
into K subsets (Ci (i = 1, K)) so that the following error
function is minimized.
K 

(xcenteri )2
E=
i=1 xCi

where centeri = xCi x / |Ci | (the mean of points in Ci ).
Figure 5 shows one iteration of the simple k-means clustering algorithm. First, squared distances to K centers are
calculated for each point in the dataset and the minimum of
them is chosen (the point belongs to the cluster which gives
the minimum distance). Then, new centers are calculated
from the points which belongs to each cluster. These operations are repeated until no improvement of E is obtained.
By applying the k-means clustering algorithm to color
images, we can reduce the number of the colors in the images to K while maintaining the quality of the images. Fig-

79

OneIteration (Point Set S, Centers Z) {


E 0;
for each (x S) {
z the closest point in Z to x;
z.weightCentroid z.weightCentroid+x;
z.count z.count+1;
E E+(xz)2 ;
}
for each (z Z)
z z.weightCentroid/z.count;
return E;
}

Fig. 5. One iteration in the simple k-means algorithm


PCI Bus
2

Min

....
....

+1

....

........

....

+1

m6

d& d 47
d& d2 48
d& d2 49
2
d& d 50

Min

External SRAM banks


m1

|Ci|

........

m2

....

+1

d& d2 23
d& d2 24
2
d& d 25
2
d& d 26

m4

(x-center i)

x in Ci

....

Next Images

Sum

External SRAM banks


m5

FPGA

........
2

d& d

Min

d& d 71
d& d2 72
d& d2 73
d& d2 74

....
....

+1
95

(x-centeri)

cluster number

Converge?
....
....

(x-center i)

x in Ci

m7

i x in Ci

Div

m3

Fig. 4. A program for the stereo-vision

ure 6 shows a block diagram of a circuit for the simple kmeans clustering algorithm for 24-bit full color RGB images. In Fig.6, pixels in one image are stored in three memory banks (m0, m1 and m2), and four pixels in the three
memory banks are read out at the same time (data width of
four pixels is 24b4, and the data width of three memory
banks is 32b3) every clock cycle. The four pixels are processed in parallel on the fully pipelined circuit, and the results (cluster numbers for the four pixels) are stored in m3.
After processing all pixels, four partial sums stored in internal memory banks are summed up to calculate new cluster centers. While processing one image using four memory banks, next image can be downloaded to other memory
banks. In order to nd the closest center, squared Euclidean
distances to K centers have to be calculated for each pixel.
Suppose that the value of a pixel is (xR , xG , xB ). Then, its
squared Euclidean distance to centeri is
d2 = (xR centeriR )2 +(xG centeriG )2 +(xB centeriB )2
and, we need three multipliers to calculate one distance. In
Fig.6, 96 3 units to calculate squared distance are used,
which means that 96 squared Euclidean distances can be calculated in parallel (24 d2 for each pixel because four pixels
are processed in parallel).
Fig.7 shows an outline of the program for the k-means

d& d
2
d& d
d& d2

(x-centeri)

i x in Ci

Min

m0

........

SADxy816b [D/8] {0, ..., 0}; // initialization


for (y = w; y < Y +w; y++) {
for (x = w; x < X+w; x++) {
for each 16 d in [0, D1] {
Vp168b
|Ir (x+1+w, y+w)Il (x+1+w+d, y+w)|;
16
|Ir (xw, yw)Il (xw+d, yw)|;
Vm168b
16
for the rst and second 8 d in Vp168b and Vm168b {
t816b
the 8 values in Vp168b ;
8
s816b
8 partial SADy in mem816b [x+w+1][d/8];
8
t816b
t816b + s816b ; // 8 SADy are obtained
8
t816b ; // store the 8 SADy
mem816b [x+w+1][d/8]
8
s816b
8
SAD
in
mem
[xw][d/8];
y
816b
8
t

s
;
t816b
816b
816b
8
SADxy816b [d/8]
SADxy816b [d/8] + t816b ;
8
// 8 SADxy (x+1, y, d) are obtained
t816b
the 8 values in Vm168b ;
8
s816b
s816b t816b ; // 8 partial SADy on xw
8
mem816b [xw][d/8]
s816b ; // store the 8 partial SADy
8
buf432b [d/82]
the 1st four data in SADxy816b [d/8];
4
the 2nd four data in SADxy816b [d/8];
buf432b [d/82+1]
4
shift buf432b [d/82] and buf432b [d/82+1] to left by 8b;
ll the 8b elds with d;
}
}
min432b
buf432b [0];
4
for each data in buf432b [k]
min432b
min {min432b , buf432b [k]};
4
min132b the minimum of four 32b in min432b ;
mind18b the lowest 8b of min132b ;
}
}

|Ci|

Fig. 6. A circuit for the simple k-means clustering algorithm


clustering using SIMD instructions. In the program, the distances to eight cluster centers are calculated in parallel for
each pixel, because data width of RGB of I(x, y) and the
cluster centers is 8b, but the data width of their differences
is 9b (signed), and the width of the squares of the differences is 16b (unsigned). The squares are summed up for R,
G and B, and the squared distance is obtained. The width of
the distance becomes larger than 16b. Therefore, the eight
squares are summed up into two variables slow,high . We
need to nd the cluster number of the cluster which is closest
to I(x, y). The distances in the two variables are shifted to
left by 8b, and the cluster numbers (up to 8b) are inserted to
the 8b elds (the data width of the distances is less than 24b).
Then, the minimum of (the distances << 8 | cluster number)
is searched, and the cluster number for the nearest cluster
can be obtained. As shown in Fig.7, the parallelism of the
sentences in the main loop is 4 or 8.
6. EXPERIMENTAL RESULTS
We have implemented the programs with the SIMD instructions on Intel Core 2 Extreme QX6850 (3GHz, 8MB L2

80

sec
K LUTs

0.08

FPGA(circuit size)
6

0.06

non-separable
(1 thread)
FPGA(66MHz)
(non-separable &
separable)

0.04

separable(4 thread)
separable(1 thread)
0.02

0.00

non-separable
(4 threads)

11

13

15

filter size (2w+1)

Fig. 8. Performance of two-dimensional lters


XC4VLX160

speedup

XC2V6000

400

XC4VLX160
(speedup)

300

200

XC2V6000
(speedup)

4 threads
100

1 thread
0

Fig. 7. A program for the k-means clustering


cache) with 4GB main memory, and compiled them using
Intel C++ Compiler 10.0. This processor has quad cores,
and the performance can be improved by the multi-thread
execution (image data can be easily divided to four subimages, and each of them can be processed by each thread).
In the following comparison, the time to download images
from main memory is not included, and the performance of
the processor is the average of 1000 runs.
Fig.8 compares the performance of the lter programs
(separable and non-separable) of the processor and a circuit on Xilinx XC4VLX160 (66MHz) for a 640480 pixel
grayscale image. In Fig.8, (2w+1)(2w+1) is the size of the
lters. The performances of FPGA for separable and nonseparable lters are the same and almost constant for all w,
because the performances are decided by the input speed of
image data (though the circuit size for non-separable lters
is almost proportional to w w). The performance of the
processor for separable lters are more than 400 fps even
with 1 thread, and faster than FPGA for all w. As for nonseparable lters, the performance by 4 threads is faster than
FPGA when the lter size is smaller than 7 (the performance
is almost proportional to w w). When the lter size is 15
15, the performance is about 43 fps, which is fast enough
for real-time applications, but slower than FPGA (about 217
fps). The performance gain by 4 threads is about 3.9 times.
Fig.9 compares the performances for the stereo vision

10

0.10

fps

for each I(x, y) {


n 0;
for each 8 centers (ck , ..., ck+7 ) in K centers {
slow432b
0;
4
shigh432b
0;
4
for r18b {IR (x, y), IG (x, y), IB (x, y)}{
{r18b , r18b , ..., r18b }; //copy 8 times
t816b
8
the values of R, GorB of the 8 centers;
u816b
8
// already stored in 8 16b bit wide array
t816b u816b ;
t816b
8
t816b
t816b t816b ;
8
slow432b
slow432b + the 1st four data in t816b ;
4
shigh432b + the 2nd four data in t816b ;
shigh432b
4
}
shift slow432b and shigh432b to the left by 8b;
ll the 8 bit slots with the center numbers (k, ..., k+7);
mem432b [n++]
slow432b ;
4
shigh432b ;
mem432b [n++]
4
}
mem432b [0];
min432b
4
for all slow432b and shigh432b stored in mem432b []
min432b
min{min432b , mem432b []};
4
min132b the minimum of four 32b in min432b ;
cn18b the lowest 8b of min132b ; // cluster number
error132b error132b +(min132b f >> 8);
sumRGB432b [cn18b ]+=4 {IR (x, y), IG (x, y), IB (x, y)};
count132b [cn18b ]+=1;
}
update cluster centers;

0
20

40

60

80

100

120

140

160

180

200

220

240

Fig. 9. Performance of the stereo-vision


of a 640 480 pixel grayscale image. In Fig.9, the size of
window is 7 7. The performance of the processor by 4
threads is about 3.6 times of 1 thread, and faster than 30
fps when D 224. The performances of FPGAs (Xilinx
XC2V6000 and XC4VLX160, 66MHz) become stepwise.
In XC2V6000, a window in Il can be compared with D =
121 windows in Ir in parallel (this is limited by the number
of block RAMs, and 241 in XC4VLX160). Therefore, when
D is less than 62, two windows in Il can be compared with
61 windows in Ir in parallel respectively. When D is larger
than 121, 121 windows are compared in the rst try, and the
rests are in the second try. Therefore, the performance gain
by the FPGAs looks like a sawtooth wave (only the gain over
4 threads is shown). FPGAs are faster than the processor
for all tested D, and XC4VLX160 is two times faster than
XC2V6000 because it is two times larger.
Fig.10 compares the performances of the k-means clustering for an image called monarch (768 512 pixel color
image). The performance of the processor by 4 threads is
about 3.7 times of 1 thread. The performances of the FPGAs
(82.7MHz) become stepwise again, because the distances up
to 24 clusters can be calculated in parallel on XC2V6000
(48 on XC4VLX160), and for more centers, the same operations are repeated (the performances for k on the same step
are not the same in this case because of the iterations in the

81

40

15

30

20

speedup

fps

XC4VLX160
(speedup)

speedup

50

memory throughput

2
K-means clustering
(K=48)

XC4VLX160

3
GB/sec
K-means clustering
(K=256)

10
30

XC2V6000

XC2V6000
(speedup)

filter (non-separable)
(15x15)

20
5

XC4VLX160

XC2V6000

10

16

24

32

40

48

56

64

72

80

88

96

0
112 128 144 160 176 192 208 224 240 256
k
104 120 136 152 168 184 200 216 232 248

Fig. 10. Performance of the k-means clustering algorithm


k-means clustering algorithm). The performance of the processor gradually decreases for larger k, and the performance
gain by the FPGA looks like a sawtooth wave (only the gain
over 4 threads is shown). XC4VLX160 is two times faster
than XC2V6000. The performance of the processor for k=8
(4 threads) is 16.7 fps, and a bit slower than the real-time requirement (20-30 fps), but the performance for a small image called lena (512 512 pixels) is 37.3 fps for k=8 and
faster than the requirement.
7. DISCUSSION AND FUTURE WORKS
We have compared the performance of a processor with SIMD
instructions and multi-cores, and two FPGAs using three
simple problems in image processing. The performance gain
by the FPGAs (over 4 threads) is not so large (5-15 times)
(we can not expect drastic gain (more than hundreds) which
was possible compared with previous processors with limited SIMD instructions, and single core). The performances
of FPGAs are limited by the size of FPGAs, and the memory bandwidth (image data are too large to store on FPGAs).
The performance of FPGAs can be improved by dividing an
image to sub-images, and processing them in parallel in the
same way as the multi-thread execution in the micro processor if the memory throughput allows it. Fig.11 shows
the measured and estimated speedup (over 4 threads) for
the problems (in the k-means clustering, it takes 6 clock
cycles for each pixel when K=256, and one clock cycle
when K=48). With an FPGA which is twice as large as
XC4VLX160 (the circles in Fig.11), we can expect twice the
performance though the required memory throughput also
becomes twice.
The performance of the processor with quad cores is fast
enough for real-time processing (more than 30 fps) when the
image size is small. However, all the tested problems are
preliminary tasks for more sophisticated works. All hardware resources of the processor have to be used to satisfy
the real-time requirement, and no resources are left for other
works. The performance gain by FPGAs are limited, but
large amount of hardware resources are still available on a

100

200

300

memory throughput
400

MB/sec

Fig. 11. Estimated Performances and memory throughput

4 threads 1 thread
0

stereo-vision
(D =241)

10

large FPGA, and we can execute more sophisticated works


which take over the task on the FPGA. When we think of
the performance improvement of the processors in the near
future, real-time applications which can be processed by the
processors is still very limited in image processing, and we
need FPGAs for practical real-time applications.
We have the following issues which have to be considered. We have compared the performances using only three
problems. The performances of the programs on the processor is not fully tuned up. In the comparison, power consumption and costs are not considered.
8. REFERENCES
[1] R.Turney, Two dimensional linear ltering,
http://www.xilinx.com
[2] H. Niitsuma, T. Maruyama, Real-time Generation
of Three-Dimensional Motion Fields, FPL 2005,
pp. 179184
[3] A.Darabiha, et.al., Video-rate stereo depth measurement on programmable hardware, Computer Vision
and Pattern Recognition, 2003, pp.203-210.
[4] J.Diaz, E.Ros, F.Pelayo, E.M.Ortigosa, S.Mota,
FPGA-based real-time optical-ow system, IEEE
Transactions on Circuits and Systems for Video Technology (TCSVT), 2006, Vol.16, Issue.2, pp.274-279
[5] T. Saegusa, T. Maruyama, An FPGA Implementation
of K-Means Clustering for Color Images Based on KdTree, FPL 2006, pp.567-572
[6] T. Saegusa, T. Maruyama, An FPGA implementation
of real-time K-means clustering for color images, Journal of Real-Time Image Processing, Springer, Vol.2,
No.4, 2007, pp.309-318
[7] M. Estlick, M. Leeser, J. Theiler and J. J. Szymanski Algorithmic transformations in the implementation
of K-means clustering on recongurable hardware,
FPGA 2001, pp 103 110.
[8] B.Maliatski, O.Yadid-Pecht, Hardware-driven adaptive
k-means clustering for real-time video imaging, IEEE
TCSVT, Vol.15, Issue.1, 2005, pp. 164166
[9] T.Saegusa, T.Maruyama, Y.Yamaguchi, How fast is an
FPGA in image processing?, IEICE Technical Report,
Vol.108. No.48, 2008, pp.8388

82

Vous aimerez peut-être aussi