Vous êtes sur la page 1sur 6

ISSC 2007, Derry, Sept 13-14

Development of a FPGA Based


Real-Time Blob Analysis Circuit
J. Treinφ, A. Th. Schwarzbacher*, B. Hoppe+, K.-H. Noffz¤ and T. Trenschel¤
φ φ+ φ¤
* School of Electronic and Department of Electronics and Silicon SOFTWARE GmbH,
Communications Engineering, Computer Science, Mannheim,
Dublin Institute of Technology, µSystems Research Group, GERMANY
IRELAND Darmstadt University
of Applied Sciences,
GERMANY
E-mail:φjohannes.trein@web.de, *andreas.schwarzbacher@dit.ie, +hoppe@eit.h-da.de, ¤info@silicon-software.com

Abstract— The blob analysis became a well known method for the detection of objects in
digital images and is an important part in the fields of image processing and computer
vision. Together with the increasing resolutions and frame rates of recent digital video
cameras the analysis requires computationally intensive operations. Software
implementations may not be able to accomplish a satisfying performance. Furthermore,
existing hardware solutions require a processing through the picture in multiple passes. This
paper describes the development of a FPGA algorithm, performing a high speed real-time
blob analysis in only a single pass.

Keywords – blob analysis, object detection, region labelling, image processing, FPGA

increase in design effort and cost. Furthermore,


I INTRODUCTION FPGAs enable almost unrestricted parallel
processing of data and are flexible to use because of
In the fields of image processing and computer their re-programmability. Modern FPGAs facilitate
vision the blob analysis became an important part for built-in memory blocks and multiplier which enable
the validation and verification of objects [1]. The aim the realisation of time critical and memory extensive
of the blob analysis is to detect objects in an image applications.
which can be separated from the background. This paper describes the FPGA investigation and
Furthermore, the properties of the found objects are implementation of a real time blob analysis
calculated. Today's image processing is often algorithm. Digital images are transferred from the
performed on microprocessor based systems like camera into the FPGA and are analysed in real time.
personal computers (PC). The growing demands on The resulting object features are then passed to a host
speed together with the increasing resolutions of PC for further processing. Also, the algorithm
digital video cameras require a real-time blob facilitates the calculation of numerous object features
analysis [1]. The high performance of today's digital such as area, centre of gravity, contour length,
cameras may be realised in microprocessor based orientation or image moments [2]. The hardware is
systems with serious performance constraints and based on a Silicon SOFTWARE [3] frame grabber
high effort. The problem of a processor based system system where the FPGA is represented by a XILINX
is the serial processing of data which cannot reach Spartan II or Spartan III [4] device. It operates with a
the performance of a parallel hardware core frequency of 50MHz. Through the parallel
implementation. Even modern processors which processing of up to 32 pixels and a pipelined
operate in the gigahertz range cannot compensate architecture it is possible to process 1600MPixels/s
this performance gap because of a limited RAM theoretically. This however, is limited to
bandwidth. High resolution image processing is 680Mpixels/s because of the input interface. The
dedicated to parallel operations, because most of its implementation is carried out in a low-cost FPGA
algorithms can be easily realised in a pipelined and therefore makes an economic realisation of the
architecture. developed real-time single pass blob analysis
The rapid development of field programmable algorithm possible.
gate arrays (FPGAs) during the last years, however,
offers new possibilities. Complex algorithms can
now be implemented into hardware without a high
II BLOB ANALYSIS are calculated. For microprocessor based systems
numerous algorithms exist to perform this labelling
A blob analysis detects and determines the properties and feature calculation. This could be an incremental
of objects which can be separated form the scan through the picture [5] or iterative flood filling
background and other objects. It should be [5]. Unfortunately, it is not possible to transfer these
mentioned that it is not the aim of the blob analysis algorithms into hardware because of FPGA
to separate images of the found objects out of the constraints. In a PC the captured image data is stored
original image. Therefore, the outputs of the analysis in the system memory. The microprocessor then has
are properties or object features rather than images or random access to this data. This is not possible inside
pixels. a FPGA, because the built-in block RAM is not big
A blob or an object is defined by a set of pixels enough for the large data volumes. Furthermore, a
which differ from the background and are connected pipelined architecture is not possible as the data has
through a direct neighbourhood. The neighbourhood to be stored first before post-processing and
is differentiated in a four pixel neighbourhood and an performing the blob analysis. Consequently, the
eight pixel neighbourhood. The difference is analysis has to be performed in a raster scan
illustrated in Figure 1. Here, every pixel is labelled procedure at the rate of the camera transfer. Thus,
with a number representing the respective object. random access on the image content is not possible.
The left part of the Figure shows that diagonally Furthermore, it is not possible to process the whole
connected pixels are not connected in a four pixel image twice because of the lack of available memory
neighbourhood. The right part shows that diagonally capacity. For example it is not possible to process the
connected pixels belong to the same object if an image from the bottom after the first pass like many
eight pixel neighbourhood is used. algorithms require [6].
0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 This is the main advantage of the presented
0 1 1 1 1 0 0 < 1 X 1 0 2 0 0 1 1 0
1 1 1 1 0 0 0 0 1 0 0 2 2 0 0 0 0 algorithm. With this novel algorithm presented now
0 0 0 0 2 0 0 0 0 0 2 0 0 0
0 3 3 0 2 2 0 1 1 1 3 3 0 0 2 2 0 the blob analysis can be pipelined and performed in a
0 3 3 3 0 0 0 1 X 1 > 3 3 0 0 2 2 0
0 0 0 0 0 0 0 1 1 1 0 3 0 0 0 0 0 single pass. Through a compression, multiple pixels
four pixel neighbourhood eight pixel neighbourhood can be processed in parallel.
Figure 1: Four and Eight Pixel Neighbourhood
The raw data for the blob analysis may be any
greyscale two-dimensional image. Therefore, the
first step of the analysis is a binarisation of the
image. This means conversion to a black and white
image or an image consisting of foreground and
background pixels. One possible result of this
binarisation is that two objects which are overlapping
in the captured image cannot be distinguished any
more. Therefore, they are considered to be only one
object when performing the blob analysis. Shadows
in the image can result in the same effects. They can
be reduced by adequate pre-processing operations
applied on the image in the FPGA. This could be an
adaptive threshold [1] or erosion and dilation [5]. It
should be noticed that this effect can be a desired
property of the blob analysis. An example is
overlapping or sticking of parts in a production Figure 2: Overlapping Objects in an Image with Their
process which have to be filtered out. They can be Respective Areas
separated by calculating the differing area with the
blob analysis. An example of two overlapping III THE SINGLE PASS BLOB
objects is shown in Figure 2. The determined area
allows the detection of this overlap. ANALYSIS ALGORITHM
In image processing literature such as [5] and [6] The developed algorithm is based on the idea of
the blob analysis is also called object labelling. This directly capturing the frames from a digital camera,
refers to the first step of the analysis which is perform the blob analysis and then pass the results on
necessary for the detection of the blobs. Here, an to a host PC. The interface to transmit the frames
algorithm allocates all connected foreground pixels a between the camera and the FPGA is the 'camera
label number and marks the pixels with this number. link' standard [7]. This interface allows the
Thus, all pixels have a specified label which transmission of up to eight pixels in parallel at a
associates them to an object. Afterwards, the features clock frequency of up to 85MHz. Furthermore, a
of the labelled pixels, thus the features of the objects, timing synchronisation before data processing in the
FPGA is required. Thus, the pixels are buffered in a Thus, each row is compared with its antecedent row.
SDRAM first. From this buffer, the pixels are read The vertical relations of the runs are determined by
with a typical clock frequency of 50MHz. As this method. If two runs in consecutive rows are
described, the clock frequency of the camera link is overlapped by their start or end values then they
higher. To avoid a limitation of these data rates must belong to the same object. The algorithm
inside the FPGA a set of antecedent pixels is read always compares a current row with a previous row.
from the buffer in parallel. This may be up to 32 As the image is processed in raster scan the
pixels which are transmitted in one clock cycle. algorithm has to proceed through the rows in the
The parallel processing of pixels in the blob same way. The respective current row becomes the
analysis is rather unfavourable, because the previous row when performing the next comparison
algorithm works similar to a human who is trying to with the following row. The comparison between
detect an object. A random point in an object is two rows is shown in Figure 4. The runs are
focused with the eye. From this point all antecedent illustrated by bars for a better visualisation of their
points which belong to the object are focused until positions. Technically they consist only of start and
the object can be seen in total. If starting this end values.
procedure at several points at the same time it is runs
row
scan direction 1 2 n-3
possible that two objects merge together. Although n-2
1 2
they are in fact different objects. These merges previous row ⇒ 1 3 2 n-1
would require time intensive operations and are current row ⇒ 1 3 ? n

difficult to implement. To still utilise the parallelism X X n+1

and therefore the high data rates, the developed X n+2

algorithm compresses the image data first. This is


performed by a run length encoding (RLE) [8] and Figure 4: Comparison Between two Rows
results in a multiple transmission of pixels in one Many algorithms allocate an object number to
clock cycle. each pixel, first. During the next pass they calculate
pixel no. run length code the object features. However, this is not possible
0 1 2 3 4 5 6 7 inside a FPGA. Therefore, a new method is
0 0 0 0 1 0 0 0 (4/4) developed. First, every run is allocated a specified
0 0 1 1 1 1 0 1 (2/5) (7/7) object number, too. In contrast, this is only applied to
0 1 1 1 1 0 0 0 (1/4) the current row and the previous row. After the
0 0 0 0 0 1 0 0 (5/5) respective previous row has been compared with the
1 0 1 1 0 1 1 0 (0/0) (2/3) (5/6) current row its data is discarded. Hence, the runs
0 0 1 1 1 0 0 0 (2/4) which have been compared with other runs are not
0 0 0 0 0 0 0 0 available any more.
Next, it will be described how the objects can be
Figure 3: Pixels of an Image and its Corresponding run determined if the runs have been discarded already.
Length Code This is required to determine the properties of the
The run length encoding is based on the objects rather than the object-pixels. These properties
assumption that black and white pixels inside a row are determined step by step while comparing the
continue without the change of colour over longer runs. An example will clarify this. In Figure 4 one of
sections. These sections, called runs, can be the runs is labelled with '?'. The run is not associated
described by their start and end position. Here, the to any object, yet. While comparing the run with the
runs of foreground pixels are described. An image previous row it is obvious that it has no direct
and its corresponding run length code are shown in connection to any of the runs of the previous row.
Figure 3. It illustrates that the amount of information Hence, the run has to belong to an object which has
is reduced because of the compression. Thus, in one not been detected. Therefore, the run is associated
clock cycle only one start and one end value is with a new created object which has the respective
transmitted, but the run may describe multiple pixels. number 4. The algorithm is based on the idea that
This allows the transmission of multiple pixels in object properties are calculated while the procedure
parallel and can maintain the parallel input speed. of comparing runs is in progress. The resulting
An object is detected by the direct neighbourhood properties are stored in a memory block called
of pixels. Thus, the objects 'grow' until they consist object-feature-RAM. If run 4 is associated to a new
of all pixels belonging to the object. The developed object its properties are calculated immediately. If
algorithm works similar. Instead of detecting the object 4 is continued when comparing with the next
neighbourhoods of pixels, the connection between row, the new properties are calculated immediately
runs is determined. Every row consists of a variable and the results are updated in the object-feature-
number of runs. Inside a row, runs cannot be in RAM. The procedure can be summarised as follows:
direct contact to other runs inside the same row. This
is because they would have been described by a
single run after the run length encoding already.
1. comparison of a run with runs of the previous of an image is completed the resulting object
row properties can be passed to the host PC. This is not a
2. association with an existing object or creation very convenient way, because the objects might be
of a new object completed earlier at different positions in the image.
3. allocation of an object number Passing all object properties at the same time to the
4. calculation of the properties of the run and host will produce a peak data rate. The developed
update of the properties in the object-feature- algorithm is designed to pass object properties to the
RAM host as soon as they are fully detected. The
The examples presented are based on simple completion of an object can be ensured if it is not
objects. More complex objects may have shapes further updated during a comparison of the next row.
which merge or divide. Two objects which are A side effect is the independence of an 'end of frame'
assumed to be individual when comparing early rows state which has advantages if a line camera is used.
can merge together into one object later. An example Theoretically, with this novel method endless
of this is shown in Figure 5. While performing a blob pictures may be processed.
analysis these merges cause problems. This is a
reason why several algorithms need to proceed IV IMPLEMENTATION
through the image a second time. These algorithms
'clean up' the object associations by proceeding The implementation of the algorithm is performed on
through the image in inverted direction [6]. a XILINX Spartan FPGA [4] which is embedded to a
row frame grabber. This frame grabber is a PCI-express
n-4
scan direction 1 2 or PCI64-bit PC extension card [3]. The card has two
1 1 2 2 n-3
external camera link interfaces which allow the
1 1 2 2 n-2

n-1
connection between the camera and the FPGA. After
previous row ⇒ 1 1 2 2
current row ⇒ 1 1 a ? n performing the blob analysis, the results are
X X n+1 transferred over the PCI-bus into the main memory
of the host PC. The complete configuration of the
Figure 5: The Merging of two Objects. frame grabber is shown in Figure 6.
The developed algorithm solves this problem in a camera
Silicon SOFTWARE
novel way. As the properties are calculated while frame grabber camera link interface

detecting the objects, the problems of merges can be


solved without a second pass of the image. By means camera port
of an example the novel solution is now described. In
FPGA
Figure 5 the merge of two objects can be seen in the
middle of the 'M' shape marked with an 'a'. If such a
merge is detected all properties from object two are image
immage
PCI bus
external RAM processing
updated into the properties of object one. Therefore, buffer logic (here:
blob analysis)
all determined properties so far are available in
object one. Now, the properties stored in object two host PC
are out-of-date and can be deleted. However, it is
still possible that runs exist which are labelled with block RAM

the object number two. In Figure 5 one of these runs


is marked at the right end of the 'M'-shape. The
normal procedure for the run marked with '?' would Figure 6: Configuration of the Frame Grabber.
be the detection of the contact to the run labelled The algorithm developed fits into a low-cost
with '2' in the previous row. Hence, the run would be FPGA which allows an economic realisation of the
labelled with '2' and the properties of object two blob analysis. If the algorithm has to determine
would be updated, but this object is out-of-date. To image moments with a grade higher than two, a
avoid this scenario a pointer is set in the object- FPGA with built-in multipliers is required to
feature-RAM of object two during the merging maintain the timing constraints and avoid
process. The pointer is set to point on object one. If propagation problems. Figure 7 shows an overall
an update process is trying to update object two, it is block diagram of the developed algorithm explained
redirected to object one and the correct update is in Section III.
ensured. This example has shown that all object
properties are determined correctly. A second pass is
not necessary. Hence, the algorithm can be
performed in only a single pass.
By comparing the rows, the properties of the
objects are calculated and stored in the object-
feature-RAM. Hence, the properties of the objects
grow until an object is fully processed. If the analysis
raw data
the object-feature-RAM which become empty are
from camera
stored in a stack. Here, the object detection
n pixel mechanism uses them for the allocation of new
image in parallel
buffer
RLE object numbers.
chunk + row
chunk /
V PERFORMANCE
obj. no. RAM
RAM
controller current row A program has been written to verify and benchmark
chunk / previous row
obj. no. the algorithm towards its speed and resource
obj. no. chunk + obj. no.
requirements. It is based on a java program which
old feature has been written in an ImageJ environment [9]. The
object chunk feature program simulates the hardware algorithm.
detection calculation
mechanism Moreover, the use of several internal counters
free address
new feature facilitates the calculation of the required clock cycles
feature feature for read, write and calculation processes. A statistical
output allows the derivation of the obtained
address feature performance and the functionality of the blob
stack RAM
analysis. Numerous images are tested. Especially the
content of the image has an impact in respect to the
marker feature
address
performance of the algorithm. Thus, the influence of
image noise and shadows in natural images are
garbage feature to DMA
collector finalise important criterions. Image noise might result in
multiple small blobs which are detected by the blob
analysis and are determined to be new individual
Figure 7: Overall Block Diagram of the Blob Analysis
Algorithm.
objects. Thus, they influence the performance and
required resources. For images with equal scenes the
The input of the run length encoder is a set of n analysis is independent of the image resolution in
antecedent pixels in parallel. Accordingly, the output respect to the performance. This is a feature of the
is a stream of runs i.e. start and end values. The run length encoding. When images have the same
encoder is limited to output a run every clock cycle if content they will have the same objects and therefore
in the parallel incoming pixels one or more runs start the same number of runs per row. Only the start and
or end. end-values of the runs varies. Thus, the content and
The memory required to store the frame rows the number of objects is the crucial factor for the
consists of two parts. Part one holds the runs of the required clock cycles of the algorithm and not the
respective current row and the other part stores the image resolution.
runs of the previous row. Furthermore, each memory The performance of the blob analysis is limited to
holds the corresponding object numbers to each run. the maximum speed of the transmission of the raw
Inside the FPGA the memory is realised by the use data between the camera link interface and the run
of block RAMs. The RAM is a true dual-port RAM. length encoder. An image with a resolution of 8k
Thus, two different components have access to the pixels i.e. an image with a resolution of 8192 by
RAM at the same time. The blob analysis algorithm 8192 pixels has 67MPixels. At a clock frequency of
exploits this feature. The runs coming from the run 50MHz and the parallel transmission of 32 pixels a
length encoder are written into the RAM and at the frame rate of
same time runs are read and object numbers are
32 ⋅ 50MHz
written by the object detection mechanism. ≈ 24 fps (1)
The memory holding the object properties i.e. the 67 MPixels
object-feature-RAM is realised by block RAM, too. can be achieved. If images with a resolution of 4k are
Here, the memory address directly corresponds to the used a frame rate of
object number. The object detection mechanism
itself is a state machine. It reads the runs from the 32 ⋅ 50MHz
≈ 94 fps (2)
current and previous row, checks for connections and 17 MPixels
updates the properties in the object-feature-RAM.
Furthermore, a component is necessary which can be achieved. The determined frame rates depend
allocates free addresses in the object-feature-RAM to on the picture content. For images with normal
store the properties of new detected objects. This objects these frame rates can be reached. If the image
task is performed by the garbage collector. It collects contains strong noise, shadows or texts with
completed objects in the object-feature-RAM, passes numerous characters the object detection mechanism
them to the host PC and deletes the entries in the might require more computing time. In general, the
RAM which are not required. For these read and algorithm is fast enough to handle the incoming data.
write processes the second port of the object-feature- The bottleneck is the transmission of the incoming
RAM is used. The addresses of the memory cells in
data and the camera link interface and not the blob video camera are transferred to the FPGA via the
analysis algorithm itself. camera link interface. After performing the blob
An example of the output of the simulation is analysis in the FPGA, the determined object features
shown in Figure 8. The 1MPixels sized input image are passed to a host PC. By use of a run length
consists of three main objects. A very strong noise is encoding the data rates of the parallel incoming
added to simulate worst case scenario. A cut-out of pixels can be maintained.
the image after the binarisation is shown in the The memory inside a FPGA is not sufficient to
figure. The algorithm calculates the area, the hold a whole frame. Therefore, the image is
bounding box, the centre of gravity as well as the processed in a raster scan procedure where only two
orientation of the objects. The blob analysis detects a antecedent rows are stored at the same time. By
total of 60069 objects in the image, due to the high comparing these two rows, the objects and their
level of added noise. Only object properties with an properties are determined on the fly. This method
area greater than 100 pixels are shown in the Figure. minimises the amount of memory necessary in the
The bounding boxes show that the objects are FPGA. Object properties are passed to the host PC as
detected correctly. The algorithm requires 377023 soon as they are completed while scanning the
clock cycles for the blob analysis. Therefore, the picture, overcoming a possible communications
maximum frame rate will be still 132fps. This shows bottleneck. The problem of merging objects is solved
the ability of the algorithm to handle strong image by combining their properties and setting pointers on
noise while maintaining the high frame rates and still objects which are out-of-date. Hence, a second pass
operating correctly. through the image is not necessary and the detection
of all objects is performed in only a single pass.
The consistent use of dual port block RAM in the
FPGA together with the algorithm focused on
performance allows the real-time analysis.
Furthermore, the algorithm fits in a low-cost FPGA
with a block RAM size between 64k and 280k bits.
The implementation facilitates a maximum data rate
of up to 1600 million pixels per second, depending
on the number of objects and picture noise.

VII ACKNOWLEDGEMENT
Thanks to Silicon SOFTWARE GmbH [3] which
supports the research as well as provides the frame
grabber hardware and software environment.

VIII REFERENCES
[1] E. Davies, Machine Vision, Academic Press, third
edition, 2005
[2] M.K. Hu, "Visual Pattern Recognition by Moment
Invariants," IEEE Transactions on Information
Figure 8: Simulation Output of an Analysed Image.
Theory, Volume 8, February 1962
Besides the performance, the simulations of [3] Silicon SOFTWARE GmbH, www.silicon-
numerous images have provided an estimation of the software.com
required resources. Here, the crucial factor is the [4] XILINX Spartan II/III, www.xilinx.com, (April 2007)
[5] W. Burger, M.J. Burge, Digitale Bildverarbeitung,
memory required to hold the object properties and Springer Verlag, March 2005.
the memory requirements of the current frame row as [6] R.V. Rachakonda, P.M. Athanas and A.L. Abbott,
well as the previous row. Tests have shown that the "High-Speed Region Detection and Labeling using an
total number of block RAM required is about 72kBit FPGA-based Custom Computing Platform," 5th
depending on the input image and the necessary International Workshop on Field Programmable Logic
object-features. Today's low-cost FPGAs deliver and Applications, Oxford, UK, Sep 1995
between 64k and 280k block RAM bits. Therefore, [7] Camera Link by the Automated Imaging Association,
the algorithm can be easy implemented without any http://www.machinevisiononline.org/public/articles/in
dex.cfm?cat=129, (April 2007)
constraints.
[8] T. Trenschel, "Blob Analyse - Bestimmung von
Formparametern beliebig geformter Objekte auf
VI CONCLUSIONS FPGAs in Echtzeit," Diploma Thesis, Ruprecht-Karls-
Universität, Heidelberg, 2000
This paper has presented the development of a real- [9] ImageJ, http://rsb.info.nih.gov/ij, (April 2007)
time blob analysis algorithm for a FPGA
implementation. The incoming frames from a digital

Vous aimerez peut-être aussi