Vous êtes sur la page 1sur 56

Obstacle detection using stereo vision for self-driving cars

Perceiving the surroundings accurately and quickly is one of the most essential and
challenging tasks for autonomous systems such as self-driving cars. The most common sensing
systems such as RADAR and LIDAR that are used for perception on self-driving cars today give
a full 360 view to the car making it more informed about the environment than a human driver.
This article presents a methodology to employ two 360 cameras to perceive obstacles all around
the autonomous vehicle using stereo vision. Using vertical, rather than horizontal, camera
displacement allows the computation of depth information in all viewing directions, except
zenith and nadir which have the least useful information about the obstacles. The Key idea for
obstacle detection is to classify points in the 3D space based on height, width and traversable
slope relative to the neighboring points. The detected obstacle points can be mapped onto
convenient projection planes for motion planning.

Computer vision is one of the toughest problems in Artificial Intelligence (AI) that has been
challenging researchers and engineers for decades. This problem of extracting useful information
from images and videos finds application in a variety of fields such as robotics, remote sensing,
virtual reality, industrial automation, etc. The concept of making cars drive by themselves has
gained immense popularity today in the AI world, mainly motivated by the number of accidents
that occur due to driver errors/ negligence. Autonomous vehicles are designed to sense their
surroundings with techniques such as RADAR, LIDAR, GPS and computer vision. This array of
sensors working coherently to observe and record the surroundings constitute the perception
module of the vehicle. The next stage in the pipeline is the localization step which stitches
together the incomplete and disconnected information obtained from the sensors to identify the
position, velocity and other states of the vehicle and the obstacles (including dynamic obstacles).
The final module is the planning stage where the vehicle has to decide what it has to do given the
situation it is in. The present day research prototypes built by major players in the industry/
academia have LIDAR and RADAR as their primary perception systems. They generally provide
a very accurate full 360 view to the vehicle making it more informed about the environment than
a normal human driver. The downside to these systems is the cost involved in deploying them.
So an option is to use cameras and computer vision techniques to substitute these systems.

It has been shown in several cases that stereoscopic vision can be applied to extract useful
information about the surroundings that could assist the navigation of mobile robots [1], [2], [3].
But in most cases the field of view is limited to just the front of the robot. It is very critical that
the vision system we are targeting doesnt compromise on obtaining the 360 view provided by
LIDAR / RADAR systems. This can be achieved by employing special cameras that capture 360
by 180 image of a scene.


It is very essential for an autonomous vehicle to accurately and reliably perceive and
discriminate obstacles in the environment. To this end, many approaches have been presented for
different application areas and scenarios in past years using stereo vision or 2D/3D sensor
technologies. Each obstacle detection system is focused on a specific tessellation or clustering
strategy, hence they have been categorized into 4 main models [4]: (i) probabilistic occupancy
map, (ii) digital elevation map, (iii) scene flow segmentation and (iv) geometry-based clusters.
In probabilistic occupancy maps, the world is represented as a rigid grid of cells containing a
random variable whose outcome can be free, occupied, or undefined (not mapped) [5], [6]. The
goal here is to compute the associated joint distribution depending on a set of measurements
carried out on a certain discrete set of time moments. Digital elevation maps (DEM) is one of the
algorithms that try to detect obstacles relying on the fact that they protrude up from a dominant
ground surface. The obstacle detection algorithm proposed in [7] marks DEM cells as road or
obstacles, using the density of 3D points as a criterion. It also involves fitting a surface model to
the road surface. The scene flow segmentation or otherwise called as optical flow utilizes
temporal correlation between different frames of a scene captured by stereo cameras to classify
obstacles that are in motion [8], [9], [10]. This method thus naturally handles tracking dynamic
obstacles. Finally the geometry-based clustering involves classification based on geometric
structure of point clouds in the 3D space. The obstacle detection algorithm that will best suit this
category is [11] which is based on a search method that clusters points using a double cone
model. This algorithm became the basis for the obstacle detection module that went on the

The approach we present in this report is inspired by the method in [11]. Given our objective
of obtaining a 360 view, we present a few modifications to this algorithm that will use 360 stereo
pair to obtain obstacle information. The remainder of this report is structured as follows. Section
3 talks about the equipment setup that would best suit for this application. Section 4 describes the
disparity and depth estimation technique used. Section 5 talks about the pixel-level
implementation details of the obstacle detection algorithm. Section 6 discusses the experimental
results on a few test scenes. Section 7 outlines possible next steps from this project.


Reliably and accurately detecting obstacles is one of the core problems that need to be solved
to enable autonomous navigation for robots and vehicles. For many use cases such as micro
aerial vehicles (MAVs) or self-driving cars, obstacle detection approaches need to run in (near)
real-time so that evasive actions can be performed. At the same time, solutions to the obstacle
detection problem are often restricted by the type of vehicle and the available resources. For
example, a MAV has restricted computational capabilities and can carry only a certain payload
while car manufacturers are interested in using sensors already built into series vehicles in order
to keep self-driving cars affordable.

There essentially exist two approaches for obstacle de-tection. Active methods use sensors
such as laser scanners, time-of-flight, structured light or ultrasound to search for obstacles. In
contrast, passive methods try to detect obstacles based on passive measurements of the scene,
e.g., in camera images. They have the advantage that they work over a wide range of weather and
lighting conditions, offer a high resolution, and that cameras are cheap. At the same time, a wide
field of view can be covered using for example fisheye cameras. In this paper, we therefore
present an obstacle detection system for static objects based on camera images. We use stereo
vision techniques [1], [2], [3] to obtain a 3D model of the scene. Our main motivation is to
enable self-driving cars to detect static objects such as parked cars and signposts, determine the
amount of free space around them, and measure the distance between obstacles, e.g., to
determine the size of an empty parking spot. We therefore detect obstacles as objects obtruding
from the ground [4]. Most existing stereo vision-based techniques rely on classical forward
facing binocular stereo cameras with a relatively narrow field of view and visual (inertial)
odometry (VIO) systems to provide accurate vehicle poses. These systems are mainly targeted
for detecting objects which are in front of the car and are therefore used in standard on road
forward driving situations. For many maneuvers such as parking into a parking spot or
navigating in a narrow parking garage, a full surround view is very important. We show that for
such situations accurate obstacle detections can be obtained from a system that uses only
monocular fisheye cameras and the less accurate poses provided from the wheel odometry of the
car, if the noisy individual detections are properly fused over time. The resulting system does not
require complex VIO systems, but simply exploits information already available while running in
real-time on our test vehicle, we thus avoid any unnecessary delay a VIO system might

This paper makes the following contributions: We describe the overall obstacle detection
system and explain each part in detail, highlighting the rationale behind our design decisions. We
demonstrate experimentally that highly precise vehicle poses are not required for accurate
obstacle detection and show that a proper fusion of individual measurements can compensate for
pose errors. Self-driving cars are currently a very active field of research and we believe that the
proposed system and our results will be of interest to a significant part of researchers working in
this field. To our knowledge, ours is the first system that uses monocular fisheye cameras and
only relies on the wheel odometry.

The paper is structured as follows: The remainder of this section discusses related work. Sec. II
provides an overview over both our vehicle setup and our obstacle detection system. Sec. III
explains the computation of the depth maps. Obstacle extraction is described in Sec. IV, while
Sec. V details how to fuse detections from multiple depth maps. Sec. VI experimentally
evaluates the proposed method.

Fig. 1. Overview over the obstacle detection system proposed in this paper.

height map [4], [9], [13], or a volumetric data structure [14] into which the depth measurements
are fused. Occupancy grids are probably the most popular scene representation as they not only
provide information about the positions of the occupied space but also about free space [15].

In the case of ground-bound vehicles, e.g., cars moving on planar surfaces, obstacles
correspond to objects obtrud-ing from the ground: [9] compute a height profile from stereo
measurements and subsequently estimate for each 2D occupancy grid cell whether it corresponds
to free space or occupied space. While [9] use a probabilistic model, [16] employ dynamic
programming on the grid cells to determine free space. Given the free space, stixels can be used
to compactly represent obstacles as boxes standing on the ground [4]. Overhanging structures
and archways can be handled by allowing multiple height map layers [13].

In this paper, we follow a similar approach and model ob-stacles as objects extruding from the
ground. While existing work requires high-precision sensors [16] or visual odometry methods to
obtain precise vehicle poses before fusing the depth data [9], [17], we show that using the less
accurate wheel odometry readily available in every car is sufficient.
The disadvantage of most stereo cameras is their limited field of view (FOV). Thus, car
manufacturers are beginning to additionally integrate wide FOV [7] or fisheye cameras [18] to
cover the entire scene surrounding the vehicle. The extremely wide FOV of fisheye cameras
enables them to also observe objects close to the vehicle, i.e., they are well-suited for obstacle
detections. In this paper, we thus use such cameras and show experimentally that we are able to
accurately detect obstacles through motion stereo. [19] use motion stereo, based on very accurate
poses from GPS/INS measurements, to generate large-scale urban reconstructions by estimating
the 3D geometry of the scene. In contrast, our approach determines which parts of the
environment are occupied by obstacles and which are free space.

As illustrated in Fig. 1, our obstacle detection framework consists of three main stages. First,
we extract a depth map for each camera mounted on the car using multi-view stereo matching on
a sequence of camera frames recorded by the moving car. The camera poses required for this step
are directly obtained from the wheel odometry and the extrinsic calibration of the cameras. No
visual odometry system is employed to refine the poses. The depth maps provide a 3D
reconstruction of the surroundings of the car but do not offer any information about which
structures are obstacles that need to be avoided and which parts correspond to free space that the
car can move through. In the second stage, we thus detect and extract both obstacles and free
space for each individual depth map. Since obstacles are objects protruding from the ground,
obstacle detection is performed in 2D.

One of the shortcomings of using wheel odometry is that the rotation of the wheels is only
discretely sampled, which leads to a slight oscillation around the true traveled distance. This is
not a problem in terms of determining the position of the car but leads to uncertainty for the
estimated depth maps due to a slightly inaccurate baseline in the stereo matching. Therefore, the
third stage fuses the obstacle detections over several camera frames to obtain a more accurate
estimation of the occupied space around the car. Depending on the use case, the fusion can be
done independently per camera or as a single fusion that combines data from multiple cameras.

We integrated our obstacle detection system into two VW Golf VI cars that are equipped with a
system of four fisheye cameras mounted on the car with minimally overlapping FOVs. The
cameras are installed in the side mirrors as well as in the front and back of the car and each have
a 185 FOV, jointly covering the whole field of view around the car. They record at a frame rate
of 12.5Hz and are triggered syn-chronously. Besides the cameras, we also utilize the wheel
odometry of the car, which provides a 3 degrees-of-freedom pose of the car on the ground plane
by measuring the rotation of the wheels. The cameras are calibrated with respect to the odometry
frame using a simultaneous localization and mapping-based calibration pipeline [20]. The cars
are the test platforms of the V-Charge project [18], whose aims are fully automated navigation
and parking. As such they are equipped with other environment perception sensors which are not
used for our system. However, since the output of our method is simply a set of occupied and
free space estimations, our results can easily be fused into a combined obstacle map together
with similar data from other sensors. All used sensors are either already built into todays cars or
close-to-market. For the processing of the sensor data and to perform the navigational tasks
required for automated driving, the cars have a cluster of 6 PCs installed in the trunk.

Fig. 2. Results of our depth map computation procedure. (Left) image of the reference view, the
depth maps computed (middle) without ground direction plane-sweep and (right) the complete
depth map computation procedure including the ground directions. The depth maps are coloured
with respect to height above ground instead of depth (color pattern repeating every 0:5 meters),
to illustrate the quality of the ground. Without the ground direction plane-sweep, the ground is
reconstructed rather bumpy (cf. middle). In contrast, our complete procedure obtains a much
smoother reconstruction. motion blur due to the close proximity to the car. To reduce noise on
the areas where the ground plane is visible in the images, we introduce a two stage approach
which first checks if the ground plane matches sufficiently well in the images and only
reconstructs other structures if a matching patch is unlikely to belong to the ground plane.

Standard binocular stereo matching takes two stereo rec-tified images as an input and
computes a depth map for one of the two images. In our case, images are recorded with a
continuous frame rate. This means we have the option to match more than one image to a single
reference image to increase the quality of the computed depth maps by increasing the baseline.
For general camera configurations, it is not possible to stereo rectify a set of more than two
images. To avoid this problem, plane-sweeping stereo [1] is usually used since it does not require
any rectification. The main idea of plane-sweeping is to match a set of images to a reference
image by projecting them onto a plane hypothesis and then back to the reference image. The
images warped through this procedure and the reference image will then be compared using an
image dissimilarity measure, which is evaluated over a small matching window. If the tested
plane hypothesis is close to the true depth of a pixel in a reference image, the corresponding
dissimilarity value will be low. Testing many plane hypotheses and taking the depth induced by
the best matching plane for each pixel then produces a depth map for the reference image.
Originally, plane-sweeping has been proposed for pinhole images, where the images are warped
into the reference view through a planar homography. In order to cover a wider field of view
with each camera, and thus enable obstacle detection around the car, our setup uses fish eye
cameras. While this increases the complexity of the warping process slightly, depth maps can
still be computed in real-time on a graphics processing unit (GPU) [3].

The plane-sweeping algorithm locally approximates the reconstructed 3D structure as a plane.

If the normal direction of the plane hypothesis is not well aligned with the actual surface
direction, the warped image will be locally distorted with respect to the reference image, which
increases the dissimilarity score even for matching patches. This can be overcome by aligning
the sweeping plane directions with the predominant directions in the scene [2]. We therefore
sweep planes in two directions; Since we are interested in detecting obstacles on the ground, one
sweep direction uses planes parallel to the ground plane while the others are fronto-parallel to the
camera. The extrinsic camera calibration, which gives us the height of the cameras with respect
to the surface the car drives on, provides a good initial estimate of the ground plane. Hence, we
only need to test very few planes for the ground direction; we used 10 in our experiments. For
the fronto-parallel sweep, we space the planes inversely proportional to the depth, resulting in an
even sampling in disparity space. For each pixel in the reference image and each plane, we
obtain one dissimilarity score from each image that is matched against the reference image. To
obtain one single cost per plane for each pixel in the reference image, we simply average these
dissimilarity scores. Notice, this means we do no utilize any occlusion handling (c.f. [3]).
Occlusion handling could slightly improve the quality of the depth maps around occlusion
boundaries but this small gain comes with a set of disadvantages. Firstly, occlusion handling
slows down the depth map computation process. Secondly, doing occlusion handling the way it
was initially proposed in [21] would mean that we would compute the depth map for at least one
frame before the newest one available. This would lead to an unnecessary delay of the data. As
we could not observe any significant gains with it and since it comes with the aforementioned
problems, we refrained from using occlusion handling in our experiments. The depth map is
finally extracted as the depth of the plane with minimal cost, utilizing a winner takes all strategy.

The last step of our depth map computation procedure is to combine the potentially incomplete
depth map computed for each sweeping direction. For this, we need to define a bit more formally
the output of the plane-sweeping algorithm. For the sweeps in the ground and fronto-parallel
direction, we obtain a depth map Zg(x; y) and Zf (x; y), respectively. The computation of the
depth maps provides for each pixel the uniqueness of this match. Let C ;2(x; y) be the cost of the
second-best matching plane for pixel (x; y). The uniqueness ratio is then defined as the ratio C
(x; y)=C ;2(x; y). Both of these additional values can be used to filter the depth maps: A low
matching cost implies that the image patches match well. A large uniqueness ratio indicates
ambiguities, e.g., uniformly textured regions or repeating patterns will result in a ratio close to 1.
We use two sets of thresholds for the two sweeping direction T Cg , TUg and TCf , TUf , respectively.

dissimilarity cost and the uniqueness ratio for the ground-parallel and fronto-parallel sweeps,
respectively. The two depth maps are merged as

uses matches on the ground plane. In our experiments, we found this behavior to be desirable as
it leads to a smoother reconstruction of the ground which is important to avoid false positive
obstacle detections.
For our experiments, we use zero mean normalized cross correlation (ZNCC) scores over a 9 9
pixel window for measuring the image similarity. The matching cost C (x; y) are computed as the
negative ZNCC score normalized to the interval [0; 1], i.e., a matching cost of 1 corresponds to a
ZNCC score of 1 and a matching cost of 0 corresponds to a ZNCC score of 1. There are 10
planes used for sweeps parallel to the ground and 50 planes for the fronto-parallel direcion. The
filtering thresholds are set to TUg := 0:9925, TUf := 0:98, TCg := 0:18 and TCf := 0:17. The depth
maps are computed on a resolution of 640 400 pixels and we match two images against the
reference image, meaning 3 images contribute to every depth map. Fig 2 shows the results of our
depth maps computation procedure. The benefit of the additional ground direction sweeping is
mostly visible in the form of a smoother ground plane.

The output of the previous section is a depth map for each camera. In this section, we describe
how to extract obstacles from a single depth map. Following [4], [9], we make the simplifying
assumption that the car is moving on a plane, i.e., every object not lying on the plane is a
potential obstacle, and perform obstacle detection in 2D.

Obstacle detection consists of three steps. Given a depth map, we first create a 2D occupancy
grid that encodes which cells contain depth measurements. From this grid, we then extract
obstacles from the occupied cells and refine their positions to avoid discretization artifacts.
Finally, we estimate the uncertainty of each detection, which will be used in Sec. V to fuse
detections from multiple depth maps.

Occupancy Grid Generation

We define the local coordinate frame of the car to coincide with the vehicle odometry frame. It
has its origin on the ground plane at the point where the middle of the rear axle orthogonally
projects to the ground plane. The x-axis is pointing towards the front of the car, the y-axis to the
left and the z-axis is pointing vertically away from the ground (cf. Fig. 3). Due to this choice of
coordinate system, we can directly infer the position of the ground for each camera from the
known extrinsic calibration of the cars camera system.

For each camera, we define a separate occupancy grid that covers the relevant part of the area
observed by the camera.
Fig. 3. Illustration of the coordinate systems used in the paper. The x-,
y-, and z-axes are colored in red, green, and blue, respectively

The grid is placed in the ground plane, i.e., coincides with the x-y plane of the vehicle frame. The
cameras are mounted such that their x-axes are parallel to the ground. Thus, the x-and y-axes of
the occupancy grid are chosen to be parallel to the projections of the x- and z-axes of the camera
onto the ground plane. As illustrated in Fig. 3, the origin of the occupancy grid is the projection
of the camera center codo, expressed in the vehicle frame, to the x-y plane. We express
coordinates in the occupancy map by an (angle, disparity) pair (d; ), where disparity d = 1=y grid
corresponds to the inverse depth of the measurement.

We use disparities instead of the original y-coordinates since the depth maps are already
computed based on dispari-ties by spacing the planes inversely-proportional to the depth. As a
consequence, the occupancy grid has a higher resolution very close to the camera, which leads to
an unnecessary high grid resolution and thus memory and time consumption. In order to
guarantee a minimal cell size in depth direction we virtually shift the grid by y shift. The definition
of the final occupancy grid coordinates (dgrid; ) can now be stated in terms of the Cartesian

coordinates in the occupancy grid frame as:

The occupancy grid coordinates (dgrid; grid) are only de-fined for Cartesian coordinates with ygrid >
0. Notice that the resulting grid has the shape of an isosceles trapezoid, similar to the polar grid
in [16].

Let Rcam 2 R3 3
and odo
tcam 2 R3 1
be the rotation and translation that transform a depth
measurement xcam from the local camera coordinate system into the vehicle frame, i.e., x odo =
odo odo
Rcamxcam + tcam. As shown in Fig. 3, the corresponding grid cell is obtained by projecting x odo
codo onto the x-y plane of the grid frame, computing the angle and disparity, and performing the
shift by yshift.

For each grid cell (d; ), we store the number F (d; ) of points from the depth map that vote for
free space in this specific cell, the number O(d; ) of points that vote for occupied space, and
additionally the average disparity dgrid of the points voting for occupied space within a given cell.
If the vehicle frame position xodo of a depth measurement is close to (or below in case of errors in
the depth map) the ground, it votes for free space in its corresponding cell. If x odo lies above the
ground plane up to a maximum height, it votes for occupied space in its cell and the
corresponding average disparity value is updated. The maximum height
Fig. 4. Results of our obstacle detection approach: (Top row) Two input images. (Bottom row)
Obstacles detected in three different frames, projected to the ground plane. The red dots denote
the obstacle position and the cyan lines indicate the uncertainty area along the ray. The last image
contains all obstacles extracted in an interval of 1 seconds around the corresponding frame and
shows that simply adding individual detections produces significant noise.

each grid cell, we aim to determine whether it corresponds to free space or is occupied, i.e.,
contains an obstacle. In robotics, this is often achieved in terms of an occupancy grid using an
inverse sensor model [10]. We follow a different approach which originates from the fusion of
laser point clouds [24] and has since then been used successfully in many computer vision
systems [25], [26], [27]. Its main idea is to follow each ray originating from a camera center until
it hits the first obstacle. For all grid cells traversed by the ray, a negative weight is added to
denote free space. At the measurement, the ray enters the object and hence the space behind the
measurement should be occupied, which is denoted by adding a positive weight. Since we do not
know the thickness of each object, we only enter positive weights for a small region behind the
obstacle. After adding the measurements for all cameras that should be considered, a cell having
a negative accumulated weight is then considered as free space and a cell with a positive weight
as occupied space. For cells that have a weight of exactly zero, the space is considered as

For real-time processing, we need to limit the size of the grid. We achieve this by only
updating a 10 10 meter area which is placed such that it covers the area observed in the current
camera frame. Which part of the fusion grid needs to be updated is determined based on the
wheel odometry reading. For each cell inside the viewing area of the current camera frame, we
determine the obstacle which lies on the ray from the camera center to the grid cell. The weight
added to the current grid cell is defined by comparing the distance l obst of the obstacle to the
camera with the distance lcell between the cell and the camera. As explained in Sec. IV-B, there
are two types of obstacles, conventional obstacles and the free space obstacles placed at the
end of the definitively visible space along a ray. Consequently, two different types of weights are
used. The free space weight wf is used for both types while the obstacle weight wo is only
employed for the standard obstacles. For both types of obstacles, an

Fig. 6. Fusion results for an indoor sequence. Reflections on the ground cause erroneous
detections while backlight complicates precise depth esti-mation. Still, our system is able to
outline the positions of the obstacles.

uncertainty interval (lobst u1; lobst + u2) along the ray is given and we distribute the weight of the
obstacle along the interval. Thus, obstacles that are measured with low accuracy will only
contribute little weight to each grid cell.

The free space weight for the cells along the ray is
where the constant k 0 is added only to cells in front of the obstacle. The obstacle weight w o
should only be non-zero in the uncertainty regions and is thus defined as

Additionally, we impose a minimal uncertainty area to make sure that the weight is spread at
least into a few cells in the grid. We also discard measurements which have a very large
uncertainty interval of more than 4 meters.

All grid cells are initialized with weight zero and are updated whenever new obstacle
detections are available by adding the respective weight to the grid cells. The update is
constantly performed in a separate compute thread con-current to depth map generation and
obstacle detection. The area of the grid that is updated is set to 400 400 cells, resulting in a 2:5cm
resolution for the 10m 10m area. The constant for free space is set to k = 4. Figures 5 and 6
depict results of the fusion approach, where free space is colored green, occupied space is red,
and unobserved space is black.

Car accidents are the key problem of todays traffic world. Particularly this strongly felt in
urban areas where cars density is largest. According to Lithuanian 2011 year car accidents
statistic most common type of car accidents are collisions. It represents 44% of total car
accidents. Accidents happened with pedestrians and obstacles also could be assigning to similar
car accidents type. By summing of everything, we are getting 82% of total accidents [1]. Most of
car accidents happened due to human factor: somnolence, inattention, absence of mind and slow
reaction. All these factors are critical for a driving safety. Using stereo vision system is a way to
reduce number of car accidence. Stereo vision based traffic observation systems are in active
development period. Scientist and many famous car manufacturers are developing durable,
accurate and real time car driver assistance systems. Many of works done by researching and
developing vision system for lane tracking [2], pedestrian or bicyclists detection [3], car
detection [4] or driver monitoring [5].
Fig. 1. Typical structure of stereo vision [6].

The main purpose of using stereo vision system is capability to transform 2D view into 3D
information usually called depth map. Typical visual system consists of two digital video
cameras which are displaced horizontally from one another to obtain different views of a scene,
in a manner similar are used to human binocular vision (Fig. 1). 3D information derived by
finding the corresponding points across multiple cameras images. Correspondences lie to the
same epipolar line. To simplify the matching process, camera images are often rectified. The
result is epipolar line corresponds on image scan line. Before image rectification system has to
meet these requirements: cameras image planes are parallel, focal points height and lengths are
the same. To get a depth map a disparity value must be calculated for every image pixel. A
disparity value for a target point p calculated using following expression

High level is the last processing stage where depth and image data are used for object extraction
and ecognition. Further object list is used for front vehicle
Fig. 2. Stereo vision system architecture consist of three data processing levels.

Suggested approach based on dense edge map calculation and ROI filter. Fig. 3 illustrates a
block diagram of vision system data processing. Totally eight steps sequence to extract and
mark car from 2D image set. Car extraction process starts from capturing of view images.
Next, captured 2D images processed in the middle data processing level. Here 2D data
transformed into 3D information, well known as depth map. ROI filter eliminates redundant
3D information. The rest of data is processed by dense edge map estimation algorithm.

Fig. 3. Vision systems data processing block diagram.

Region of Interest filter is used to eliminate redundant 3D information. ROI can be represented
as a frustum in a 3D environment (Fig. 4). According to fact that system designed to observe
front vehicles a frustum is projected on a roadway and covers one or more drive lines. All data
outside this space are eliminated leaving only significant information. The main benefit of
applying ROI filter is saved extra computation resources for further object extraction processes.
Fig. 4. Illustration of region of interest filtering.

Edge dense map estimation basically is a depth maps transformation from the front view to the
top view perspective. A value of every new map element calculated using following expression:


where A is 3D data array; B is dense edge data array, i{0,1..h-1}, h is image height,
j{0,1..w-1}, w is image width, c is a factor calculated

This transformation is modified following the idea to analyze not whole object but only their
edges. The result of this transformation is an edges map in a perspective of top view. Fig. 5
depicts the transformation principle in a 3D environment. The most density of points is located
on objects edges. Edges are highlighted by summing points in a vertical direction. The more
objects edge is higher and sharper the more points will be counted in a vertical direction. Flat
and low altitude objects have a small amount of points; therefore, objects like a road can be
easily eliminated using simple threshold.

Fig. 5. Illustration of 3D object dense edge estimation principle

A virtual 3D computer reality was chosen as very first vision system and approached car
extraction algorithm test environment. Specially designed software allows simulating stereo
vision system behavior in a virtual 3D reality. Software based on OpenGL and OpenCV
programming libraries and is written in a C++ language. Software runs on Core 2 Duo 2.53GHz
CPU and 4 GB memory of RAM. Similar kind of test method has been used by other authors [9]
and [10] to test stereo vision systems for relative applications. In order that tested situations
would be more realistic, a segment of street with all following infrastructure: cars, road signs,
buildings, pedestrians were designed and rendered in a virtual 3D reality. In this test stage a
depth map was generated using OpenGL graphic API methods. This allows getting the best depth
map quality. Depth map can be estimated and using any kind of stereo matching algorithm.
Simulation software allows imitating stereo camera view and capture left and right camera
images. Estimated depth map can be represented as a point cloud converting 3D data to real
word coordinates. Different traffic situations, vision systems configurations can be emulated and
repeated many times. This is the main advantage of described simulation software. bottom. The
opposite situation presented in the situation no. 2. Detected car marked with a red colored
rectangle. Red color indicates that car is in the unsafe distance and driver must pay attention.
Close objects represent as big blobs located near to dense edge map bottom. The rest two
situations (no. 3, 4) shows system ability detect not only forward cars but and different kind of
obstacles, for instance pedestrians crossing the street. Another aspect of vision system is ability
to detect multi numbers of obstacles.
Fig. 6. Four traffic situation examples simulated using virtual 3D reality. Left image camera
view, right image edge dense map.
A pair of cameras are required to implement stereo vision. Since the application of this work is
to replace existing LIDAR-based perception systems, the system developed in this work needs to
detect obstacles all around the vehicle. Two Ricoh Theta 1 cameras were used to capture 360 by
180 spherical panoramas. The Ricoh Theta camera comprises of two opposite-facing 185 fish-
eye lens and sensors. The two images are stitched by the Ricoh application to form a 360 by 180
spherical image.

The two cameras were displaced vertically as this would result in loss of information directly
above and below the cameras, which are areas, not of interest to us. For the purposes of this
work, the camera was mounted on a tripod at one position to capture the first image. A spirit
level was used to ensure that the camera was horizontal. The camera was then vertically
displaced through a known height on the tripod to the second position to capture the second

\ Fig. 1: Vertically displaced Ricoh Theta cameras mounted on a vehicle


. Image pair capture

The spherical images captured by the ricoh cameras are in equirectangular format i.e. they are
uniformly smapled in azimuth and altitude angles. The azimuth variation is from

to along the x-axis of the image and the altitude variation is from =2 to =2 along the y-axis of
the image. The images are of resolution 2048 1024 pixels.
The disparity for the spherical image pair is the change in the altitude angle between the two
images, since the cameras are displaced vertically. There are several techniques to estimate
disparity values such as those outlined in [14], [15], [16]. The procedure adopted in this work
involves estimating the optical flow of each point. The optical flow estimation is done using the
optical flow software2 developed by [17]. As the cameras are vertically displaced, the optical
flow is in the vertical direction. The optical flow at every pixel gives us the disparity value in
number of pixels, at that pixel.

To speed up the disparity estimation step, the original image is down-sampled to 1022 512
pixels. To further speed up the computation, the top and bottom portions of

the image which dont contain important information for the vehicle are chopped out. An altitude
range of 30_ to 30_ is considered for the process. The resolution of the image so obtained is
1023 _ 171 pixels. The image is further down-sampled to 767 _ 129 pixels, on which the rest of
the algorithm is run. The black and white image of the obstacle map obtained as the output of the
obstacle detection algorithm is then up-sampled to 1023 _ 171 pixels.

Fig. 3: Chopped and down sampled original lower image. The

image is of resolution 767 by 129 pixels with an altitude range of 30 to 30 .

Fig. 4: Disparity map generated from optical flow estimation. Brighter pixels correspond to
greater disparity and so smaller depth. We observe uniform gradient of disparity along the
ground plane, in the bottom half of the image.

Depth estimation The next step is to determine the depth map. This is achieved through some
basic geometry and trigonometric transformations applied to the disparity map.

where l is the baseline distance, __ is the disparity value, _1 is the altitude angle of the point with
respect to the upper camera center and PA is the distance from the point to the upper camera
center. The depth is then calculated as

Using the above formulae, the depth value for every pixel can be calculated from the
corresponding disparity value.

Fig. 5: Depth map generated from the disparity map. Darker pixels correspond to smaller depth.

Geometry-based clustering technique has been proposed to detect obstacles in the scene.
Before we dive into the obstacle detection algorithm, we need to come up with a definition for an
obstacle. It is important to understand and visualize obstacles in the 3D space. Let us consider
the most general case of obstacles found above a ground plane and focus our analysis on this
case. We later discuss some of the other possibilities of obstacles or spaces in a scene that need to
be avoided by a vehicle. Thus, obstacles are points or areas in the scene which are at a height
from the dominant ground plane.

Mathematically, we will define obstacles in terms of two distinct points in space in the following

Two points P1 and P2 belong to the same obstacle and are said to be compatible if

1) HT < jP2Z P1Z j < Hmax. The difference between the elevations of the two points is within a
range defined by HT and Hmax.

2) ((P2 P1) (P3 P1)=jjP2 P1jjjjP3 P1jj) > cos T . The point P3 is obtained by displacing P1 through
Hmax in the z direction. This condition enforces that the angle between P 2 and P1 with respect
to the z direction or elevation direction is less than a threshold value.
In the above definition, HT is the minimum height of an object for it to be considered an
obstacle. Hmax is the maximum height of on obstacle. The value of T can be set based on the
accepted traversable slope to classify obstacles appropriately.

The definition is illustrated in Figure 6. We construct an upward cone from P1 in the 3D space
based on the values chosen for the three parameters HT , Hmax and _T . If any of the points in
the scene lie in the frustum as shown, those points are classified as obstacle points.
Fig. 6: Cone formed in the 3D space with P1 at the center. The point P2, if lies within the frustum
formed by HT , Hmax and T , is classified as an obstacle point. (Source: [11])

From the depth map generated before, we have information about the depth of all the pixels.
All the points or pixels are now 3D points in Azimuth-Altitude-Depth space. The points are
transformed from the Azimuth-Altitude-Depth space to the X-Y-Z space.

A naive algorithm would involve examining all point pairs which would have a complexity of
O(N2). The algorithm proposed is more efficient and compares a point with a limited set of
points. These set of points are the points contained within the trapezium formed by the projection
of the frustum or cone formed above the point (referred to as base point from here on) in the 3D
space, onto the image plane. The projected trapezium is scaled according to depth of the base
point. The parameters of the trapezium are
where H is the height of the image and depth is the depth of the base point. The first term is a
scaling factor to transform the height to the altitude angle and thus, the number of pixels.

Here, hT is height of the closer parallel side of the trapezium from the base point and h max is
height of the farther parallel side from the base point. The upper left angle of the trapezium is
same as T , the threshold angle chosen for the compatibility definition earlier.

We loop through all the points in the trapezium and form a point pair for each point with the base
point. If the pair of points satisy the definition of obstacles i.e. are compatible, then we classify
that point in the trapezium as an obstacle point. This algorithm has a better time complexity than
the Naive alogrithm. Let K denote the average number of points in the trapezium. The
complexity is O(KN).

Classify all points as non-obstacles.

Scan through all the pixels, P , in the image.

Determine the set of pixels, TP , in the projected trapezium of P on the 2D image plane.
Examine all points in TP and determine set OP of points Pi in TP , compatible with P .
If OP is not empty, classify all points of OP as obstacle points.
Fig. 7: Black and white image of the Obstacle map. The white pixels correspond to obstacle


Due to the lack of texture on the ground, we get fewer feature points in those regions and so
the estimated disparity values are inaccurate. The disparity gradient is discontinuous in those
regions on the ground, resulting in noise in the output. Median filtering and morphological
closing operation are applied to the output of the obstacle detection algorithm to close the small

Fig. 8: Black and white image of the Obstacle map after post-processing. The small holes in
Figure 7 are filled.

Fig. 9: Original lower image with obstacles overlaid on it. The pixels in yellow correspond to

Polar Map

The primary application of this work is to detect obstacles so as to plan the motion of the
vehicle. It is important to represent the obstacle points in a manner that might be useful for
motion planning. A polar map representation of the obstacle points is chosen. The polar map is
generated by projecting the 3D obstacle points onto the ground plane. In figure 10, the vehicle is
at the center of the circle facing 0 . The obstacle points are depicted in blue with the radius of the
circle indicating the depth from the vehicle. The small patch at 0 , close to the center corresponds
to the white sign board in the scene in figure 3 and the patch at 30 corresponds to the black
platform. The buildings and trees can be seen further away in the map. From the polar map, we
can plan a path for the vehicle, not obstructed by obstacles.

Fig. 10: The agent is at the center of the map, facing 0 . The blue points correspond to polar
positions of the obstacle points around the agent.

Figure 11 depcits a scene which has a good mix of obstacles at short range and long range. The
algorithm detects all the obstacles in scene. The process was tested on various outdoor settings
and gave positive results in all the cases. In figure 13, we can observe that the polar map
representation accurately captures the cars parked next to each other at angles 315 to 35 . Also, in
figure 13, we notice small erroneous detection at 90 on the ground. This is due to the
lack of texture on the ground, which results in inaccurate disparity estimation. Thus, the obstacle
detection algorithm classifies it as an obstacle point. The accuracy of the obstacle detection is
limited by the accuracy of disparity estimation techniques. The programs were developed in
MATLAB3. The average runtime of disparity estimation is 125s and that of obstacle detection is

It was mentioned earlier that obstacles above the ground were being considered for this work.
But the process developed

also works for potholes in the ground. When a point inside the hole is picked, the cone/frustum in
the 3D space will contain points on the ground around the pothole. These points will be classified
as obstacles which implies that these points will be avoided in the path planning process.
Potholes are areas that need to be avoided and the algorithm does exactly that.



MATLAB is a high-performance language for technical computing. It integrates
computation, visualization, and programming in an easy-to-use environment where problems and
solutions are expressed in familiar mathematical notation. Typical uses include

Math and computation

Algorithm development

Data acquisition

Modeling, simulation, and prototyping

Data analysis, exploration, and visualization

Scientific and engineering graphics

Application development, including graphical user interface building.

MATLAB is an interactive system whose basic data element is an array that does not
require dimensioning. This allows you to solve many technical computing problems, especially
those with matrix and vector formulations, in a fraction of the time it would take to write a
program in a scalar non interactive language such as C or FORTRAN.

The name MATLAB stands for matrix laboratory. MATLAB was originally written to
provide easy access to matrix software developed by the LINPACK and EISPACK projects.
Today, MATLAB engines incorporate the LAPACK and BLAS libraries, embedding the state of
the art in software for matrix computation.

MATLAB has evolved over a period of years with input from many users. In university
environments, it is the standard instructional tool for introductory and advanced courses in
mathematics, engineering, and science. In industry, MATLAB is the tool of choice for high-
productivity research, development, and analysis.
MATLAB features a family of add-on application-specific solutions called toolboxes. Very
important to most users of MATLAB, toolboxes allow you to learn and apply specialized
technology. Toolboxes are comprehensive collections of MATLAB functions (M-files) that
extend the MATLAB environment to solve particular classes of problems. Areas in which
toolboxes are available include signal processing, control systems, neural networks, fuzzy logic,
wavelets, simulation, and many others.

The MATLAB System:

The MATLAB system consists of five main parts:

Development Environment:

This is the set of tools and facilities that help you use MATLAB functions and files. Many
of these tools are graphical user interfaces. It includes the MATLAB desktop and Command
Window, a command history, an editor and debugger, and browsers for viewing help, the
workspace, files, and the search path.

The MATLAB Mathematical Function:

This is a vast collection of computational algorithms ranging from elementary functions

like sum, sine, cosine, and complex arithmetic, to more sophisticated functions like matrix
inverse, matrix Eigen values, Bessel functions, and fast Fourier transforms.

The MATLAB Language:

This is a high-level matrix/array language with control flow statements, functions, data
structures, input/output, and object-oriented programming features. It allows both "programming
in the small" to rapidly create quick and dirty throw-away programs, and "programming in the
large" to create complete large and complex application programs.


MATLAB has extensive facilities for displaying vectors and matrices as graphs, as well as
annotating and printing these graphs. It includes high-level functions for two-dimensional and
three-dimensional data visualization, image processing, animation, and presentation graphics. It
also includes low-level functions that allow you to fully customize the appearance of graphics as
well as to build complete graphical user interfaces on your MATLAB applications.

The MATLAB Application Program Interface (API):

This is a library that allows you to write C and FORTRAN programs that interact with
MATLAB. It includes facilities for calling routines from MATLAB (dynamic linking), calling
MATLAB as a computational engine, and for reading and writing MAT-files.



Mat lab Desktop is the main Mat lab application window. The desktop contains five sub
windows, the command window, the workspace browser, the current directory window, the
command history window, and one or more figure windows, which are shown only when the
user displays a graphic.

The command window is where the user types MATLAB commands and expressions at
the prompt (>>) and where the output of those commands is displayed. MATLAB defines the
workspace as the set of variables that the user creates in a work session. The workspace browser
shows these variables and some information about them. Double clicking on a variable in the
workspace browser launches the Array Editor, which can be used to obtain information and
income instances edit certain properties of the variable.
The current Directory tab above the workspace tab shows the contents of the current
directory, whose path is shown in the current directory window. For example, in the windows
operating system the path might be as follows: C:\MATLAB\Work, indicating that directory
work is a subdirectory of the main directory MATLAB; WHICH IS INSTALLED IN
DRIVE C. clicking on the arrow in the current directory window shows a list of recently used
paths. Clicking on the button to the right of the window allows the user to change the current

MATLAB uses a search path to find M-files and other MATLAB related files, which are
organize in directories in the computer file system. Any file run in MATLAB must reside in the
current directory or in a directory that is on search path. By default, the files supplied with
MATLAB and math works toolboxes are included in the search path. The easiest way to see
which directories are on the search path.

The easiest way to see which directories are soon the search paths, or to add or modify a
search path, is to select set path from the File menu the desktop, and then use the set path dialog
box. It is good practice to add any commonly used directories to the search path to avoid
repeatedly having the change the current directory.

The Command History Window contains a record of the commands a user has entered in
the command window, including both current and previous MATLAB sessions. Previously
entered MATLAB commands can be selected and re-executed from the command history
window by right clicking on a command or sequence of commands. This action launches a menu
from which to select various options in addition to executing the commands. This is useful to
select various options in addition to executing the commands. This is a useful feature when
experimenting with various commands in a work session.

Using the MATLAB Editor to create M-Files:

The MATLAB editor is both a text editor specialized for creating M-files and a graphical
MATLAB debugger. The editor can appear in a window by itself, or it can be a sub window in
the desktop. M-files are denoted by the extension .m, as in pixel up .m. The MATLAB editor
window has numerous pull-down menus for tasks such as saving, viewing, and debugging files.
Because it performs some simple checks and also uses color to differentiate between various
elements of code, this text editor is recommended as the tool of choice for writing and editing M-
functions. To open the editor , type edit at the prompt opens the M-file file name .m in an editor
window, ready for editing. As noted earlier, the file must be in the current directory, or in a
directory in the search path.

Getting Help:

The principal way to get help online is to use the MATLAB help browser, opened as a
separate window either by clicking on the question mark symbol (?) on the desktop toolbar, or by
typing help browser at the prompt in the command window. The help Browser is a web browser
integrated into the MATLAB desktop that displays a Hypertext Markup Language(HTML)
documents. The Help Browser consists of two panes, the help navigator pane, used to find
information, and the display pane, used to view the information. Self-explanatory tabs other than
navigator pane are used to perform a search.



Digital image processing is an area characterized by the need for extensive experimental
work to establish the viability of proposed solutions to a given problem. An important
characteristic underlying the design of image processing systems is the significant level of
testing & experimentation that normally is required before arriving at an acceptable solution.
This characteristic implies that the ability to formulate approaches &quickly prototype candidate
solutions generally plays a major role in reducing the cost & time required to arrive at a viable
system implementation.

What is DIP
An image may be defined as a two-dimensional function f(x, y), where x & y are spatial
coordinates, & the amplitude of f at any pair of coordinates (x, y) is called the intensity or gray
level of the image at that point. When x, y & the amplitude values of f are all finite discrete
quantities, we call the image a digital image. The field of DIP refers to processing digital image
by means of digital computer. Digital image is composed of a finite number of elements, each of
which has a particular location & value. The elements are called pixels.

Vision is the most advanced of our sensor, so it is not surprising that image play the
single most important role in human perception. However, unlike humans, who are limited to the
visual band of the EM spectrum imaging machines cover almost the entire EM spectrum, ranging
from gamma to radio waves. They can operate also on images generated by sources that humans
are not accustomed to associating with image.

There is no general agreement among authors regarding where image processing stops &
other related areas such as image analysis& computer vision start. Sometimes a distinction is
made by defining image processing as a discipline in which both the input & output at a process
are images. This is limiting & somewhat artificial boundary. The area of image analysis (image
understanding) is in between image processing & computer vision.

There are no clear-cut boundaries in the continuum from image processing at one end to
complete vision at the other. However, one useful paradigm is to consider three types of
computerized processes in this continuum: low-, mid-, & high-level processes. Low-level
process involves primitive operations such as image processing to reduce noise, contrast
enhancement & image sharpening. A low- level process is characterized by the fact that both its
inputs & outputs are images.
Mid-level process on images involves tasks such as segmentation, description of that
object to reduce them to a form suitable for computer processing & classification of individual
objects. A mid-level process is characterized by the fact that its inputs generally are images but
its outputs are attributes extracted from those images. Finally higher- level processing involves
Making sense of an ensemble of recognized objects, as in image analysis & at the far end of
the continuum performing the cognitive functions normally associated with human vision.

Digital image processing, as already defined is used successfully in a broad range of

areas of exceptional social & economic value.

What is an image?

An image is represented as a two dimensional function f(x, y) where x and y are spatial
co-ordinates and the amplitude of f at any pair of coordinates (x, y) is called the intensity of the
image at that point.

Gray scale image:

A grayscale image is a function I (xylem) of the two spatial coordinates of the image

I(x, y) is the intensity of the image at the point (x, y) on the image plane.

I (xylem) takes non-negative values assume the image is bounded by a rectangle [0, a] [0, b]I:
[0, a] [0, b] [0, info)

Color image:

It can be represented by three functions, R (xylem) for red, G (xylem) for green and B
(xylem) for blue.
An image may be continuous with respect to the x and y coordinates and also in
amplitude. Converting such an image to digital form requires that the coordinates as well as the
amplitude to be digitized. Digitizing the coordinates values is called sampling. Digitizing the
amplitude values is called quantization.

Coordinate convention:

The result of sampling and quantization is a matrix of real numbers. We use two
principal ways to represent digital images. Assume that an image f(x, y) is sampled so that the
resulting image has M rows and N columns. We say that the image is of size M X N. The values
of the coordinates (xylem) are discrete quantities. For notational clarity and convenience, we use
integer values for these discrete coordinates. In many image processing books, the image origin
is defined to be at (xylem)=(0,0).The next coordinate values along the first row of the image are
(xylem)=(0,1).It is important to keep in mind that the notation (0,1) is used to signify the second
sample along the first row. It does not mean that these are the actual values of physical
coordinates when the image was sampled. Following figure shows the coordinate convention.
Note that x ranges from 0 to M-1 and y from 0 to N-1 in integer increments.

The coordinate convention used in the toolbox to denote arrays is different from the
preceding paragraph in two minor ways. First, instead of using (xylem) the toolbox uses the
notation (race) to indicate rows and columns. Note, however, that the order of coordinates is the
same as the order discussed in the previous paragraph, in the sense that the first element of a
coordinate topples, (alb), refers to a row and the second to a column. The other difference is that
the origin of the coordinate system is at (r, c) = (1, 1); thus, r ranges from 1 to M and c from 1 to
N in integer increments. IPT documentation refers to the coordinates. Less frequently the toolbox
also employs another coordinate convention called spatial coordinates which uses x to refer to
columns and y to refers to rows. This is the opposite of our use of variables x and y.

Image as Matrices:

The preceding discussion leads to the following representation for a digitized image
f (0,0) f(0,1) .. f(0,N-1)

f (1,0) f(1,1) f(1,N-1)

f (xylem)= . . .

. . .

f (M-1,0) f(M-1,1) f(M-1,N-1)

The right side of this equation is a digital image by definition. Each element of this array
is called an image element, picture element, pixel or pel. The terms image and pixel are used
throughout the rest of our discussions to denote a digital image and its elements.

A digital image can be represented naturally as a MATLAB matrix:

f (1,1) f(1,2) . f(1,N)

f (2,1) f(2,2) .. f (2,N)

. . .

f= . . .

f (M,1) f(M,2) .f(M,N)

Where f (1,1) = f(0,0) (note the use of a moonscape font to denote MATLAB quantities).
Clearly the two representations are identical, except for the shift in origin. The notation f(p ,q)
denotes the element located in row p and the column q. For example f(6,2) is the element in the
sixth row and second column of the matrix f. Typically we use the letters M and N respectively
to denote the number of rows and columns in a matrix. A 1xN matrix is called a row vector
whereas an Mx1 matrix is called a column vector. A 1x1 matrix is a scalar.

Matrices in MATLAB are stored in variables with names such as A, a, RGB, real array
and so on. Variables must begin with a letter and contain only letters, numerals and underscores.
As noted in the previous paragraph, all MATLAB quantities are written using mono-scope
characters. We use conventional Roman, italic notation such as f(x ,y), for mathematical
Reading Images:

Images are read into the MATLAB environment using function imread whose syntax is

Imread (filename)

Format name Description recognized extension

TIFF Tagged Image File Format .tif, .tiff

JPEG Joint Photograph Experts Group .jpg, .jpeg

GIF Graphics Interchange Format .gif

BMP Windows Bitmap .bmp

PNG Portable Network Graphics .png

XWD X Window Dump .xwd

Here filename is a spring containing the complete of the image file(including any
applicable extension).For example the command line

>> f = imread (8. jpg);

Reads the JPEG (above table) image chest ray into image array f. Note the use of single
quotes () to delimit the string filename. The semicolon at the end of a command line is used by
MATLAB for suppressing output If a semicolon is not included. MATLAB displays the results of
the operation(s) specified in that line. The prompt symbol (>>) designates the beginning of a
command line, as it appears in the MATLAB command window.

Data Classes:

Although we work with integers coordinates the values of pixels themselves are not
restricted to be integers in MATLAB. Table above list various data classes supported by
MATLAB and IPT are representing pixels values. The first eight entries in the table are refers to
as numeric data classes. The ninth entry is the char class and, as shown, the last entry is referred
to as logical data class.

All numeric computations in MATLAB are done in double quantities, so this is also a
frequent data class encounter in image processing applications. Class unit 8 also is encountered
frequently, especially when reading data from storages devices, as 8 bit images are most common
representations found in practice. These two data classes, classes logical, and, to a lesser degree,
class unit 16 constitute the primary data classes on which we focus. Many ipt functions however
support all the data classes listed in table. Data class double requires 8 bytes to represent a
number uint8 and into 8 require one byte each, uint16 and int16 requires 2bytes and unit 32.

Name Description

Double precision, floating point numbers the Approximate.

Uint8 unsigned 8_bit integers in the range [0,255] (1byte per


Uint16 unsigned 16_bit integers in the range [0, 65535] (2byte per element).

Unit 32 unsigned 32_bit integers in the range [0, 4294967295](4 bytes per element).
Int8 signed 8_bit integers in the range [-128,127] 1 byte per element)

Int 16 signed 16_byte integers in the range [32768, 32767] (2 bytes per element).

Int 32 Signed 32_byte integers in the range [-2147483648, 21474833647] (4 byte per

Single _precision floating _point numbers with values

In the approximate range (4 bytes per elements)

Char characters (2 bytes per elements).

Logical values are 0 to 1 (1byte per element).

Int 32 and single required 4 bytes each. The char data class holds characters in Unicode
representation. A character string is merely a 1*n array of characters logical array contains only
the values 0 to 1,with each element being stored in memory using function logical or by using
relational operators.

Image Types:

The toolbox supports four types of images:

1 .Intensity images;

2. Binary images;

3. Indexed images;

4. R G B images.

Most monochrome image processing operations are carried out using binary or intensity
images, so our initial focus is on these two image types. Indexed and RGB colour images.

Intensity Images:

An intensity image is a data matrix whose values have been scaled to represent
intentions. When the elements of an intensity image are of class unit8, or class unit 16, they have
integer values in the range [0,255] and [0, 65535], respectively. If the image is of class double,
the values are floating point numbers. Values of scaled, double intensity images are in the range
[0, 1] by convention.
Binary Images:

Binary images have a very specific meaning in MATLAB.A binary image is a logical
array 0s and1s.Thus, an array of 0s and 1s whose values are of data class, say unit8, is not
considered as a binary image in MATLAB .A numeric array is converted to binary using function
logical. Thus, if A is a numeric array consisting of 0s and 1s, we create an array B using the

B=logical (A)

If A contains elements other than 0s and 1s.Use of the logical function converts all
nonzero quantities to logical 1s and all entries with value 0 to logical 0s.

Using relational and logical operators also creates logical arrays.

To test if an array is logical we use the I logical function: is logical (c).

If c is a logical array, this function returns a 1.Otherwise returns a 0. Logical array can be
converted to numeric arrays using the data class conversion functions.

Indexed Images:

An indexed image has two components:

A data matrix integer, x

A color map matrix, map

Matrix map is an m*3 arrays of class double containing floating point values in the range
[0, 1].The length m of the map are equal to the number of colors it defines. Each row of map
specifies the red, green and blue components of a single color. An indexed images uses direct
mapping of pixel intensity values color map values. The color of each pixel is determined by
using the corresponding value the integer matrix x as a pointer in to map. If x is of class double
,then all of its components with values less than or equal to 1 point to the first row in map, all
components with value 2 point to the second row and so on. If x is of class units or unit 16, then
all components value 0 point to the first row in map, all components with value 1 point to the
second and so on.

RGB Image:

An RGB color image is an M*N*3 array of color pixels where each color pixel is triplet
corresponding to the red, green and blue components of an RGB image, at a specific spatial
location. An RGB image may be viewed as stack of three gray scale images that when fed in to
the red, green and blue inputs of a color monitor

Produce a color image on the screen. Convention the three images forming an RGB color
image are referred to as the red, green and blue components images. The data class of the
components images determines their range of values. If an RGB image is of class double the
range of values is [0, 1].

Similarly the range of values is [0,255] or [0, 65535].For RGB images of class units or
unit 16 respectively. The number of bits use to represents the pixel values of the component
images determines the bit depth of an RGB image. For example, if each component image is an
8bit image, the corresponding RGB image is said to be 24 bits deep.

Generally, the number of bits in all component images is the same. In this case the
number of possible color in an RGB image is (2^b) ^3, where b is a number of bits in each
component image. For the 8bit case the number is 16,777,216 colors.

We have showed the working of an algorithm to detect obstacles. But the challenge always lies
in running it real-time on the vehicle with a quick update rate so as to react quickly
Fig. 13: Scene 3: (a) lower image; (b) depth map; (c) black and white image of obstacle map; (d)
lower image with obstacles overlaid; (e) polar map representation of obstacles

to changes in the environment. The first among our future steps should be optimizing the run
time of this algorithm and possibly explore efficient platforms and parallel architecture if
required to run it online. There are other quick depth-estimation algorithms [18], [16] in the
literature that are more suitable for real time applications at the cost of reduced accuracy. So a
sensible trade off has to be made on the choice of the depth estimation algorithm.

The obstacle detection algorithm is found to be decently robust in detecting obstacles sticking
out of the ground. But it does not particularly consider holes or cliffs. Given the depth

map and 3D location of points in view, it is easy to build a few more features in the pipeline to
seamlessly handle these kinds of obstacles as well. For example, we could classify the road
based on the fact that it has a steady gradient in depth value and plan a path for the vehicle only
along the definite road, if exists, to avoid falling off cliffs. And of course, besides just obstacles,
the vision system should also detect lane markings, sign boards, bicyclists hand signals, etc to
complete the whole perception package of the vehicle.

[1] Don Murray and James J Little. Using real-time stereo vision for mobile robot navigation.
Autonomous Robots, 8(2):161171, 2000.
[2] Hans P Moravec. Rover visual obstacle avoidance. In IJCAI, pages 785790, 1981.
[3] Don Murray and Cullen Jennings. Stereo vision based mapping and navigation for mobile
robots. In Robotics and Automation, 1997. Proceedings., 1997 IEEE International
Conference on, volume 2, pages 16941699. IEEE, 1997.
[4] Nicola Bernini, Massimo Bertozzi, Luca Castangia, Marco Patander, and Mario Sabbatelli.
Real-time obstacle detection using stereo vision for autonomous ground vehicles: A survey.
In Intelligent Transportation Systems (ITSC), 2014 IEEE 17th International Conference on,
pages 873878. IEEE, 2014.
[5] Alberto Elfes. Using occupancy grids for mobile robot perception and navigation. Computer,
22(6):4657, 1989.
[6] Hernan Badino, Uwe Franke, and Rudolf Mester. Free space com-putation using stochastic
occupancy grids and dynamic programming. In Workshop on Dynamical Vision, ICCV, Rio
de Janeiro, Brazil, volume 20, 2007.
[7] Florin Oniga and Sergiu Nedevschi. Processing dense stereo data using elevation maps: Road
surface, traffic isle, and obstacle detection.
Vehicular Technology, IEEE Transactions on, 59(3):11721182, 2010.

[8] Andreas Wedel, Annemarie Meiner, Clemens Rabe, Uwe Franke, and Daniel Cremers.
Detection and segmentation of independently moving objects from dense scene flow. In
Energy minimization methods in computer vision and pattern recognition, pages 1427.
Springer, 2009.
[9] Uwe Franke, Clemens Rabe, Hernan Badino, and Stefan Gehrig. 6d-vision: Fusion of stereo
and motion for robust environment perception. In Pattern Recognition, pages 216223.
Springer, 2005.
[10] Philip Lenz, Julius Ziegler, Andreas Geiger, and Martin Roser. Sparse scene flow
segmentation for moving object detection in urban environ-ments. In Intelligent Vehicles
Symposium (IV), 2011 IEEE, pages 926 932. IEEE, 2011.
[11] A Talukder, R Manduchi, A Rankin, and L Matthies. Fast and reliable obstacle detection
and segmentation for cross-country navigation. In
Intelligent Vehicle Symposium, 2002. IEEE, volume 2, pages 610618. IEEE, 2002.

[12] Massimo Bertozzi, Luca Bombini, Alberto Broggi, Michele Buzzoni, Elena Cardarelli,
Stefano Cattani, Pietro Cerri, Alessandro Coati, Stefano Debattisti, Andrea Falzoni, et al.
Viac: An out of ordinary experiment. In Intelligent Vehicles Symposium (IV), 2011 IEEE,
pages 175180. IEEE, 2011.
[13] Alberto Broggi, Michele Buzzoni, Mirko Felisa, and Paolo Zani. Stereo obstacle
detection in challenging environments: the viac experience. In
Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, pages
15991604. IEEE, 2011.

[14] Kyung-Hoon Bae, Dong-Sik Yi, Seung Cheol Kim, and Eun-Soo Kim. A bi-directional
stereo matching algorithm based on adaptive matching window. In Optics & Photonics 2005,
pages 590929590929. Interna-tional Society for Optics and Photonics, 2005.
[15] HanSung Kim and Kwanghoon Sohn. Hierarchical depth estimation for image synthesis
in mixed reality. In Electronic Imaging 2003, pages 544553. International Society for Optics
and Photonics, 2003.
[16] Jochen Schmidt, Heinrich Niemann, and Sebastian Vogt. Dense disparity maps in real-
time with an application to augmented reality. In Applica-tions of Computer Vision, 2002.
(WACV 2002). Proceedings. Sixth IEEE Workshop on, pages 225230. IEEE, 2002.
[17] Deqing Sun, Stefan Roth, and Michael J Black. Secrets of optical flow estimation and
their principles. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE
Conference on, pages 24322439. IEEE, 2010.
[18] Heiko Hirschmuller. Accurate and efficient stereo processing by semi-global matching
and mutual information. In Computer Vision and Pat-tern Recognition, 2005. CVPR 2005.
IEEE Computer Society Conference on, volume 2, pages 807814. IEEE, 2005.