Vous êtes sur la page 1sur 14

PhD Research proposal

Enabling robust visual navigation

Department of Computer Science and Software Engineering Geospatial Research Centre (NZ) Ltd. University of Canterbury, Christchurch, New Zealand

Visual Navigation is the key to enabling useful mobile robots to work autonomously in unmapped dynamic environments. It involves positioning a robot by tracking the world as it moves past a camera. Visual Navigation is not widely used in the real world yet as current implementations have limitations including drift (accumulated errors), poor performance in featureless, self-similar or dynamic environments, or during rapid motion, and the need for significant amounts of processing power. In my PhD I propose to address some of these issues including improving the accuracy and robustness of feature-based relative positioning, and reducing drift without damaging reliability by adapting developing loop closure techniques. I will aim to implement a 3d Visual Navigation system to demonstrate improvements in robustness and accuracy over existing technology.

1 Introduction.................................................................................................................2 2 Current Research and Technology..............................................................................3 2.1 Accumulated Errors..............................................................................................4 2.1.1 Bundle Adjustment........................................................................................4 2.1.2 Sensor Integration..........................................................................................4 Inertial Navigation Systems...................................................................5 2.1.3 Loop Closure.................................................................................................5 Scene Recognition..................................................................................5 2.2 Motion Models, Frame Rates and Robustness.....................................................6 3 PhD Plan:.....................................................................................................................7 3.1 Progress to-date:...................................................................................................7 3.1.1 Developing techniques to reduce errors in initial solution used to start Bundle Adjustment.................................................................................................7 Translation extraction.............................................................................7 Unbiased Estimator of stereo point position..........................................8 Future development work.......................................................................9 3.1.2 Loop Closure.................................................................................................9 3.1.3 Demonstrating improved Visual Navigation.................................................9 3.2 Future development work.....................................................................................9 3.3 Proposed research timeline and targets..............................................................11 4 Bibliography..............................................................................................................13

1 Introduction
Visual Navigation (VN) involves working out the position of and path taken by a human, robot or vehicle from a series of photographs taken en-route. This is potentially a useful navigation tool in environments without satellite positioning, for example indoors, in rugged terrain or cities, underwater, on other planets or near the poles. It could also be used to augment satellite or inertial navigation sensors where more precision is required, or to provide a backup navigation system. VN is a major component of Visual Simultaneous Localisation and Mapping (V-SLAM). SLAM is the process by which a mobile robot can autonomously map and position itself within an unknown environment. Solving this problem is a major challenge in robotics and is heavily researched. It will enable robots to perform tasks requiring them to navigate unknown environments (for example so they can drive or deliver items, for domestic tasks, in military operations, in hazardous environments). VN is likely to become a widely used navigation solution once solved is because cameras (and microprocessors) are already common on robots and portable computers (e.g. phones). Cameras have many other additional uses, are relatively cheap, dont interfere with other sensors (they are passive if additional lighting is not needed), are not easily mislead, and require no additional infrastructure (such as base stations or satellites). In theory VN will work in any environment where there is enough light and texture that static features can be identified. Humans manage to navigate using predominately stereo vision whenever there is enough light to see our surroundings (although we also integrate acceleration information from our ears, feedback from our limbs and understanding of what we are seeing, for example we know the scale of objects in our environment and can identify those that are moving with respect to other). These applications usually require a near-real time response to what has been seen. Ideally we need each photo or video frame to be processed rapidly enough that the robot does not have to stop while calculation of its position catches up, and so it can react to its change in position (to stop moving, having reached a goal, or to re-plan its path given new information on its environment). To avoid losing our position the frame rate must be high enough that there is significant overlap between

consecutive frames. Real-time VN systems have been demonstrated in controlled environments (mainly two-dimensional and indoors), which limits their usefulness. The aim of my research will be to improve current VN technology so that it can be used in less constrained environments, with the aim of developing a real-time VN system robust enough for real-world applications.

2 Current Research and Technology

The best existing VN implementations (without an inertial sensor) can position a vehicle in 2d in a stationary outdoor environments with an accuracy of about 3% [1]. This system ran at 5-13 fps and moved slowly enough that sufficient features moved less than 30% of the frame-width in every frame (typically 3-10%). These results are substantially better than [2], who used V-SLAM with a single camera to position a person in, and map, a 3d urban environment. After about 300m the error in position was about 100m, before the error was greatly reduced by loop closure. The algorithm did not quite run in real-time and did not use bundle adjustment. A video of the work in [3] shows visual navigation and re-localisation indoors using a mono handheld camera. This works in real-time in 3d but can only track 6-10 features and regularly loses its position (having tracked fewer than 3 points between two frames) until it recognises features it has seen before. [write about Oxford V-SLAM video] The V-SLAM and VN implementations described above work like this: 1. 2. 3. 4. 5. Take photograph(s) Extract 2d or 3d features from mono image, or stereo image pair Find correspondences between these features with features in previous frame(s) If this image is different to recent ones, and sufficiently distinctive, attempt Loop Closure: a. Search for similar scenes seen in the past b. Test whether they are likely to be the same (could we be in the same position?) Estimate displacement from previous frame(s)

Stage 2 usually uses Harris corners or similar point features. Stage 3 either tracks features or uses a projective transform-invariant feature descriptor (SIFT [4] or SURF [5]) to find correspondences. Stage 5 involves a point-alignment algorithm (some form of Procrustes alignment, or the 3- or n-point algorithm), or a motion model, to estimate the translation and rotation. RANSAC may be used to remove outliers, and Bundle Adjustment [6] may be used to refine the position estimate. Measurements from an odometer or INS may be incorporated (usually using an Extended Kalman Filter). Repeating this algorithm for every frame gives us our position. An alternative approach, from Structure from Motion research, is to use Optical Flow [7]. This requires a high frame rate and is dependent on tracking many features across an image. Optical flow has been combined with feature tracking to make use of distant features by [8], and for mobile robot navigation [9, 10], but it is generally used because it is fast to compute rather than because of its accuracy, however when stereo depth disparities are an issue it may outperform feature-based algorithms as depth is not used. [Neural nets???]


Accumulated Errors

VN implementations suffer badly from small errors accumulated over many steps. There are several ways of reducing this error: we can use Bundle Adjustment to refine our position estimate, we can integrate a complementary sensor (such as an inertial sensor), and/or we can recognise and use the position of places weve seen before (loop closure).

2.1.1 Bundle Adjustment

As a first approximation we estimate the transformation from the previous frame. There are closed form (i.e. constant time) linear methods that give us a best estimate (in some sense) of the transformation between two sets of features. These include the N-point algorithm [11] and point pattern (Procrustes) alignment [12]. However we can improve on this estimate if we have matched features across more than two frames. Bundle Adjustment (BA) [6] is the technique used in photogrammetry to estimate the scene structure and extrinsic camera parameters (position and orientation) given a sequence of frames. BA was shown to improve VN accuracy by [13]. BA estimates the scene structure and extrinsic camera parameters that minimise a function of the reprojection error. Reprojection error is the distance within the image between features we observed and corresponding features projected from our estimated structure into the image. We can explicitly calculate this error whereas we cannot explicitly calculate error in our structure/camera position. BA is an iterative algorithm: each step consists of estimating the local gradient (Jacobian) of the objective function, and hence choosing a vector in the direction of this gradient that we can add to our previous solution to reduce the value of the objective function. The slow part of this is inverting a function of the Jacobian. The objective function is a function of all points and camera locations being estimated. Note that most of these are uncorrelated (e.g. if a camera doesnt see a point then the points position has no effect on our estimated camera position), so the Jacobian is sparse. Bad correspondences have a big effect on the error function if it assumes Gaussian or similar errors (a reasonable assumption otherwise). We can deal with a few mismatched points as outliers by choosing a Gaussian plus outliers (i.e. heavy-tailed) error model, but we are better off removing mismatches first. A RANSAC method is normally used to choose a large set of inliers. BA is usually used in mapping/structure from motion calculations using large numbers of overlapping frames and large numbers of features. This takes much too long for real time applications and so would applying BA to all frames and correspondences at once [14]. For a real-time implementation (i.e. each frame is processed in constant time) we must limit the number of frames. This is in theory a very good approximation to BA over the entire sequence for VN, as we are unlikely to have correspondences over more than a few consecutive frames (e.g. 2-10) when moving, therefore (if we sort the camera positions and points appropriately, e.g. by order of appearance) the whole Jacobian matrix will have a banddiagonal structure. Using BA on short sequences of frames corresponds to inverting short sections of this diagonal at a time. Errors will still accumulate as we gain and lose features over the series of images in the BA, however in some ways BA gives the best motion estimate we can make from a sequence of images of tracked points. To start BA it is desirable to have a reasonably good approximation to the actual solution to minimise the number of iterations needed to get close enough to the minimum, and to reduce the probability of converging to false minima (which are common, especially if there are mismatched points).

2.1.2 Sensor Integration

A different navigation sensor can be integrated with VN. Normally two position estimates would be combined using a Kalman filter. This gives a best estimate of the new position given previous position and multiple motion estimates (assuming Gaussian errors with known covariance), and keeps track of the estimated accumulated error. Kalman filters are widely used for maintaining SLAM position and map estimatesincorporating an additional sensor is straightforward. A good implementation would

cope with position estimate failures (e.g. we cant track enough points between two images, or we lose our GNSS signal). Particle filters [15] are another less popular option that can be better at dealing with erroneous input, or data that is not well approximated by a Normal distribution. It appears to be hard to make real-time implementations that provide the accuracy of Kalman Filter SLAM [16]. Odometry (for example on the Mars Rover [17]), GNSS (for example to navigate missiles), laser scanners (to identify profiles as recognisable descriptors [18]) and inertial navigation sensors have been integrated with VN. Inertial Navigation Systems

INS is the most popular positioning system used to augment vision. Like vision it suffers from drift over time. Currently it is much more accurate than VN. As vision is measuring motion and INS measure acceleration then vision can be used to measure and correct INS drift if it is detected that the camera is stationary (or very slowly moving)this is one time when VN is highly accurate compared to inertial navigation. Various teams have shown that inertial position estimates can be improved using vision, including [19], who showed integrating vision with a low quality IMU reduced errors to those of an expensive higher quality tactical IMU. The position error is still unbounded in the long term.

2.1.3 Loop Closure

A useful way of bounding position errors (within a finite region) is to recognise places we have been beforewe then know our position relative to a previous position (and if we are using an INS we can correct for drift over the previous loop). This is known as the Loop Closure, and is the key to building accurate maps in SLAM. The problem is usually tackled in SLAM by looking (visually, or with range-finders) for landmarks that we might be able to see from our current pose and pose uncertainty. This appears to work well when the robot position is reasonably well known (e.g. using a wheeled robot with an odometer). However it is most useful to be able to re-localise ourselves when there is the greatest uncertainty in our position. Therefore it would be more useful to recognise places weve been before without needing to know the approximate position of that place, as a lost person would when they recognise somewhere familiar. It is very easy to lose our orientation entirely with vision (for example if all we can see for a frame is a blank wall, or a light is switched off) so re-localisation is particularly useful when we can see again (although we can easily lose our orientation entirely, a bound on our velocity can give a rapidly growing bound on our position). If we perform loop closure for VN using scene similarity, we dont need to maintain a large map of features. Updating the map is the main computational cost in SLAM. Maps have advantages for VN though: a small local map can help us keep track of nearby features that may re-enter our field of view [20] (which helps reduce problems caused by occlusion), and the positions of many previous scenes can be refined after closing multiple loops. A database of scenes and their positions is essentially the same as a map, although updating the positions of many scenes, as we would using a Kalman Filter in SLAM on loop closure, would be less straightforward. Scene Recognition

Object Recognition is a key problem in Artificial Intelligence. We are concerned with the related problem of recognising two scenes that are identical, possibly from different viewpoints (i.e. identical up to a perspective transformation and occlusion) and distinctive. The basic method used for scene recognition is to encode some property of the scene as a descriptor that can be added to a database. These should (ideally) be invariant to occlusion and perspective transformations. We recognise a scene when we find a scene in the database with a descriptor that is close enough to the current image by some metric. SIFT feature descriptors [21], words on signs read using OCR [22], and 2d stereo/laser scan profiles [18] have been used as descriptors for navigation.

Loop closure can also cause us to lose our position if we recognise a scene incorrectly. Therefore it is important that what we see is distinctive and discriminative, so we dont incorrectly recognise scenes that are common in our environment. We would normally do this by clustering descriptors; descriptors in a large cluster are less distinctive than those in a small cluster [23]. The Bag of Words algorithm has been applied to SIFT descriptors to identify discriminative combinations of descriptors [24, 25]. For example when navigating indoors, window corners are common so are not good features to identify scenes with. Features found on posters or signs are much better, although even these may be repeated elsewhere. Scenes with very little detail are more likely to be falsely recognised than those with more detail (which are less likely to happen to have to same property that we are recognising). [26] demonstrates real-time loop detection using a hand-held mono camera, using SIFT features and histograms (of intensity and hue) combined using a Bag of Words approach. [27] also demonstrated real-time loop closure outdoors using SIFT features and laser scan profiles. Much work to remove visually ambiguous scenes was needed, and more complex profiles were preferred to provide more discriminative features. The geometry of scenes has not often been used to recognise them, even though it is likely that people use scene geometry. This is probably because of a lack of geometric properties that are invariant under perspective transformations. In a 2d projection ratios of distances, angles, and ordering along lines are not preserved. The epipolar constraint does define an invariant feature, but this is defined by seven feature matches (up to a small number of possibilities) so eight or more points are needed for this to validate or invalidate a possible correspondence. Simple geometric constraints have been used to eliminate triples of bad correspondences [28], but the key assumption here is that points lie on a convex surface, i.e. there is no occlusion, which is not a good assumption for real world navigation applications. If suitable descriptors can be found then geometric constraints would be very useful for identifying distinctive scenes in an environment made up of different arrangements of similar components.


Motion Models, Frame Rates and Robustness

A high frame-rate is desirable for VN so that there is always overlap between consecutive frames (and preferably sequences of frames) so feature correspondences can be found between frames. For example, a camera held by a human may be swung through 180 degrees in approximately a second; if the frame is 45 degrees wide then this requires a frame rate higher than eight frames per second for there to be any overlap between frames at all, and higher than 16 frames per second if any features are to be tracked across more than two images. Omni-directional cameras may partially solve this, although they provide a larger or less detailed view of the scene, and are more expensive and less generic. Often the camera motion is modelled and used as an initial estimate of the translation between frames. This works well in many applications (such as UAVs) where the acceleration in the interval between frames is small, however for cameras attached to humans (or even fast robots, or robots travelling through difficult terrain) jerky movement and rapid rotation is likely, and the algorithm must be able to reliably cope with unpredictable motion (in other words, should be able to predict motion from what it sees alone). This is also the time when errors accumulate most rapidlyit is less critical to keep the frame rate high at the same time as the motion model will be most effective at speeding up the algorithm.

3 PhD Plan:
I will develop ways of speeding up VN that uses BA to refine position estimates by identifying existing algorithms, and developing new algorithms that will lead to more accurate initial solutions and robust outlier rejection (to reduce the number of iterations, and the probability of finding false minima). I will compare different approaches using simulated data to determine the best approach. I will investigate ways of improving the reliability of fast loop closure algorithms. I would like to incorporate image geometry into algorithms that match feature points, either as part of a descriptor or to validate matches. I will also investigate conditioning I will implement VN software that I hope will provide real-world verification of algorithms I have developed. People have demonstrated real-time navigation systems and loop closure in real time before, but always with severe restrictions, e.g. on numbers of points tracked, maximum angular velocities, restricted to 2d or relying on a level ground plane.


Progress to-date:

3.1.1 Developing techniques to reduce errors in initial solution used to start Bundle Adjustment.
It is advantageous to start Bundle Adjustment from a good approximation to the actual motion, so that false minima are more likely to be avoided and to reduce the number of iterations needed to reach a good enough solution. To determine relative motion from a set of correspondences between 3d points we first recover the rotation between the point sets, then the translation. This gives us the camera motion. This process is known as Procrustes Alignment and is related to Point-Pattern Matching. Sometimes an initial solution of either the solution obtained by Bundle Adjustment on previous frames, or an assumption that the motion between frames is close enough to zero is used. This assumes that the frame rate is high enough that motion or acceleration is small between frames, however this is a bad approximation when there is substantial motion between frames. This is precisely when we want the most accurate (relatively) motion estimate as smaller relative errors in estimating smaller movements contribute proportionally less to the global error. Various methods have been proposed for extracting the rotation: 1. 2. [12] derived expressions for the translation and rotation (and scale) between a set of pointcorrespondences that minimise the square of the error. [29] take the Singular Value Decomposition of a matrix formed from matrix products of points and discard the diagonal factor to give an orthogonal rotation matrix.

An alternative approach would be to use the SVD to give a matrix that is the best least-squares transformation matrix between two point sets. The we can use either the method in (2) (taking a second SVD of this 3x3 matrix), or use the algorithm by [30] to find the closest rotation to this matrix. After getting identical simulated results from comparing these methods I have proved that they are equivalent. I have implemented a MATLAB program to simulate noisy stereo image data (projecting points onto images, adding Gaussian measurement noise, calculating 3d structure). I can compare these approaches with each other. Initial results show that the first method is slightly better (mean error approximately 5% less) than the second given 5-15 points with a reprojection error with mean 0.01 radians (approximately 2 pixels for a typical camera), but the second is significantly better given more exact data (mean reprojection error 0.002 radians). Translation extraction

The three widely used methods of extracting translation are:

1. 2.


The 3-point algorithm [31]: Three non-collinear point correspondences determine a small finite set of possible new camera positions. We can solve this exactly, then disambiguate between solutions using a fourth correspondence. The N-point algorithm [11]. This is the generalisation of (1) to n > 3 points, making use of the fact that the problem is overdetermined. These algorithms can position a camera relative to known 3d structure given one 2d image alone. The only information about the second point set used is the angles between points. If we have a stereo pair we can use angles between 3d points (this is effectively an average of the angles from the two images, so we would expect a slightly reduced error), and also improve on distance estimates between point pairs (by taking some sort of average again). Procrustes Alignment (difference of centroids of point sets) between 3d point sets after the rotation. Unlike (1) and (2) this method is directly affected by the accuracy of rotation estimates.

My simulations can currently use (1) or (3) and I am working on incorporating (2). Unbiased Estimator of stereo point position

3d point positions are normally calculated from stereo correspondences by finding the intersection of rays projected through each point in the image, as described by [32]. As there is noise in image measurements these lines do not generally intersect at a point, so instead we take the midpoint of the points where the lines are closest together. In general this is a good approximation, and it is an unbiased estimator of the position if the rays are perpendicular. However for more distant points the PDF of the true position is highly asymmetric and is not centred on this position. The proofs that algorithms for doing Procrustes alignment are optimal make the assumption that errors are independent Gaussian RVs with zero mean. However this does not appear to be a good assumption when points come from stereo. Independence is a reasonable assumption if calibration errors are small, so errors come mainly from extracting points from pixels in a photograph. Algorithms to extract points are generally symmetric and extract individual points independently. If we assume these errors are Gaussian then the error in the plane perpendicular to the direction the stereo rig is pointing will be approximately Gaussian and will have zero mean (it is determined by the point where these rays intersect this plane at whatever depth we have calculated for our point). If we adjust the depth calculated by the procedure above so that the adjusted point is at the expected value of its depth it will be at its expected true position, so will be an unbiased estimator of this position. The distribution of possible positions about the adjusted position is now closer in some sense to a Gaussian distribution, so it is intuitive that the algorithms described above that assume this will perform better given adjusted points. I have calculated the expected depth of points given their reconstructed positions:

d =

bd m 0, (e)de b + 2ed m l
b is baseline length dm is measured depth e is the error in measuring one pixel

0 , is the normal PFD with standard deviation (image noise in radians)

l >

b 2d m is a lower limit high enough that we are not considering points that we couldnt

possibly see as they are behind the camera [TODO: should that be root 2? Sum of 2 NDs] This can be pre-computed and stored in a lookup table, so is relatively fast. Several small-angle assumptions are made so possibly simulated data would give better estimates. I will try this approach.

3d points can be shifted by this amount in the direction of the stereo rig so the distribution of their true position is centred on the adjusted point. Preliminary results suggest there is a small improvement in accuracy that is insignificant (about 2.5%) until very noisy data is used, or the stereo baseline is less than ten times the point depth, when it gives a significant improvement in accuracy. More simulations are necessary to see when this is useful. Future development work

1) I will use and extend my MATLAB program to compare different first approximations to transformations between images. 2) I will investigate incorporating Bundle Adjustment into my MATLAB program to test whether accurate starting positions really do save time and reduce the likelihood of finding false minima. 3) I will investigate whether the reconstructed structure given by BA also suffers from skewed stereo distributions. It clearly does in the case of points occurring in only two images, as the same reconstructed point described above is the one that minimises reprojection error. A function of the reprojection error that is minimised when the optimal point is found would have to be decreasing in increasing total reprojection error (close to where the optimal reconstruction would be with the current approach) so this is unlikely to be a suitable approach. Post-processing results may be possible.

3.1.2 Loop Closure

I will investigate using the geometry of scenes to either search for matching scenes, or to validate matches based on distinctive point combinations. I have already started investigating ways of validating matches without aligning points by examining the direction of vectors between correspondences. I will attempt to analyse projective-invariant features of point sets in order to describe scenes in terms of their geometry. The aim is to find a descriptor (such as a set of orderings) that will discriminate between different scenes. I will investigate conditioning based on either descriptor frequencies or descriptor combination frequencies, and ways of partitioning distinctive descriptor sets. This may involve finding existing techniques that are suitable for loop closure, or developing new techniques.

3.1.3 Demonstrating improved Visual Navigation

I have started to implement a navigation algorithm using C++ and the OpenCV library and a stereo webcam pair. [Insert screenshot] At the moment features are detected and good correspondences are found within stereo pairs. The epipolar constraint and match condition numbers are used to speed up matching and eliminate bad matches. 3d positions are calculated but these are not working. Features are matched with the previous frame. A RANSAC/Procrustes alignment algorithm for relative motion is implemented but untested due to lack of good 3d points.


Future development work

I will fix and extend my navigation software to give motion estimates using the navigation algorithm identified by simulation. I will aim to show that this gives a good enough initial estimate to allow BA to refine the position estimate in real-time. Fast BA code is available in the sba library for this purpose. Other potential areas to research include the trade-off between a high frame rate (features are tracked over many frames but not much time is available to process each frame) and spending longer refining estimates from less frequent, possibly higher resolution, frames. Generic relative-positioning techniques across multiple frames will enable me to incorporate loop closure into this software, by positioning relative to frames from the same position in the past. The

most accurate algorithms for navigation at the moment do not incorporate BA. Hopefully by identifying and refining the most appropriate algorithms it will be possible to do this in real-time.

I will need to learn more Bayesian statistics and statistical geometry. This will allow me to understand and select existing methods and to develop new solutions for the Loop Closure problem. I will do this primarily through reading. I will attend Mathematics undergraduate lecture courses on optimisation (MATH412-07S1), geometry (MATH407-07S1) and calculus (MATH264-07S1) to extend my pure mathematical knowledge to more applied fields. This will help me understand the concepts underlying BA and other geometric algorithms, and to develop and adapt them. For example I am currently investigating whether it is beneficial for VN to adapt BA to minimise reconstruction errors rather than reprojection errors, which requires an understanding of robust optimisation, As the field of VN is moving rapidly and significant advances are likely I will continue my literature review, paying particular attention to forthcoming conference proceedings and the activities of groups working on VN, V-SLAM and image recognition, including the following: Key computer vision conferences: International Conference on Robotics and Automation 2008 European Conference on Computer Vision 2008 International Conference on Computer Vision 2008 Computer Vision and Pattern Recognition 2008 SLAM Summer School 2008unconfirmed at the moment

Leading research groups in the field: ROBOTVIS: Computer Vision and Robotics, INRIA, Grenoble (Localisation, VN, SLAM) Robotics Research Group, Oxford University (Loop closure, SLAM, Image Registration) Center for Visualization & Virtual Environments, University of Kentucky (VN, photogrammetry applied to CV) The Australian Centre for Field Robotics, University of Sydney (SLAM and UAVs)


Proposed research timeline and targets

January Complete investigation of approximate transformation extraction techniques from point sets. February Learn sufficient Bayesian statistics to be able to adapt categorisation and discrimination (conditioning) techniques to the problems of VN. March Complete analysis of and publish results from transformation extraction experiments. April Develop VN software to a stage where it can infer reasonable motion estimates from point sets. Aim for a real-time implementation. June Complete incorporation of BA into VN software to refine position. Aim for a near real-time implementation, identifying bottlenecks to help guide future work. August Complete preliminary research into suitable registration algorithms and map formats for loop closure. October Decide whether to extend research to monocular vision or to stay with stereo.

November Complete addition of mapping to VN software (either a database of descriptors approach or a traditional SLAM landmark map). 2009 January Decide whether to focus research efforts on sensor integration involving VN, or on localisation. June Complete research into registration/recognition in loop closure. 2009 March Exhibit VN system for 3d indoor or outdoor positioning. April Publish details of any improvements over existing technology of my VN implementation November Start writing thesis. 2010 April Complete experimental work. July Submit PhD thesis. Write papers based on thesis.

4 Bibliography
1. 2. 3. 4. 5. 6. Qyngxiong, Y., et al. Stereo Matching with Color-Weighted Correlation, Hierachical Belief Propagation and Occlusion Handling. in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on. 2006. Laura A. Clemente, A.J.D., Ian Reid, Jos Neira and Juan D. Tards, Mapping Large Loops with a Single Hand-Held Camera. RSS, 2007. Murray, R.O.C.a.D.J.G.a.G.K.a.D.W. Towards simultaneous recognition, localization and mapping for hand-held and wearable cameras. in International Conference on Robotics and Automation. 2007. Rome. Lowe, D.G. Object recognition from local scale-invariant features. in Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on. 1999. Tinne, T. and G. Luc Van, Matching Widely Separated Views Based on Affine Invariant Regions. Int. J. Comput. Vision, 2004. 59(1): p. 61-85. Bill Triggs, P.F.M., Richard I. Hartley and Andrew W. Fitzgibbon, Bundle Adjustment -- A Modern Synthesis. Vision Algorithms: Theory and Practice: International Workshop on Vision Algorithms, Corfu, Greece, September 1999. Proceedings, 1999. Volume 1883/2000. Tomasi, C. and T. Kanade, Shape and Motion from Image Streams under Orthography - a Factorization Method. International Journal of Computer Vision, 1992. 9(2): p. 137-154. Agrawal, M., K. Konolige, and R.C. Bolles. Localization and Mapping for Autonomous Navigation in Outdoor Terrains : A Stereo Vision Approach. in Applications of Computer Vision, 2007. WACV '07. IEEE Workshop on. 2007. Lee, S.Y. and J.B. Song, Mobile robot localization using optical flow sensors. International Journal of Control Automation and Systems, 2004. 2(4): p. 485493. Davison, A.J. Real-time simultaneous localisation and mapping with a single camera. in Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on. 2003. Lan, L.Q.a.Z.-D., Linear N-Point Camera Pose Determination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999. Horn, B.K.P., Closed-form solution of absolute orientation using unit quaternions. J. Opt. Soc. Am. A, 1987. 4(4): p. 629. Sunderhauf, N.P., P., Towards Using Sparse Bundle Adjustment for Robust Stereo Odometry in Outdoor Terrain. Proceedings of Towards Autonomous Robotic Systems TAROS06, 2006. Zhang, Z. and Y. Shan. Incremental motion estimation through modified bundle adjustment. in Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on. 2003. Kwok, N.M. and A.B. Rad, A Modified Particle Filter for Simultaneous Localization and Mapping. J. Intell. Robotics Syst., 2006. 46(4): p. 365-382. Dailey, M.N. and M. Parnichkun. Simultaneous Localization and Mapping with Stereo Vision. in Control, Automation, Robotics and Vision, 2006. ICARCV '06. 9th International Conference on. 2006.

7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

17. 18. 19. 20. 21. 22. 23. 24.


Maimone, M., Y. Cheng, and L. Matthies, Two years of Visual Odometry on the Mars Exploration Rovers. Journal of Field Robotics, 2007. 24(3): p. 169186. Kin Ho, P.N. Combining Visual and Spatial Appearance for Loop Closure Detection in SLAM. in 2nd European Conference on Mobile Robots. 2005. Ancona, Italy. Veth, M.J., Raquet, J.R. Fusion of Low-Cost Inertial Systems for Precision Navigation. in Proceedings of the ION GNSS. 2006. Segvic, S., et al. Large scale vision-based navigation without an accurate global reconstruction. in Computer Vision and Pattern Recognition, 2007. CVPR '07. IEEE Conference on. 2007. Felzenszwalb, P.F. and D.P. Huttenlocher, Pictorial structures for object recognition. International Journal of Computer Vision, 2005. 61(1): p. 55-79. Bret Taylor, L.V., Database assisted OCR for street scenes and other images, U.P. office, Editor. 2007. Neira, J. and J.D. Tardos, Data association in Stochastic mapping using the joint compatibility test. Ieee Transactions on Robotics and Automation, 2001. 17(6): p. 890-897. Fei-Fei, L. and P. Pietro, A Bayesian Hierarchical Model for Learning Natural Scene Categories, in Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) Volume 2 - Volume 02. 2005, IEEE Computer Society. G. Csurka, C.D., L. Fan, J. Williamowski, C. Bray. Visual categorization with bags of keypoints. in ECCV04 workshop on Statistical Learning in Computer Vision

2004. 26. Filliat, D. A visual bag of words method for interactive qualitative localization and mapping. in International Conference on Robotics and Automation (ICRA). 2007. 27. Ho, K.L. and P. Newman, Detecting loop closure with scene sequences. International Journal of Computer Vision, 2007. 74(3): p. 261-286. 28. Xiaoping, H. and N. Ahuja, Matching point features with ordered geometric, rigidity, and disparity constraints. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 1994. 16(10): p. 1041-1049. 29. Shinji, U., Least-Squares Estimation of Transformation Parameters Between Two Point Patterns. IEEE Trans. Pattern Anal. Mach. Intell., 1991. 13(4): p. 376-380. 30. Bar-Itzhack, I.Y., New Method for Extracting the Quaternion from a Rotation Matrix. Journal of Guidance, Control, and Dynamics, 2000. 23(6). 31. Haralick, R.M., et al., Review and Analysis of Solutions of the 3-Point Perspective Pose Estimation Problem. International Journal of Computer Vision, 1994. 13(3): p. 331-356. 32. Hartley, R.I.a.Z., A., Multiple View Geometry in Computer Vision. Second ed. 2004: Cambridge University Press.