Vous êtes sur la page 1sur 8

Distributed Visual Processing for Augmented Reality

Winston Yii
Monash University

Wai Ho Li
Monash University

Tom Drummond
Monash University

A BSTRACT Recent advances have made augmented reality on smartphones possible but these applications are still constrained by the limited computational power available. This paper presents a system which combines smartphones with networked infrastructure and xed sensors and shows how these elements can be combined to deliver real-time augmented reality. A key feature of this framework is the asymmetric nature of the distributed computing environment. Smartphones have high bandwidth video cameras but limited computational ability. Our system connects multiple smartphones through relatively low bandwidth network links to a server with large computational resources connected to xed sensors that observe the environment. By contrast to other systems that use preprocessed static models or markers, our system has the ability to rapidly build dynamic models of the environment on the y at frame rate. We achieve this by processing data from a Microsoft Kinect, to build a trackable point cloud model of each frame. The smartphones process their video camera data on-board to extract their own set of compact and efcient feature descriptors which are sent via WiFi to a server. The server runs computationally intensive algorithms including feature matching, pose estimation and occlusion testing for each smartphone. Our system demonstrates real-time performance for two smartphones. 1 I NTRODUCTION This work is motivated by the goal of creating a collaborative multiuser Augmented Reality system which combines three elements: Handheld devices such as smartphones or tablets which contain video cameras that image the environment and display that image with overlaid computer graphics on their screens. Fixed hardware that can communicate with the mobile devices via a wireless network and can provide large computing resources. Fixed sensors in the environment that can provide information about the environment that provide information that can be used to localise the mobile devices. This kind of architecture creates an asymmetric computing environment which raises the key issues of how these components can collaborate to provide an Augmented Reality experience, how the (somewhat) limited bandwidth of the wireless network can best be used and how the computational burden should be distributed between the smartphones and the xed infrastructure. This has to be organised so as to overcome the constraint of limited computational capability on the mobile devices and also maximise their battery life.

1.1 Contributions We present a system comprising multiple (initially two) smartphones and a desktop PC to which a Microsoft Kinect is attached. The Kinect is used to provide a coordinate frame in which virtual content and the smartphones positions can be expressed. It also senses the environment as 3D textured surfaces and processes this to generate a trackable point cloud model consisting of an indexed set of visual descriptors. This trackable model is dynamic and rebuilt at 30Hz framerate and is used to localize the smartphones at interactive frame rates. To our knowledge this is the rst time this has been achieved and is made possible by using a novel rotationally covariant descriptor, RHIPS which is based on HIPS [20]. Because we remodel the world dynamically at 30Hz, our system is not only robust to moving and possibly unmodelled elements of the scene, it is able to use them to aid localisation. We show how to transform the Kinects depth map into the viewpoint of a smartphone thus turning it into a virtual Kinect. This is used to correctly render virtual content with occlusions in the smartphones viewpoints. We also use the Kinects depth map to capture interactions between real and virtual elements (e.g. when a users hand touches a virtual object).

Figure 1: Distributed visual processing for Multi-user Augmented Reality. Here the system demonstrates a real-time AR service for two phones, tracking both phones and rendering virtual content with occlusions. Note the virtual model of the earth rendered behind the Kinect packaging.

2 BACKGROUND Markerless visual tracking of real scenes is an attractive option [3, 10, 7, 15, 1] for obtaining pose information for Augmented Reality applications for several reasons:
e-mail:

winston.yii@monash.edu

e-mail:wai.ho.li@monash.edu e-mail:tom.drummond@monash.edu

Video cameras are information rich sensors and are already needed where AR is mediated by video feed-through such as on a handheld tablet or smartphone. The use of additional

sensors can be avoided by calculating the cameras location directly from its images and then using it to accurately register virtual elements to the real world. By accurately determining the world 3D location of the pixels in the camera image, it becomes possible to augment the image with virtual objects to similar (ie pixel) precision. The use of ducial markers (such as ARToolKit [6]) may be undesirable for aesthetic or practical reasons, or may not even be permitted. Such systems rely on having a 3D model of the operating environment which is usually generated ofine beforehand [3, 7, 15] or, at the very least, a partial model [2] or keyframes which correctly represent the scene [10]. This implies that the 3D model must be valid for the duration of AR operation, which in turn suggests that the majority of the operating environment must be static. Many kinds of model can be used in these systems: [7] used line features, while [15] used a textured polyhedral model. Dynamic environments such as those which include moving human users and physical components (e.g. for tangible user interfaces) are particularly problematic for augmented reality systems based on markerless visual tracking. While moveable components can be modelled [14, 19] and independently tracked, such scenes often contain unmodelled components (such as the user) which occlude the scene and confuse markerless visual trackers. This has been partially solved by using a second camera [21], however this requires that other parts of the scene can be modelled as being static. Sensors that directly recover the three-dimensional structure of the world are a natural t for Augmented Reality and numerous researchers have exploited them to obtain several benets: By sensing the 3D world directly, it is easier to localise the sensors. Otherwise unmodelled components of the scene (such as the users hands) can be detected and correctly handled (although this can also be handled in special cases by skin detector tools [9]). Visual effects such as relighting are easier to calculate. Early work [5] used a calibrated multi-camera rig to densely recover scene structure. Time of ight cameras have also been used in conjunction with colour cameras to which they are rigidly mounted to provide a joint colour and depth representation of the world [4]. More recently the availability of cheap RGB-D sensors in the form of the Microsoft Kinect and the Asus Xtion has created new opportunities. Kinect Fusion [12] fuses continuous views from a Kinect to obtain a voxel model of the world. This is then used to provide an Augmented Reality experience in which virtual objects are overlaid on the live view from the colour camera in the Kinect. A constraint that applies to all of these approaches is that they are only able to augment the view provided by the complex sensor (multi-camera, time-of-ight, or projector-camera) and this has prevented their use on smartphones because the sensors are not available there. ShareZ [13] is similar to our work in that it showed how to transform sensed depth into the frame of another viewer and provide this as a service in a client-server architecture for use in head-mounted displays. This relied on an external tracker or xed surveyed ducial markers to localise the HMDs. 3 F RAMEWORK The system consists of a PC server, WiFi base station, Microsoft Kinect and smartphone clients. A key driver of the design is the desire to minimise computation on the smartphones, both to enable operation on resource-limited devices and to minimise their

battery consumption. The natural consequence of this is that it is advantageous to get the processing chain off the device as quickly as possible. However, while would be possible to compress video from the phone and transmit that to the server, this results in a large consumption of bandwidth that would make it difcult to support multiple phones. Video compression is also relatively expensive in terms of computation and battery use. Hence a compromise is found whereby the phone performs the initial processing of its image by nding keypoints and computing descriptors and these are conveyed to the server, rather than the image that they came from. The server contains a continuously updated visual 3D model of the environment. It is responsible for calculating the smartphones camera poses, handling WiFi communications, detecting user interactions, calculating augmented reality physics and performing occlusion testing for virtual objects with respect to the mobile phones. The server is connected to a Kinect sensor to provide up-to-date visual models of the environment. Users are able to point their smartphone cameras at the modelled environment and view the servers computer graphic objects. Figure 2 shows a data ow overview of the system. To achieve low latency pose estimation, the system is based upon efcient feature matching techniques. At every frame, processing is distributed between the smartphone and the server in the following procedures: Smartphone on-board processing 1. A new frame is acquired from the video camera. 2. Extract keypoints in the frame and build query set of keypoint locations and descriptors. 3. Send query set to server over WiFi. 4. Receive a camera pose estimation relative to the Kinect along with the positions of virtual objects and an occlusion mask for rendering. 5. Render virtual objects with the occlusion mask over the live camera feed. PC server processing 1. Capture colour and depth images from the Kinect. 2. Extract keypoints from the Kinect and build an indexed reference set of keypoint descriptors. 3. Wait for incoming request from a smartphone client and receive a query set of keypoint descriptors. 4a. Predict where each reference keypoint from the kinect should appear in the viewpoint of the smartphone and perform an active search for the smartphones query set. 4b. Should the active search fail, full feature matching occurs between the query set and the reference set. 5. Estimate the pose of the smartphone. 6. Transform the Kinects 3D model data to the phones viewpoint. 7. Update the virtual object physics and test for occlusion of virtual objects from the phones point of view using the transformed depth map. 8. Send the pose and virtual object locations and occlusion mask back to the phone.

Figure 2: System overview: distributed processing dataow

S MARTPHONE P ROCESSING

The smartphone captures colour images from its video camera in YUV422 format at 15Hz. The cameras intrinsic parameters have been pre-calibrated and are stored on the server. Captured images are converted to greyscale for image processing while the colour image is directly rendered as the background for AR. The greyscale image is downsampled by half to construct a two layer image pyramid. FAST-9 corners [17] are extracted on both pyramid levels. 4.1 RHIPS

For each corner, we build a descriptor using a novel variant of Histogram of Intensity PatcheS (HIPS) [20]. Our descriptor is rotationcovariant (rather than invariant) in the sense, taken from [11], that it is designed so that the approximate global orientation of all descriptors in a query image relative to those in a reference image can be quickly computed. All the descriptors in the query image can then be rotated by this value with minimal computational effort so that they can be matched to those in the reference image. The query and reference descriptors are built differently so that the reference descriptors are invariant to small afne deformations (including rotation), but not large rotations since these have already been normalised out. The advantage of this approach over using rotation-invariant descriptors, is that rotation-covariant descriptors offer greater discriminative power. Where the four corners of a box may look very similar to a rotation-invariant descriptor, they will look different to our descriptor and provided the global image orientation has been correctly computed, they will be correctly matched. Gravity aware descriptors [8] show how viewpoint correction

can be applied to feature points to provide more accurate feature correspondences. The descriptors use the gravity vector from an accelerometer to warp a locally planar patch around a feature and compensate for a different view. This requires a potentially costly warping process on the smartphone. Unlike the gravity aware descriptor, RHIPS ofoads viewpoint corrections onto the server. The client is merely required to sample a xed ring pattern around the point. The server performs the viewpoint covariance based on nding a globally consistent rotation that aligns a set of querying descriptors with a set of reference descriptors. Like its predecessor, this rotational-covariant version (RHIPS) quantises a patch of 64 pixels centered around the corner. However, the sampling pattern of the patch is altered from a regular square grid to four concentric circles (see Figure 3). The rotational symmetry of the sampling pattern is important because it enables the change to a descriptor under in-plane rotation to be quickly calculated by manipulating the descriptor bits. RHIPS reduces keypoint representation to 8 bytes for image position and 32 bytes for the descriptor. The smartphone typically extracts 500 features which are sent to the server via WiFi where the rest of the processing taskes place. The amount of data to be sent is 500*40 bytes, about 20 times less than the 800*480 bytes required to send an uncompressed image. The 64 pixel feature patch is sampled according to 4 concentric Bresenham circles of radius 3, 6, 8 and 9. The support region for the patch requires the scale to be determined by at the detector stage. Our system builds features according to the scale at which the FAST feature was found. The sampled pixel intensities are normalised for variance and

under changes up to about double this amount because neighbouring pixel values are well correlated. This approximation of afne invariance is many orders of magnitude faster than sampling hundreds of warped image patch for each feature. 4.3 Active Search

Figure 3: Left: HIPS sample patch, Right: RHIPS sample patch

mean before quantising each pixels normalised intensity into one of 4 bins. Although it has been determined that 4 bins compromises a slight loss of distinctiveness over the use of 5 bins, the use of 4 levels is convenient for optimisations with SIMD instruction sets. The bins are quantised at 0.675 , and + 0.675 , providing even distribution across the four bins if the patch intensity distribution is Gaussian. The descriptor, D, is built in the same manner as HIPS, creating a descriptor D[i].bit( j) of 4 64 bits, where D[i].bit( j) = 1 if pixel j has quantised intensity i and is = 0 otherwise. The pixels are sampled in the order shown in gure 3 (i.e. a radial sector at a time, moving around the circle). This order is chosen so that rotating all four D[i] by 4 bits corresponds to rotating the sample patch by 1/16 of a full revolution, hence any rotation can be approximated by a rotation of a multiple of 4 bits. This allows the descriptor of an in-plane rotated feature to be calculated without any resampling of rotated image patches. In a small deviation from HIPS, we express the error function, E, between a query descriptor, q, and reference descriptor, r, as an efcient bitwise operation: E = Error(r, q) = bitcount(q r) (1)

The server retrieves colour and depth images from the Kinect to build a set of reference keypoints. A 5-layer greyscale image pyramid with downsample factors of 1, 1.5, 2, 3 and 4 is used and FAST corners with RHIPS descriptors are extracted at all levels. This provides some scale-invariance to the system. Each keypoint has an associated depth from the Kinects depth image. Assuming good quality tracking from the previous frames, the server predicts the smartphones next pose using a decaying alphabeta velocity model. The reference keypoints are transformed according to the predicted pose and projected onto the phones image plane. Such transformations are performed in uvq coordinates as described later in section 4.5. The associated reference descriptors are rotated according to the predicted in-plane rotation. RHIPS advantage here is the descriptor prediction only involves a few bit shift operations as opposed to resampling from an image. Querying keypoints search within a xed radius for a match with a predicted keypoint. A match is deemed to be the smallest descriptor error from equation 1 under a xed threshold. From the set of keypoint matches, the smartphone camera pose is estimated as described in section 4.5. The quality of the pose is assessed to determine if it necessary to perform full keypoint matching between the query and reference descriptor set. The metric for the pose quality is based on the number of inlier matches from RANSAC and the magnitude of the difference between the estimated pose and the previous pose. Should the number of keypoint matches fall below a threshold or the frame-to-frame motion translate beyond a threshold, tracking has failed and relocalisation from feature matching is required. 4.4 Orientation Correction and Matching

4.2 Afne Viewpoint Invariance The original HIPS descriptor obtains afne viewpoint invariance by sampling from a large number of articially warped feature patches for in-plane rotation, scale and afne transformations. The large number of warps prohibits building a database of afne invariant descriptors in real-time. By contrast, RHIPS takes a computationally efcient approach. Reference feature descriptors which form the database to be searched are built with a tolerance to viewpoint variations by bitwise ORing the descriptor with the descriptors of 8-neighbouring pixels. DR (x, y) =
1i, j1

D(x + i, y + j)

(2)

Feature matching with in-plane rotational invariance enables matching between two rotated views, however the invariance comes with a trade off in descriptor distinction. A better approach is to adjust the descriptors for rotation covariance. RHIPS determines a globally consistent orientation between a set of querying descriptors and the reference database by rst matching rotationally invariant descriptors. A database of rotationally invariant features is built from canonical rotated descriptors. All querying descriptors are likewise rotated to their canonical orientation before matching. Each match votes for a global orientation correction based on the difference between the querying and database features canonical orientations. Querying descriptors can then perform matching against a second database built from the orientation-corrected descriptors.

The effect of ORing a number of descriptors together to form the reference descriptor r used in equation 1 is to make matching more permissive. The error score counts the number of pixels in the query descriptor q that have a quantised intensity never seen in the nine descriptors used to build r. Because RHIPS treats the pixels in patch as independent for the purposes of matching, this operation means that DR directly represents all possible deformations in which each pixel lands within one pixel of its canonical position. This includes small rotations (where the outer ring of pixels move around the ring by up to 1 pixel tangentially), small changes in scale (where the outer ring of pixels move in or out by up to 1 pixel radially), or similar translations and shears. In practise, these descriptors also continue to match well

Figure 4: RHIPS keypoint correspondences

Figure 5: RHIPS keypoint correspondences under rotation

4.4.1 Canonical Orientation A number of approaches to determine canonical orientation are discussed by Rosin [16]. The original HIPS descriptor determines the orientation of the patch by summing the gradient differences between diametrically opposed pixels in the FAST Bresenham circle. The angle of the nal vector sum is determined as the canonical orientation. ORBs oFAST [18] explores various ways of determining canonical orientation including; argmax the direction of the largest gradient within circular support patch and bin the largest orientation bin in a SIFT style histogram of gradient orientations. The most stable method used by oFAST was the intensity centroid of pixels within a circular support patch. Due to the circular symmetry of the RHIPS descriptor, we only require a canonical orientation of the descriptor, not the feature support region. The canonical orientation used in RHIPS is determined by the intensity centroid of the quantised sampled feature patch. The quantised intensity centroid can be derived directly from the descriptor, avoiding any further image pixel accesses. C = (m10 /m00 , m01 /m00 ) (3) where m pq are the moments of the quantised feature. m pq = offset j .x p offset j .yq Di j i
ij

A point at a known position in the kinect frame, pk = (x, y, z, 1)T , can be expressed in the camera frame by applying the rigid transformation to obtain pc = Tck pk . pk is projectively equivalent to ( x , y , 1, 1 )T = (u, v, 1, q)T and it z z z is more convenient to solve for pose using this formulation because (u, v) are a linear function of pixel position and errors in q from the Kinect measurement are independent of depth; Our cameras exhibit almost no radial distortion and we assume a linear camera model for both cameras, and the value reported in the depth image from the Kinect is linearly related to qk . This gives: uk uc uk vc vk R vk + qk t (6) 1 Tck 1 = 1 qk qc qk qc is unobserved, so only two constraints are provided per correspondence and thus three correspondences are needed to compute Tck . When this has been computed, the same equation can be used to calculate qc , thus making it possible to transform the Kinect depth map into the mobile camera frame. 0 Pose is solved iteratively from an initial assumption Tck = I, i.e. that the mobile and Kinect coordinate frames are coincident. At each iteration, the error in Tck is parameterised using the exponential map and at each iteration Tck is updated by: Tck
j+1

= exp

i Gi
i

Tck

(7)

where Gi are the generators of SE(3). The six coefcients i are then estimated by considering the projection of the kinect coordinates into the mobile camera frame. uc vc 1 exp qc uk j vk Tck , 1 qk

(4)

i Gi
i

(8)

The orientation is quantised into one of 16 directions separated by 2/16. 4.4.2 Reference Descriptor Tree Datastructure To improve matching efciency, a binary tree index for the reference descriptors similar to that described in [19] is used, however we use a faster method of building the tree. We sort the reference descriptors by bit count (number of 1s) to produce an initial working set. Starting with the lowest bit count, we nd a preferred partner for this descriptor by scanning the rest of the working set for the descriptor which when ORed with it gives the lowest resulting bit count. We then nd the preferred partner for this descriptor and repeat this process until we nd two descriptors that wish to partner with each other. These descriptors are then removed from the working set and merged by ORing their bit patterns together and the result inserted back into the working set in sort-order. 4.5 Pose Estimation The pose of the mobile camera relative to the coordinate frame of the Kinect is computed in a conventional manner, with matched points expressed in inverse depth coordinates (u, v, 1, q) = 1 z (x, y, z, 1). To summarise: The pose can be expressed as a rigid body transformation in SE(3): Tck = R 000 t where R SO(3) and t R3 1 (5)

uk uk v j v Writing k Tck k , 1 1 qk qk

(9)

then gives the Jacobian J of partial derivatives of (uc , vc )T with respect to the six i as: J= qk 0 0 qk uk qk vk qk uk vk 1 vk 2 1 + uk 2 uk vk vk , uk (10)

The subscript ck indicates the camera coordinate frame relative to the kinect frame.

where the rst three columns of J correspond to derivatives with respect to translations in x, y and z (and thus depend on q) and the last three columns correspond to rotations about the x, y and z axes respectively. This simple form of J is a nice consequence of working in (u, v, q) coordinates rather than (x, y, z). From three correspondences between (uk , vk , qk ) and (uc , vc ) it is possible to stack the three Jacobians to obtain a 6x6 matrix which can be inverted to nd the linearised solution to the i from the error between the observed coordinates in the mobile camera frame and the transformed Kinect coordinates: (uc , vc ) (uk , vk ). With more than three correspondences, it is possible to compute a least squares solution to obtain i . When i have been computed, iteration proceeds by setting j+1 j Tck = exp (i i Gi ) Tck and recomputing the i . In practice, we found that 6 iterations is enough for this to converge to acceptable accuracy.

Because the correspondences obtained using RHIPS contain many outliers, RANSAC was used (sampling on triples of correspondences) to obtain the set of inliers. The nal pose was then calculated through an iterative reweighted least squares optimisation over all inliers. Figure 4 and 5 shows all the inlier correspondences between the Kinect and mobile camera images computed in this way. 4.6 Occlusion

in the scene by the two cameras. To assess the accuracy and stability of our pose estimation, we conducted the following turntable experiments. A smartphone was mounted on a turntable and a stationary Kinect was positioned to view the same scene as shown in gure 8. Figures 8 and 9 show the setup for viewing a poster and a cluttered scene. The smartphone was moved relative to the Kinect in 5 degree increments and performance measurements were taken at each relative position. Using this experimental setup, we measured both matching performanace and localisation accuracy.

Because virtual content may be occluded by real objects (both animate and inanimate), it is desirable to detect these occluders and avoid rendering the virtual objects where elements of the mobile camera image are in front of them. Occlusion testing is calculated on the server using the dense depth data from the Kinect. Occlusions will be different in the Kinects view than that of the mobile camera. It is necessary to transform the Kinects depth image into the mobile view as shown in gure 6.

Figure 6: Transformation of the Kinects depth map into the mobile cameras viewpoint.

Figure 8: Poster scene experimental setup

Figure 7: Occlusion test with a virtual sphere to produce a binary occlusion mask

Virtual objects are tested for occlusion against the transformed depth image (see gure 7) to produce an binary occlusion mask. Only necessary data, a bitmask, rather than the full transformed depth data is sent to reduce the latency from the limited bandwidth WiFi link. The camera pose and virtual object positions are nally sent along with the occlusion mask to the smartphone for rendering. 5 E XPERIMENTS
AND

Figure 9: Cluttered scene experimental setup

Matching Performance The number of keypoints successfully matched as inliers between the smartphone and Kinect images was counted. This shows how the matching performance of RHIPS degrades as the two viewpoints diverge. Figure 10 shows how the performance changes for these two scenes as the viewing angles diverge. For small viewing angle differences, the number of inlier matches is a function of the scene geometry and occlusions. As the angle becomes larger this is dominated by degrading performance of the RHIPS matcher. For both the poster and cluttered scenes, when the relative viewing angle becomes larger than 70 degrees, tracking becomes intermittent and fails altogether shortly thereafter.

R ESULTS

A number of experiments were conducted to measure the tracking performance and latency of the system. 5.1 Tracking Performance

In general visual tracking systems perform best when the two cameras have similar viewpoints and degrade as the viewpoints diverge. In particular this depends upon the angle subtended at the keypoints

Number of Inlier Feature Matches


180

160

140

Number of Inlier Feature Matches

120

100

overlap of views between the smartphone and Kinect. Due to the systems feature-based approach, the overlap required depends on the scene, in particular on the distribution of detected feature points. To test for the minimum overlap limit, an experiment was conducting using the poster scene because it provided a relatively even distribution of features to track. The smartphone camera was moved so as to only view a percentage of the poster. The phone was successfully tracked until it viewed only 25% of the poster.
Clutter
Poster

80

60

40

20

0 0 10 20 30 40 50 60 70 80 Mobile Phone Viewing Angle (degrees)

5.2 Latency The system was tested using an Intel i7 desktop running Linux and two HTC Desires running Android 2.3. To optimise performance on the phone, we take advantage of ARMs NEON SIMD instruction set to detect FAST corners and build RHIPS descriptors. The time taken by the phone to extract features and build feature descriptors is typically 14.3ms. The phone takes another 1.2 ms to send the data via WiFi to the server. Typical timings on the server for each section are as follows: Section Extract Reference Keypoint Set Build Invariant and Unrotated Keypoint Database Receive Phone Keypoints and Store Query Set Active Search and Pose Estimation (with RANSAC) Global Orientation Covariant Feature Matching Pose Estimation (with RANSAC) Depth Transform and Occlusion test Send back data to phone
Table 1: Processing times on the server

Figure 10: Matching performance measured as the number of inliers found as a function of the difference in viewing angle between the smartphone and Kinect

Localisation Accuracy The most important measure of localisation accuracy is not the positional accuracy of the smartphone, but the reprojection error of world points into the smartphone image measured in pixels. This is what governs the accuracy with which virtual content is registered to the smartphone view. To calculate this accuracy, the residual sum-squared error of inlying keypoint matches was calculated. Because calculating pose requires three matches (and thus three matches would generate zero residual error), the sum squared error is divided by the number of inliers minus 3 and the sqaure root taken to give an unbiassed estimate of reprojection error (in a similar way that unbiassed estimates of standard deviation are calculated). Figure 11 shows this unbiassed estimate of reprojection error as a function of relative viewing angle between the smartphone and Kinect. As can be seen from the gure, the reprojection error degrades slightly, increasing from about 1.5 pixels to just under 2 pixels as the relative viewing angle increases.
Feature Match RMS Pixel Error
2.5

Time (ms) 5.6 6.6 1.6 4.5 2.7 2.9 2.5 4.2 <1

If active search is successful, the latency between the phone starting to process a frame to receiving the pose information from the server is as low as 14.3 + 1.2 + 1.6 + 4.5 + 2.7 + 1 = 25.3ms (the reference keypoint set can be extracted and indexed prior to receiving data from the phone). If the active search fails, full covariant feature matching is required, adding 14ms of latency on the server side. This amount of latency is low enough to allow a single phone to operate at 15Hz (the rate that images arrive from the camera) and give a smooth user experience. The system is able to support two phones at 12Hz which is still smooth enough for comfortable use (e.g. as in gure 1). The drop in framerate was caused by suboptimal handling of UDP packets in our code. We believe this could be mitigated by improving the way we handle dropped packets. 5.3 Discussion and Limitations The system shows good tracking under smooth motions, however, The camera phones exhibit strong camera blur under quick motions. This is certain to cause tracking failure in almost any feature-based matching. The system is quick to relocalise the phone in the next frame unblurred frame. Figure 12 shows the system continuting to operate successfully when the phone occludes part of the Kinect view. The occlusions on the smartphone appear correctly as shown in gure 13 when the phone has a similar distance to the scene as the Kinect or further. However, when the phone moves closer to the scene, the holes in the occlusion mask become apparent from the sparser depth map. Possible future work to remedy this would be to build 3D surfaces from the Kinect rather than using a point cloud for occlusion. 6 C ONCLUSION This paper presents a distributed AR system with support for two phones. The system uses a desktop PC to provide real-time AR services including feature-based tracking and occlusion testing. The

1.5 RMS Pixel Error

Clutter
Poster 1

0.5

0 0 10 20 30 40 50 60 70 80 Mobile Phone Viewing Angle (degrees)

Figure 11: Localisation accuracy measured as the reprojection error of world points into the smartphone view as a function of the difference in viewing angle between the smartphone and Kinect

Another consideration in tracking performance is the required

Figure 12: Kinect view showing virtual content despite smartphone occluding part of the scene. Note that the virtual content is also correctly rendered on the smartphone

Figure 13: Smartphone view showing virtual content (model of Earth) being correctly occluded by objects in the real world scene

phone gleans the most distinct information from its video camera and summarises the data into a set of binary-based feature descriptor. The efcient and compact feature descriptor Histogram of Intensity PatcheS was extended to provide quick feature matching, even under in-plane rotation. The compact representation is a key feature that enables real-time performance over a limited WiFi link. The server also calculates a binary occlusion mask to correctly render virtual content in a real scene. Results show that the system is able to support an Android phone at its full video camera framerate (15Hz) and two phones at 12Hz, limited by the latency of the WiFi links. The system shows adequate tracking, even under large viewpoint variations between the smartphone and the Kinect for a variety of scenes. R EFERENCES
[1] G. Bleser and D. Stricker. Advanced tracking through efcient image processing and visual-inertial sensor fusion. Computers & Graphics, 33(1):5972, 2009.

[2] G. Bleser, H. Wuest, and D. Stricker. Online camera pose estimation in partially known and dynamic scenes. In Proceedings of the 5th IEEE and ACM International Symposium on Mixed and Augmented Reality, pages 5665. IEEE Computer Society, 2006. [3] A. Comport, E. Marchand, and F. Chaumette. A real-time tracker for markerless augmented reality. In Proceedings of the 2nd IEEE/ACM International Symposium on Mixed and Augmented Reality, page 36. IEEE Computer Society, 2003. [4] F. de Sorbier, Y. Takaya, Y. Uematsu, and H. Saito. Augmented reality for 3d tv using depth camera input. pages 117123, 2010. [5] G. Gordon, M. Billinghurst, M. Bell, J. Woodll, B. Kowalik, A. Erendi, and J. Tilander. The use of dense stereo range data in augmented reality. In Proceedings of the 1st International Symposium on Mixed and Augmented Reality, page 14. IEEE Computer Society, 2002. [6] H. Kato and M. Billinghurst. Marker tracking and hmd calibration for a video-based augmented reality conferencing system. In iwar, page 85. Published by the IEEE Computer Society, 1999. [7] G. Klein and T. Drummond. Robust visual tracking for noninstrumented augmented reality. In Mixed and Augmented Reality (ISMAR), 2003 2nd IEEE and ACM International Symposium on, pages 113122. Published by the IEEE Computer Society, 2003. [8] D. Kurz and S. Benhimane. Gravity-aware handheld augmented reality. In Mixed and Augmented Reality (ISMAR), 2011 10th IEEE International Symposium on, pages 111120. IEEE, 2011. [9] W. Lee and J. Park. Augmented foam: A tangible augmented reality for product design. In Mixed and Augmented Reality, 2005. Proceedings. Fourth IEEE and ACM International Symposium on, pages 106109. Ieee, 2005. [10] V. Lepetit, L. Vacchetti, D. Thalmann, and P. Fua. Fully automated and stable registration for augmented reality applications. In Proceedings of the 2nd IEEE/ACM International Symposium on Mixed and Augmented Reality, page 93. IEEE Computer Society, 2003. [11] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Gool. A comparison of afne region detectors. International journal of computer vision, 65(1):4372, 2005. [12] R. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In Mixed and Augmented Reality (ISMAR), 2011 10th IEEE International Symposium on, pages 127136. IEEE, 2011. [13] Y. Ohta, Y. Sugaya, H. Igarashi, T. Ohtsuki, and K. Taguchi. Sharez: Client/server depth sensing for see-through head-mounted displays. Presence: Teleoperators & Virtual Environments, 11(2):176 188, 2002. [14] Y. Park, V. Lepetit, et al. Multiple 3d object tracking for augmented reality. In Proceedings of the 7th IEEE/ACM International Symposium on Mixed and Augmented Reality, pages 117120. IEEE Computer Society, 2008. [15] G. Reitmayr and T. Drummond. Going out: robust model-based tracking for outdoor augmented reality. In Proceedings of the 5th IEEE and ACM International Symposium on Mixed and Augmented Reality, pages 109118. IEEE Computer Society, 2006. [16] P. Rosin. Measuring corner properties. Computer Vision and Image Understanding, 73(2):291307, 1999. [17] E. Rosten and T. Drummond. Machine learning for high-speed corner detection. Computer VisionECCV 2006, pages 430443, 2006. [18] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: an efcient alternative to sift or surf. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 25642571. IEEE, 2011. [19] S. Taylor and T. Drummond. Multiple target localisation at over 100 fps. In British machine vision conference, volume 4. Citeseer, 2009. [20] S. Taylor, E. Rosten, and T. Drummond. Robust feature matching in 2.3s. In Computer Vision and Pattern Recognition Workshops, 2009. CVPR Workshops 2009. IEEE Computer Society Conference on, pages 1522. IEEE, 2009. [21] J. Yu and J. Kim. Camera motion tracking in a dynamic scene. In Mixed and Augmented Reality (ISMAR), 2010 9th IEEE International Symposium on, pages 285286. IEEE, 2010.

Vous aimerez peut-être aussi