Académique Documents
Professionnel Documents
Culture Documents
The resolution of the depth map may be lower than that of the associated 2D image;
It can be highly compressed;
2D compatibility is maintained; and
Real time generation of stereo, or multiple stereo pairs, is possible.
The main disadvantage has been the laborious nature of the manual conversion techniques used to
create depth maps from existing 2D images, which results in a slow and costly process. An alternative,
highly productive technique has been developed based upon the use of Machine Leaning Algorithms
(MLAs). This paper describes the application of MLAs to the generation of depth maps and presents
the results of the commercial application of this approach.
Keywords: 2D to 3D conversion, autostereoscopic displays, machine learning
1. INTRODUCTION
The last few years have seen a dramatic increase in the demand for stereo content. This has largely
been driven by the commercial availability of multiviewer autostereoscopic displays, such as those
manufactured by Stereographics [1], 4D-Vision [2] and Philips [3].
Such displays require a number of adjacent views of the scene, typically eight or nine, rather than the
simple left and right eye views of previous stereoscopic display technologies. Whilst original content
can be created for such displays using CGI based material, consumer demand appears strongest in
video formats. Recording multiple views live using synchronised cameras has been attempted but,
particularly for inside shots, has proven both cumbersome and time consuming.
We have previously presented the advantages of 2D to 3D conversion by generating a depth map from
the original 2D image [4]. This technique enables the conversion of existing content, as well as live
broadcasting and recording, to be undertaken at a commercial level of service.
The use of a depth map as part of the 2D to 3D conversion process has a number of desirable
characteristics. Consumer testing has indicated that, with currently available autostereoscopic displays,
the resolution of the depth map may be substantially less that that of the associated 2D image before
any degradation of the stereo image becomes apparent. Typically, for NTSC video resolution 2D
images, a reduction of 4:1 may be used.
Since the depth map is of lower resolution and only contains luminance information, its bandwidth and
storage requirements are lower than the associated 2D image. Optimum compression techniques can
be used to reduce the depth map to less than 2% of the size of its associated 2D image [4]. This
enables the depth map to be embedded in the original 2D image with minimal overhead and with the
ability to deliver a 2D compatible 3DTM image.
Software or hardware decoders can subsequently render, in real time, either a single stereo pair, or a
series of perspective images, suitable for driving a wide range of stereoscopic displays [4][2][5].
Hand drawn object outlines manually associated with an artistically chosen depth value; and
Semi-automatic outlining with corrections made manually by an operator.
Each of these has a number of drawbacks. Hand drawing produces high quality depth maps but is very
time consuming and expensive. Semi-automatic outlining is generally unreliable where complex
outlines are encountered.
Although the fully automated recovery of depth from monocular image sequences is possible under
certain conditions, the operational constraints associated with such techniques limit their commercial
viability. These approaches generally fall into one of the following two categories:
1.
Depth from motion: The relationship between the motion of an object (relative to the
camera) and its distance from the camera can be used to calculate depth maps by analyzing
optic flow [8]. This technique can only recover relative depth accurately if the motion of all
objects is directly proportional to their distance from the camera. This assumption only holds
in a relatively small proportion of footage encountered (for example, a camera panning across
a stationary scene). This principal, which exploits motion parallax, is also the basis of single
lens stereo systems [9] and stereopsis by binocular delay [10].
2.
Structure from motion (SFM): SFM is an active area of computer vision research in which
correspondences between subsequent frames (or similar views of the same scene) are used to
determine depth and recover camera parameters [11][12]. A restriction of this approach is that
the 3D scene must be predominately static that is, objects must remain stationary.
Furthermore, the camera must be moving relative to this static scene. Although this technique
is used in the special effects industry for compositing live action footage with CGI its
application to depth recovery appears limited.
It should also be noted that these techniques rely on finding correspondences between frames, a process
that is unreliable in the presence of low textured, fast moving objects. These fully automated
techniques cannot recover depth in the absence of any motion.
Inputs
Sample: x,y,r,g,b
MLA
Depth: z
The MLA will adjust its internal configuration to learn the relationships between the samples and
their associated depth. The details of this learning process vary according to the algorithm used.
Popular algorithms include Decision Trees and Neural Networks [13], the specifics of which are
beyond the scope of this paper.
3.1.2 Classification
During classification samples with unknown depth values are presented to the MLA, which uses the
relationship established during training to determine an output depth value.
Inputs
Outputs
MLA
Sample: x,y,r,g,b
Depth: z
The learning process is applied in two related phases of the rapid 2D to 3D conversion process:
1.
2.
Training samples
Figure 1: (left) An example source frame1 the dots indicate the position of training samples. The
colour of the dots indicates the depth associated with the pixel. A horizon line may be used to add
depth ramps. (right) The completed depth map derived from the MLA with added depth ramp.
The images used in this test are taken from Ultimate Gs: Zacs Flying Dream Copyright 1999, Sky High
Entertainment. All rights reserved.
Figure 1 indicates how an MLA provided with a relatively small number of training samples, as
indicated by the depth coloured dots on the source frame, can generate an accurate depth map. In this
instance 623 samples were used this represents approximately 0.2% of the total number of pixels in
the image. In more complex scenes additional training data is required, but it is rarely necessary to
supply more than 5% of the image as training samples to achieve an acceptable result.
In this example, the results from the MLA are composited on top of a perspective depth ramp by
adding a horizon line. Depth maps are median filtered and smoothed to reduce stereoscopic rendering
artifacts.
3.1.4 Depth Tweening
Depth maps are generated for key frames using the process described above. These frames are
strategically located at points in an image sequence where there is significant change in the colour
and/or position of objects. Key frames may be identified manually, or techniques used for detecting
shot transitions [14] may be used to automate this process.
Key Frame 1
Key Frame 2
Tween frame
x,y,r,g,b
x,y,r,g,b
Train
Train
combine
z1
MLA 1
z2
MLA 2
Figure 2: An illustration of the depth tweening process. At each key frame an MLA is trained
using the known depth map of the source image. At any given tween frame the results of these
MLAs are combined to generated a tweened depth map.
During the Depth Tweening phase of the rapid conversion process MLAs are used to generate depth
maps for each frame between any two existing key frames. This process is illustrated in figure 2. As
indicated, a separate MLA is trained for each key frame source and depth pair. For any other frame in
the sequence the x,y,r,g,b values are input into both MLAs and the resulting depths (z1 and z2) are
combined using a normalised time-weighted sum:
w1 =
1
( f k1 ) P
w2 =
1
(k 2 f ) P
Depth =
w 1 z1 + w 2 z 2
w1 + w 2
f is the timecode of the frame under consideration, k1 is the timecode of the first key frame and k2 is the
timecode of the second key frame. The parameter P is used to control the rate at which the influence of
a MLA decays with time. Figure 3 illustrates an example of the MLA weighting functions for P = 2.
w1
0.6
w2
0.4
0.2
0
key frame 1
key frame 2
Figure 3: Plot showing the relative MLA weighting over time in this example P=2.
4. RESULTS
The rapid conversion process was tested on a short sequence consisting of 43 frames. This sequence is
relatively challenging as it contains fast motion and overlapping regions of similar colour (i.e. the
oarsmans head and the cliffs on the left hand side of the image). Three key frames (at frames 1, 14 and
43) were depth mapped and the remaining frames were converted by depth tweening.
Figure 4: Source (left) and depth map (right) generated by depth tweening at frame 6.
Figure 4 shows the depth map generated from tweening at frame 6 using the key frames at frame
positions 1 and 14. The frames that are furthest away from a key frame generally contain the most
errors as the difference between the source at training and classification is highest. The depth map in
figure 4 accurately represents the major structure of the scene, although there are misclassification
errors between the oarsmans head and the background. Similarly, figure 5 shows the depth map
generated by tweening at frame 32 using the key frames at frame 14 and 43.
Figure 5: Source (left) and depth map (right) generated by depth tweening at frame 32
This 43 frame sequence was successfully depth mapped by providing around 8,000 training samples
over the 3 key frames. This represents only 0.05% of the total number of pixels depth mapped in this
sequence.
for CGI scenes using most commercial CG packages. We used these depths to measure the root mean
square error of the depth tweening results.
The graph in figure 6 shows the RMS analysis on a 30 frame sequence with 4 key frames. The RMS
error is shown as a percentage of the total depth range. At the key frames the RMS error drops to zero
and as expected the error increases as the distance to the nearest key frame increases. The higher RMS
errors at the end of the sequence are due to the presence of very fast motion in the scene.
Error Depth Tweening cf. CG ground truth
12
10
8
6
4
2
0
639
644
649
654
659
664
669
frame number
Figure 6: Root mean square error as a percentage of the depth resolution for a 30-frame
sequence with 4 key frames
The average root mean square error for this sequence was 7.5% of the total depth range. These results
indicate that we can reconstruct the depth maps with better than 90% accuracy given just over 10%
training data. It should be noted that although the frame-by-frame RMS error is a useful tool for
evaluating such techniques other factors such as edge accuracy, smoothness and temporal consistency
are crucial for generating effective stereoscopic display.
The rapid conversion process, as described, has been successfully deployed in a commercial content
conversion service for the last two years.
5. CONCLUSIONS
The application of MLAs to the generation of depth maps has resulted in a substantial reduction in both
the time and manual effort required for converting 2D images to 3D. Training MLAs for content
conversion is simple, intuitive and easy to learn. The depth maps generated by this rapid conversion
process are of a high enough quality for commercial applications using autostereoscopic displays.
REFERENCES
1.
2.
3.
C. Berkel, D. W. Parker and A. R. Franklin, Multiview 3D-LCD, Proc. Of SPIE Vol. 2653,
Stereoscopic Displays and Virtual Reality Systems III, ed. S S Fischer, J O Merritt, M T Bolas,
pp 32-39, Apr. 1996
4.
5.
6.
J. Berg, "3D Vision for Autonomous Robot-Based Security Operations", Advanced Imaging,
pp 20-24, Jan. 2001.
7.
8.
9.
10. J. Ross, "Stereopsis by binocular delay", Nature 248, Vol 2, 363-364, 1974.
11. C. Tomasi and T. Kanade, "Shape and motion from image streams under orthography: A
factorization approach", International Journal of Computer Vision (IJCV), 9(2), pp 137-154,
November 1992.
12. Zisserman, A. W. Fitzgibbon and G. Cross, "VHS to VRML: 3D Graphical Models from
Video Sequences", Proc. International Conference on Multimedia Systems, pp 51-57, 1999.
13. T. M. Mitchell, "Machine Learning", McGraw Hill, 1997.
14. S. Porter, M. Mirmehdi and B. Thomas, "Detection and Classification of Shot Transitions",
British Machine Vision Conference (BMVC), pp 73-82, 2001.
PV Harman, Chief Technology Officer, Dynamic Digital Depth Research Pty Ltd, 6a Brodie Hall
Drive, Bentley, Western Australia 6102. Tel: +61 8 9355688 Fax: +61 8 93556988. Email
pharman@ddd.com www.ddd.com