Vous êtes sur la page 1sur 9

Rapid 2D to 3D Conversion

Phil Harman*, Julien Flack, Simon Fox, Mark Dowley


Dynamic Digital Depth Research Pty Ltd
Perth, Western Australia
ABSTRACT
The conversion of existing 2D images to 3D is proving commercially viable and fulfills the growing
need for high quality stereoscopic images. This approach is particularly effective when creating
content for the new generation of autostereoscopic displays that require multiple stereo images. The
dominant technique for such content conversion is to develop a depth map for each frame of 2D
material. The use of a depth map as part of the 2D to 3D conversion process has a number of desirable
characteristics:
1.
2.
3.
4.

The resolution of the depth map may be lower than that of the associated 2D image;
It can be highly compressed;
2D compatibility is maintained; and
Real time generation of stereo, or multiple stereo pairs, is possible.

The main disadvantage has been the laborious nature of the manual conversion techniques used to
create depth maps from existing 2D images, which results in a slow and costly process. An alternative,
highly productive technique has been developed based upon the use of Machine Leaning Algorithms
(MLAs). This paper describes the application of MLAs to the generation of depth maps and presents
the results of the commercial application of this approach.
Keywords: 2D to 3D conversion, autostereoscopic displays, machine learning

1. INTRODUCTION
The last few years have seen a dramatic increase in the demand for stereo content. This has largely
been driven by the commercial availability of multiviewer autostereoscopic displays, such as those
manufactured by Stereographics [1], 4D-Vision [2] and Philips [3].
Such displays require a number of adjacent views of the scene, typically eight or nine, rather than the
simple left and right eye views of previous stereoscopic display technologies. Whilst original content
can be created for such displays using CGI based material, consumer demand appears strongest in
video formats. Recording multiple views live using synchronised cameras has been attempted but,
particularly for inside shots, has proven both cumbersome and time consuming.
We have previously presented the advantages of 2D to 3D conversion by generating a depth map from
the original 2D image [4]. This technique enables the conversion of existing content, as well as live
broadcasting and recording, to be undertaken at a commercial level of service.
The use of a depth map as part of the 2D to 3D conversion process has a number of desirable
characteristics. Consumer testing has indicated that, with currently available autostereoscopic displays,
the resolution of the depth map may be substantially less that that of the associated 2D image before

any degradation of the stereo image becomes apparent. Typically, for NTSC video resolution 2D
images, a reduction of 4:1 may be used.
Since the depth map is of lower resolution and only contains luminance information, its bandwidth and
storage requirements are lower than the associated 2D image. Optimum compression techniques can
be used to reduce the depth map to less than 2% of the size of its associated 2D image [4]. This
enables the depth map to be embedded in the original 2D image with minimal overhead and with the
ability to deliver a 2D compatible 3DTM image.
Software or hardware decoders can subsequently render, in real time, either a single stereo pair, or a
series of perspective images, suitable for driving a wide range of stereoscopic displays [4][2][5].

2. DEPTH MAP GENERATION


A number of devices capable of capturing depth maps in real-time, in synchronism with the 2D source
are now commercially available. These include 3DVs Z-Cam and other sensors based on scanning
lasers [6][7]. These systems both enable live broadcasting and eliminate the need for content
conversion. Whilst live recording will most certainly be the dominant process in the future, there are
still significant challenges in educating the existing 2D content creators in this new art as well as the
costs associated with equipping studios with such technology.
In the meantime the conversion of 2D content, either pre-existing or recorded specifically for the
purpose of display on a 3D screen, is a commercially viable alternative. Given the vast library of
existing 2D material the consumer is ensured of both compelling and current content. Conversion from
2D to 3D, of pre-existing content, based on the generation of depth maps is now an established process
[4]. The main disadvantage of the technique has been the manual nature of the majority of techniques
used to create depth maps, which resulted in a slow and costly process.
There are a number of manual techniques that are currently used to produce depth maps, which include:

Hand drawn object outlines manually associated with an artistically chosen depth value; and
Semi-automatic outlining with corrections made manually by an operator.

Each of these has a number of drawbacks. Hand drawing produces high quality depth maps but is very
time consuming and expensive. Semi-automatic outlining is generally unreliable where complex
outlines are encountered.
Although the fully automated recovery of depth from monocular image sequences is possible under
certain conditions, the operational constraints associated with such techniques limit their commercial
viability. These approaches generally fall into one of the following two categories:
1.

Depth from motion: The relationship between the motion of an object (relative to the
camera) and its distance from the camera can be used to calculate depth maps by analyzing
optic flow [8]. This technique can only recover relative depth accurately if the motion of all
objects is directly proportional to their distance from the camera. This assumption only holds
in a relatively small proportion of footage encountered (for example, a camera panning across
a stationary scene). This principal, which exploits motion parallax, is also the basis of single
lens stereo systems [9] and stereopsis by binocular delay [10].

2.

Structure from motion (SFM): SFM is an active area of computer vision research in which
correspondences between subsequent frames (or similar views of the same scene) are used to
determine depth and recover camera parameters [11][12]. A restriction of this approach is that
the 3D scene must be predominately static that is, objects must remain stationary.
Furthermore, the camera must be moving relative to this static scene. Although this technique
is used in the special effects industry for compositing live action footage with CGI its
application to depth recovery appears limited.

It should also be noted that these techniques rely on finding correspondences between frames, a process
that is unreliable in the presence of low textured, fast moving objects. These fully automated
techniques cannot recover depth in the absence of any motion.

3. IMPROVED DEPTH MAP GENERATION


The research presented in this paper describes a more pragmatic approach to the problem of 2D to 3D
conversion. We have developed an efficient interactive or semi-automated process in which a special
effects artist guides the generation of depth maps using a Machine Learning Algorithm (MLA).
3.1 Machine Learning Algorithms
A MLA can be considered as a black box that is trained to learn the relationships between a set of
inputs and a set of outputs. As such, most MLAs consist of two stages, training and classification. In
our application of MLAs the inputs relate to the position and colour of individual pixels. For the
purpose of this paper, we define the 5 inputs of a pixel as: x,y,r,g,b. Where x and y represent the
Cartesian coordinates and r,g,b respectively represent the red, green and blue colour components of any
given pixel. The output of the MLA is the depth of a pixel, which we denote by the output z.
3.1.1 Training
During the training stage samples are presented to the MLA along with the known depth:

Inputs
Sample: x,y,r,g,b

MLA
Depth: z

The MLA will adjust its internal configuration to learn the relationships between the samples and
their associated depth. The details of this learning process vary according to the algorithm used.
Popular algorithms include Decision Trees and Neural Networks [13], the specifics of which are
beyond the scope of this paper.

3.1.2 Classification
During classification samples with unknown depth values are presented to the MLA, which uses the
relationship established during training to determine an output depth value.

Inputs

Outputs
MLA

Sample: x,y,r,g,b

Depth: z

The learning process is applied in two related phases of the rapid 2D to 3D conversion process:
1.
2.

Depth mapping: assigning depths to key frames


Depth tweening: generating depth maps for frames between the key frames mapped in the
previous phase

3.1.3 Depth Mapping


During the depth mapping phase of rapid 2D to 3D conversion the MLA is applied to a single key
frame. Manual depth mapping techniques traditionally require the user to associate a depth with every
pixel of the source image, typically by manipulating some geometric objects (such as Bezier curves).
By using an MLA we can significantly reduce the amount of effort required to produce a depth map.
Horizon line

Training samples

Figure 1: (left) An example source frame1 the dots indicate the position of training samples. The
colour of the dots indicates the depth associated with the pixel. A horizon line may be used to add
depth ramps. (right) The completed depth map derived from the MLA with added depth ramp.

The images used in this test are taken from Ultimate Gs: Zacs Flying Dream Copyright 1999, Sky High
Entertainment. All rights reserved.

Figure 1 indicates how an MLA provided with a relatively small number of training samples, as
indicated by the depth coloured dots on the source frame, can generate an accurate depth map. In this
instance 623 samples were used this represents approximately 0.2% of the total number of pixels in
the image. In more complex scenes additional training data is required, but it is rarely necessary to
supply more than 5% of the image as training samples to achieve an acceptable result.
In this example, the results from the MLA are composited on top of a perspective depth ramp by
adding a horizon line. Depth maps are median filtered and smoothed to reduce stereoscopic rendering
artifacts.
3.1.4 Depth Tweening
Depth maps are generated for key frames using the process described above. These frames are
strategically located at points in an image sequence where there is significant change in the colour
and/or position of objects. Key frames may be identified manually, or techniques used for detecting
shot transitions [14] may be used to automate this process.
Key Frame 1

Key Frame 2

Tween frame

x,y,r,g,b

x,y,r,g,b

Train

Train
combine
z1

MLA 1

z2

MLA 2

Figure 2: An illustration of the depth tweening process. At each key frame an MLA is trained
using the known depth map of the source image. At any given tween frame the results of these
MLAs are combined to generated a tweened depth map.
During the Depth Tweening phase of the rapid conversion process MLAs are used to generate depth
maps for each frame between any two existing key frames. This process is illustrated in figure 2. As
indicated, a separate MLA is trained for each key frame source and depth pair. For any other frame in
the sequence the x,y,r,g,b values are input into both MLAs and the resulting depths (z1 and z2) are
combined using a normalised time-weighted sum:

w1 =

1
( f k1 ) P

w2 =

1
(k 2 f ) P

Depth =

w 1 z1 + w 2 z 2
w1 + w 2

f is the timecode of the frame under consideration, k1 is the timecode of the first key frame and k2 is the
timecode of the second key frame. The parameter P is used to control the rate at which the influence of
a MLA decays with time. Figure 3 illustrates an example of the MLA weighting functions for P = 2.

MLA weights between key frames


1.2
1
0.8

w1

0.6

w2

0.4
0.2
0
key frame 1

key frame 2

Figure 3: Plot showing the relative MLA weighting over time in this example P=2.

4. RESULTS
The rapid conversion process was tested on a short sequence consisting of 43 frames. This sequence is
relatively challenging as it contains fast motion and overlapping regions of similar colour (i.e. the
oarsmans head and the cliffs on the left hand side of the image). Three key frames (at frames 1, 14 and
43) were depth mapped and the remaining frames were converted by depth tweening.

Figure 4: Source (left) and depth map (right) generated by depth tweening at frame 6.
Figure 4 shows the depth map generated from tweening at frame 6 using the key frames at frame
positions 1 and 14. The frames that are furthest away from a key frame generally contain the most
errors as the difference between the source at training and classification is highest. The depth map in
figure 4 accurately represents the major structure of the scene, although there are misclassification
errors between the oarsmans head and the background. Similarly, figure 5 shows the depth map
generated by tweening at frame 32 using the key frames at frame 14 and 43.

Figure 5: Source (left) and depth map (right) generated by depth tweening at frame 32
This 43 frame sequence was successfully depth mapped by providing around 8,000 training samples
over the 3 key frames. This represents only 0.05% of the total number of pixels depth mapped in this
sequence.

4.1 Quantitative Analysis


In order to evaluate the accuracy of depth maps generated from the tweening process a CGI sequence
was used as ground truth in a quantitative comparison. Pixel accurate depth maps can be generated

for CGI scenes using most commercial CG packages. We used these depths to measure the root mean
square error of the depth tweening results.
The graph in figure 6 shows the RMS analysis on a 30 frame sequence with 4 key frames. The RMS
error is shown as a percentage of the total depth range. At the key frames the RMS error drops to zero
and as expected the error increases as the distance to the nearest key frame increases. The higher RMS
errors at the end of the sequence are due to the presence of very fast motion in the scene.
Error Depth Tweening cf. CG ground truth
12
10
8
6
4
2
0
639

644

649

654

659

664

669

frame number

Figure 6: Root mean square error as a percentage of the depth resolution for a 30-frame
sequence with 4 key frames
The average root mean square error for this sequence was 7.5% of the total depth range. These results
indicate that we can reconstruct the depth maps with better than 90% accuracy given just over 10%
training data. It should be noted that although the frame-by-frame RMS error is a useful tool for
evaluating such techniques other factors such as edge accuracy, smoothness and temporal consistency
are crucial for generating effective stereoscopic display.
The rapid conversion process, as described, has been successfully deployed in a commercial content
conversion service for the last two years.

5. CONCLUSIONS
The application of MLAs to the generation of depth maps has resulted in a substantial reduction in both
the time and manual effort required for converting 2D images to 3D. Training MLAs for content
conversion is simple, intuitive and easy to learn. The depth maps generated by this rapid conversion
process are of a high enough quality for commercial applications using autostereoscopic displays.

REFERENCES
1.

L. Lipton, SynthaGramTM: an autostereoscopic display technology, to be published Proc. Of


SPIE Vol 4660, Stereoscopic Displays and Virtual Reality Systems IX, Jan 2002.

2.

4D Vision GMBH homepage: http://www.4d-vision.de

3.

C. Berkel, D. W. Parker and A. R. Franklin, Multiview 3D-LCD, Proc. Of SPIE Vol. 2653,
Stereoscopic Displays and Virtual Reality Systems III, ed. S S Fischer, J O Merritt, M T Bolas,
pp 32-39, Apr. 1996

4.

P.V. Harman, Home-based 3D Entertainment An Overview, Proceedings of the IEEE Intl.


Conference on Image Processing, p 1-4, Vancouver 2000.

5.

J. Eichenlaub, A Lightweight, Compact 2D/3D Autostereoscopic LCD Backlight for Games,


Monitor and Notebook Applications, Proc. SPIE, Stereoscopic Displays and Applications,
San Jose California, pp 180-185, 1998.

6.

J. Berg, "3D Vision for Autonomous Robot-Based Security Operations", Advanced Imaging,
pp 20-24, Jan. 2001.

7.

J. A. Beraldin, F. Blais, L. Cournoyer, G. Godin and M. Rioux, "Active 3D Sensing", Modelli


E Metodi per lo studio e la conservazione dell'architettura storica, University: Scola Normale
Superiore, Pisa 10, NRC 44159, pp 22-46, 2000.

8.

Y. Matsumoto, H. Terasaki, K. Sugimoto and T. Arakawa, "Conversion System of Monocular


Image Sequence to Stereo Using Motion Parallax", Proc. of SPIE Vol. 3012, Stereoscopic
Displays and Virtual Reality Systems IV, ed. S. S. Fisher, J. O. Merritt, M. T. Bolas, pp 108112, May 1997.

9.

B. J. Garcia, "Approaches to stereoscopic video based on spatio-temporal interpolation", Proc.


of SPIE Vol. 2653, Stereoscopic Displays and Virtual Reality Systems IV, ed. S. S. Fisher, J.
O. Merritt, M. T. Bolas, pp 85-95, Apr. 1996.

10. J. Ross, "Stereopsis by binocular delay", Nature 248, Vol 2, 363-364, 1974.
11. C. Tomasi and T. Kanade, "Shape and motion from image streams under orthography: A
factorization approach", International Journal of Computer Vision (IJCV), 9(2), pp 137-154,
November 1992.
12. Zisserman, A. W. Fitzgibbon and G. Cross, "VHS to VRML: 3D Graphical Models from
Video Sequences", Proc. International Conference on Multimedia Systems, pp 51-57, 1999.
13. T. M. Mitchell, "Machine Learning", McGraw Hill, 1997.
14. S. Porter, M. Mirmehdi and B. Thomas, "Detection and Classification of Shot Transitions",
British Machine Vision Conference (BMVC), pp 73-82, 2001.

PV Harman, Chief Technology Officer, Dynamic Digital Depth Research Pty Ltd, 6a Brodie Hall
Drive, Bentley, Western Australia 6102. Tel: +61 8 9355688 Fax: +61 8 93556988. Email
pharman@ddd.com www.ddd.com

Vous aimerez peut-être aussi