Vous êtes sur la page 1sur 22





"3D Aeroplane Shooting Game Using Hand Gesture Recognition"

Submitted VII Semester

Bachelor of Technology
For the Academic year
Submitted by,
GAURAV SINGH (IT A-14120025)
Submitted To,
Ms. Sariga Raj,
Division of Information Technology,



This is to certify that the project report entitled 3D Aeroplane
Shooting Game Using Hand Gesture Recognition is a bonafide record of
work done by Bhawesh Kumar Singh (Reg No. 14120018) and Gaurav
Singh (Reg No. 14120025) of VII semester, for the partial fulfillment of the
award of the B. Tech degree in Information Technology during the year of

Ms. Sariga Raj

Project Guide

Mr. Santosh Kumar M.B.

Head of Department
Information Technology


We would like to thank the almighty God for blessing us with his grace. The whole process
of developing the project has been quite an experience, a lot of new and interesting things
helped us in making this project a success.
We would like to express our deepest gratitude to Ms. Sariga Raj (Sr. Lecturer), IT
Division, for all the help she provided us throughout the project completion. I wish to
express my heartiest regards to Mr. Santosh Kumar M.B., Head of Department,
Information technology and we thank all the faculty members of the Information
Technology for their support and help.
We express our profound gratitude to all our friends for their innumerable contributions,
affection and support towards the completion of this project successfully and effectively.
Last but not the least; we thank all others who helped us in one way or the other by their
support and encouragement.





1. Introduction
1.1 Aim
1.2 Project Goals
1.3 Project Description

2. User Guidelines
2.1 System Requirements
2.2 File Description
2.3 Compilation and Execution
2.4 Usage description

3. Key Frame Description

3.1 Overall Scenario
3.2 Actor Design
3.3 Animation Design
3.4 Algorithms and codes
3.4.1 Choice of Visual formats
3.4.2 Edge Detection
3.4.3 Color Conversion
3.4.4 HAAR Cascade Classifier
3.4.5 System Overview
3.4.6 System Design

4. Experimental Results


4.1 Target Detection

4.2 Gesture Detection and Control of Aeroplane
5. Conclusion


6. Limitations


7. Reference


3D Aeroplane Shooting Game Using Hand Gesture Recognition

IT 707 Multimedia Based Project Report

This report has been submitted for IT 707 multimedia based mini project.
1.1 Aim
We aim to make 3d aeroplane shooting game using hand gesture recognition via webcam input.
1.2 Project Goals
The project essentially consists of 3 separate modules:
1. Developing the 3D Aeroplane game.
2. Detecting gesture from the video feed input using Haar classifiers.
3. Streaming the control messages over TCP port.
4. Detecting the control messages and controlling the game.

1.3 Project Description

We have used the Open-CV library for handling and manipulating input from the webcam. The
relative position of the detected elements help detect gestures. The gestured was found using the
algorithm in python. The action, either a system command or a keystroke, corresponding to the
keystroke is then performed accordingly. The project has two phases: first bring the
development of game and the second being the gesture recognition and control passing. The
movements of the head while performing the gesture are eliminated. There is simple conversion
of hand movements into 3D computer space for the purposes of gaming.

Page | 1

3D Aeroplane Shooting Game Using Hand Gesture Recognition

2.1 System Requirements
Hardware Specification:


Pentium IV or higher.


1GB or higher.

Hard Disk:

80 GB or higher.





Software Specification:

Operating System:

Platform : Open-CV, Unity3d

Programming Language : Python

Windows 8/ Windows 7

Page | 2

3D Aeroplane Shooting Game Using Hand Gesture Recognition

2.2 File Description

2.3 Compilation and Execution

The system should have Open-CV and Python 2.7 installed. Then the gesture detection code
is compiled using Python. Unity 3d is used for game development. The gesture detection
program is run using command python face_detect_sv.py.
Then game is run in from the unity editor by opening the testScene.unity.

Page | 3

3D Aeroplane Shooting Game Using Hand Gesture Recognition

2.4 Usage Description

Fig 2.1: Game Rocket Shooting Scene

In the above screenshot, the enemy plane has been detected and target has been locked on
displayed as green bulls eye. The Player plane is shooting rockets at the enemy in a rapid

Fig 2.2: Game Gesture Control Feedback

Page | 4

3D Aeroplane Shooting Game Using Hand Gesture Recognition

Here the players gesture are being detected and the controls are shown. The zero zone is the
area where no controls are passed. After recognition of gesture, by detecting relative position
across the zero zone positive and negative x and y values are transmitted to game that act as
axes of control in the game.


3.1 Overall Scenario
The code continuously streams video from the webcam and processes the frames of the video
to recognise the gesture. Frames are extracted from video. Then this image is converted to
Grayscale image. This makes most of the pixels grey or white. A trained HAAR classifier is
used for the fist and face detection. As the user comes in front of screen, a rectangle encircles
the face and fist. Then the relative position to the pivot point is taken. These coordinates are set
in x and y controls. These controls are streamed over TCP to the game over a server.
These controls are read by the game client and are used in controlling the game.

3.2 Actor Design

Page | 5

3D Aeroplane Shooting Game Using Hand Gesture Recognition

3.3 Animation Design

#F16 Fighter Plane Model

#Rocket Model

These models were created in 3D Max and imported in the project in fbx format.
#Target Locked

3.4 Algorithms and Codes

Phase 1: The Aeroplane Fighter Game
The fighter plane game was developed using Unity3d. F16 Fighter Plane Model was used with
the ballistic missiles model. A target region to detect the enemy was setup. The enemies

Page | 6

3D Aeroplane Shooting Game Using Hand Gesture Recognition

entering the target are highlighted by the target visual and the shooting is done automatically
when having green lock in range.
#Input Control for the player
public class FalconMotor : MonoBehaviour {
public Transform player;
public float speed;
// Use this for initialization
void Start () {
speed = 20;
// Update is called once per frame
void FixedUpdate () {
#region HorizontalRotation
//Reading Horizontal Control
player.Rotate(0,0,-40* Time.deltaTime); // Rotating the Plane
}else if(CustomInput.GetAxis("Horizontal")<0){
transform.Rotate(0,0,40 * Time.deltaTime);
#region VerticalRotation
//Reading Vertical Control
player.Rotate(40* Time.deltaTime,0,0); //Moving the Plane Up
}else if(CustomInput.GetAxis("Vertical")>0){
player.Rotate(-40* Time.deltaTime,0,0);
player.Translate (0, 0, speed * Time.deltaTime);

# Linking the gesture controls

The gesture controls were mapped to the the game via this script.
public class InputControl : MonoBehaviour {
TcpClient tcp;
string s,xtemp,ytemp;
StreamReader sr;
int xfac,yfac;
Vector3 newpos;
int x,y;
int rangeOfError;
void Start () {
tcp = new TcpClient("", 7045);
sr = new StreamReader(tcp.GetStream());
x = 0;
y = 0;
rangeOfError = 50;
void Update () {
string[] words=s.Split(' ');
if (words[0][0]=='1')

Page | 7

3D Aeroplane Shooting Game Using Hand Gesture Recognition

xtemp = words[1];
ytemp = words[2];
else if(x>rangeOfError){x=1;}
else if(x<-rangeOfError){x=-1;}
else if(y>rangeOfError){y=1;}
else if(y<-rangeOfError){y=-1;}


//setting vertical controls

//setting horizontal controls

public static class CustomInput {
//Dictionary structure for axes control
static Dictionary<string,float> inputs = new Dictionary<string, float>();
static public float GetAxis(string _axis){
inputs.Add(_axis, 0);
return inputs[_axis];
static public void SetAxis(string _axis, float _value){
inputs.Add(_axis, 0);
inputs[_axis] = _value;

Phase 2: Gesture Detection and Control Passing

To implement we had to detect the face and fist. The face and fist are detected using the HAAR
classifier and the controls of their relative positioning wrt to the pivot is streamed on TCP port

//----- gesture recognition and control passing algorithm-----class data:


Page | 8

3D Aeroplane Shooting Game Using Hand Gesture Recognition

end='' # structure for passing data over TCP
def __server(threadName,delay):

#server thread definition

host = ''
port = 7045
backlog = 5
size = 50
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) #socket registration
while 1:

#server listening for connections

client, address = s.accept()

running = 1
while running:
def __gesture(threadName,delay):
vid_cap = cv2.VideoCapture(0)

# gesture recognition thread

#capture frames

# Get user supplied values
cascPath = "haarcascade_frontalface_default.xml"
cascd ="fist.xml"
# Create the haar cascade
faceCascade = cv2.CascadeClassifier(cascPath)
palmCascade = cv2.CascadeClassifier(cascd)
while 1:
# Read the image
ret,image = vid_cap.read()
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Detect faces in the image
faces = faceCascade.detectMultiScale(
minSize=(30, 30),
flags = cv2.cv.CV_HAAR_SCALE_IMAGE
# Detect palms in the image
palm = palmCascade.detectMultiScale(

Page | 9

3D Aeroplane Shooting Game Using Hand Gesture Recognition

minSize=(20, 15),
flags = cv2.cv.CV_HAAR_SCALE_IMAGE
# Draw a rectangle around the faces
for (x, y, w, h) in faces:
cv2.rectangle(image, (x, y), (x+w, y+h), (0, 255, 0), 2)
if w>=bgWdth:

#extend base line of face

for (x, y, w, h) in palm:

cv2.rectangle(image, (x, y), (x+w, y+h), (255, 0, 0), 2)

#rectangle around palm

if w<=bdWdth:
if len(palm)>0:
#line from extended base of face and center of hand
cv2.line(image,(xf-bgWdth,yf+hf),(x1+bdWdth/2,y1+h1/2),(0,0,255),2 )
cv2.rectangle(image, (xf-bgWdth-50,yf+hf-50), (xf-bgWdth+50,yf+hf+50), (255, 0, 0), -1)
data.valid="1 "
data.x=str(xAxis)+" "
data.y=str(yAxis)+" "
data.end="1 1 1 1 1 1 1 1 1 1 1 "

and hands found", image)

#set the heading of the window

if cv2.waitKey(1) & 0xFF ==ord('q'):


3.4.1 Choice of visual data format

An important trade-off when implementing a computer vision system is to select whether to
differentiate objects using colour or black and white and, if colour, to decide what colour space
to use (red, green, blue or hue, saturation, luminosity). For the purposes of this project, the
detection of skin and marker pixels is required, so the colour space chosen should best facilitate
Colour or black and white: The camera and video card available permitted the detection of
colour information. Although using intensity alone (black and white) reduces the amount of data
to analyse and therefore decreases processor load it also makes differentiating skin and markers

Page | 10

3D Aeroplane Shooting Game Using Hand Gesture Recognition

from the background much harder (since black and white data exhibits less variation than colour
data). Therefore it was decided to use colour differentiation.
RGB or HSL: The raw data provided by the video card was in the RGB (red, green, blue) format.
However, since the detection system relies on changes in colour (or hue), it could be an
advantage to use HSL (hue, saturation, luminosity- see Glossary) to permit the separation of the
hue from luminosity (light level).
Hue, when compared with saturation and luminosity, is surprisingly bad at skin differentiation
(with the chosen background) and thus HSL shows no significant advantage over RGB.
Moreover, since conversion of the colour data from RGB to HSL took considerable processor
time it was decided to use RGB, but saturating it into greyscale.
3.4.2 Edge Detection
An improved algorithm based on frame difference and edge detection is presented for moving
object detection. First of all, it detects the edges of each two continuous frames by Canny
detector and gets the difference between the two edge images. And then, it divides the edge
difference image into several small blocks and decides if they are moving areas by comparing
the number of non-zero pixels to a threshold. At last, it does the block-connected component
labeling to get the smallest rectangle that contains the moving object.

Fig 3.1: Edge Detection Process

3.4.3 Color conversion
Threshold is an image segmentation to convert grayscale to binary image. During the threshold
process, individual pixels in an image are marked as object pixels if their value is greater than
some threshold value (assuming an object to be brighter than the background) and as
background pixels otherwise. This convention is known as threshold above. Variants include
threshold below, which is opposite of threshold above; threshold inside, where a pixel is labeled
"object" if its value is between two thresholds; and threshold outside, which is the opposite of
threshold inside. Typically, an object pixel is given a value of 1 while a background pixel is
given a value of 0. Finally, a binary image is created by coloring each pixel white or black,
depending on a pixel's label.
Page | 11

3D Aeroplane Shooting Game Using Hand Gesture Recognition

Fig 3.2: Result image of grayscale to binary color conversion process.


The core basis for Haar classifier object detection is the Haar-like features. These features, rather
than using the intensity values of a pixel, use the change in contrast values between adjacent
rectangular groups of pixels. The contrast variances between the pixel groups are used to
determine relative light and dark areas. Two or three adjacent groups with a relative contrast
variance form a Haar-like feature. Haar-like features are used to detect an image. Haar features
can easily be scaled by increasing or decreasing the size of the pixel group being examined. This
allows features to be used to detect objects of various sizes.

Integral Image
The simple rectangular features of an image are calculated using an intermediate representation
of an image, called the integral image. The integral image is an array containing the sums of
the pixels intensity values located directly to the left of a pixel and directly above the pixel at

location (x, y) inclusive. So if A[x,y] is the original image and AI[x,y] is the integral image then
the integral image is computed as shown in equation 1 and illustrated in Figure 2. The features

Page | 12

3D Aeroplane Shooting Game Using Hand Gesture Recognition

rotated by forty-five degrees, like the line feature shown in Figure 1 2(e), as introduced by
Lienhart and Maydt, require another intermediate representation called the rotated integral
image or rotated sum auxiliary image [5]. The rotated integral image is calculated by finding
the sum of the pixels intensity values that are located at a forty five degree angle to the left and
above for the x value and below for the y value. So if A[x,y] is the original image and AR[x,y]
is the rotated integral image then the integral image is computed as shown in equation 2 an
illustrated in Figure 3.

It only takes two passes to compute both integral image arrays, one for each array. Using the
appropriate integral image and taking the difference between six to eight array elements forming
two or three connected rectangles, a feature of any scale can be computed. Thus calculating a
feature is extremely fast and efficient. It also means calculating features of various sizes requires
the same effort as a feature of only two or three pixels. The detection of various sizes of the
same object requires the same amount of effort and time as objects of similar sizes since scaling
requires no additional effort.

Classifiers Cascaded
Although calculating a feature is extremely efficient and fast, calculating all 180,000 features
contained within a 24 24 sub-image is impractical. Fortunately, only a tiny fraction of those
features are needed to determine if a sub-image potentially contains the desired object .In order
to eliminate as many sub-images as possible, only a few of the features that define an object are
used when analyzing sub-images. The goal is to eliminate a substantial amount, around 50%, of
the sub-images that do not contain the object. This process continues, increasing the number of
features used to analyze the sub-image at each stage. The cascading of the classifiers allows
only the sub-images with the highest probability to be analyzed for all Haar-features that
distinguish an object. It also allows one to vary the accuracy of a classifier. One can increase
both the false alarm rate and positive hit rate by decreasing the number of stages. The inverse
of this is also true. Viola and Jones were able to achieve a 95% accuracy rate for the detection
of a human face using only 200 simple features. Using a 2 GHz computer, a Haar classifier
cascade could detect human faces at a rate of at least five frames per second.

Page | 13

3D Aeroplane Shooting Game Using Hand Gesture Recognition


Detecting human facial features, such as the mouth, eyes, and nose require that Haar classifier
cascades first be trained. In order to train the classifiers, this gentle AdaBoost algorithm and
Haar feature algorithms must be implemented. Fortunately, Intel JCSC 21, 4 (April 2006)
developed an open source library devoted to easing the implementation of computer vision
related programs called Open Computer Vision Library (OpenCV). The OpenCV library is
designed to be used in conjunction with applications that pertain to the field of HCI, robotics,
biometrics, image processing, and other areas where visualization is important and includes an
implementation of Haar classifier detection and training. To train the classifiers, two set of
images are needed. One set contains an image or scene that does not contain the object, in this
case a facial feature, which is going to be detected. This set of images is referred to as the
negative images. The other set of images, the positive images, contain one or more instances of
the object. The location of the objects within the positive images is specified by: image name,
the upper left pixel and the height, and width of the object. For training facial features 5,000
negative images with at least a mega-pixel resolution were used for training. These images
consisted of everyday objects, like paperclips, and of natural scenery, like photographs of forests
and mountains. In order to produce the most robust facial feature detection possible, the original
positive set of images needs to be representative of the variance between different people,
including, race, gender, and age. A good source for these images is National Institute of
Standards and Technologys (NIST) Facial Recognition Technology (FERET) database. This
database contains over 10,000 images of over 1,000 people under different lighting conditions,
poses, and angles. In training each facial feature, 1,500 images were used. These images were
taken at angles ranging from zero to forty five degrees from a frontal view. This provides the
needed variance required to allow detection if the head is turned slightly.
The classifiers have a high rate of detection. However, as implied by, the false positive rate is
also quite high.

Since it is not possible to reduce the false positive rate of the classifier without reducing the
positive hit rate, a method besides modifying the classifier training attribute is needed to increase
accuracy. The method proposed is to limit the region of the image that is analyzed for the facial
features. By reducing the area analyzed, accuracy will increase since less area exists to produce
false positives. It also increases efficiency since fewer features need to be computed and the
area of the integral images is smaller. In order to regionalize the image, one must first determine
the likely area where a facial feature might exist. The simplest method is to perform facial
detection on the image first. The area containing the face will also contain facial features.
However, the facial feature cascades often detect other facial features.. The best method to
eliminate extra feature detection is to further regionalize the area for facial feature detection.

Page | 14

3D Aeroplane Shooting Game Using Hand Gesture Recognition

3.4.5 System Overview



portion of
head &


rectangle over
head & hand

Get control

Emulate these
inputs in the

The flowchart here shows the overview of the flow of controls in the game using gesture

Page | 15

3D Aeroplane Shooting Game Using Hand Gesture Recognition

3.4.6 Design
The design of the system is shown here.

Page | 16

3D Aeroplane Shooting Game Using Hand Gesture Recognition

4.1 Target Detection
The enemies entering the target are highlighted by the target visual and the shooting is
done automatically when having green lock in range.
Fig 4.1: Shooting Target

4.2 Gesture Detection and Control of Aeroplane

Simultaneous detection of face and fist generate the controls for the game play.

Fig 4.2: Gameplay and Control

Page | 17

3D Aeroplane Shooting Game Using Hand Gesture Recognition

This project aimed to build a 3d aeroplane game simulated via Gesture Controls and it has been
successful in this initiative.
This project has a vast arena of development, notably the Sixth Sense project of Pranav Mistry
which completely revolutionises the digital world. The code can be extended to control mouse
movements. This can also be used as a plugin for any other game with controls and be played.

The efficient detection of hand and face can be done best in ambient light. For best results,
The fist has to be parallel to the webcam.

1. Adolf, F. How-to build a cascade of boosted classifiers based on Haar-like features.
June 20 2003.
2. Cristinacce, D. and Cootes, T. Facial feature detection using AdaBoost with shape
constraints. British Machine Vision Conference, 2003.
5. Lienhart, R. and Maydt, J. An extended set of Haar-like features for rapid object
detection. IEEE ICIP 2002, Vol. 1, pp. 900-903, Sep. 2002.

Page | 18