Vous êtes sur la page 1sur 16

Final Project MIS 6324 Data Mining Techniques

Group 2 Page 1

PROJECT REPORT

MIS 6324: Business Intelligence Software & Techniques



Topic: Data Mining from Bird Strikes Data
Presented by:- Group 2
Aditi Saluja
Divya Vanacharla
Ishan Dindorkar
Rohan Patil
Valay Raval
Under the guidance of Prof. Kelly Slaughter




Final Project MIS 6324 Data Mining Techniques
Group 2 Page 1

1. Introduction
Our aim is to implement data mining techniques on a large dataset (around 20
attributes and 20000 instances) and look out for interesting patterns which we can
extract to aide in decision making.
For our studies, we have decided to carry out the operations on a dataset "Bird
Strike.xlsx". It purports to represent all the Bird Strikes reported from 2000-2011.
This dataset is public and listed on Tableau Software Community, which is
extracted by Federal Aviation Administration (FAA) (link). As FAA puts it "The FAA
Wildlife Strike Database contains records of reported wildlife strikes since 1990.
Strike reporting is voluntary. Therefore, this database only represents the
information we have received from airlines, airports, pilots, and other sources."
By definition, a Bird Strike is a collision between an aircraft and air-borne animals
(generally birds). Our interest in this particular database was ignited by the fact
that Bird Strike is a major cause of concern for airline industries and Air Traffic
Controls around the world. Some major casualties caused by Bird Strikes are as
below (source Wikipedia):
The Federal Aviation Administration (FAA) estimates the problem costs US
aviation 400 million dollars annually and has resulted in over 200 worldwide
deaths since 1988
NASA astronaut Theodore Freeman was killed when a goose shattered the
Plexiglas cockpit canopy of his Northrop T-38 Talon, resulting in shards being
ingested by the engines, leading to a fatal crash
In 1988 Ethiopian Airlines Flight 604 sucked pigeons into both engines
during take-off and then crashed, killing 35 passengers.
On September 22, 1995, a U.S. Air Force Boeing E-3 Sentry AWACS aircraft
(Callsign Yukla 27, serial number 77-0354), crashed shortly after takeoff
from Elmendorf AFB. The aircraft lost power in both port side engines after
these engines ingested several Canada Geese during takeoff. It crashed
about two miles (3 km) from the runway, killing all 24 crew members on
board
Final Project MIS 6324 Data Mining Techniques
Group 2 Page 2

2. Knowing the Data
Before carrying out any kind of data mining activities, it is very important to know
the data and exactly understand what it purports to represent. We know that
each instance of our dataset represents a reported Bird Strike and all the details
related to it. In the below table, we have explained what each attribute in the
data represents about the Bird Strike instance:
Bird Strikes
Attributes Explanation Type
Aircraft_Type What kind of aircraft was involved in the Bird
Strike
Nominal
Airport_Name On which airport was the strike detected Nominal
Altitude_bin On what altitude was the strike done (<>1000
ft)
Binomial
Aircraft_Model What was the model of the Aircraft struck Nominal
Wildlife_Number_struck Number of wildlife struck in the instance Range
Effect_Impact_to_flight What was the impact of Strike on the flight, if
any
Nominal
Record_ID Unique record ID for each incident Numerical
Effect_Indicated_Damage What was the damage caused to the Aircraft Binomial
Aircraft_Number_of_engines. Number of engines in the Aircraft struck Numerical
Aircraft_Airline.Operator Airline operator of the aircraft struck Nominal
Origin_State US State in which strike occurred on the
aircraft
Nominal
When_Phase_of_flight During which phase of flight did strike occurr Nominal
Conditions_Precipitation Precipitation condition during the strike Nominal
Remains_of_wildlife_collected Was the wildlife remain collected or not? Binomial
Wildlife_Size Size of the wildlife struck (Small, Medium,
Large)
Range
Conditions_Sky Condition of the sky during strike Nominal
Wildlife_Species Which species of the wildlife was struck? Nominal
Pilot_warned Was the pilot warned of the possible strike? Binomial
Feet_above_ground At what feet above the ground was the
aircraft
Numerical
Speed (in Knots) Speed of aircraft during the strike Numerical


Final Project MIS 6324 Data Mining Techniques
Group 2 Page 3

By graphing the attribute values of the database, we can answer some basic
questions from the dataset:

a) Frequency of wildlife species where damage was caused(most common)



b) Frequency of strikes for each state



Final Project MIS 6324 Data Mining Techniques
Group 2 Page 4

c) Plotting of Species & States for each strike



The graphs gives us the below following results about the dataset:

1) The species which have the highest frequency to strike the airplanes is Canada
Goose (after Unknown category)

2) Florida has the highest frequency of bird strikes, followed by Colorado and
Texas

3) Even though Canadian Goose has the highest frequency for Bird Strikes
(distributed across all the states), when we plot State Vs Species graph we see
that Turkey Vulture strikes are highly prominent in Florida.







Final Project MIS 6324 Data Mining Techniques
Group 2 Page 5

3. Data Cleaning
The next important step after knowing the data and before carrying out data
mining activities, is to "clean" the data. As the name suggests, cleaning the data is
getting rid of non-required instances. It helps improving the data quality and
reducing negative impacts of errors. In our database, we will be cleaning the
dataset by:
a) Removing instances having NULL/ Blank values
By manual scanning, we can detect that number of instances have Blank values.
We will address this issue by first converting the blank values by "NA" and then
removing all the instances which have one or more "NA" values in them.
Note: We will not be replacing the NULL values with mean values because the
attributes having NULL values are all Nominal.
R commands:
> Birds_Strikes<-read.csv("Birds.csv") #read the CSV dataset
> Bird_Strikes[Bird_Strikes == ""] <- NA #convert blank values into NA
> Bird_Strikes$Origin_State[Bird_Strikes$Origin_State == "N/A"] <- NA
> Bird_Strikes<- na.omit(Bird_Strikes) #delete NA values
After removing the NULL values, we can see that our dataset still has "clean"
19,375 out of
Removing Outliers, if required
By plotting the box-plot of all attributes, we can observe that it does not contain
any outliers which have to be deleted. There are some extreme values present
but they are important for consideration.


Final Project MIS 6324 Data Mining Techniques
Group 2 Page 6

4. Apriori Algorithm - Extracting Interesting Association Rules from Dataset
Using Apriori algorithm, we will be finding some interesting association rules
which will help us in detecting causation factors against the attribute
"Effect_Indicated_Damage = Caused Damage". These rules will showcase the
reasons behind damage done to aircraft in case of Bird Strike.
R commands:
> Bird_Strikes$Feet_above_ground<- gsub (",","",Bird_Strikes$Feet_above_ground)
#removes "," from attribute Feet_above_ground
> Bird_Strikes$Record_ID <- as.factor(Bird_Strikes$Record_ID)
> Bird_Strikes$Speed <- as.factor(Bird_Strikes$Speed)
> Bird_strikes_trans <- as(Bird_Strikes, "transactions")
> BirdtypeRules <- apriori(Bird_Type_trans, parameter=list(support=.01,
confidence=.6))
parameter specification:
confidence minval smax arem aval originalSupport support minlen maxlen
target ext
0.6 0.1 1 none FALSE TRUE 0.01 1 10 rules FALSE

algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE

apriori - find association rules with the apriori algorithm
version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[43255 item(s), 20375 transaction(s)] done [0.08s].
sorting and recoding items ... [192 item(s)] done [0.01s].
creating transaction tree ... done [0.01s].
Final Project MIS 6324 Data Mining Techniques
Group 2 Page 7

checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [9.01s].
writing ... [2605356 rule(s)] done [0.95s].
creating S4 object ... done [3.72s]

> BirdRules_caused <- subset(BirdtypeRules, subset = rhs%in%
"Effect_Indicated_Damage=Caused damage" & lift > 1.0))

> inspect(sort(BirdRules_Damage, by = "confidence")[1:10])
lhs rhs support confidence lift
1 {Aircraft_Type=Airplane,
Effect_Impact_to_flight=Precautionary
Landing,
Wildlife_Size=Large} => {Effect_Indicated_Damage=Caused
damage} 0.01099387 0.8265683 7.218743
2 {Aircraft_Type=Airplane,
Effect_Impact_to_flight=Precautionary
Landing,
Conditions_Precipitation=None,
Wildlife_Size=Large} => {Effect_Indicated_Damage=Caused
damage} 0.01035583 0.8210117 7.170216
3 {Effect_Impact_to_flight=Precautionary
Landing,
Wildlife_Size=Large} => {Effect_Indicated_Damage=Caused
damage} 0.01207362 0.8200000 7.161380
4 {Effect_Impact_to_flight=Precautionary
Landing,
Conditions_Precipitation=None,
Wildlife_Size=Large} => {Effect_Indicated_Damage=Caused
damage} 0.01138650 0.8140351 7.109286
5 {Aircraft_Airline.Operator=BUSINESS,

Remains_of_wildlife_collected=FALSE,

Final Project MIS 6324 Data Mining Techniques
Group 2 Page 8

Wildlife_Size=Large,
Pilot_warned=N} => {Effect_Indicated_Damage=Caused damage}
0.01006135 0.7620818 6.655558
6 {Aircraft_Type=Airplane,
Aircraft_Airline.Operator=BUSINESS,

Conditions_Precipitation=None,
Wildlife_Size=Large,
Pilot_warned=N} => {Effect_Indicated_Damage=Caused damage}
0.01060123 0.7578947 6.618991
7 {Aircraft_Type=Airplane,
Aircraft_Airline.Operator=BUSINESS,

Wildlife_Size=Large,
Pilot_warned=N} => {Effect_Indicated_Damage=Caused damage}
0.01168098 0.7555556 6.598562
8 {Aircraft_Airline.Operator=BUSINESS,

Conditions_Precipitation=None,
Wildlife_Size=Large,
Pilot_warned=N} => {Effect_Indicated_Damage=Caused damage}
0.01128834 0.7516340 6.564313
9 {Aircraft_Airline.Operator=BUSINESS,

Wildlife_Size=Large,
Pilot_warned=N} => {Effect_Indicated_Damage=Caused damage}
0.01241718 0.7507418 6.556522
10
{Aircraft_Type=Airplane, Aircraft_Number_of_engines.=1, Wildlife_Size=Large}
=> {Effect_Indicated_Damage=Caused damage} 0.01011043 0.7304965 6.379711
From some of the interesting rules above, we can conclude that maximum times
damage is caused to the aircraft during Bird Strikes when:
Final Project MIS 6324 Data Mining Techniques
Group 2 Page 9

a) Wildlife_Size = Large
b) Pilot_warned = N
c) Effect_Impact_to_flight = Precautionary Landing
These can be useful results for airlines operators and ATC to take precautionary
measures, for example We can see that damages are caused most when Pilots are
not warned about the possibility of Bird Strike by ATC.
To see which species cause the damage most number of times, we will run another
small Apriori algorithm on Wildlife_species and Effect_Indicated_Damage:
> Bird_Type <- data.frame(Bird_Strikes$Effect_Indicated_Damage,Bird_Strikes$
Wildlife_Species)
> Bird_Type_trans <- as(Bird_Type,"transactions")
> BirdtypeRules <- apriori(Bird_Type_trans, parameter=list(support=.005, confid
ence=.4))

parameter specification:
confidence minval smax arem aval originalSupport support minlen maxlen target
ext
0.4 0.1 1 none FALSE TRUE 0.005 1 10 rules FALSE

algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE

apriori - find association rules with the apriori algorithm
version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[312 item(s), 19705 transaction(s)] done [0.00s].
sorting and recoding items ... [19 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing ... [18 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].

Final Project MIS 6324 Data Mining Techniques
Group 2 Page 10

> BirdRules_caused <- subset(BirdtypeRules, subset = rhs%in% "Bird_Strikes.Effe
ct_Indicated_Damage=Caused damage" & lift > 1.0)
> inspect(BirdRules_caused)

lhs rhs support confidenc
e lift
1 {Bird_Strikes.Wildlife_Species=Turkey vulture} => {Bird_Strikes.Effect_Indicated
_Damage=Caused damage} 0.005531591 0.6158192 5.466089
2 {Bird_Strikes.Wildlife_Species=Canada goose} => {Bird_Strikes.Effect_Indicated
_Damage=Caused damage} 0.009489977 0.6032258 5.354308

The above results show that damage to the aircraft is caused most of the times
during a Bird Strike when Wildlife_Species = Turkey vulture OR Canada goose. The
ATC can run special operations to relocate these species from surrounding areas of
the airport.

Final Project MIS 6324 Data Mining Techniques
Group 2 Page 11

5. Logistic Regression
For predicting the result on whether any Bird Strike will cause damage or not, we
will be using Logistic Regression on the dataset. We generally use Logistic
Regression on cases where our predictive variable is binomial, like True or False,
Yes or No etc.
In our case, the dependent variable will be "Effect_Indicated_Damage" and the
independent variables will be rest of the attributes.






















Final Project MIS 6324 Data Mining Techniques
Group 2 Page 12


6. Classification: Decision Tree

Through this technique, we tried predicting values (Cause Damage, No Damage) of
dependent variable - Effect_Indicated_Damage. We divided, our dataset into two parts
training data & test data. With the help of training data, we created a predictor model by
executing tree function. Then, we tried predicting values of Effect_Indicated_Damage in test
dataset by applying predictor model using command predict. This exercise helped us in
classifying casualty cases on the basis of occurrence of damage and gauging efficiency of
predictor model if it is applied on unknown dataset.

R commands:
> install.packages("rpart")
> install.packages("tree")
> library(rpart)
> library(tree)
> Bird_Training_DataSet$Aircraft_Type <- as.numeric(Bird_Training_DataSet$Aircraft_Type)
> Bird_Training_DataSet$Altitude_bin <- as.numeric(Bird_Training_DataSet$Altitude_bin)
> Bird_Training_DataSet$Aircraft_Model <- as.numeric(Bird_Training_DataSet$Aircraft_Model)
> Bird_Training_DataSet$Wildlife_Number_struck <-
as.numeric(Bird_Training_DataSet$Wildlife_Number_struck)
> Bird_Training_DataSet$Effect_Impact_to_flight <-
as.numeric(Bird_Training_DataSet$Effect_Impact_to_flight)
> Bird_Training_DataSet$Effect_Indicated_Damage <-
as.numeric(Bird_Training_DataSet$Effect_Indicated_Damage)
> Bird_Training_DataSet$Aircraft_Number_of_engines. <-
as.numeric(Bird_Training_DataSet$Aircraft_Number_of_engines.)
> Bird_Training_DataSet$Origin_State <- as.numeric(Bird_Training_DataSet$Origin_State)
> Bird_Training_DataSet$When_Phase_of_flight <-
as.numeric(Bird_Training_DataSet$When_Phase_of_flight)
> Bird_Training_DataSet$Conditions_Precipitation <-
as.numeric(Bird_Training_DataSet$Conditions_Precipitation)
> Bird_Training_DataSet$Wildlife_Size <- as.numeric(Bird_Training_DataSet$Wildlife_Size)
> Bird_Training_DataSet$Conditions <- as.numeric(Bird_Training_DataSet$Conditions)
> Bird_Training_DataSet$Wildlife_Species <-
as.numeric(Bird_Training_DataSet$Wildlife_Species)
> Bird_Training_DataSet$Pilot_warned <- as.numeric(Bird_Training_DataSet$Pilot_warned)
> Bird_Training_DataSet$Feet_above_ground <-
as.numeric(Bird_Training_DataSet$Feet_above_ground)
> Bird_Training_DataSet$Speed <- as.numeric(Bird_Training_DataSet$Speed)
Final Project MIS 6324 Data Mining Techniques
Group 2 Page 13

> Bird_DTModel <- tree(Effect_Indicated_Damage~.-Aircraft_Airline_Operator-Record_ID-
Origin_State-Conditions_Precipitation-Aircraft_Type-X.1-Conditions-Speed-
X,Bird_Training_DataSet)
> summary(Bird_DTModel)
Record_ID - Origin_State - Conditions_Precipitation - Aircraft_Type -
X.1 - Conditions - Speed - X, data = Bird_Training_DataSet)
Variables actually used in tree construction:
[1] "Wildlife_Size" "Effect_Impact_to_flight" "Wildlife_Species"
Number of terminal nodes: 6
Residual mean deviance: 0.08454 = 844.2 / 9985
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.96440 0.03564 0.03564 0.00000 0.14980 0.80840

After creating model with training data, now we apply model to test data in order to determine
efficiency of model,

> Bird_Test_DataSet <- read.csv('Bird_Test_Data.csv')
> Bird_Test_DataSet$Aircraft_Type <- as.numeric(Bird_Test_DataSet$Aircraft_Type)
> Bird_Test_DataSet$Altitude_bin <- as.numeric(Bird_Test_DataSet$Altitude_bin)
> Bird_Test_DataSet$Aircraft_Model <- as.numeric(Bird_Test_DataSet$Aircraft_Model)
> Bird_Test_DataSet$Wildlife_Number_struck <-
as.numeric(Bird_Test_DataSet$Wildlife_Number_struck)
> Bird_Test_DataSet$Effect_Impact_to_flight <-
as.numeric(Bird_Test_DataSet$Effect_Impact_to_flight)
> Bird_Test_DataSet$Effect_Indicated_Damage <-
as.numeric(Bird_Test_DataSet$Effect_Indicated_Damage)
> Bird_Test_DataSet$Aircraft_Number_of_engines. <-
as.numeric(Bird_Test_DataSet$Aircraft_Number_of_engines.)
> Bird_Test_DataSet$Origin_State <- as.numeric(Bird_Test_DataSet$Origin_State)
> Bird_Test_DataSet$When_Phase_of_flight <-
as.numeric(Bird_Test_DataSet$When_Phase_of_flight)
> Bird_Test_DataSet$Conditions_Precipitation <-
as.numeric(Bird_Test_DataSet$Conditions_Precipitation)
> Bird_Test_DataSet$Wildlife_Size <- as.numeric(Bird_Test_DataSet$Wildlife_Size)
> Bird_Test_DataSet$Conditions <- as.numeric(Bird_Test_DataSet$Conditions)
> Bird_Test_DataSet$Wildlife_Species <- as.numeric(Bird_Test_DataSet$Wildlife_Species)
> Bird_Test_DataSet$Pilot_warned <- as.numeric(Bird_Test_DataSet$Pilot_warned)
> Bird_Test_DataSet$Feet_above_ground <-
as.numeric(Bird_Test_DataSet$Feet_above_ground)
> Bird_Test_DataSet$Speed <- as.numeric(Bird_Test_DataSet$Speed)

> Bird_PredModel <- predict(Bird_DTModel,Bird_Test_DataSet)
> summary(Bird_PredModel)
Final Project MIS 6324 Data Mining Techniques
Group 2 Page 14

> Bird_PredModel <- cut(Bird_PredModel,br=c(1,1.895,2), labels=c("Cause Damage","No
Damage"))

To calculate efficiency, we assigned results of Bird_PredModel to a new data frame, for
instance, Damage,
> Damage <- Bird_PredModel

Also we assigned values of column 'Effect_Indicated_Damage' into another data frame, Results
> Results <- Bird_Test_DataSet$Effect_Indicated_Damage

Now in order to compute efficiency of predictor model, following command is executed,
> mean(Result == Damage)
[1] 0.7425304


Explanation for converting columns into numeric types

On executing tree command, without converting columns to numeric type, we got following
error,

> Bird_DTModel_Tree <- tree(Effect_Indicated_Damage~.-Aircraft_Airline_Operat
or-Record_ID-Origin_State-Conditions_Precipitation-Aircraft_Type-X.1-Conditio
ns-Speed-X,Bird_Training_DataSet)
Error in tree(Effect_Indicated_Damage ~ . - Aircraft_Airline_Operator - :
factor predictors must have at most 32 levels

Therefore we converted all the columns having nominal data into numeric before executing
tree command or predict command.

An efficiency of 74% shows that the predictor model generated is suitable for predicting values
of dependent variable Effect_Indicated_Damage in case of unknown dataset.










Final Project MIS 6324 Data Mining Techniques
Group 2 Page 15