Vous êtes sur la page 1sur 108

Abstract

Expecting a shipment of 1 billion Android devices in 2017, cyber criminals have


naturally extended their vicious activities towards Google’s mobile operating
system: threat researchers are reporting an alarming increase of detected Android
malware from 2014 to 2015. In order to have some control over the estimated 700
new Android applications that are being released every day, there is need for a form
of automated analysis to quickly detect and isolate new malware instances.
Android is an open source Linux-based mobile operating system distributed by
Google. According to the latest statistics, android powers hundreds of thousands
mobile devices over 190 countries [26]. Google Play [51] is the official android
centralized market place maintained by Google, where any independent application
developer can submit his/her android app and make it available to the users. The
growing popularity of this android ecosystem also is becoming a worthy target for
security and privacy violations. Highly sensitive and confidential information such
as text messages, private and business contacts, calendar data, etc may be leaked
through an application. Sensors such as GPS present in the phones allow
applications to provide context-sensitive user experience, they also create additional
privacy concerns it can exploit the data for tracking or monitoring. Apart from these
issues, smart phones are also susceptible to various malware threats such as viruses,
Trojan horses, worms, etc. [50].
Android security model relies highly on permission-based mechanism. There are
about 130 per-missions that govern access to different resources. Whenever an user
tries to install a new application, he/she is prompted to approve or reject all the
permissions requested by the application. The application will be installed only after
the user accepts all the necessary permissions requested by it.
In this work, we use the permissions and api level information from the apps as
the features to detect malicious applications. Further we observe that, android store
[51] defines a category for every published application. We have done extensive
studies and discovered that, certain categories are highly prone to malicious acts
compared to other categories. We explicitly incorporate this information in our
model and learn a naive bayes classifier for each category using the features that
encode information about permissions and api calls. Given a new test application
with a known category, we apply an appropriate classifier to detect if the application
is malicious. We created a large data set of android applications and achieve an
improvement of 3 − 4% by incorporating category level information.
Secondly, we combine the association rule mining and classification rule mining
techniques to build a classifier. The integration is done by focusing on mining a
special subset of association rules, called class association rules (CARs). To select
the best features that distinguish between malware from benign

vii
viii

apps, we rely on API level information within the bytecode since it conveys
substantial semantics about the apps behaviour. More specifically, we focus
on critical API calls and their package level information.
Rather than simply treating the individual api calls as items, we represent
an item as a combination of caller and callee api. We capture one level of
control flow and context between caller and callee. Each item in our model is
of the form A%B, where A is the caller and B is the callee. We use
Androguard [8], a reverse engineering tool to perform API level feature
extraction and data flow analysis. In summary,

• combining association rule mining and classification rule mining for


Android malware detection.

• We achieved a detection rate of 85% over the baseline classifier of 0.69%.


Contents

Chapt Pag
er e
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Android System
2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Linux kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 Android runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Application
2.1.4 framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Dalvik Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Hardware constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Bytecode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Apps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Application components . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 . 11
2.3.2 Manifest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Native code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.4 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Types of malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 Malware distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.3 Malware data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Android Malware Detection Using Permissions and Api
calls . . . . . . . . . . . . . . . . . 19
Android Application
4.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.1 Android Security Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Android Permission
4.1.2 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Static analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.1 Androguard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

ix
CONTE
x NTS
4.2.2 APK Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Monkey and
4.2.3 Monkeyrunner . . . . . . . . . . . . . . . . . . . . . . . .... 24
4.2.4 AndroViewClient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 24
4.3 Dynamic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 25
4.3.1 Droidbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 25
4.3.2 Taintdroid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 25
Reverse engineering Android
4.4 App . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Static Feature Extraction and Refinement . . .
4.4.1 ............ . . . . . 27
4.5 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Application categories . . . . . . . . . . . . . . . . . .
4.5.1 ....... . . . . . 27
Bayesian Classification Model . . . . . . . . . . . .
4.5.2 ......... . . . . . 28
Experiment Results and
4.6 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.6.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Evaluation
4.6.2 measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Android Malware Detection using Association Rule based
Classification . . . . . . . . . . .34
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Background Information . . . . . . . . . . . . . . . . . . . . .
5.2 ....... . . . . . 36
5.2.1 Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.2 Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
The Frequent Itemset Mining
5.2.3 Stage . . . . . . . . . . . . . . . . . . . . . . . 36
The Rule Generation Stage . . . . . . . . . . . . . . .
5.2.4 ........ . . . . . 37
Reverse engineering Android
5.2.5 App . . . . . . . . . . . . . . . . . . . . . . . . 37
Generation of
5.2.6 items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Feature Extraction and
5.3.1 Refinement . . . . . . . . . . . . . . . . . . . . . . . . 39
Classification rule
5.3.2 mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Generating the Complete Set of CARs . . . . . .
5.3.3 ........... . . . . . 41
The CBA-RG
5.3.4 algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Building a
5.3.5 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Experimental Results and Discussions . . . . . . . . . . .
5.4 .......... . . . . . 43
5.4.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Evaluation
5.4.2 measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
List of Figures

Pag
Figure e
Android malware growth in
1.1 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Android low level system architecture . . . . . . . . . . . . . . . . . . . . . .
2.1 . . . . 6
Android application build
2.2 process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1 Android Folder Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 AndroidManifest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Taintdroid Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Different Stages in Feature
4.4 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Top 20 permissions & Api calls with the Highest Difference
4.5 Between Malware and
Benign Apps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.6 Category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.7 Error Rate & Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
True Negative & False Positive
4.8 Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
True positive & False Negative
4.9 Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1
0 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1
1 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1 Apriori algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Different Stages in Feature
5.2 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Top 20 APIs with the Highest Difference Between Malware and
5.3 Benign Apps . . . . . 40
5.4 Preision & Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.5 TP Rate & FP Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Android malware detection analysis with different classifiers from the
5.6 Precision-Recall
view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

xi
Chapter 1

Introduction

1.1 Motivation

With an estimated market share of 70% to 80%, Android has become the
most popular operating system for smartphones and tablets [4]. Within the
past several years, the popularity of smartphones and other kinds of mobile
devices like tablets has risen significantly. This fact is accompanied by the
large amount and variety of mobile applications (typically abbreviated as
apps) and the increased functionality of the mobile devices themselves.
Several mobile operating systems are available, with iOS and Android being
the most popular ones according to latest studies. As a side effect of this
popularity, centralized application marketplaces like Google Play and Apple’s
App Store have massively grown. Such marketplaces enable developers to
upload their own applications in a convenient way and users can download
these apps directly to their mobile devices. Besides the official markets from
platform vendors (e.g., Google and Apple) and manufacturers (e.g., Samsung
and HTC), a large number of unofficial third-party marketplaces have
emerged. Most of these markets contain thousands of apps and have millions
of downloaded apps per month.
This fast growth rate also has a downside: attackers have realized that
rogue apps can be used to target smartphones and in the recent past,
malicious software for smartphones became popular. Mobile threat
researchers indeed recognize an alarming increase of Android malware from
2013 to 2014 and estimate that the number of detected malicious apps is now
in the range of 120,000 to 718,000 [3]. In the summer of 2012, the
sophisticated Euro grabber attack showed that mobile malware may be a very
lucrative business by stealing an estimated 36, 000, 000 from bank customers
in Italy, Germany, Spain and the Netherlands [1].
Android’s open design allows users to install applications that do not
necessarily originate from the Google Play Store. With over 1 million apps
available for download via Google’s official channel, and possibly another
million spread among third-party app stores, we can estimate that there are
over 20,000 new applications being released every month. This requires
malware researchers and app store administrators to have access to a scalable
solution for quickly analyzing new apps and identifying and isolating
malicious applications.

1
Google reacted to the growing interest of miscreants in Android by
revealing Bouncer in February 2012, a service that checks apps submitted to
the Google Play Store for malware. However, research has shown that
Bouncer’s detection rate is still fairly low and that it can easily be bypassed
[32]. A large body of similar research on Android malware has been
proposed, but none of them provide a comprehensive solution to obtain a
thorough understanding of unknown applications: Blasing et al [13] limit
their research to a system call analysis, Enck et al [22]. focuses on taint
tracking, Rastogi [42] et al. and Spreitzenbarth et al [49]. track only specific
API invocations, and work done by Yan and Yin [56] is bound to use an
emulator.

1.2 Problem Statement


A recent report has shown that there are about 850,000 Android Apps with
22% low quality apps cur-rently available on the market [55]. The popularity
of Android system has led to a huge increase in the spreading of Android
malware as showing in 1.1. which demonstrates the Android malware growth
till 2014. These malwares are mainly distributed in markets operated by third
parties, but even the Google Android Market cannot guarantee that all of its
listed applications are threat free. Examples of Android malware include
Phishing, Banking-Trojans, Spyware, Bots, Root Exploits, SMS Fraud,
Premium Di-alers and Fake Installers. Download-Trojans Apps that download
their malicious code after installation which means that these Apps cannot be
easily detected by Googles technology during publication in Google Android
Market.
Most malware detection methods are based on traditional content signature
based approaches in which they use a list of malware signature definitions,
and compare each application against the database of known malware
signatures. The disadvantage of this detection method is that users are only
protected from malware that are detected by most recently updated
signatures, but not protected from new malware (i.e. zero-day attack). A
previous study of the malicious patterns has concluded that ”Signature-based
approaches never keep up with the speed at which malware is created and
evolved” [15]. In this thesis, our goal is to find a solution that can process an
application, extract features and try to predict whether the application under
process may be Malware or Benign.

1.3 Contributions
With attention to the rapid growth of malicious Apps and the disappointing
results of current security software [60], there is a pressing need to develop
effective solution to deal with malware. Instead of using static signatures, an
effective alternative solution is to use characteristic and Heuristic-based
methods which try to detect malware by observing the statistic characteristic
and features of mobile applications. One of the most popular Heuristic
methods is malware detection based on static requested permissions, which
checks what types of resources, such as Wi-Fi network, user location, and
user contact information, an App is requested for installation (Android
provides over 130 permissions for

2
Figure 1.1: Android malware growth in 2014

developers to control the resources that an App can request [30]). Although
pure permission based method is simple and have shown moderate results but
thier performance is not reliable. Mainly because developers can freely
request any permission they want, in order to mock the requested permissions
of benign applications. On the other hand, observing dynamic behaviors of
Apps, such as dynamic API calls, is far more accurate than permission based
methods in capturing runtime activities of the App. Analyzing app’s runtime
dynamic behaviors is not simple and requires large volume of resources
implying overhead and process time. Motivated by the above observations,
we propose a framework for analyzing and classifying Android applications
based on learning techniques. The framework rests on a combination of
requested permission, static API call behaviors and extracts features from
these parameters and builds classifiers to detect malicious applications.
1.4 Outline

This document is further outlined as follows. In Chapter 2, we provide an


introduction into the Android architecture and outline the techniques used by
mobile malware authors. In Chapter 3, we discuss, how to distinguish
between a genuine app and malware app exploiting the category information.
Chapter 4, we discuss about classifying android apps through association rule
mining. Chapter 5. deals with related research efforts. We look closely at
related work that uses static and dynamic analysis.

3
Finally, in Chapter 6, we propose a number of future research directions and
possible extensions to our implementations and conclude our work.

4
Chapter 2

Background Information

Before we discuss the details of our analysis framework, it is important to


understand how Android and Android applications work. In this chapter, we
provide a short introduction into the Android archi-tecture.
We start with a high level overview of the Android system architecture in
Section 2.1. In this section, we describe the implementation design of
Android and discuss its various component layers.
Section 2.2 deals with virtual machine that is responsible for executing
Android applications, we discuss this layer in detail.
An overview of the core components found in Android applications is
outlined in Section 2.3. This section discusses activities, services, receivers,
and intents, the building blocks of Android applications.
Finally, in Section 2.4, we briefly discuss how Android malware takes
advantage of the Android platform.

2.1 Android System Architecture

The Android software stack is illustrated in figure 2.1. In this figure, green
items are components written in native code (C/C++), while blue items are
Java components interpreted and executed by the Dalvik Virtual Machine.
The bottom red layer represents the Linux kernel components and runs in
kernel space.
In the following subsections, we briefly discuss the various abstraction
layers using a bottom-up approach. For a more detailed overview, we refer to
existing studies [20]

2.1.1 Linux kernel

Android uses a specialized version of the Linux Kernel with a few special
additions. These include wakelocks (mechanisms to indicate that apps need to
have the device stay on), a memory management system that is more
aggressive in preserving memory, the Binder IPC driver, and other features
that are important for a mobile embedded platform like Android.

5
Figure 2.1: Android low level system architecture

6
2.1.2 Libraries
A set of native C/C++ libraries is exposed to the Application Framework
and Android Runtime via the Libraries component. These are mostly external
libraries with only very minor modifications such as OpenSSL , WebKit and
bzip2 . The essential C libraries, codename Bionic, were ported from BSD’s
libc and were rewritten to support ARM hardware and Android’s own
implementation of pthreads based on Linux futexes.

2.1.3 Android runtime


The middleware component called Android Runtime consists of the Dalvik
Virtual Machine (Dalvik VM or DVM) and a set of Core Libraries. The
Dalvik VM is responsible for the execution of applica-tions that are written in
the Java programming language and is discussed in more detail in Section
2.2. The core libraries are an implementation of general purpose APIs and can
be used by the applications executed by the Dalvik VM. Android
distinguishes two categories of core libraries.

1. Dalvik VM-specific libraries.

2. Java programming language interoperability libraries.

The first set allow in processing or modifying VM-specific information


and is mainly used when bytecode needs to be loaded into memory. The
second category provides the familiar environment for Java programmers and
comes from Apaches Harmony. It implements most of the popular Java
packages such as java.lang and java.util.

2.1.4 Application framework


The Application Framework provides high level building blocks to
applications in the form of var-ious android.* packages. Most components in
this layer are implemented as applications and run as background processes
on the device.
Some components are responsible for managing basic phone functions like
receiving phone calls or text messages or monitoring power usage. A couple
of components deserve a bit more attention:

1. Activity Manager : The Activity Manager (AM) is a process-like


manager that keeps track of active applications. It is responsible for
killing background processes if the device is running out of memory. It
also has the capability to detect unresponsive applications when an app
does not respond to an input event within 5 seconds (such as a key press
or screen touch). It then prompts an Application Not Responding (ANR)
dialog.

2. Content Providers: Content Providers are one of the primary building


blocks for Android appli-cations. They are used to share data between
multiple applications. Contact list data, for example, can be accessed by
multiple applications and must thus be stored in a content provider.

7
3. Telephony Manager: The Telephony Manager provides access to
information about the telephony services on the device such as the
phone’s unique device identifier (IMEI) or the current cell location. It is
also responsible for managing phone calls.

4. Location Manager: The Location Manager provides access to the


system location services which allow applications to obtain periodic
updates of the device’s geo- graphical location by using the device’s GPS
sensor.

2.1.5 Applications

Applications or apps are built on top of the Application Framework and are
responsible for the interaction between end-users and the device. It is unlikely
that an average user ever has to deal with components not in this layer. Pre
installed applications offer a number of basic tasks a user would like to
perform (making phone calls, browsing the web, reading e-mail, etc.), but
users are free to install third-party applications to use other features (e.g.,
play games, watch videos, read news, use GPS navigation, etc.). We discuss
Android applications in more detail in Section 2.3.

2.2 Dalvik Virtual Machine

Android’s design encourages software developers to write applications that


offer users extra func-tionality. Google decided to use Java as the platform’s
main programming language as it is one of the most popular languages: Java
has been the number one programming language almost continuously over
the last decade, and a large number of development tools are available for it
(e.g., Eclipse and NetBeans ). Java source code is normally compiled to and
distributed as Java bytecode which, at run-time, is interpreted and executed
by a Virtual Machine (VM). For Android, however, Google decided to use a
different bytecode and VM format named Dalvik. During the compilation
process of Android applications, Java bytecode is converted to Dalvik
bytecode which can later be executed by the specially designed Dalvik VM.
Since a large part of our contributions involve modifying the Dalvik VM,
we now discuss it in a bit more detail.

2.2.1 Hardware constraints

The Android platform was specifically designed to run on mobile devices,


thus has to overcome some challenging hardware restrictions when compared
to regular desktop operating systems: mobile phones are limited in size and
are powered by only a battery. Due to this mobile character, initial devices
contained a relatively slow CPU and had only little amount of RAM left once
the system was booted. Despite these ancient specifications, the Android
platform does rely on modern OS principles: each

8
application is supposed to run in its own process and has its own memory
space which means that each application should run in its own VM.
It was argued that the hardware constraints, made it hard to fulfill the
security requirements using existing Java virtual machines [2]. To overcome
these issues, Android uses the Dalvik VM. A special instance of the DVM is
started at boot time which will become the parent of all future VMs. This VM
is called the Zygote process and preloads and pre initializes all system classes
(the core libraries discussed in Section 2.1.3). Once started, it listens on a
socket and fork()s on command whenever a new application start is
requested. Using fork() instead of starting a new VM from scratch increases
the speedup time and by sharing the memory pages that contain the preloaded
system classes, Android also reduces the memory footprint for running
applications.
Furthermore, as opposed to regular stack-based virtual machines a
mechanism that can be ported to any platform the DVM is register-based and
is designed to specifically run on ARM processors. This allowed the VM
developers to add more speed optimizations.

2.2.2 Bytecode

The bytecode interpreted by the DVM is so-called DEX bytecode (Dalvik


EXecutable code). DEX code is obtained by converting Java bytecode using
the dx tool. The main difference between the DEX file format and Java
bytecode is that all code is repacked into one output file (classes.dex), while
removing duplicate function signatures, string values and code blocks.
Naturally, this results in the use of more pointers within DEX bytecode than
in Java .class files. In general, however, .dex files are about 5% smaller than
their counterpart, compressed .jar files.
It is worth mentioning that during the installation of an Android
application, the included classes.dex file is verified and optimized by the OS.
Verification is done to reduce runtime bugs and to make sure that the program
cannot misbehave. Optimization involves static linking, inlining of special
(native) methods (e.g. calls to equals()), and pruning empty methods.

2.3 Apps

Android applications are distributed as Android Package (APK) files. APK


files are signed ZIP files that contain the app’s bytecode along with all its
data, resources, third-party libraries and a manifest file that describes the
app’s capabilities. Figure 2.2 shows the simplified process of how Java
source code projects are translated to APK files. Image source stackoverflow.
To improve security, apps run in a sandboxed environment. During
installation, applications receive a unique Linux user ID from the Android
OS. Permissions for files in an application are then set so that only the
application itself has access to them. Additionally, when started, each
application is granted its own VM which means that code is isolated from
other applications.

9
Figure 2.2: Android application build process

10
2.3.1 Application components

We now outline a number of core application components that are used to


build Android apps. For more information on Android application
fundamentals, we refer to the official documentation.

Activities

An activity represents a single screen with a particular user interface. Apps


are likely to have a number of activities, each with a different purpose. A
music player, for instance, might have one activity that shows a list of
available albums and another activity to show the song that is currently be
played with buttons to pause, enable shuffle, or fast forward. Each activity is
independent of the others and, if allowed by the app, can be started by other
applications. An e-mail client, for example, might have the possibility to start
the music app’s play activity to start playback of a received audio file.

Services

Services are components that run in the background to perform long-


running operations and do not provide a user interface. The music
application, for example, will have a music service that is responsible for
playing music in the background while the user is in a different application.
Services can be started by other components of the app such as an activity or
a broadcast receiver.

Content providers

Content providers are used to share data between multiple applications.


They manage a shared set of application data. Contact information, for
example, is stored in a content provider so that other ap-plications can query
it when necessary. A music player may use a content provider to store
information about the current song being played, which could then be used by
a social media app to update a user’s ’current listening’ status.

Broadcast receivers

A broadcast receiver listens for specific system-wide broadcast


announcements and has the possibil-ity to react upon these. Most broadcasts
are initiated from the system and announce that, for example, the system
completed the boot procedure, the battery is low, or an incoming SMS text
message was received. Broadcast receivers do not have a user interface and
are generally used to act as a gateway to other components. They might, for
example, initiate a background service to perform some work based on a
specific event. Two types of broadcasts are distinguished: non-ordered and
ordered. Non ordered broadcast are sent to all interested receivers at the same
time. This means that a receiver cannot interfere with other receivers. An
example of such

11
broadcast is the battery low announcement. Ordered broadcasts, on the
other hand, are first passed to the receiver with the highest priority, before
being forwarded to the receiver with the second highest priority, etc. An
example for this is the incoming SMS text message announcement. Broadcast
receivers that receive ordered broadcasts can, when done processing the
announcement, decide to abort the broad-cast so that it is not forwarded to
other receivers. In the example of incoming text messages, this allows
vendors to develop an alternative text message manager that can disable the
existing messaging applica-tion by simply using a higher priority receiver and
aborting the broadcast once it finished handling the incoming message.

Intents

Activities, services and broadcast receivers are activated by an


asynchronous message called an intent. For activities and services, intents
define an action that should be performed (e.g., view or send). They may
include additional data that specifies what to act on. A music player
application, for example, may send a view intent to a browser component to
open a webpage with information on the currently selected artist. For
broadcast receivers, the intent simply defines the current announcement that
is being broadcast. For an incoming SMS text message, the additional data
field will contain the content of the message and the sender’s phone number.

2.3.2 Manifest

Each Android application comes with an AndroidManifest.xml file that


informs the system about the app’s components. Activities and services that
are not declared in the manifest can never run. Broadcast receivers, however,
can be either declared in the manifest or may be registered dynamically via
the registerReceiver() method. The manifest also specifies application
requirements such as special hardware requirements (e.g., having a camera or
GPS sensor), or the minimal API version necessary to run this app.
In order to access protected components (e.g., camera access, or access to
the user’s contact list), an application needs to be granted permission. All
necessary permissions must be defined in the app’s An-droidManifest.xml.
This way, during installation, the Android OS can prompt the user with an
overview of used permissions after which a user explicitly has to grant the
app access to use these components.
Within the OS, protected components are element of a unique Linux group
ID. By granting an app permissions, it’s VM becomes a member of the
accompanying groups and can thus access the restricted components.

2.3.3 Native code

It may be helpful for certain types of applications to use native code

languages like C and C++ so that they can reuse existing code libraries

written in these languages. Typical good candidates for native

12
code usage are self-contained, CPU intensive operations such as signal
processing, game engines, and so on. Unlike Java bytecode, native code runs
directly on the processor and is thus not interpreted by the Dalvik VM.

2.3.4 Distribution
Android users are free to install any (third-party) application via the
Google Play Store (previously known as the Android Market). Google Play is
an online application distribution platform where users can download and
install free or paid applications from various developers (including Google
self). To protect the Play Store from malicious applications, Google uses an
in-house developed automated anti-virus system named Google Bouncer.
Users have the possibility to install applications from other sources than
Google Play. For this, a user must enable the unknown sources option in the
device’s settings overview and explicitly accepts the risks of doing so. By
using external installation sources, users can install APK files downloaded
from the web directly, or choose to use third-party markets. These third-party
markets sometimes offer a specialized type of applications, such as
MiKandi’s Adult app store , or target users from specific countries, like
Chinese app stores Anzhi and Xiaomi (a popular Chinese phone
manufacturer).

2.4 Malware

Recent reports focusing on mobile malware trends estimate that the


number of malicious Android apps is now in the range of 120, 000 to 718,
000 [3][34]. In this section, we take a closer look at mobile malware
characteristics, how they are distributed and what data sets are publicly
available for malware researchers.

2.4.1 Types of malware


The majority of Android malware can be categorized in two types, both
using social engineering to trick users into installing the malicious software.
Fake install/SMS trojan The majority of Android malware is classified as
fake installers or SMS trojans. These apps pretend to be an installer for
legitimate software and trick users into installing them on their devices.
When executed, the app may display a service agreement and, once the user
has agreed, sends premium rated text messages. The promised functionality is
almost never available. Variants include repackaged applications that provide
the same functionality as the original - often paid
- app, but have additional code to secretly send SMS messages in the
background . SMS trojans are relatively easy to implement: only a single
main activity with a button that initiates the sending of an SMS message
when clicked is required. It is estimated that on average, each deployed
sample generates an immediate profit of around $10 USD [34]. This type of
attack is also referred to as toll fraud. High profit and easy manufacturing
make toll fraud apps popular among malware authors. Spyware/Botnet

13
Another observed type of Android malware is classified as spyware and has
capabilities to forward private data to a remote server. In a more complex
form, the malware could also receive commands from the server to start
specific activities in which case it is part of a botnet. Spyware is likely to use
some of the components described in Section 2.3.1. Broadcast receivers are
of particular interest as they can be used to secretly intercept and forward
incoming SMS messages to a remote server or to wait for BOOT
COMPLETED to start a background service as soon as the device is started.
In the summer of 2012, the sophisticated Eurograbber attack showed that
these type of malware may be very lucrative by stealing an estimated 36, 000,
000 from bank customers in Italy, Germany, Spain and the Netherlands [1].

2.4.2 Malware distribution

A problem with third-party marketplaces described in Section 2.3.4, is the


lack of accountability. There are often no entry limitations for mobile app
developers which results in poor and unreliable applications being pushed to
these stores and making it to Android devices. Juniper Networks finds that
malicious applications often originate from these marketplaces, with China
(173 stores hosting some malware) and Russia (132 ’infected’ stores) being
the worlds leading suppliers [34].
One of the issues Android has to deal with in respect to malware
distribution is the loose management of the devices. Over the past few years,
Android versions have become fragmented, with only 6.5% of all devices
running the latest Android version 4.2 (codename Jelly Bean). More than two
years after its first release in February 2011, a majority of Android devices
(33.0%) is still running Android 2.3.3-2.3.7 (codename Gingerbread) . This
fragmentation makes new security features only available to a small group of
users who happen to use the latest Android release. Any technique invented to
prevent malicious behavior will never reach the majority of Android users,
until they buy a new device.
One of the security enhancements in Android 4.2, for example, is the more
control of premium SMS feature . This feature notifies the user when an
application tries to send an SMS message that might cause additional charges.
This feature would prevent a large portion of the previously discussed SMS
trojans, but is unfortunately not attainable for the majority of Android users.
New Android releases also come with bugfixes for core components to
prevent against arbitrary code execution exploits. Android versions prior to
2.3.7 are especially vulnerable to these root exploits (examples include rage
against the cage, exploid and zergRush). While these exploits were originally
developed to overcome limitations that carriers and hardware manufactures
put on some devices, they have also been used by malware to obtain a higher
privilege level without a user’s consent. This approach allows malware to
request only a few permissions during app installation, but still access the
entire system once the app is started.

14
2.4.3 Malware data sets
Public access to known Android malware samples is mainly provided via
the Android Malware Genome Project [60] and Contagio Mobile [5] . The
malgenome-project was a result of the work done by Zhou and Jiang [60] and
contains over 1200 Android malware samples, classified in 49 malware
families and were collected in the period of August 2010 to October 2011.
Contagiodump offers an upload dropbox to share mobile malware samples
among security researchers and currently hosts 114 items.

2.5 Summary

In This Section we discussed Android system architecture, core


components in the Android ecosys-tem. This section also discusses activities,
services, receivers, and intents, the building blocks of An-droid applications.
In the end, we discuss, about malware and public access to malware datasets.
15
Chapter 3

Related Work

In this section, we describe some of the previous approaches employed by


researchers for detecting the malacious applications. There are various
methods in the literature adapting different strategies to detect the malware
applications. These can be roughly grouped into static and dynamic analysis.
Below, we give a brief review of various approaches belonging to these
categories. We later also discuss recent works in behavioural and signature
based malware detection in this section.
One of the important methods for analysing the malware is through static
analysis which performs detection of malware applications before installation
or run on the device. There are various approaches proposed for malware
detection based on static analysis. Comdroid was proposed by Chin et al. [14]
for detecting application communication based vulnerabilities in Android.
ProfileDroid [54] and Risk Ranker [27] leverages static analysis for profiling
and analyzing Android applications. ScanDroid [25] proposed by Fuchs et al.
analyses the data policies in application manifest and data flows across con-
tent providers. Barrera et al. [12] propose a methodology for identifying
application clusters based on requested permissions. Dicerbo et al. [19] uses
Android permissions in the manifest file to identify malicious Android
applications. Zhou et al. [62] proposed a permission-based behavioral
footprinting scheme and heuristics based filtering scheme to detect the
malware. Other static analysis approaches exploit the information present in
bytecode of the android application to predict its behavior [29]. Using the
bytecode, they retrieve information ranging from coarse-grained levels as
packages to fine-grained levels as instructions. However, this approach is
computationally expensive and we thus focus on ex-tracting permissions and
api level information in our work, as they clearly capture the applications
behavior.
A different direction for detecting Android malware relies on dynamic
analysis where the malwares could be detected in run time. Zhao et al. [59]
propose AntiMalDroid to detect Android malware that use logged behavior
sequence as the feature and construct the models for further detecting
malware and its variants effectively in runtime. Enck et al. [21] perform
dynamic taint analysis to track the flow of private and sensitive data through
third party applications and detect any leakage to remote servers.
Nowadays, most antimalware use pattern matching methods and signature-
based is one of the most popular methods in this area. Signature based
methods [47] [35], introduced in the mid-90s, are com-

16
monly used in malware detection. Most anti-virus programs detect the
presence of a virus by using short identifiers called signatures, which consist
of sequences of bytes in the machine code of the in-fected program. A
suitable signature is one that is found in every program infected by malware,
but is significantly less likely to be found in programs where malware is not
present. The major weakness of this approach is that it cannot detect
metamorphic or unseen malware. There are several related works which
applying this approach to detect malware. Kim et al. [36] build a power
consumption history from the constructed history for power-aware malware
detection they propose a power-aware malware detection framework that
monitors, detects and analyzes previously unknown energy-depletion threats.
Their framework was composed:

1. A power monitor that collects power samples and builds a consumption


history from the collected sample and

2. A data analyzer that generates a power signature from the constructed


history.

Desnos et al. [17] develop an algorithm to help them construct the rules.
By converting an app to its bytecode(contains semantic information that
allows doing a better analysis) allowing for the extraction of useful
information on variables, fields, and methods. They propose a signature-
based method and also use the permission properties. The final step was to
build the control flow graphs using the collected data for malware detection.
Enck et al. [23] proposed Kirin, security service that perform the certification
of applications. They define a variety of potential dangerous permission
combinations as rules to block the installation of potential unsafe
applications. Our approach is different in a way that, these techniques are not
adaptive to a new Android malware and they require continuous update of the
signatures.
Behavior based malware detection techniques focus on analyzing the
behavior of a program to con-clude whether it is malicious or not. Behavior
based usually applies machine learning algorithm for learning known
malware behavior and pattern to predict unknown or novel malware.
Instead of using predefined signatures for malware detection, data mining
and machine learning techniques provide an effective way to dynamically
extract malware patterns [46] [53]. A study done by Sami et al. [44]
employed data mining technique with features generated from Windows
executable API calls. They achieved acceptable results in a very large scale
dataset with about 35,000 portable executable files. Another behavioral foot
printing method device by Jiang et all. [33] also provides a dynamic approach
to detect self-propagating malware. For smartphone based mobile computing
plat-form, recent years have witnessed an increasing number of more
sophisticated malware attacks such as repackaging. A recent research by
Zhou et al. [60] systematically characterizes existing Android malware from
various aspects, including installation methods,activation mechanism and the
nature of malicious payloads. A study conducted with four representative
mobile security tools with over 1200 malware samples showed that the
current malware detection solutions are outdated and need to be up-graded to
the next generation.
From another perspective, Shabtai et al. [47] propose a behavior-based
Android malware detection approach, Andromaly, to protect the smartphone.
They test a series of feature selection approaches for

17
finding the most representative sets of features. Andromaly applies several
different machine learning algorithms such as Logistic Regression and
Bayesian Networks to classify the collected applications as benign or
malicious. In [45], authors extract the function calls from binaries of
applications and apply their clustering mechanism, called Centroid, for
detecting unknown malware. In contrast, our approach is based on automated
analyses of Android packages. A recent paper by Sahs and Khan [43]
proposes a machine learning approach to Android Malware detection based
on (SVM). They use the Android permissions in the Manifest files as the
features and learn a single-class (SVM) model using benign samples alone.
This is contrast to our approach which uses Api calls, permissions and
categories as features for training the naive-Bayes model.
Table 3.1 shows the description and results published by various authors,
whose work is similar to that of ours.
Table 3.1: Existing
approaches
Publication Description Results
Detecting application communication
based
Comdroid -
vulnerabilities in Android
Android Malware Detection through
Manifest
DroidMat F-measure: 0.9183
and API Calls Tracing
Powerful tool to disassemble and to
decompile
Androgaurd F-measure: 0.6611
android apps
Permission Usage to detect Malware in
PUMA Android Accuracy = 83.32%
Automated Malware Detection for
Android, the
study utilizes behavior analysis of
AMDA applications as Acuraccy= 71.1538%
basis for malware
Permission-
Detecting malicious applications in
Based Detection Android
F-measure= .735849
for Android system based on permissions
Malware
Mining API-Level Features for Robust Accuracy : 99% and a
DroidAPIMin Malware
er FPR as low as 2.2%
Detection in Android
using KNN classifier.
Performs a broad static analysis, gathering
as
Derbin Accuracy: 94%
many features of an application as possible.

18
Chapter 4

Android Malware Detection Using Permissions and Api calls

4.1 Android Application Structure

In this section, we briefly describe Android application structure with focus


on its important files/-folders. This description will serve as preliminary
knowledge in understanding Apps in Android based mobile platform, through
which our algorithm can extract useful patterns for malware detection.
Android uses several partitions (including boot, system, recovery, data etc.)
to organize files and folders in the device with each partition having its own
functionality. Due to our research mainly focusing on the identification of
malware instead of studying Android system, we only consider data partition
which contains user’s data like contacts, SMS, settings and all Android
applications that have been installed on the system. Figure 4.1 represents the
folder structure of the android app.
We briefly describe the android app structure as below:
/Src: The src folder contains Java source code files of the application
organized into packages. Once an App is installed, the source files, services,
etc will be placed in the SRC folder.
/Gen: Files in the Gen folder are automatically generated by the Android
Development Tools (ADT). Inside this folder, R.java file contains
reference/index to all resources used in a user program (i.e. App). Each time
the developer adds a new resource to the project, ADT will automatically
regenerate the R.java file containing reference to the newly added resource.
/Android: Android folder is also called Android target library in the
Android project structure. The android.jar file contains all the essential
libraries for the user program as well as the version num-ber/build target.
/Assets: The Assets folder is used to store raw asset file, which can be
access through the asset manager in the Android platform.
/Res: Res folder contains all external resources for the application such as
images, layout XML files, strings, animations, audio files, etc.
/Bin: The Bin folder contains the application files once an application has
been compiled, which include java class files, apk archives, and dex files
which are executables in Dalvik Virtual machine.

19
Figure 4.1: Android Folder Structure

/Android Manifest file: AndroidManifest.xml is the most important file


because it works as the road map of the application ensuring the proper
function of the application in the Android System. It contains all the
necessary information about the application which the android system may
need.
apk file: The apk file (Android Application Package file) which compiles
and packages the project into a single file which includes all the application
code (i.e. .dex files), resources, assets, and manifest file. The
AndroidManifest.xml file must be read by the Android System prior to
launching an application in order to verify that all components exist. The
AndroidManifest.xml can be found at the root of the project directory which
can be found inside the .apk file. The .apk file can have any name but must
have the .apk extension.
Figure 4.2 represents the manifest file. The manifest or
AndroidManifest.xml includes:

• Identifying any permission that the application requires such as Internet


access or read-access to the user’s contacts etc.

• Declaring the minimum API Level required by the application.


• Declaring hardware and software features used or required by the
application, such as a camera, Bluetooth services, a multi touch screens.

• API libraries that application needs to be linked against (other than the
Android framework APIs), such as Google Maps library.

AndroidManifest.xml provides first hand information to understand the


characteristics and security settings of each Apps.

20
Figure 4.2: AndroidManifest

4.1.1 Android Security Approach

Android security model highly relies on permission-based mechanisms.


Android Applications re-quire several permissions to work with over 130
permissions available at the developer’s disposal. Consequently, an essential
step to install an Android application into a mobile device is to allow all
permissions requested by the application. Before an application is being
installed, the system prompts a list of permissions requested by the
application and asks the user to confirm the settings for installa-tion. Even
though users have the ability to deny permission request, the lack of
knowledge allows the possibility of misusing resources by an App. For
examples, requesting network access, including wifi and short message
service (SMS), are pretty normal for generic Apps, whereas some malware
misuse the services to steal bandwidth or other useful information. So it’s
very difficult for users to determine, at the first place, whether an App is a
malware by using the permission request only.
At the system level, Google announced that a security check mechanism is
applied to each appli-cation uploaded to their market. The open design of
Android operating system still allows a user to install any application
downloaded from an untrusted source. Nevertheless, the permission list is
still the minimal defence for a user to detect whether an application could be
harmful. Users should not install applications that request unnecessary
permission to access their personal information (i.e.phone book, access to
sms).
Google also categorizes Android permissions into four threat level:
Normal permission: includes lower risk permissions which control access
to API calls that are not particularly harmful. The system automatically grants
this type of permission to a requesting application at installation, without
asking for the user’s explicit approval like SET ALARM.

21
Dangerous permission: regulates access to potential harmful API calls that
would give access to private user data. For example, permissions to read the
location of a user ACCESS FINE LOCATION or WRITE CONTACTS are
classified as dangerous.
Signature permission: protects access to the most dangerous privilege. The
system grants the permis-sion only if the requesting application is signed with
the same certificate as the application that declared the permission.
Signature/System permission: A permission that the system grants only to
applications that are in Android system image.
A simple straightforward idea to determine a harmful application is to
check whether the App re-quests for permissions in dangerous or higher level.
Although Android adopts an authorized permission model to control access to
its components, there is no clear evidence demonstrating how good or bad it
is to detect a malicious application based on permissions or combinations of
permissions. It should be noticed that the permissions shown to a user during
an installation process are requested permis-sions instead of required
permissions. The requested permissions are declared by an application devel-
oper manually. However, not all declared permissions are required by the
application. In addition to Google’s methods to protect Android from
malicious application, many security software companies have launched their
own security Apps.

4.1.2 Android Permission Setting


”Permission is a restriction limiting access to a part of the code or to data
on the device. The limitation is imposed to protect critical data and code that
could be misused to distort or damage the user experience. Each permission
is identified by a unique label. Often the label indicates the action that’s
restricted”. For example, here are some permissions defined by Android:

• android.permission.CALL EM ERGEN CY N U M BERS

• android.permission.READ OW N ER DAT A
• android.permission.SET W ALLP AP ER

• android.permission.DEV ICE P OW ER

Every Android application package (APK) has an Android-Manifest.xml


file in its root directory as shown in figure 4.2. The manifest.xml file includes
essential information about the application to Android system and
applications user. Android system must have and process this information
before it can run any of the application’s code. Among other things, the
manifest file does the following which are closely relevant for the behaviors
and security settings of the App. Manifest file declares which permissions the
application must have in order to access protected parts of the API and
interact with other applications. It also declares the permissions that others
are required to have in order to interact with the application’s components.

22
4.2 Static analysis

Apps are statically analysed using several techniques aiming at


unpackaging and disassembling apps. this process is mainly performed using
Androguard [8]. For unpackaging and repackaging apps into a modified app,
we use ApkTool [10]and dex2jar tools [18]. Monkey [38] and
AndroidViewClient [9] are used to generate a common sequence of events to
interact with the apps. These events should be generated specifically for each
test to intelligently drive the GUI exploration i.e., to test code imple-menting
different functionalities of the app. Culebra [9] is used to create
AndroidViewClient scripts for further automating the analysis. We then
describe in some detail the most popular static tools used in our work. Further
information about the particularities of each tool can be found in the
references given throughout the document.

4.2.1 Androguard
Androguard [8] is an interactive-oriented static analysis tool for third-
party Android applications. It allows to disassemble apps and access their
components throughout its API. Androguard’s API also provides access to
each attribute of the binary code, such as classes, methods, and variables. The
main features of its API are:

1. APK: As already explained, the Android Application Package (APK) is a


file format used to distribute Android apps from the markets to the
devices. This package is an archive in JAR format containing a number
of files and a well-structured directory hierarchy.

2. DVM:The Dalvik Virtual Machine (DVM) is a component of Android


OS responsible of running the apps on the device. Each Android APK
packages a DVM file-known as Dalvik Executable Format (DEX)-
containing the compiled Android application code. This component of
Andro-guard disassembles the DEX file and provides access to its
components. More precisely, it allows to retrieve Java Annotations
(metadata) about a program, the name and size of its classes, meth-ods,
and variables, among other static features from the DVM.

3. Analysis: This library interprets Dalvik’s code and provides a semantic


analysis of the DVM. It allows to identify where permissions are used in
a specific app and when special libraries (such as crypto or reflection
libs) are used. Additionally, it also provides a Control Flow Graph (CFG)
representation of the Dalvik code flow.

4. Bytecode: The Dalvik code executed by the DVM is a compact and


efficient instruction set (nu-meric codes, constants, and references) that
encodes executable programs into a portable language called bytecode.
This bytecode is translated into native machine code at run time. This
facilitates the portability of the bytecode itself across different hardware-
specific platforms. However, it also makes easier the reverse engineering
analysis of Android apps. This component of Androguard provides a
number of methods that aid bytecode analysis.

23
4.2.2 APK Tool
ApkTool [10] is a reverse engineering tool for third-party Android
applications. This tool allows to decode Android apps into Smali code [48]. It
also facilitates the modification of the app or the injection of new code before
repackaging it. Smali is a DEX code disassembler that transforms bytecode
into a syntax similar to the one used in Jasmin’s [31] and dedexer’s [16]
project. This syntax aims at alleviating the complexity of exploring Java
Virtual Machine binaries. Thus, ApkTool allows to reconstruct the original
resources into a human-friendly format to facilitate reverse engineering of the
code.
We then describe the main functions of ApkTool:

1. Decompile:It performs the inverse operation to that of Dalvik’s bytecode


compiler and the APK packaging. The resulting folder contains the
manifest of the app, all Java classes in Smali lan-guage, as well as
assembled resources, libraries, and assets.

2. Recompile: Transforms Smali source code classes resulting from the


previous steptogether with any other resources contained in the appin an
Android APK file ready to be executed in the device. This new APK may
well be different from the original one, e.g., it can contain piggybacked
functionality.

4.2.3 Monkey and Monkeyrunner


Monkey and Monkeyrunner [38] are two Android Developer tools for
automatically testing Android apps. Monkey generates dummy random
events to interact with the Operating System. These events typically include
GUI actions such as touch, press a button, etc. Monkeyrunner provides the
developer with a Python API to interact with the running apps and control the
device from the command line. The main components of Monkeyrunner are:
• Runner. This component provides a number of utility methods such as
communicating with the device, creating user interfaces, and displaying
built-in help.

• Device. This component facilitates the installation and removal of


Android packages. It also pro-vides the appropriate interface for starting
Android Activities, sending keyboard or touch events to an app, etc.

• Image. This component provides access to the device for capturing


screen- shots, converting bitmap images to various formats, and
comparing two MonkeyImage images. This component is very useful for
monitoring changes in an Activity running at a given time instant.

4.2.4 AndroViewClient
AndroViewClient is a Python tool that facilitates the creation of scripts for
interacting with the de-vice. A remarkable feature of AndroViewClient is its
ability to retrieve a tree view of the UI-components

24
displayed on the device at any given moment. For instance, given an Activity,
AndroViewClient allows to retrieve which other clickable views are nested
into this one. Then, it allows the user to interact with those components by,
for instance, clicking them or inserting text into a TextBox.

4.3 Dynamic Analysis

In this section we discuss about the popular dynamic analysis tools. We


used an open source dynamic analysis tool called Droidbox [17] to monitor
various activities that can be used to characterize app behavior and tell apart
benign from suspicious behaviour. Later in this section we also discuss about
Taintdroid [22].

4.3.1 Droidbox
Droidbox is a dynamic analysis tool that allows the execution of Android
apps and provides a variety of data about how an app is behaving. More
precisely, Droidbox monitors the execution of 11 different activities:

• crypto: generated when calls to the cryptographic API are invoked.

• netopen, netread, netwrite: associated with network I/O activities


(opening a connection, receiv-ing, and sending data).

• fileopen, fileread, filewrite: associated with file system I/O activities


(opening,reading, and writing a file).

• sms: generated whenever a text message is sent or received.

• call: generated whenever a call is made or received from the device.

• leak: generated when a leakage of private information has occurred. This


is determined using tainting analysis [22].

• dexload: generated when native code is loaded dynamically.


We have extended Droidbox to allow the extraction of these activities
programatically.
4.3.2 Taintdroid
Taintdroid [22] uses dynamic taint analysis to track sensitive information
throughout a program exe-cution. Taintdroid instruments the DVM interpreter
to provide the device with a variable-level tracking system, as well as
message and file level tracking. This enhancement offers a valuable
awareness of an app’s information flow during its execution. Figure 4.3
depicts Taintdroid’s architecture as illustrated by Enck et al. in [22]. Image
source techrepublic.com

25
Figure 4.3: Taintdroid Architecture

Apart from traditional static and dynamic analysis techniques, a number of


recent works have opted for a radically different approach based on
maintaining a synchronized replica of the device in the cloud. Paranoid
Android [39], Secloud [63] and CloudShield [11] are illustrative examples of
such systems. In these cases, all security-related tasks, including monitoring,
analysis, and detection can be performed in an environment not exposed to
battery or computational constraints. Furthermore, multiple detection
techniques can be applied simultaneously.

4.4 Reverse engineering Android App

Reverse Engineering is a process by which one can discover and


understand the complete working of an application by learning its operation,
structure and functions. In this work, we use tools like ApkTool [10],
Smali/Baksmali [48], Dex2Jar [18] and Android SDK for reverse engineering
a Android Application.
The ApkTool [10] is a 3rd party tool that is used to analyze closed Android
application binaries. We show the steps involved in Figure 4.4. To parse the
.dex file, we use a tool called Baksmali [48] which is a disassembler for the
dex format used by Dalvik. Baksmali disassembles .dex files into multiple
files with .smali extensions. Each .smali file contains only one class
information which is equivalent to a Java .class file.

26
Extract Features
convert manifest.xml from
to readable manifest.xml
manifest.xml
Profile
Manifest.x -----------
ml -
-----------
Benign -
-----------
Apps -
Build -----------
Profile(s) -
-----------
Malware -
-----------
Apps -
Disassemble -----------
each .smali Extract Feature -
from .smali -----------
class.dex file to .smali files -
Android Apps

Figure 4.4: Different Stages in Feature Extraction

4.4.1 Static Feature Extraction and Refinement

In order to build features for an application, we extracted the (binary) APK


file using ApkTool [10] and Smali disassembler [48]. We extracted all the
relevant api invocations in the Android application along with the
permissions in the manifest file. Instead of using all the api calls, we use only
a subset of them namely sensitive api calls, which are governed by an
Android permission settings. To obtain the set of sensitive api’s, we relied on
the work of Felt et al. [24], who identified and used the mapping between
permissions and Android methods. Further, a sensitive api is considered only
if it is declared in the binary and if its corresponding permission is requested
in the manifest file. This resulted in the elimination of large number of api
calls. We used the Android Asset Packaging Tool (aapt) to extract and decrypt
the data from the AndroidManifest.xml file, provided by the Android SDK.

4.5 Proposed Method

In this section, we explain the feature extraction and our proposed Naive
Bayes classifier which exploits the category information of an application.

4.5.1 Application categories

When a developer publishes an application in Google Play, one needs to


picks the category under which the application will be published. Currently,
Google play has around 30 categories which are shown in Table 3.1. For each
category, applications are ranked based on a combination of ratings, reviews,
downloads, country, and other factors. We have done extensive study and
found out that number of malwares is not uniform across all the categories.
Certain categories such as Entertainment, Games,

27
Tools, etc. are highly prone to malwares while categories such as medical,
social have few malwares. In our work, we explicitly learn a model that
exploits this information.

4.5.2 Bayesian Classification Model

One of the simplest and powerful machine learning techniques is Bayes


classifier. This is probably due to its simplicity, linear computational
complexity and accuracy. It is also referred to as Naive Bayesian Classifier
because it makes the naive assumption that all the features representing the
data are independent for a particular choice of the behavior one is trying to
learn.
The naive bayesian classifier consists of training and testing phases. In the
training phase, a model learns from sufficient number of training data
containing both benign and malicious android apps. Then during testing or
detection phase, the model infers whether the given test app is benign or
malign using the model learnt during the training.
We extract the desired features from each application in the corpus. The
feature set is further reduced by a feature reduction function. Each application
X is represented as a vector X = [x 1, . . . , xm], where xi 2 {0, 1}, 8i = 1, . . . ,
m are the random variables indicating a particular characteristic feature of the
android application. We consider the api calls and permission as the
characteristic features. If a particular api call/permission is present in the
application then the corresponding feature xi is defined as 1 otherwise as 0.
Let Y denote the label of each application suspicious, Y 2 {malign,
benign}. We define the application category as C 2 {1, 2, . . . , K} where K
denote the number of categories available in the android store. We exploit the
information from both api calls and categories and thus we define the
posterior density of Y using bayes
rule as,
P (X=Xj |C=ck,Y =yi)P (C=ck|Y =yi)P (Y
=yi)
P (Y = yi|X = Xj , C = ck) = P (X=Xj |C=ck)P (C=ck)
(4.1)
where the probabilities, P (X|C, Y ), P (C|Y ), P (C), P (X|C) and P (Y ) are
estimated from the train-ing data.
During inference, any test app is classified as malign if P(Y=malign|X,C)
>P(Y=benign|X,C). Since the category of a test application will be known
apriori, we classify the app using the classifier trained on the corresponding
category. This makes sense because apps belonging to similar category use
similar kind of api calls and permission.

4.6 Experiment Results and Discussions

In this section, we describe the dataset and discuss the experimental results.

28
4.6.1 Data set
Our data set consists of 25865 apps collected from Google Play [?] and
Android Malware Genome Project [61]. We collected 24335 apps from
Google Play and 1530 applications from Genome Project as shown in Table
3.1. We collected only the top free apps in each category for creating a benign
set. For the benign applications, we used VirusTotal [52] to make sure that
they are genuinely benign. Each of these benign and malware applications
belong to 30 categories as defined in android market (Table 3.1).
Table 4.1: Benign & Malware App Categories
Geniun Geniun Malwa
Category e Malware Category e re
Arcade 1409 123 Medical 499 5
Books &
References 884 10 Music & Audio 1287 30
Brain 1342 117 News & Magazine 545 20
Business 574 13 Personalization 2131 16
Cards 545 21 Photography 324 37
Casual 1658 140 Productivity 728 85
Comics 517 19 Racing 615 90
Communication 280 83 Shopping 169 10
Education 959 30 Social 683 5
Entertainment 1546 173 Sport 800 4
Finance 403 20 Sport Games 633 30
Health & Fitness 703 27 Tools 1227 275
Libraries &
Demo 564 32 Transportation 397 17
Lifestyle 1112 26 Travel & Local 602 23
Media & Video 827 37 Whether 404 12

4.6.2 Evaluation measures


To evaluate the effectiveness of proposed approach, we calculate true
positive, true negative, false positive and false negative rates, precision,
recall, accuracy and F-measure in our experiments. These measures are
defined as follows. Let TP (true positive) be the number of Android malware
apps that are correctly detected, FN (false negative) be the number of
malware apps that are detected as benign, TN (true negative) be the number
of benign apps that are correctly classified, and let FP (false positive) be the
number of benign app that are incorrectly detected as Android malware. In
terms of classification error, two cases can occur: (a) A benign app may be
misclassified as suspicious and (b) a suspicious app may be misclassified as
benign. For our problem, the latter case is more crucial as it is more important
to prevent a malicious app in reaching the end device than excluding a benign
app from the distribution chain. We use the following measures to check the
performance of our proposed approach.
Accuracy: The accuracy is the proportion of true results (both true
positives and true negatives) among the total number of cases examined.

29
Figure 4.5 shows the frequently occurring api calls and permissions in the
Android application. We consider only the api calls and permission that are
highly frequent as features. We adopted a ten-fold cross validation strategy
for our expeirments. We trained our model using 9 folds and tested on

30
remaining fold. We repeat the experiment 10 times and report the average
accuracy. This ensures a wider range of samples for the testing the classifier.
We also conducted our experiments as Aafer et al. [6] but with four
different set of top features. These top features are selected based on
frequently occurring features in our samples as shown in Figure 4.5. We refer
top 10, 15 and 25 ranked features as 10T f, 15T f and 25T f respectively and
five lowest ranked features as 5Lf.
Figure 4.7 shows the error rates and accuracy for different feature sets with
and without category information. We observe an increasing accuracy and
decreasing error rates when larger number of features are used to train the
classifier. It is also evident, by exploiting the category information, there is a
clear improvement in the accuracy and error rates. Also, note that there is
almost a difference of 20% in the performance using 5Lf and 10T f feature
sets indicating the importance of feature ranking based on the frequency of
api calls.
Figure 4.8 shows the true negative and false positive rates and Figure 4.9
shows the true positive and false negative rates using different sets of
features. We can observe in both the cases, that there is a improvement in the
performance when the category information is included in the model. Finally,
we show the precision and recall in Figure 4.10 and 4.11 with varying
number of features. As the features are increased, both precision and recall
improved and when the category information is included in the model, the
performance is even better.
We summarize the results of various measures without category
information in Table 3.2 and with category information in Table 3.3. It can be
observed from the Table 3.3 that an average improvement of 3 − 4% across
all the categories is achieved.
We also report the measure Area under Curve (AUC) which defines the
total area under the Receiver Operation Characteristic (ROC) curve, for
different number of features. We can see that AUC for 10f, 15f, and 25f is
very close to 1 implying a very good performance.
31
32
4.7 Summary

We proposed a naive-Bayes approach for detecting Android malware


application. Unlike the previ-ous approach which uses only api calls for
prediction, we combine various information from api calls, permissions and
category information of an application. This is based on observation that,
every applica-tion in the android market has a category assigned to it. We
created a large dataset of 25865 applications from Google Play and Genome
Project. We demonstrated the effectiveness of our approach on the dataset and
showed that exploiting category information indeed improves the
performance. As future work, we plan to explore if the static analysis can be
combined with dynamic analysis to achieve better performance.
Naive Bayes, is very straight forward, doing a bunch of counts. As the
training set is small, high bias/low variance classifiers (e.g., Naive Bayes)
have an advantage over low bias/high variance classi-fiers (e.g., kNN or
logistic regression), since the latter will overfit. Naive Bayes classifier will
converge quicker than discriminative models like logistic regression and
works well with less data.
33
Chapter 5

Android Malware Detection using Association Rule based Classification

5.1 Introduction

Knowledge Discovery and Data Mining (KDD) is playing an important


role in extracting knowledge in this era of data overflow. KDD consists of
many methods and techniques that can be applied to differ-ent data to extract
knowledge. Some of the methods include association, classification, and
clustering. In this work, we primarily focus on association and classification.
Association rule mining is the discovery of association relationships among
a set of items in a dataset. Association rule mining has become an important
data mining technique due to the descriptive and eas-ily understandable
nature of the rules. Although association rule mining was introduced to
extract associations from market basket data [7], it has proved useful in many
other domains (e.g. microarray data analysis, recommender systems, and
network intrusion detection). In the domain of market basket analysis, data
consists of transactions where each is a set of items purchased by a customer.
A com-mon way of measuring the usefulness of association rules is to use the
support-confidence framework introduced by [7]. Support of a rule is the
percentage of transactions that carry all the items in the rule, and the
confidence is the percentage of the transactions that carry all the items in the
rule among those transactions that carry the items in the antecedent of the
rule.
The problem of association rule mining can be stated as: Given a dataset of
transactions, a threshold support (minsupport), and a threshold confidence
(minconfidence); Generate all association rules from the set of transactions
that have support greater than or equal to minsupport and confidence greater
than or equal to minconfidence.
Classification is another method of data mining. Classification can be
defined as learning a function that maps (classifies) a data instance into one
of several predefined class labels. The data from which a classification
function or model is learned is known as the training set. A separate testing
set is used to test the classifying ability of the learned model or function.
Examples of classification models include decision trees, Bayesian models,
and neural nets. When classification models are constructed from rules, often
they are represented as a decision list (a list of rules where the order of rules
corresponds to

34
the significance of the rules). Classification rules are of the form P − > c,
where P is a pattern in the training data and c is a predefined class label
(target).
As part of this thesis, we study and build classifiers from association rules.
Given that association rules are descriptive in nature, they are useful in
learning about relationships in the data. The learned relationships can be
helpful in analyzing the domain. But usefulness of the rules can be further
extended if predictive models can be extracted from the rules. Given that the
number of rules produced is a function of the minsupport and the
minconfidence thresholds, the challenge is to generate an appropriate number
of rules that can be useful in developing predictive models.
We combine the association rule mining and classification rule mining
techniques to build a classifier. The integration is done by focusing on mining
a special subset of association rules, called class associ-ation rules (CARs).
To select the best features that distinguish between malware from benign
apps, we rely on API level information within the bytecode since it conveys
substantial semantics about the apps behaviour. More specifically, we focus
on critical API calls and their package level information.
Rather than simply treating the individual api calls as items, we represent
an item as a combination of caller and callee api. We capture one level of
control flow and context between caller and callee. Each item in our model is
of the form A%B, where A is the caller and B is the callee. We use
Androguard [8], a reverse engineering tool to perform API level feature
extraction and data flow analysis.
Association rule based classification is introduced in [37]. They propose an
Apriori like algorithm called CBA-RG for generating rules and another
algorithm called CBA-CB for building the classifier. The rules generated by
CBA-RG are called classification association rules (CARs), as they have a
prede-fined class label or target. From the generated CARs, a subset is
selected based on the heuristic criterion that the subset of rules can classify
the training set accurately.
Many other classification systems have been built based on association
rules [58]and [57]. In our work, we have implemented an association rule-
based classifier system in the WEKA framework. WEKA is a data mining
system developed at the University of Waikato and has become very popu-lar
among the academic community working on data mining. We have chosen to
develop this system in WEKA as we realize the usefulness of having such a
classifier in the WEKA environment. To generate classification association
rules, we make use of the CBA algorithm [40]. CBA is an extended version of
the Apriori algorithm that is capable of mining associations from set-valued
and temporal datasets.
More generally, we have adapted the algorithm to generate only rules that
satisfy user specified constraints. We achieve this by integrating these
constraints into the mining phase so that we can use the constraints to prune
itemsets that would not yield rules of the type that the user desires.

5.1.1 Problem Statement


Our problem can be further broken down as follows:

• Adapt the Apriori algorithm to generate classification association rules


(CARs) efficiently.
• ”Build Classification Models”.

35
– Build a framework to generate models from CARs.
– Build a classifier using the CARs to classify the malware apps from
the genuine apps.

5.2 Background Information

5.2.1 Association Rules


Association rule mining was introduced in [7] as a way to find associative
patterns from market bas-ket data. The market basket data consist of
transactions where a transaction is a set of items purchased by a customer.
The motivation for applying this data mining approach on market basket data
was to learn about buying patterns and use that information in catalog design,
and store layout design.
Many association rule mining algorithms have been proposed in the data
mining literature. Apriori
[7] and FP-growth [28] are two of them. Apriori uses the property, all
nonempty subsets of an frequent itemset must also be frequent [7] to prune
the search space. Apriori follows a breadth first-search strat-egy while FP-
growth follows a depth-first search strategy. Several extensions of the basic
association rule mining algorithm have been published. One of them is the
CBA-RG algorithm [37], which adapts Apriori to generate classification
association rules efficiently. The generated rules are used in CBA-CB
[37] to extract a classification model. We have implemented CBA-CB as part
of our model building system.

5.2.2 Apriori Algorithm


The Apriori algorithm was introduced in [7] as a way to generate
association rules from market basket data. The Apriori algorithm is a two
stage process: A frequent itemset (itemsets that satisfy minimum support
threshold) mining stage and a rule generation stage (rules that satisfy
minimum confidence threshold).

5.2.3 The Frequent Itemset Mining Stage


In the first iteration of Apriori’s frequent itemset mining stage, each item
becomes part of the 1-item candidate set C 1 . The algorithm makes a pass
over the data set to count support for C 1 , see Figure 5.1. Those itemsets
satisfying the minimum support will form L 1 , the set of frequent itemsets of
size-1. To generate candidates of size-2 (C2 ) itemsets, the level 1 collection
of frequent itemsets is joined with itself. This join is denoted by L 1 1 L1 and is
equal to the collection of all set unions of different itemsets in L 1 . The
algorithm scans the database for support of the items in C 2 . Those itemsets
satisfying the minimum support condition will form L2 .
When generating candidates of size-3 (C3 ), L2 1 L2 is performed but with a
condition. Apriori assumes that the items in an itemset are sorted according to
a predefined order (e.g. lexicographic order). The join, L k 1 Lk for k > 1, has
the condition that for two itemsets from Lk to be joined, the first k-1

36
Figure 5.1: Apriori algorithm

item(s) must be the same in both itemsets. This ensures that the generated
candidate is of size k and that most of the subsets of the set are frequent.
Before counting support for all the items in C3 , the Apriori property is
applied. The Apriori property [7] states that all nonempty subsets of an
itemset must be frequent for this itemset to be frequent. The Apriori property
prunes the search space. The Apriori algorithm continues to generate frequent
itemsets until it cannot generate any more candidate itemsets.

5.2.4 The Rule Generation Stage

The frequent itemsets produced are used to generate association rules that
satisfy minimum support and minimum confidence. For each frequent
itemset, all possible splits of the itemset into two part (antecedent and
consequent) are generated and the rule so generated is outputted by the
Apriori if the rule satisfies the minimum confidence condition.

5.2.5 Reverse engineering Android App

The information in the android app’s bytecode can be used to describe its
behavior. We can extract information ranging from coarse-grained levels as
packages to fine-grained levels such as api calls from the bytecode. In this
work we focus on extracting API level information since they clearly capture
the app’s behavior. More specifically, we consider class name, method name
of the callee and the package name of the caller. The bytecode also consists of
user defined functions, we represent them as USERFUNC.
Reverse Engineering is a process by which we discover and understand the
complete working of an app. We use tools like Androguard[8], ApkTool [10],
Smali/Baksmali [48] and Android SDK for reverse engineering a Android
Application.

37
Androguard [8] is a python based tool which is used to disassemble and to
decompile android app’s. It decompiles the bytecode to smali code.
Smali/Baksmali [48] is an assembler/disassembler for the dex format used by
dalvik, Android’s Java VM implementation.

5.2.6 Generation of items

After extracting the api calls using Androgaurd, we identify the caller and
the callee parts from the smali code.
For example: consider the following hypothetical code. Let A() be a user
defined function. getDevi-ceId and getActiveNetworkInfo are the api calls
made inside the function a().

. method public ([Ljava/lang/String;)A


invoke getDeviceId() //api call
invoke g e tAc t i v e Ne t wo r k I n fo ( )
//api call
r e t u r n −v o i d
. end method
Our items in this case will be,

• USERFUNC%getDeviceId

• USERFUNC%getActiveNetworkInfo

where, A() is the user defined function and getDeviceId(),


getActiveNetworkInfo() as api’s.

5.3 Proposed Method

Different stages involved in our approach is as shown in the figure 5.2. We


extract the api level information using Androgaurd [8]. As explained in
subsection 2.4, we generate the items. We use variation of association rule
mining [7] to generate the CARs. Finally classification rule mining [37] is
used to classify an app either as a malware or genuine.
Classification rule mining and association rule mining are two important
data mining techniques. Classification rule mining aims to discover a small
set of rules in the database to form an accurate classifier [41]. Association
rule mining finds all rules in the dataset that satisfy some minimum support
and minimum confidence constraints [7]. The integration is done by focusing
on a special subset of association rules whose right-hand-side are restricted to
the classification class attribute. We refer to this subset of rules as the class
association rules (CARs). An existing association rule mining algorithm [7]is
adapted to mine all the CARs that satisfy the minimum support and minimum
confidence constraints.

38
Extract api level
information Classifier Genuine
Android
.smali Profile CBA
Apps
Disassemble each Malware class.dex
file to
.smali

Figure 5.2: Different Stages in Feature Extraction

5.3.1 Feature Extraction and Refinement

In this section, we aim to systematically determine and extract necessary


features for malware func-tioning. We follow a heuristic based approach for
identifying critical features for malware functioning, we statically analyse a
large set of malware and benign app’s and generated a list of distinct API
calls within each set. A distinct API refers to a distinct combination of Class
Name, Method Name, and De-scriptor. We then conduct a frequency analysis
to select those API’s which are more used in the malware than in the benign
set. We further refine the API list to include only those with a usage
difference higher or equal to a certain threshold. Figure 5.3 shows the top 20
API’s that produce the highest difference of usage between malware and
benign apps.

5.3.2 Classification rule mining

Let D represent the number of android apps. Let I be the set of all items in
D, where each item is of the form A%B , where A is the caller and B is the
callee api’s, and Y be the set of class labels.
Y 2 {Benign, M alware}. We say that a data case d 2 D contains X ✓ I, a
subset of items, if X ✓ d. A class association rule (CAR) is an implication of
the form X ! y , where X ✓ I, and y 2 Y. A rule X ! y holds in D with
confidence c if c% of cases in D that contain X are labelled with class y. The
rule X ! y has support s in D if s% of the cases in D containsX and are
labelled with class y.
Our objectives are

1. To generate the complete set of CARs that satisfy the user specified
minimum support (called minsup) and minimum confidence (called
minconf ) constraints, and

2. to build a classifier from the CARs.

39
5.3.3 Generating the Complete Set of CARs
The CBA (classification based associations) [37] consists of two parts, a
rule generator (called CBA-RG), which is based on algorithm Apriori for
finding association rules [7], and a classifier builder (called CBA-CB). This
section discusses CBA-RG. The next section discusses CBA-CB.

5.3.4 The CBA-RG algorithm


The CBA-RG algorithm generates all the frequent ruleitems by making
multiple passes over the data. In the first pass, it counts the support of
individual ruleitem and determines whether it is frequent. In each subsequent
pass, it starts with the seed set of ruleitems found to be frequent in the
previous pass. It uses this seed set to generate new possibly frequent
ruleitems , called candidate ruleitems . The actual supports for these
candidate ruleitems are calculated during the pass over the data. At the end of
the pass, it determines which of the candidate ruleitems are actually frequent.
From this set of frequent ruleitems, it produces the rules (CARs).
The key operation of CBA-RG is to find all ruleitems that have support
above minsup. A ruleitem is of the form: < condset, y > where condset is a set
of items, y 2 Y is a class label.
Let k-ruleitem denote a ruleitem whose condset has k items. Let F k denote
the set of frequent k-ruleitems. Let Ck be the set of candidate k-ruleitems.
1: procedure CBA-RG
2: F1 = large 1 − ruleitems ;
3: CAR1 = genRules(F1);
4: P rCAR1 = pruneRules(CAR1)
5: for (k = 2; Fk−1 6= ;; k + +) do
6: Ck = candidateGen(Fk−1);
7: for each data case d 2 D do
8: ruleSubsset(ck, d);
9: for each candidate c 2 Cd do
10: c.condsupCount + +;
11: IF d.class = c.class Then c.rulesupCount + +
12: end for
13: end for
14: Fk = {c 2 Ck|C.rulesupCount ≥ minsup}
15: CARk = genRules(Fk);
16: prCARk = pruneRules(CARk)
17: end for S
18: CARs = CARk;
k S
19: P rCARs = prCARk;
k
20: end procedure

41
Line 1-3 represents the first pass of the algorithm. It counts the item and
class occurrences to de-termine the frequent 1- ruleitems(line 1). From this
set of 1-ruleitems, a set of CARs (called CAR 1 ) is generated by genRules
(line 2). CAR1 is subjected to a pruning operation (line 3). Pruning is also
done in each subsequent pass to CAR k (line 16). The function pruneRules
uses the pessimistic error rate based pruning method in C4.5 [41]. It prunes a
rule as follows: If rule r’s pessimistic error rate is higher than the pessimistic
error rate of rule r00 (obtained by deleting one condition from the conditions
of r), then rule r is pruned. This pruning can cut down the number of rules
generated substantially.
For each subsequent pass, say pass k, the algorithm performs 4 major
operations. First, the frequent ruleitems Fk−1 found in the (k-1)th pass are used
to generate the candidate ruleitems Ck using the condidateGen function. It
then scans the database and updates various support counts of the candidates
in Ck. After those new frequent ruleitems have been identified to form F k, the
algorithm then produces the rules CARk using the genRules function. Finally,
rule pruning is performed on these rules.
The candidateGen function is similar to the function Apriori-gen in
algorithm Apriori.
The final set of class association rules is in CARs. Those remaining rules
after pruning are in prCARs.

5.3.5 Building a Classifier


Here we describe the CBA-CB algorithm for building a classifier using
CARs (or prCARs). Our proposed algorithm is a heuristic one. We define a
total order on the generated rules. This is used in selecting the rules for our
classifier.
Definition: Given two rules, ri and rj , ri > rj (also called ri precedes rj or ri
has a higher precedence than rj ) if
1. the confidence of ri is greater than that of rj , or
2. their confidences are the same, but the support of ri is greater than that of
rj , or
3. both the confidences and supports of ri and rj are the same, but ri is
generated earlier than rj ;
Let R be the set of generated rules (i.e., CARs or pCARs), and D the
training data. The basic idea of the algorithm is to choose a set of high
precedence rules in R to cover D. Our classifier C is of the following format:
< r1, r2, ..., rn, default class >,
where ri 2 R, ra > rb if b > a. default class is the default class. In classifying
an unseen case, the first rule that satisfies the case will classify it. If there is
no rule that applies to the case, it takes on the default class.
1: procedure CBA-CB .
classification
2: R = sort(R)
3: for each rule r 2 R in sequence do

42
4: temp = ;
5: for each case d 2 D do
6: IF d satisfies the conditions of r Then
store d.id in temp and mark r if it correctly classifies d;
7: IF r is marked
Then insert r at
the end of C;
delete all the cases with the ids in
temp from D; selecting the default
class for th current C; compute the
total number of errors of C;
8: end for
9: end for
10: Find the first rule p in C with the lowest total number of errors and drop
all the rules after p in
C;
11: Add the default class associated with p to end of C, and return C (our
classifier).
12: end procedure
Our algorithm for building the classifier has 3 following steps.

1. Sort the set of generated rules R according to the relation >. This is to
ensure that we will choose the highest precedence rules for our classifier.

2. Select rules for the classifier from R following the sorted sequence. For
each rule r, we go through D to find those cases covered by r (they
satisfy the conditions of r). We mark r if it correctly classifies a case d.
d.id is the unique identification number of d. If r can correctly classify at
least one case (i.e., if r is marked), it will be a potential rule in our
classifier. Those cases it covers are then removed from D. A default class
is also selected (the majority class in the remaining data), which means
that if we stop selecting more rules for our classifier C this class will be
the default class of C . We then compute and record the total number of
errors that are made by the current C and the default class . This is the
sum of the number of errors that have been made by all the selected rules
in C and the number of errors to be made by the default class in the
training data. When there is no rule or no training case left, the rule
selection process is completed.

3. Discard those rules in C that do not improve the accuracy of the


classifier. The first rule at which there is the least number of errors
recorded on D is the cutoff rule. All the rules after this rule can be
discarded because they only produce more errors. The undiscarded rules
and the default class of the last rule in C form our classifier.

5.4 Experimental Results and Discussions

In this section, we describe the dataset and discuss the experimental results.

43
5.4.1 Data set
Our data set consists of 1449 apps in total. We collected 1008 top free apps
across different category from Google Play [51] to create a benign set. Our
malware set consists of 441 apps taken from Android malware Genome
Project [61]. We used VirusTotal [52] to make sure that our benign set is free
from any malware.

5.4.2 Evaluation measures


To evaluate the effectiveness of proposed approach, we calculate true
positive, true negative, false positive and false negative rates, precision and
recall rates and F-measure in our experiments. These measures are defined as
follows. Let TP (true positive) be the number of Android malware apps that
are correctly detected, FN (false negative) be the number of malware apps
that are detected as benign, TN (true negative) be the number of benign apps
that are correctly classified, and let FP (false positive) be the number of
benign app that are incorrectly detected as Android malware. In terms of
classification error, two cases can occur: (a) A benign app may be
misclassified as suspicious and (b) a suspicious app may be misclassified as
benign. For our problem, the latter case is more crucial as it is more important
to prevent a malicious app in reaching the end device than excluding a benign
app from the distribution chain. We use the following measures to check the
performance of our proposed approach.
TP
T rueP ositiveRate(T P R) = T P + F N (5.1)
FP
F alseP ositiveRate(F P R) = T N + F P (5.2)
TP
Recall(Rec) = T P + F N
(5.3)
TP
precision(P rec) = T P + F P
(5.4)
We define two ore metrics support and confidence. The support supp(X) of
an itemset X is defined as the proportion of transactions in the database
which contain the itemset. and the confidence as

conf(X ) Y ) = supp(X [ Y )/supp(X). (5.5)


Figure 5.4 plots the precision and recall at different values of support
points. We conducted exper-iments by varying the support of association rule
mining, We observed, classifier performed better at a higher support value.
Figure 5.5 plots the True positive Rate and the False positive Rate against the
support. Again we observe, as we increase the support, we achieve higher
TPR and FPR.
Figure 5.6 shows the experiment result for three different classification
algorithms. The Baseline classifier i.e BC uses a naive classification rule -
always classify to the largest class, in other words, classify according to the
prior. Accuracy of our baseline classifier is 0.69. The RF classifier i.e
Random

44
Figure 5.6: Android malware detection analysis with different classifiers from
the Precision-Recall view.

45
forest fits a number of decision tree classifiers on various sub-samples of the
dataset and use averaging to improve the predictive accuracy and control
over-fitting. We observed the RF classifier has a accuracy of 0.75. Finally our
approach CBA gives us an accuracy of 0.85 when the support value during
association rule mining is set at 0.7.
Table 1 lists some interesting rules that were used by the classifier. Each
rule gives us the com-bination of the items which were used by our classifier
to distinguish between a genuine and mal-ware app and the confidence
associated with the rule. For ex: Rule 1 in the Table 1 says that, the rule, i.e
combination of USERFUNC%getSubscriberId, USERFUNC
%openConnection, SendSmsMes-sage%sendTextMessage classifies an app
as malware with a confidence of 0.71.
Table 5.1: Popular Rules

No Rule’s conf
USERFUNC%getSubscriberId,USERFUNC%openConnection,
Rule 1 0.71
SendSmsMessage%sendTextMessage ==> class=malware
USERFUNC%getConnectionInfo,None%getDeviceId ==>
Rule 2 class=malware 0.73
USERFUNC%getActiveNetworkInfo,USERFUNC
%openConnection,
Rule 3 0.78
USERFUNC%getDeviceId ==> class=malware
None%getDeviceId,USERFUNC%getDeviceId,USERFUNC
Rule 4 %getActiveNetworkInfo 0.78
==> class=malware

We summarize the evaluation metrics for the CBA with 0.7 support and
confidence value in Table 2.
Table 5.2: Evaluation measures for support and confidence:0.7
TP rate FP rate Precision Recall F- mes
0.954 0.617 0.780 0.954 0.858

5.5 Conclusion
We proposed, a novel approach to distinguish and detect Android malware
from the genuine app’s. The proposed technique aims to generate the
complete set of potential classification rules. We represent an item as a
combination of caller and callee api. We capture one level of control flow and
context between caller and callee. Classification based association rule
mining help in establishing the classifi-cation rules against other classifiers.
As future work, we plan to further reduce the false positives and negatives
through analysing the samples that were not correctly classified and finding
out the reasons behind the misclassification. We also want to develop
sophisticated techniques to mine the CARs.

46
Chapter 6

Conclusions and Future Work

In this Thesis, we proposed to use permissions, API calls of Android


applications to detect malware and malicious codes in Android based mobile
platform. Further we have done extensive studies and discovered that, certain
categories are highly prone to malicious acts compared to other categories.
We explicitly incorporate this information in our model and learn a naive
bayes classifier for each category using the features that encode information
about permissions and api calls. Experiments on real data demonstrate that,
this framework has a good performance in malware detection. In summary -

1. We validate that using category information with permissions and API


calls are effective for mal-ware detection which achieved average 93%
of malware detection AUC.

2. Our framework does not involve any complicated dynamical tracing of


Android applications.

3. Using data mining techniques which is easy to scale to large volume of


data (applications) and easy to identify new malwares.

We also proposed, another novel approach to distinguish and detect


Android malware from the gen-uine app’s. We combine the association rule
mining and classification rule mining techniques to build a classifier. We
conduct thorough analysis to extract relevant features to malware behavior
captured at API level. We use classification rule mining, that aims to discover
a small set of rules from the apis that forms an accurate classifier. Association
rule mining finds all the rules existing in the api dataset that satisfy some
minimum support and minimum confidence constraints. The integration is
done by focusing on mining a special subset of association rules, called class
association rules (CARs). To select the best features that distinguish between
malware from benign apps, we rely on API level information within the
bytecode since it conveys substantial semantics about the apps behaviour.
Rather than simply treating the individual api calls as items, we represent
an item as a combination of caller and callee api. We capture one level of
control flow and context between caller and callee. Each item in our model is
of the form A%B, where A is the caller and B is the callee. Experiment results
performed over a dataset, are encouraging which shows the effectiveness of
our simple yet productive approach. In summary,

47
• combining association rule mining and classification rule mining for
Android malware detection.

• We achieved a detection rate of 85% over the baseline classifier of 0.69

As future work, we plan to further reduce the false positives and negatives
through analysing the samples that were not correctly classified and finding
out the reasons behind the misclassification.
48
Related Publications

• Category based Android malware detection Vijayendra Grampurohit,


Vijay Kumar, Sanjay Rawat, Shatrunjay Rawat. In Proceedings of
ICACCI 2014, 3rd International Conference on Advances in
Communication, computing and Informatics, Delhi

• Android Malware Detection using Association Rule based Classification


Vijayendra Grampuro-hit, Sanjay Rawat, Shatrunjay Rawat, Shubhum
Tripathi. in submission of PST 2016, 14th Pri-vacy,Security and Trust,
Auckland, NZ
49
Bibliography

[1]Eran kalige and darrel burkey. a case study of eurograbber: How 36


million euros was stolen via malware,. 2012.
[2]Dan bornstein. dalvik vm internals. google i/o. 2013.
[3]Kindsight security labs malware report - q2 2013. alcatel-lucent. 2013.
[4]Over 1 billion android-based smart phones to ship in 2017. canalys. 2013.
[5]http://contagiominidump.blogspot.in/. Contagio Mini Dump.
[6]Y. Aafer, W. Du, and H. Yin. Droidapiminer: Mining api-level features for
robust malware detection in android. In Security and Privacy in
Communication Networks. 2013.
[7]R. Agrawal, R. Srikant, et al. Fast algorithms for mining association
rules. In Proc. 20th int. conf. very large data bases, VLDB, volume 1215,
pages 487–499, 1994.
[8]androgaurd. https://code.google.com/p/androguard/.
[9]AndroidViewClient. https://github.com/dtmilano/androidviewclient.
[10] apktool. http://ibotpeaches.github.io/apktool/.
[11] M. V. Barbera, S. Kosta, J. Stefa, P. Hui, and A. Mei. Cloudshield:
Efficient anti-malware smartphone patching with a p2p network on the
cloud. In Peer-to-Peer Computing (P2P), 2012 IEEE 12th International
Conference on, pages 50–56. IEEE, 2012.
[12] D. Barrera, H. G. Kayacik, P. C. van Oorschot, and A. Somayaji. A
methodology for empirical analysis of permission-based security models
and its application to android. In ACM conference on Computer and
communications security, 2010.
[13] T. Blasing,¨ L. Batyuk, A.-D. Schmidt, S. A. Camtepe, and S. Albayrak.
An android application sandbox system for suspicious software detection.
In Malicious and unwanted software (MALWARE), 2010 5th
international conference on, pages 55–62. IEEE, 2010.
[14] E. Chin, A. P. Felt, K. Greenwood, and D. Wagner. Analyzing inter-
application communication in android. In International conference on
Mobile systems, applications, and services, 2011.
[15] M. Christodorescu and S. Jha. Static analysis of executables to detect
malicious patterns. Technical report, DTIC Document, 2006.
[16] Dedexer. http://dedexer.sourceforge.net/.

50
[17] A. Desnos and P. Lantz. Droidbox: An android application sandbox for
dynamic analysis (2011). URL https://code. google. com/p/droidbox,
2014.
[18] Dex2jar. https://github.com/pxb1988/dex2jar.
[19] F. Di Cerbo, A. Girardello, F. Michahelles, and S. Voronkova. Detection
of malicious applications on android os. In Computational Forensics.
2011.
[20] dvm. David ehringer. the dalvik virtual machine architecture.
[21] W. Enck, P. Gilbert, B.-G. Chun, L. P. Cox, J. Jung, P. McDaniel, and A.
Sheth. Taintdroid: An information-flow tracking system for realtime
privacy monitoring on smartphones. In OSDI, 2010.
[22] W. Enck, P. Gilbert, S. Han, V. Tendulkar, B.-G. Chun, L. P. Cox, J.
Jung, P. McDaniel, and A. N. Sheth. Taintdroid: an information-flow
tracking system for realtime privacy monitoring on smartphones. ACM
Transactions on Computer Systems (TOCS), 32(2):5, 2014.
[23] W. Enck, M. Ongtang, and P. McDaniel. On lightweight mobile phone
application certification. In ACM conference on Computer and
communications security, 2009.
[24] A. P. Felt, E. Chin, S. Hanna, D. Song, and D. Wagner. Android
permissions demystified. In ACM confer-ence on Computer and
communications security, 2011.
[25] A. P. Fuchs, A. Chaudhuri, and J. S. Foster. Scandroid: Automated

security certification of android appli-cations. Manuscript, Univ. of

Maryland, http://www. cs. umd. edu/˜ avik/projects/scandroidascaa, 2009.

[26] Google. http://developer.android.com/about/index.html.


[27] M. Grace, Y. Zhou, Q. Zhang, S. Zou, and X. Jiang. Riskranker: scalable

and accurate zero-day android malware detection. In International

conference on Mobile systems, applications, and services, 2012.

[28] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate
generation. In ACM SIGMOD Record, volume 29, pages 1–12. ACM,
2000.
[29] H. Hao, V. Singh, and W. Du. On the effectiveness of api-level access

control using bytecode rewriting in android. In ACM SIGSAC

symposium on Information, computer and communications security,

2013.

[30] http://developer.android.com/reference/packages.html. Android


developer page for android manifest per-mission group.
[31] Jasmine. https://github.com/guilhermechapiewski/titanium-jasmine.
[32] X. Jiang. An evaluation of the application (”app”) verification service in
android 4.2. http://www.cs.ncsu.edu/faculty/jiang/ appverify,. Dec. 2012.
[33] X. Jiang and X. Zhu. veye: behavioral footprinting for self-propagating
worm detection and profiling. Knowledge and information systems,
18(2):231–262, 2009.
[34] Juniper. Third annual mobile threats report.
[35] J. O. Kephart and W. C. Arnold. Automatic extraction of computer virus
signatures. In 4th virus bulletin international conference, pages 178–184,
1994.

51
[36] H. Kim, J. Smith, and K. G. Shin. Detecting energy-greedy anomalies

and mobile malware variants. In Proceedings of the 6th international

conference on Mobile systems, applications, and services, pages 239–

252. ACM, 2008.


[37] B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule
mining. pages 80–86, 1998.
[38] A. Monkey. http://developer.android.com/tools/help/monkey.html.
[39] G. Portokalidis, P. Homburg, K. Anagnostakis, and H. Bos. Paranoid

android: versatile protection for smartphones. In Proceedings of the 26th

Annual Computer Security Applications Conference, pages 347–

356. ACM, 2010.


[40] K. Pray. Mining association rules from time sequence attributes. PhD
thesis, Masters Thesis, Dept of Computer Science, WPI, 2004.
[41] J. R. Quinlan et al. Learning with continuous classes. In 5th Australian
joint conference on artificial intelligence, volume 92, pages 343–348.
Singapore, 1992.
[42] V. Rastogi, Y. Chen, and W. Enck. Appsplayground: automatic security
analysis of smartphone applications. In Proceedings of the third ACM
conference on Data and application security and privacy, pages 209–220.
ACM, 2013.
[43] J. Sahs and L. Khan. A machine learning approach to android malware
detection. In Intelligence and Security Informatics Conference, 2012.
[44] A. Sami, B. Yadegari, H. Rahimi, N. Peiravian, S. Hashemi, and A.
Hamze. Malware detection based on mining api calls. In Proceedings of
the 2010 ACM Symposium on Applied Computing, pages 1020–1025.
ACM, 2010.
[45] A.-D. Schmidt, R. Bye, H.-G. Schmidt, J. Clausen, O. Kiraz, K. A.

Yuksel, S. A. Camtepe, and S. Albayrak. Static analysis of executables for

collaborative malware detection on android. In ICC, 2009.

[46] M. G. Schultz, E. Eskin, E. Zadok, and S. J. Stolfo. Data mining


methods for detection of new malicious executables. In Security and
Privacy, 2001. S&P 2001. Proceedings. 2001 IEEE Symposium on, pages
38–49. IEEE, 2001.
[47] A. Shabtai, U. Kanonov, Y. Elovici, C. Glezer, and Y. Weiss. andromaly:

a behavioral malware detection framework for android devices. Journal

of Intelligent Information Systems, 38(1):161–190, 2012.

[48] Smali. http://code.google.com/p/smali/.


[49] M. Spreitzenbarth, F. Freiling, F. Echtler, T. Schreck, and J. Hoffmann.
Mobile-sandbox: having a deeper look into android applications. In
Proceedings of the 28th Annual ACM Symposium on Applied
Computing, pages 1808–1815. ACM, 2013.
[50] G. P. Store. http://en.wikipedia.org/wiki/mobile-virus.
[51] G. P. Store. https://play.google.com/store?hl=en.
[52] V. Total. https://www.virustotal.com/.

52
[53] J.-H. Wang, P. S. Deng, Y.-S. Fan, L.-J. Jaw, and Y.-C. Liu. Virus
detection using data mining techinques. In Security Technology, 2003.
Proceedings. IEEE 37th Annual 2003 International Carnahan Conference
on, pages 71–76. IEEE, 2003.
[54] X. Wei, L. Gomez, I. Neamtiu, and M. Faloutsos. Profiledroid: Multi-
layer profiling of android applications. In International conference on
Mobile computing and networking, 2012.
[55] www.appbrain.com/stats/number-of-android apps. Appbrain report,
google android application number in the market.
[56] L.-K. Yan and H. Yin. Droidscope: Seamlessly reconstructing the os and
dalvik semantic views for dynamic android malware analysis. In
USENIX security symposium, pages 569–584, 2012.
[57] Q. Yang, T. Li, and K. Wang. Building association-rule based sequential
classifiers for web-document prediction. Data mining and knowledge
discovery, 8(3):253–273, 2004.
[58] O. R. Zaıane, M.-L. Antonie, and A. Coman. Mammography
classification by an association rule-based classifier. MDM/KDD, pages
62–69, 2002.
[59] M. Zhao, F. Ge, T. Zhang, and Z. Yuan. Antimaldroid: An efficient svm-
based malware detection framework for android. In Information
Computing and Applications. 2011.
[60] Y. Zhou and X. Jiang. Dissecting android malware: Characterization and
evolution. In Security and Privacy (SP), 2012 IEEE Symposium on,
pages 95–109. IEEE, 2012.
[61] Y. Zhou and X. Jiang. Dissecting android malware: Characterization and
evolution. In Security and Privacy, 2012.
[62] Y. Zhou, X. Zhang, X. Jiang, and V. W. Freeh. Taming information-
stealing smartphone applications (on android). In Trust and Trustworthy
Computing. 2011.
[63] S. Zonouz, A. Houmansadr, R. Berthier, N. Borisov, and W. Sanders.

Secloud: A cloud-based comprehen-sive and lightweight security solution

for smartphones. Computers & Security, 37:215–227, 2013.

53

Vous aimerez peut-être aussi